Docs → Guardrails
Design systems ship documentation. Component galleries, usage guidelines, API references, Storybook instances. All designed for a human to read, internalize, and apply through their own judgment. A developer reads the dialog guidelines, understands the pattern, and builds a dialog that follows the rules. The design system trusts the human to enforce its constraints.
That’s not quite true, actually.
The Shift
AI coding tools do read documentation — just not the way humans do. When a coding agent hits something it can’t infer from context, it drops down to docs the way a developer drops down to Stack Overflow: as a secondary source, consulted when the primary signal (the code around it, the examples in context) isn’t enough. Agents can and do consume markdown files, README docs, and reference pages. But they don’t exercise judgment. That’s the key phrase. A senior developer reads the dialog guidelines and makes a call: “this situation is an exception, I’ll break the pattern here because the user need justifies it.” An AI agent follows the pattern or it doesn’t. It can’t weigh trade-offs the way a human does. And if the docs it finds are ambiguous, it guesses.
So the question isn’t “how do we make docs machine-readable?” — that’s necessary but not sufficient. The question is: what form do guardrails take? And honestly, we don’t fully know yet. I see at least three candidates, and the answer is probably all of them.
First: skills — structured instructions that tell an agent how to use the design system. Not documentation to be interpreted, but executable context: “when building a dialog in Fluent, always include DialogTitle, always trap focus, never auto-dismiss without a timeout token.” Stripe’s agent skills are the best example I’ve seen — specific, behavioral, and consumable by an agent as constraints rather than suggestions. The design system ships a skill file alongside its components.
Second: evaluators — automated checks that run after generation. The agent builds the component, and an evaluator scores it against the design system’s rules. Does the dialog have focus trapping? Does it use valid spacing tokens? Does the dismiss behavior match the spec? Evaluators don’t prevent violations — they catch them. Think of it as a design system linter for AI-generated code, runnable by anyone.
Third: auditor agents — AI that reviews AI. A design system auditor that can see the rendered output, navigate the component, check the accessibility tree, and verify that the generated UI actually behaves the way the system says it should. Not a static lint pass — a runtime audit by an agent that understands the intent behind the rules, not just the rules themselves.
The honest answer is that this dimension is the most frontier of all eight. We know the problem — every AI-generated component without guardrails is a brand violation waiting to happen. We don’t yet know the shape of the solution. Skills, evaluators, and auditor agents are early patterns. The design system that figures out the right combination will have solved a problem every organization building with AI will face within the next 18 months.
Where Systems Stand Today
Both systems score 2/10. Fluent’s TypeScript types provide some machine-readable constraints — valid prop values, slot types, the _unstable API convention. Material’s documentation is excellent for humans but not yet shaped for agent consumption. Neither ships skills, evaluators, or audit patterns. The W3C Design Tokens spec is a starting point for the appearance layer — if tokens can be exported as structured data, that’s the beginning. But tokens describe appearance, not behavior. The behavior schema, the skills, and the audit loop are all open territory.
What Pushes a Score Up
Excellent documentation for humans is a 2. TypeScript types that provide some machine-readable constraints are a 3. A 7 means the design system can be consumed as a behavioral schema by generative tools — a contract that an LLM can follow when it generates a component. If the design system can’t be a prompt, it can’t be a guardrail.
Where this is going. This page is a working summary — the shift, the current state, the scoring rubric. The full deep dive expands each section with code-level evidence, specific component proposals, and mockups. Trust Expression is the first dimension getting the full treatment; the rest follow as they earn it.
If you’re building against this shift — or you see something the summary is missing — write back. The scorecard is debatable by design.