Overview
UI Verification covers two parallel modes against a live web app:
- Visual verification — checks whether the page matches its design specification. Translates design claims into CSS rule checks, runs them against the live DOM via . Deterministic; the browser's computed styles are the source of truth.
- Flow verification — checks whether user journeys complete correctly. Executes Gherkin scenarios from files via Nova Act's (actions) and (assertions). Non-deterministic; results vary run-to-run with network timing and live UI shifts.
Both modes share the skill, the MCP server, and the browser session. A single run can produce both kinds of output, combined into one report. Or either mode can run alone; the other section is omitted from the combined report.
Each run produces:
- Structured artifacts — per-category JSON for visual; per-flow execution data for flows
- Annotated screenshots — red bounding boxes highlighting visual failures on the page
- Verification report — markdown combining a visual summary, a flow summary (per-flow status table), and links to per-flow detail reports
The verify_* tools are deterministic — no vision model, browser's computed styles are the source of truth. The compile and audit passes for visual are LLM-driven (best-effort), reconciling design intent against the app's actual structure. Flow steps are interpreted by Nova Act each run; flow runs are inherently non-deterministic.
Reconciliation inputs
Visual verification. The 5 compiled category files (
.ui-verification/specs/*.md
) are reconciled from up to three inputs:
| Input | What it provides | Required? |
|---|
| Free-form design intent — tokens, prose, design language, component definitions | Required |
| The running app | Live DOM observed via + — real selectors, real computed values | Required for verification (informs selectors and validates rules) |
| App source code | Components, theme/tokens, CSS files — implementation truth | Optional. When accessible, makes selectors deterministic and divergence classification more precise |
Flow verification. Flows are authored or generated as
files at
:
| Input | What it provides | Required? |
|---|
| files | Gherkin scenarios with metadata header (flow ID, type, app URL, optional auth and cleanup) | Required for flow verification |
| The running app | Target of the scenarios; Nova Act executes against the live URL | Required |
| Auth credentials | Provided when a flow's metadata declares a required login | Conditional |
When source code is accessible (the user is a developer working on their own app), the agent should detect it and use it during compile, audit, and spec generation. Source can live at the project root (
+
at
) OR one level deeper (e.g.
output_dir/<package-name>/package.json
— a workspace layout where the verifier opens at the workspace root and one or more app packages live as subdirs). Sniff both. When source is not available (verifying an external site, black-box check), the skill operates in DOM-only mode — selectors are best-effort, audit relies on design.md + DOM only.
The 5 category files grow as the app develops:
- design.md grows via Scribe (user expressing design intent in chat → design.md edits)
- 5 mds grow as the Compiler discovers more verifiable surfaces in the app and source
Both layers can grow, but only via their own author paths. The 5 mds are NEVER edited to record observations or freeze divergent live-site values — see hard rules below.
Three names — don't conflate them
| Name | What it identifies |
|---|
| This skill (the agent's playbook) |
| The MCP server providing browser + verify_* tools |
| The artifact directory at the project root (compiled specs, assertion JSON, reports) |
These are unrelated despite words overlapping. The skill does NOT live inside the artifact dir; the artifact dir is at the project root, not inside the skill.
Capabilities
| Capability | Tool | Source File | What It Checks |
|---|
| Visual Style | | | Colors, typography, spacing, radii, shadows |
| Components | | | Component presence, variants, props |
| Accessibility | | | Aria roles, landmarks, heading hierarchy |
| Project Rules | | | Layout structure, spacing system, conventions |
| Platform Conventions | verify_platform_conventions
| specs/platform-conventions.md
| Navigation patterns, page structure |
| User Flows | + | flows/<flow-name>.feature
| End-to-end user journeys, functional correctness |
Visual rules can be
route-scoped: each category file may contain
(default) and
sections. See
references/spec_authoring.md
. Flow scenarios target the URL declared in their
metadata; route scoping is per-flow rather than per-rule.
Available MCP Tools (19 total)
Session Management
start_browse(url, intent, browser_mode)
— open a URL, get a . Use for verification.
session_close(session_id)
— terminate browser session
- — list active sessions
Verification — Visual only (all require and JSON; ALWAYS pass = absolute path to project root)
verify_visual_style(session_id, rules, output_dir)
verify_components(session_id, rules, output_dir)
verify_accessibility(session_id, rules, output_dir)
verify_project_rules(session_id, rules, output_dir)
verify_platform_conventions(session_id, rules, output_dir)
These are the visual-mode verification tools. For flow verification,
and
(below) are the primary drivers —
doesn't apply to Gherkin steps.
If
is omitted, the server writes assertions to a
temp dir and downstream report/annotation steps can't find them.
Browser Interaction (all require )
navigate(session_id, url)
— go to URL
click(session_id, selector)
— click element
scroll(session_id, direction, selector?)
— scroll up/down
hover(session_id, selector)
— hover element
press_key(session_id, key)
— keyboard input
type_text(session_id, selector, text, clear_first?)
— type into input
Content & Capture (all require )
evaluate_js(session_id, script)
— run JavaScript in page context
get_page_content(session_id, format?)
— page as or
screenshot(session_id, destination?)
— capture viewport to file path
Natural Language (all require )
- — instruct browser actions (scroll, click, navigate, fill forms). For flow verification, this is the primary driver of and steps. For visual verification, do NOT use for CSS checks — use instead.
act_get(session_id, prompt, schema?)
— structured data extraction or state verification. For flow verification, this is the primary driver of and (after ) assertions; supplement with for deterministic checks. For visual verification, do NOT use for CSS checks — perception/reasoning over the page is the agent's job using + , and CSS verdicts come from .
Artifact Structure
<project_root>/
visual/design.md ← visual source spec (or .ui-verification/design.md)
.ui-verification/
.integrity.json ← compile-state ledger (visual only — see spec_sync.md)
specs/ ← compiled visual category files (INPUT to verify_*)
visual-style.md (clean markdown — integrity tracked in .integrity.json)
component-rules.md
accessibility.md
project-rules.md
platform-conventions.md
flows/ ← flow .feature files (INPUT to act() / act_get())
<flow-name>.feature
sessions/ ← per-session output (MCP-owned)
<session_id>/
<category>_assertions.json (visual assertion JSON, write-once)
reports/ ← per-run output (skill-owned)
<YYYYMMDD-HHmmssZ>/ ← UTC run-timestamp (a run can span multiple sessions)
report.md ← combined visual + flow summary
screenshots/ ← visual annotated failures
flow-reports/ ← per-flow reports
<flow-name>.report.md
sessions.json ← manifest of session IDs in this run
Hard rules every run obeys
Default mode is "both" unless user narrows scope
When the user says "verify [url]", "run verification on [url]", or any unqualified verification request, the run MUST include BOTH visual and flow verification. Do NOT default to visual-only. Only narrow to one mode when the user explicitly requests it ("check styles only", "run flows only") or when the disambiguation table clearly matches a single-mode pattern.
If no
files exist, generate them (see
references/flow_generation.md
). If no
exists, generate it (see
references/spec_generation.md
). Missing artifacts trigger generation, not scope narrowing.
Audit when the integrity ledger triggers
Before calling any
tool, check the integrity ledger (see § "The integrity ledger covers the clean case" below for the trigger conditions). When the audit runs, reconcile each in-scope rule against the inputs (design.md, app source if accessible, running app DOM). This is a best-effort LLM check — not a substring match — because the Compiler is itself LLM-driven and rules can legitimately encode information that isn't a literal substring of design.md.
For each rule (
{Name, Selector, Property, Constraint, Scope}
), answer three questions:
- Intent traceable — does the rule's claim (what's being asserted: a token value, a property/value pair, an element's presence) correspond to something stated or implied by design.md, OR a component definition / theme token in source code, OR an idiom present in the running app?
- Constraint reconciles — does the constraint value match what design.md assigns to this element/property combination, OR what source code's theme/token files assign, OR what the running app's component renders at rest? Constraints lifted from the live site WITHOUT a design.md or source backing are contamination.
- Selector plausible — does the selector target the element that design.md (or source) describes? Selectors can come from the app (more specific than design.md alone could specify), but the target must match the described element.
Classify each rule as:
- PASS — all three questions reconcile against at least one input
- ORPHAN — the rule's claim has no source. design.md doesn't make this assertion; source doesn't define this assignment; the only "evidence" is what the live site happens to render. This is the contamination case.
- DIVERGENT — the claim IS in design.md (token defined, component referenced) but the rule's constraint contradicts design.md's assignment. E.g. design.md says component X uses token Y, but the rule asserts component X has the value of token Z.
Skip ORPHAN and DIVERGENT rules in the current run; surface them in the report's Audit Findings section (see
references/verification_report.md
). Verifying them would either pass (silently confirming contamination) or fail (without the right reason). Continue verifying the PASS rules. The user resolves contamination on their own time with three options: drop the rule, upstream the claim into design.md and recompile, or recompile from scratch.
When to write the integrity ledger. Single rule: write it when the category files on disk equal what the Compiler would emit from the current
right now.
- Compile finished cleanly, no skipped rules → write.
- Selector repair or constraint syntax fix completed → write (those repairs ARE what the Compiler would emit now that the original was known to fail).
- Audit skipped any rules (ORPHAN/DIVERGENT) → don't write. Those rules are still in the file but they're NOT what a fresh compile would emit. Leave the ledger missing/stale so the next run re-audits.
- Verify-only run, files unchanged → no-op; don't touch the existing ledger.
The origin of selectors (DOM observation in heuristic mode, source code in source-aware mode) does NOT determine ledger eligibility. As long as rules' claims and constraints trace to design.md (the audit verifies this), the ledger reflects a valid Compiler-approved state. See
Compilation step 7 for the full case table.
The integrity ledger () covers the clean case. If the ledger says all hashes match —
and every category file — the file state is provably what the Compiler last wrote, no audit needed. See
§ Integrity Ledger.
Audit runs when:
- Any category file's hash mismatches the ledger (file edited outside Compiler — prior buggy run, hand-edit, partial-write)
- Ledger is missing (no integrity baseline, run conservatively)
- User explicitly requests re-audit (manual correctness check; hashes can be stale even when valid if a Compiler bug wrote bad rules and updated its own hash)
Skip-audit when hashes match is a real efficiency improvement for repeat runs. But periodic manual re-audit ("re-audit visual-style") is recommended after any large compile or after suspicious changes.
Audit cost. This is one LLM reasoning pass per scoped category file (or per rule batch — agent's choice). Not free, but bounded: proportional to the rules being verified, no MCP calls. The same kind of reasoning the Compiler used to write the rules; the audit just checks "would I write this rule if I compiled fresh now?"
Assertion JSON is immutable
Files at
<output_dir>/.ui-verification/sessions/<session_id>/*_assertions.json
are write-once OUTPUT of
.
NEVER edit them. No exceptions.
The JSON records what
saw against the live DOM. Don't rewrite values, change pass/fail, add scope, "annotate" findings, or add commentary. If a field seems missing (e.g. scope), the report layer joins it in from the source it came from (e.g. the category file) — assertion JSON itself stays exactly as the MCP server wrote it.
If you find yourself opening assertion JSON to fix something, stop — that's the report's job. The agent reads the JSON; the JSON does not change after
writes it.
The 5 category files are reflections of design.md
The 5 compiled
.ui-verification/specs/*.md
files are derived from
. They are NOT a scratch pad, working memory, or place to record observations.
Verification mode (design.md exists):
Edit a category file ONLY when:
- changed (or chat became a design.md edit) → recompile the affected rules
- Selector repair: an existing rule's selector returned "selector not found" and you found a working replacement (selector update only — name/property/constraint stay)
Do NOT edit category files to:
- Capture an observation about the live site (that's the report's job)
- Add a rule that "documents a divergence" with a constraint that matches the divergent live-site value (this silently encodes site bugs as truth and prevents future detection)
- Make a failing rule pass by relaxing the constraint
- Record findings, notes, or context
For partial / scoped verification, pick existing rules from the right category files — don't author new ones unless they're traceable back to a design.md claim that was missed during the prior compile (which is a Compiler bug to surface, not a routine action).
Generation mode (cold-compile from a live site, no yet):
The above rule is RELAXED during generation, because the 5 mds are being seeded for the first time. Generation observes the running app and writes both
and the 5 mds in one pass. The constraints in the 5 mds at end-of-generation match the observed DOM values — that is the reverse-engineering contract, not contamination.
The "no recording observations" guard kicks in
after generation completes and the user has reviewed
. From that point forward, the verification-mode rules above apply: edits go through
+ recompile, never directly to the 5 mds.
See
references/spec_generation.md
§ Phase 5 for the generation-mode rules. Source code, when accessible during generation, informs
names (token names, component names) but NOT
values — the DOM is authoritative for values. There is no "source vs DOM divergence" during generation: the DOM is the cascade-resolved outcome of all source CSS, and any apparent disagreement is between one source file the agent read and the same source compiled by the browser.
Each run is independent
Do NOT read prior assertion JSON or
from earlier
or
directories. The only state carried across runs is the compiled
files plus the
ledger (re-compile is skipped if all ledger hashes match current files). Prior assertions and reports are historical artifacts; they don't inform the current run.
If you find yourself reading a prior session's assertions to "compare," stop — that's cross-session warm-start, which is deferred. Run fresh, write a fresh report.
Flow files at are the only flow input
The
files at
<output_dir>/.ui-verification/flows/
are the only input to flow verification. Never compose flows ad-hoc from chat input mid-run; never modify
files mid-run. If the user wants to change a scenario, the change goes through the Scribe (see
) before the next run.
Flow runs are non-deterministic
Do NOT carry forward prior flow session results across runs. Nova Act re-interprets steps each run, network timing varies, the live UI shifts. Carrying forward "passed" verdicts would mask real flakiness or environmental drift. Every flow runs every time. Flow-side regressions surface via the per-flow status table in the combined report, not a warm-start mechanism.
Every run produces a report
This rule has no exceptions. Whether the user asks to verify a whole site or a single line of
, the run isn't done until:
- Rules persist on disk — every rule passed to verify_* at verification time (step 6) must already exist in a category file under the right section. (Compile-time selector validation is a separate use of verify_*; see verification.md step 4.)
- Scope is joined at report-time, not stamped onto assertions — the report reads BOTH the assertion JSON (for verdicts) and the category file (for scope) and joins them on rule name. Assertion JSON stays exactly as the MCP server wrote it. See
references/verification_report.md
for the join.
- A report is written —
<output_dir>/.ui-verification/reports/<run-timestamp>/report.md
, with the failure table and any annotated screenshots. See references/verification_report.md
for format. Even an all-pass run produces a report.
- The user-facing summary links the report, not the assertion JSON. The JSON is intermediate output; the report is the deliverable.
A "quick check" of one or two claims is still a verification run. The same four rules apply.
Workflow
For visual verification tasks, load
references/verification.md
. For flow verification tasks, load
references/flow_verification.md
. Both reference docs have a complete decision flow for their mode.
| User intent | Reference |
|---|
| Verify a live site against a design spec (visual) | references/verification.md
|
| Run user flows against a live site | references/flow_verification.md
|
| Generate spec from live site (no design.md exists) | references/spec_generation.md
|
| Generate flows from a live site (no .feature files exist) | references/flow_generation.md
|
| Compile design.md → category files; sync chat edits | |
| Sync user intent → files | |
| Set up MCP server + browser session | |
| Write/edit design spec files | references/spec_authoring.md
|
| Write files | references/flow_authoring.md
|
| Generate verification report (visual + flow) | references/verification_report.md
|
| Annotate failures visually on the page | references/annotate_failures.md
|
| Constraint syntax reference | references/constraint_reference.md
|
| Per-category translation patterns | references/verify_visual_style.md
, references/verify_components.md
, references/verify_accessibility.md
, references/verify_project_rules.md
, references/verify_platform_conventions.md
|
| Cross-session warm-start (deferred — not in scope) | |
All references live at
relative to this SKILL.md file. The absolute path depends on where the skill is installed:
- Global install:
~/.<agent>/skills/ui-verification/references/<name>.md
- Workspace install:
<project_root>/.<agent>/skills/ui-verification/references/<name>.md
To resolve references, use the directory containing this SKILL.md as the base — NOT the workspace root. If your skill loader's progressive disclosure hasn't surfaced them mid-session, read them directly with the Read tool using the appropriate absolute path — never search the filesystem with
.
Disambiguation: visual vs flow vs both
Match the user's request to the right mode:
| Phrase pattern | Mode | Action |
|---|
| "verify design", "check styles", "match the spec", "is it on-brand" | Visual only | Load references/verification.md
|
| "run flows", "test the user journey", "verify login works" | Flow only | Load references/flow_verification.md
|
| "verify [url]", "run verification on [url]" with no further qualifier | Both | Visual first, then flow, into one combined report |
| User names a specific file or flow ID | Flow only | Load references/flow_verification.md
and run only that flow |
| User selects text from or names a category | Visual only | Load references/verification.md
and use partial-selection flow |
When in doubt, ask the user once: "Run visual verification, flow verification, or both?" Don't guess at scope when the request is genuinely ambiguous.
Where this skill lives
This skill is at
<output_dir>/.<agent>/skills/ui-verification/
for workspace-local installs, OR at
~/.<agent>/skills/ui-verification/
for global installs — wherever the skill loader picked it up from is its installed location.
Do NOT search the filesystem for it. No
, no
. Activation is the runtime's job; if you've reached this SKILL.md, the runtime already knows where you are.
Resolving the references directory
The
folder is
always co-located with this SKILL.md file, not with the workspace or output directory. Use the path that the runtime used to load this file as the base:
| Install type | SKILL.md location | References at |
|---|
| Global | ~/.<agent>/skills/ui-verification/SKILL.md
| ~/.<agent>/skills/ui-verification/references/
|
| Workspace | <project>/.<agent>/skills/ui-verification/SKILL.md
| <project>/.<agent>/skills/ui-verification/references/
|
When reading a reference, construct the absolute path from the skill's install location. Example for a global install:
~/.<agent>/skills/ui-verification/references/verification.md
~/.<agent>/skills/ui-verification/references/spec_sync.md
Do NOT assume references are at
<output_dir>/.<agent>/skills/ui-verification/references/
when the skill was loaded from the global location — the workspace may not have a copy.
If you're a fresh agent on a new turn and you don't immediately have a tool from
available, the MCP server may still be starting — wait for the runtime to surface it on the next user turn rather than searching the filesystem to "find" the skill yourself. The skill is already loaded; the tools are not always synchronously available with skill activation.
Don't search for tool implementations
Never
for the MCP server source code, the constraint engine source, or any other tool implementation. The behavior of
, the constraint syntax, the selector matching algorithm — all of this is documented in
references/constraint_reference.md
and the per-category deep-dives. If a constraint or property behaves unexpectedly during a run,
read the reference, not the implementation. The references are the agent-facing documentation of record; reaching for
to spelunk the engine is a sign the reference needs an update, which the user can address — but in-session, work from documented behavior.