To review a Meticulous test run, follow the workflow below step by step, using the CLI commands as described.
Before starting, run the
skill to ensure the Meticulous CLI is up to date — unless it has already run earlier in this conversation, in which case skip it.
Prerequisites
You need a
for a completed Meticulous test run. Resolve it from whatever you were given:
- A test-run ID — a 20+ character alphanumeric string (e.g. ), as opposed to a short PR number. Use it directly and skip the rest of this section.
- A PR number — a short integer. Resolve it against the local repo by passing it to below.
- Nothing — omit the argument; resolves the PR associated with the current branch in the local repo.
To infer the
from a PR's GitHub checks (pass the PR number when you have one, omit it to use the current branch's PR):
gh pr checks <pr-number> --json name,link,state,bucket \
--jq '.[] | select(.name | startswith("Meticulous Tests")) | select(.bucket != "pending") | select((.link // "") | contains("/test-runs/")) | {name, state, testRunId: (.link | capture("/test-runs/(?<id>[^/?#]+)").id)}'
The
is the final path segment of the check's details URL (
app.meticulous.ai/.../test-runs/<id>
).
- Only completed runs:
select(.bucket != "pending")
drops checks that are still queued or in progress, so a half-finished check can't hand you a for an incomplete run. This keeps both passing and failing runs — a bucket is a completed run with visual diffs, which is exactly what you're usually here to review. Don't narrow this to passing checks only.
- Check not finished yet: the filter further excludes checks whose is still or empty (the guard keeps jq from erroring on those rows). If the command returns nothing, wait and re-run, or ask the user.
- New commit pushed: no extra steps — always reflects the PR's current head, so the same command returns the latest run's .
Assess visual frontend changes
Get an overview of all diffs, then visually inspect representative screenshots to cover all changes. The
from Step 1 tell you which screenshots share the same structural DOM change — pick one representative per unique diff ID to efficiently cover all changes. For each representative, always look at the screenshot images first (Step 2) — the diff image is the most informative way to understand what actually changed. Use the DOM diff (Step 3) for additional structural detail, and the timeline (Step 4) only when a diff is unexpected and not explained by the DOM or images. The final report should cover all significant visual changes: each visual change deserves its own explanation.
Step 1 -- Get the replay diff summary
meticulous agent test-run-diffs --testRunId <testRunId>
Output format: TSV on stdout, metadata on stderr.
stdout columns:
replayDiffId screenshotName index total outcome mismatch domDiffIds
Example output:
CqctwLpPC7 after-event-0 1 5 diff 0.00234 1;3
RRMGQft7PD after-event-174 3 8 diff 0.01050 1;2
CLkCJ8WLrJ after-event-8 2 4 diff 0.00100 none
Ct8HwmJNzM end-state 5 5 flake 0.00010 error
Ab3xKLmN9Q after-event-12 3 6 missing-base 0.00000 n/a
Each row represents a screenshot where a visual pixel difference was detected between the base (before) and head (after) replay. Rows with
are confirmed visual differences; other outcomes (
,
,
,
,
) are informational.
- : (visual pixel difference), , , , ,
- (0-1, 5 decimal places) is the pixel mismatch fraction
- is a semicolon-separated ordered list of diff IDs, one per independent DOM change in the screenshot. Each ID groups structurally identical DOM changes across screenshots (same ID = same structural change). Example: means two independent DOM changes with IDs 1 and 3. Special values: means no DOM changes were found (either computed successfully with no differences, or a matching screenshot where no diff is expected) -- the visual difference is purely pixel-level (e.g. anti-aliasing, rendering differences), so you must inspect the screenshot images to understand the change. means the DOM diff is not applicable (e.g. error/warning outcomes where no comparison is possible). means the DOM diff was attempted but failed (e.g. metadata unavailable or could not be retrieved).
stderr shows: total counts, unique diff counts, and timing breakdown. Proceed to Steps 2-3 for rows with
.
Use
to identify which subset of diffs to inspect. Screenshots sharing the same ID contain the same structural DOM change — pick one representative per unique ID for efficient coverage.
Step 2 -- Get screenshot images
For each representative screenshot:
meticulous agent image-files --replayDiffId <replayDiffId> --screenshotName <screenshotName>
This downloads the screenshot images to
~/.meticulous/agent-images/
and prints the local file paths.
Output format:
outcome: <outcome>
screenshot: <path> # present for missing-base/missing-head
before: <path> # present for diff/no-diff
after: <path>
diffImage: <path>
Open the
,
, and
files to visually inspect the change. The
is usually the most informative — it highlights exactly which pixels changed. Always inspect the images to understand the actual visual impact of a change, even when the DOM diff is clear.
Alternative: use
instead of
to get URLs to the images rather than downloading them locally.
Step 3 -- Inspect the DOM diff (for structural detail)
meticulous agent dom-diff --replayDiffId <replayDiffId> --screenshotName <screenshotName>
Optional: pass
to control how many context lines surround each hunk (default 3). Use
for no context, or
for a single unified diff with full file context.
Output format: Unified diff (
/
format) with leading indentation stripped. All diff blocks are separated by
,
, etc. headers. Example:
[diff 0]
<span class="text-zinc-400">#7687</span>
-<span class="min-w-0 flex-1 truncate transition-colors">Use divergence-aware comparison</span>
+<span class="min-w-0 flex-1 truncate transition-colors" data-tooltip-id=":r1h:">Use divergence-aware comparison</span>
[diff 1]
<span class="inline-flex items-center rounded-lg bg-zinc-800">Temporal Workflow</span></a>
+<a href="/projects/Foo/Bar/test-runs/abc123"><span class="inline-flex items-center rounded-lg bg-zinc-800">Original: abc123</span></a>
To view a single diff block, add
(maps to the position in the
list from Step 1).
Step 4 -- Get the replay timeline (optional, for diagnosing unexpected diffs)
If a diff is unexpected and the images/DOM don't make it obvious why it happened:
meticulous agent timeline-diff --replayDiffId <replayDiffId>
Output format: TSV on stdout, replay IDs on stderr.
stdout columns:
diff timeMs event description
- column: (identical), (removed), (added), (changed)
- types: , , , , , , , , etc.
- : concise one-line summary of the event
Look for anomalies such as failed network requests, unexpected redirects, or timing-related differences that could explain a visual change.
Decision guide
For each representative screenshot, classify the visual change as intended or unintended based on the diff image and DOM diff:
- Intended: The visual change is a desired outcome of the task you're working on. Confirm and move on.
- Unintended: The change was not a goal of the task. This includes both changes that are clearly unrelated to your code, and — often more importantly — side effects of your code changes that weren't meant to happen. A change being explainable by your code does not make it intended; if the task didn't call for that visual change, it's unintended.
For unintended changes:
- If the change is a side effect of your code, attempt to fix it so the code achieves the intended result without the unwanted visual change, then re-run the test.
- Use the timeline (Step 4) to check for failed network requests, redirects, or other anomalies that could explain diffs unrelated to your code.
- If you can confidently explain the cause (e.g. a flaky timestamp, a non-deterministic element), note the explanation.
- If you cannot explain or fix it, flag it to the user.
Final report
After investigating all diffs and attempting fixes for any fixable issues, produce a summary that covers all significant visual changes. The number of explanation points should be at least as many as the number of unique domDiffIds you inspected — each visual change deserves its own explanation.
- Intended changes: For each distinct visual change that is a desired outcome of the task, describe what changed visually (based on the diff image) and why it's intended.
- Unintended changes (if any): For each, include:
- A representative /
- What the visual change looks like (e.g. "new badge element added", "layout shift in header")
- Whether it's a side effect of your code or unrelated, and your best assessment of the cause