Triaging visual review runs
Visual Review is PostHog's screenshot-regression product: CI captures storybook + playwright screenshots,
diffs them against committed baseline hashes, and gates the PR until a human approves the visible changes.
A PR with visual changes carries a
GitHub status check that stays red until each diffed
snapshot is approved or tolerated in the
VR UI.
This skill teaches an agent how to answer the questions a human reviewer would actually ask, by chaining
the read-only VR MCP tools — instead of reaching for
and tab-hopping to the VR web UI.
When this skill applies
Trigger this skill on any of:
- A PR number, branch name, or commit SHA paired with words like visual review, VR, snapshot, screenshot,
storybook diff, playwright snapshot, baseline, approve, tolerated, quarantine.
- Questions about why a PR is blocked, what visually changed, or whether a diff is real.
- "Is my run done?" / "What's left to review?" / "Has this story flaked recently?"
- A failing GitHub check or a PR comment from the mentioning visual review.
When the user asks for the rendered diff image itself, the
VR web UI
is faster — direct them there. This skill is for everything around the diff: status, scope, history, triage.
Tools
All read-only. None of these require write scopes; approval/toleration still happens in the web UI.
| Tool | Purpose |
|---|
posthog:visual-review-runs-list
| List runs, filter by / / / . Start here. |
posthog:visual-review-runs-retrieve
| Full detail for a single run (status, summary counts, supersession). |
posthog:visual-review-runs-snapshots-list
| Per-snapshot results inside a run: identifier, , diff %, classification, baseline + current artifact URLs. |
posthog:visual-review-runs-snapshot-history-list
| A single story's last N runs across master/PRs — the flake check. |
posthog:visual-review-runs-counts-retrieve
| Aggregate counts for queue triage (how many runs in , etc.). |
posthog:visual-review-runs-tolerated-hashes-list
| Hashes the team has explicitly accepted as "known flake / acceptable variation". |
posthog:visual-review-repos-list
| Repos (one per GitHub repo) — usually only one matters; useful for filtering. |
posthog:visual-review-repos-retrieve
| Repo metadata: baseline file paths, PR-comment configuration. |
Vocabulary cheat sheet
These appear in tool output and matter for interpretation:
- Run : (open, awaiting human), (zero diffs), (CI still uploading),
(a newer run on the same PR has superseded this one — check ).
- Run : (component snapshots) or (full-page e2e snapshots).
- Snapshot : , (real diff), (no baseline yet), .
- Snapshot : (matches a known-tolerated hash, no action needed),
(under the noise floor), (byte-identical), (real diff requiring review).
- Snapshot : or .
- Run :
total / changed / new / removed / unchanged / unresolved / tolerated_matched
—
is what's actually blocking review.
Workflows
"What's the VR status of this PR?"
The single most common job. Map a PR number to its run state in two calls.
posthog:visual-review-runs-list { pr_number: <n>, limit: 5 }
— sort by desc, take the latest non-stale one.
- If the run has or , drill in:
posthog:visual-review-runs-snapshots-list { id: <run_id> }
and report the snapshots.
Report back: PR number, run UUID,
, summary counts, and the
deep link so the
user can click straight to the diff viewer.
"Is the diff real or unrelated?"
The most useful judgment a code-aware agent can add. Combine three signals: scope match, flake history,
and the actual rendered images. The agent should look at the screenshots — not just describe metadata.
-
Scope check —
git diff master...HEAD --stat
(or against the PR's base branch) → list of touched paths.
Cross-reference with
posthog:visual-review-runs-snapshots-list { id }
filtered to
→ story identifiers.
Stories are namespaced like
<area>-<scene>--<story>--<theme>
; e.g.
scenes-app-settings-user--settings-user-profile--dark
maps to
frontend/src/scenes/settings/user/...
. Use this to translate story id → likely source path.
-
Visual inspection — for each
snapshot, the tool result contains
current_artifact.download_url
and
baseline_artifact.download_url
. These are pre-signed S3 URLs to PNG files; pull them and look:
bash
curl -s -o /tmp/vr-baseline.png "<baseline_artifact.download_url>"
curl -s -o /tmp/vr-current.png "<current_artifact.download_url>"
Then
both files (the Read tool renders images visually) and compare. Things to call out:
- The actual visible delta (text changed, button moved, layout shift, color drift, missing element).
- Whether the change is consistent with the diff_pixel_count and diff_percentage in the metadata
(e.g. 54% diff but the images look near-identical → screenshot framing changed, not the UI).
- Whether the baseline and current have different dimensions ( / fields). Mismatched
dimensions usually mean the story rendered to a different viewport or didn't fully render before
screenshot — a flake signal, not a regression.
-
Flake history — run the flake check below for any story that looks suspect.
-
Verdict — combine all three:
- Scope plausible + visible regression matches the code change → real diff, recommend approval.
- Scope mismatch + dimensions mismatch + frequent prior changes → flake, recommend tolerating the hash.
- Scope plausible + visible regression looks unintended → push a fix; do not approve.
Always include a one-line description of what you saw in the images — the user uses this to decide whether to
trust your verdict without opening the VR UI themselves.
Flake check: "Has this story been changing?"
Once you have a suspect snapshot identifier:
posthog:visual-review-runs-snapshot-history-list { id: <snapshot_id> }
→ returns prior outcomes for the same story.
Verdicts:
- Mostly and this run's diff is the outlier → likely a real regression caused by this PR.
- Frequent across unrelated branches/master → flaky story; recommend tolerating the hash via the UI.
- Recent or large-jump dimension change → baseline likely stale; recommend re-baselining on master.
Triaging the queue
When the user is doing housekeeping rather than asking about a specific PR:
posthog:visual-review-runs-counts-retrieve
→ total queue size.
posthog:visual-review-runs-list { review_state: needs_review, limit: 50 }
(paginate if needed).
- Group by author or to surface clusters (e.g., "12 PRs blocked on the same shared
component change" usually means a single underlying root cause to address).
- Prefer surfacing runs whose over runs that are only — means no baseline
yet, which is usually trivial to approve; is the real review work.
Output expectations
For PR-status questions, lead with the verdict in one line, then 2-4 bullets of supporting context. Always
include the
deep link to the run — humans need to see the rendered images to make the call,
the agent can only describe the metadata.
For triage / aggregate questions, a short table beats prose. Group by what the user is going to act on.
What NOT to do
- Do not approve or tolerate snapshots from this skill — those endpoints are intentionally not exposed as
MCP tools yet. Direct the user to the run's .
- Do not assume the failing GitHub check on a PR is unrelated to VR — if a check is red on
a PR you're working on, that's the trigger to run this skill.
- Do not declare a verdict from metadata alone when . Pull the baseline and current PNGs
and look at them; metadata can only say "something changed", not whether the change is intended.