Agent Output Audit
Independent verification of AI-implemented work. The skill that asks:
"Did the implementing agent actually do what says it did?" — not
"Would a real user succeed at this product?" (that's
).
Required Reading Router
Match your task to the row. Read the listed files in full before producing output. They are not appendices — they are load-bearing. Inline content in this SKILL.md is a pointer, not a substitute.
| Task | MUST read |
|---|
| Discovering install/lint/test/build/start commands (Step 1) | references/project-signals.md
|
| Deciding E2E support and classifying coverage (Step 1) | references/e2e-coverage.md
|
| Building the audit scope checklist (Step 2) | |
| Holding independent-evaluator stance on AI tasks (Step 3) | references/independent-evaluator-protocol.md
|
| Scanning test diffs for AI hygiene red flags (Step 4) | references/ai-implementation-audit.md
|
| Diagnosing a test that passed on retry without a code change | references/flaky-triage.md
|
Reference Index
references/project-signals.md
— Heuristics for picking install/lint/test/build/start commands across ecosystems when the repo lacks an umbrella gate.
references/e2e-coverage.md
— Taxonomy for / / / and how to detect harness support.
- — Audit checklist by category: contract discovery, baseline, task audit, AI hygiene, flaky detection, quality gates.
references/ai-implementation-audit.md
— Red Flag scanners (RF-1..RF-6), Requirement→Test mapping, verdict matrix for completed tasks.
references/independent-evaluator-protocol.md
— What counts (and doesn't count) as evidence; transcript classification ( / / / ).
references/flaky-triage.md
— Taxonomy, diagnosis protocol, and quarantine workflow for retry-passes-without-code-change failures.
Required Inputs
- audit-output-path (optional): Directory where audit artifacts (bugs, audit report, evidence) are stored. When provided, create the directory if it does not exist and use it for all audit outputs. When omitted, fall back to repository conventions or
/tmp/agent-output-audit-<slug>
.
Procedures
Step 1: Discover the Repository Verification Contract
- Read root instructions, repository docs, and CI/build files before running commands.
- Execute
python3 scripts/discover-project-contract.py --root .
to surface candidate install, verify, build, test, lint, start commands, and E2E signals.
- STOP. Read
references/project-signals.md
in full before picking commands when discovery surfaces more than one plausible gate or the repo mixes ecosystems.
- STOP. Read
references/e2e-coverage.md
in full before classifying any flow.
- Prefer repository-defined umbrella commands such as , , or CI entrypoints over language-default commands.
- Resolve the audit artifact directory. If the user provided an argument, use it. Otherwise use repository conventions, falling back to
/tmp/agent-output-audit-<slug>
. Create the subdirectory; store all bugs and reports under <audit-output-path>/audit/
.
- Detect Compozy mode. If exists, record the slug and switch into Compozy-aware audit:
- Read (read-only — never write to it; owns mutation per the cy-codex-loop contract).
- Read (deliverable source of truth) and (task roster) when present.
- List every and capture its frontmatter value (allowed: , , ). When frontmatter disagrees with , treat frontmatter as the source of truth.
- Note the canonical memory slot
.compozy/tasks/<slug>/memory/qa-execution.md
— Step 4 writes audit notes there before any status flips.
Step 2: Run the Baseline Verification Gate
- Install dependencies with the repository-preferred command.
- Run the canonical verification gate once before any audit work. Execute in fastest-first order: lint and type-check, then build, then unit tests, then integration tests.
- If the E2E command is separate from the umbrella gate, decide whether to run it now or after runtime prerequisites are ready, then record that plan explicitly.
- If the baseline fails, read the first failing output carefully and determine whether it is pre-existing or introduced by current work before moving on.
4a. Flaky-failure protocol. When a baseline command fails, before classifying as pre-existing or new, run the failing test in isolation 3-5 times on the same SHA. If it passes at least once without code changes, classify as , record in under (test name, attempts, retry outcome, suspected category), and do NOT promote to PASS via retry. STOP. Read
references/flaky-triage.md
in full before assigning a suspected category or proposing a quarantine.
Step 3: Audit Task Implementations (Compozy mode and any AI-implemented tasks)
Skip this step only when no task, phase, PRD, tech spec, or implementation-plan artifacts exist.
- STOP. Read
references/independent-evaluator-protocol.md
in full before forming any task verdict. Tripwire summary: never accept the implementing agent's transcript, success message, or memory note as evidence. In Compozy mode, read the implementing agent's .compozy/tasks/<slug>/memory/<phase>.md
artifacts and classify anomalies ( / / / ) in the section of before judging the task.
- Read each and its body. Summarize each task into a Task Implementation Matrix (column names mirror cy-codex-loop frontmatter):
- (e.g.,
.compozy/tasks/<slug>/task_07.md
)
- — literal frontmatter value
- , , , — mirrored from frontmatter
- — linked section in when present
- Requirements, subtasks, checklist items, success criteria, dependent files
- — files, modules, routes, commands, migrations, seeds, tests
- — commands executed, exit codes, output summaries
- — | | | | (distinct from )
- — red flag IDs that fired in Step 4 with verdict
- — | | |
- — BUG IDs
- Do not treat a task , checked checkbox, memory note, or prior agent summary as proof. Verify every completed or claimed-complete task against actual files, public behavior, automated tests, and acceptance criteria.
- Classify each task with :
- : every material requirement and success criterion has implementation and fresh verification evidence.
- : implementation exists but one or more non-critical requirements, tests, or evidence are missing.
- : claimed behavior does not work or a critical requirement is absent.
- : the source has in frontmatter but the QA verdict is or .
- : audit cannot continue because a concrete prerequisite is missing.
Step 4: AI Test-Hygiene Scan (RF-1..RF-6)
- STOP. Read
references/ai-implementation-audit.md
in full before scanning the test diff of any task with declared_status: completed
. That file owns the Red Flag scanners (RF-1..RF-6), the Requirement→Test mapping rules, and the verdict matrix.
- Run the scans against the diff since the task baseline (
git log --follow <test_file>
, git diff <baseline_sha>..HEAD
).
- Emit verdict automatically when scanners detect:
- Weakened assertions on P0/P1 Success Criterion (RF-2).
- / / / inserted in the diff (RF-1).
- Mocks inserted in tests whose corresponding TC declared as Integration/E2E (RF-3).
- Snapshot drift on P0/P1 with no requirement-change justification (RF-4).
- Record findings in the Task Implementation Matrix column and in the per-task block of .
- Apply the Requirement → Test mapping table from
references/ai-implementation-audit.md
. For every Success Criterion in (frontmatter or body) and every linked bullet in , find the corresponding test by name, reference, or assertion content. Mark each criterion / / . A checked item or without a row is an audit failure.
Step 5: Reopen, File Bugs, Write Memory
- Mark incomplete completed tasks as in the matrix.
- In Compozy mode, write audit notes to
.compozy/tasks/<slug>/memory/qa-execution.md
using the canonical sections required by cy-codex-loop: , , , , , . This file must be written before any frontmatter is flipped (memory-precedes-status invariant).
- Edit the offending frontmatter back to (or if salvageable). Never write to — cy-codex-loop's owns mutation; frontmatter wins because the next iteration reconciles from it.
- File under
<audit-output-path>/audit/issues/
using . Include:
- The task path under .
- The failed Success Criterion under .
- The original strict assertion (when RF-2 fired) under .
- The red flag ID and verdict under notes.
- The transcript anomaly classification (when applicable) under .
- When the missing work is a bounded root-cause fix inside the audit scope, you may implement it, add regression coverage, and rerun the task proof. Otherwise reopen the task — do not silently pass it.
Step 6: Quality Gates Verdict
- Re-run the canonical verification gate from scratch after the last code change made during the audit.
- Compile the Quality Gates section of . Each gate is / / :
- Flaky rate <2% in canonical suite.
- Zero from AI test-hygiene audit on P0/P1 tasks.
- Zero / issues open.
- Coverage delta ≥ baseline (no regression).
- Zero unresolved on P0 flows.
- A on any gate blocks an unconditional PASS verdict for the run.
Step 7: Write the Audit Report
- Summarize the audit using
assets/audit-report-template.md
and write the report to <audit-output-path>/audit/audit-report.md
.
- Mandatory sections:
- Claim / Command / Exit code / Verdict per command executed in Step 2 and Step 6.
- AUTOMATED COVERAGE — support detected, harness, canonical command, required flows with classification, specs added or updated.
- TASK IMPLEMENTATION AUDIT — Compozy slug, plan sources, matrix totals, per-task verdicts, reopened/fixed/blocked tasks, links to bugs.
- SUITE HEALTH SNAPSHOT — flaky rate, flaky events list, mutation score (when harness exists), coverage delta vs baseline, blocked count, manual-only count, AI audit findings count.
- QUALITY GATES — PASS/FAIL/N/A per gate.
- ISSUES FILED — total, by severity, with annotations.
- When running in a Compozy slug, the final PASS feeds cy-codex-loop's precondition for Phase E — do not call ; cy-codex-loop owns that mutation.
- Report blocked scenarios, missing credentials, or environment gaps with the exact command or prerequisite that stopped execution.
Error Handling
- If command discovery returns multiple plausible gates, prefer the broadest repository-defined command and explain the tie-breaker.
- If E2E support signals are weak or contradictory, prefer explicit config files and runnable commands before claiming the repository supports E2E.
- If no canonical verify command exists, read
references/project-signals.md
, choose the broadest safe install, lint, test, and build commands for the detected ecosystem, and state that assumption explicitly.
- If a required live dependency is unavailable, validate every local boundary that does not require the missing dependency and report the blocked live validation separately.
- If a failure appears unrelated to the audited tasks, prove that with a clean reproduction before excluding it from the audit scope.
- If the repository has an E2E harness but credentials, runtime services, or test data prevent execution, keep the affected flow classified as and report the exact prerequisite that is missing.
- If files are marked but contain unchecked subtasks, missing deliverables, or unverified criteria, do not call the audit a pass. Write first, then edit frontmatter back to or , and file per Step 5. Never write to .
- If a test fails and passes on retry without a code change, do not promote to PASS. Register as per
references/flaky-triage.md
, record the event in the Suite Health Snapshot, and treat any unresolved on a P0 flow as a blocker for the final verdict.
- If the AI test-hygiene scan (Step 4) detects weakened assertions, skipped tests, or mocks hiding integration in a task with
declared_status: completed
, do not call the audit a pass. Apply the verdict matrix in references/ai-implementation-audit.md
, file with Type , and flip frontmatter per Step 5.
Companion: qa-execution
validates that the implementing AI agent did what it claimed.
validates that a real human user can succeed at the product. They are complementary, not redundant:
- Run to certify that
task_NN.md status: completed
reflects real work.
- Run to certify that the product, taken as a whole, is acceptable to end users.
A Compozy slug typically wants both: audit the task implementations, then exercise the resulting product through user-flow QA. They share no output directory, no bug taxonomy, and no procedures — keep them separate.