Agent Output Audit

Independent verification of AI-implemented work. The skill that asks: "Did the implementing agent actually do what
task_NN.md
says it did?" — not "Would a real user succeed at this product?" (that's

qa-execution

Required Reading Router

Match your task to the row. Read the listed files in full before producing output. They are not appendices — they are load-bearing. Inline content in this SKILL.md is a pointer, not a substitute.

Task	MUST read
Discovering install/lint/test/build/start commands (Step 1)	`references/project-signals.md`
Deciding E2E support and classifying coverage (Step 1)	`references/e2e-coverage.md`
Building the audit scope checklist (Step 2)	`references/checklist.md`
Holding independent-evaluator stance on AI tasks (Step 3)	`references/independent-evaluator-protocol.md`
Scanning test diffs for AI hygiene red flags (Step 4)	`references/ai-implementation-audit.md`
Diagnosing a test that passed on retry without a code change	`references/flaky-triage.md`

Reference Index

```
references/project-signals.md
```
— Heuristics for picking install/lint/test/build/start commands across ecosystems when the repo lacks an umbrella gate.

references/e2e-coverage.md

— Taxonomy for

existing-e2e

needs-e2e

manual-only

blocked

and how to detect harness support.

```
references/checklist.md
```
— Audit checklist by category: contract discovery, baseline, task audit, AI hygiene, flaky detection, quality gates.
```
references/ai-implementation-audit.md
```
— Red Flag scanners (RF-1..RF-6), Requirement→Test mapping, verdict matrix for completed tasks.

references/independent-evaluator-protocol.md

— What counts (and doesn't count) as evidence; transcript classification (

genuine-failure

grader-bug

ambiguous-task

bypass-exploit

```
references/flaky-triage.md
```
— Taxonomy, diagnosis protocol, and quarantine workflow for retry-passes-without-code-change failures.

Required Inputs

audit-output-path (optional): Directory where audit artifacts (bugs, audit report, evidence) are stored. When provided, create the directory if it does not exist and use it for all audit outputs. When omitted, fall back to repository conventions or
```
/tmp/agent-output-audit-<slug>
```
.

Procedures

Step 1: Discover the Repository Verification Contract

Read root instructions, repository docs, and CI/build files before running commands.
Execute
```
python3 scripts/discover-project-contract.py --root .
```
to surface candidate install, verify, build, test, lint, start commands, and E2E signals.
STOP. Read
references/project-signals.md
in full before picking commands when discovery surfaces more than one plausible gate or the repo mixes ecosystems.
STOP. Read
references/e2e-coverage.md
in full before classifying any flow.
Prefer repository-defined umbrella commands such as
```
make verify
```
,
```
just verify
```
, or CI entrypoints over language-default commands.
Resolve the audit artifact directory. If the user provided an
```
audit-output-path
```
argument, use it. Otherwise use repository conventions, falling back to
```
/tmp/agent-output-audit-<slug>
```
. Create the
```
audit/
```
subdirectory; store all bugs and reports under
```
<audit-output-path>/audit/
```
.
Detect Compozy mode. If
```
.compozy/tasks/<slug>/
```
exists, record the slug and switch into Compozy-aware audit:
- Read
```
state.yaml
```
  (read-only — never write to it;
  scripts/update-state.py
  owns mutation per the cy-codex-loop contract).
- Read
```
_techspec.md
```
  (deliverable source of truth) and
```
_tasks.md
```
  (task roster) when present.
- List every
```
task_NN.md
```
  and capture its frontmatter
```
status:
```
  value (allowed:
```
pending
```
  ,
```
in_progress
```
  ,
```
completed
```
  ). When
```
task_NN.md
```
  frontmatter disagrees with
```
state.yaml
```
  , treat frontmatter as the source of truth.
- Note the canonical memory slot
```
.compozy/tasks/<slug>/memory/qa-execution.md
```
  — Step 4 writes audit notes there before any status flips.

Step 2: Run the Baseline Verification Gate

Install dependencies with the repository-preferred command.
Run the canonical verification gate once before any audit work. Execute in fastest-first order: lint and type-check, then build, then unit tests, then integration tests.
If the E2E command is separate from the umbrella gate, decide whether to run it now or after runtime prerequisites are ready, then record that plan explicitly.
If the baseline fails, read the first failing output carefully and determine whether it is pre-existing or introduced by current work before moving on. 4a. Flaky-failure protocol. When a baseline command fails, before classifying as pre-existing or new, run the failing test in isolation 3-5 times on the same SHA. If it passes at least once without code changes, classify as
```
flaky-suspect
```
, record in
```
audit-report.md
```
under
```
SUITE HEALTH SNAPSHOT
```
(test name, attempts, retry outcome, suspected category), and do NOT promote to PASS via retry. STOP. Read
references/flaky-triage.md
in full before assigning a suspected category or proposing a quarantine.

Step 3: Audit Task Implementations (Compozy mode and any AI-implemented tasks)

Skip this step only when no task, phase, PRD, tech spec, or implementation-plan artifacts exist.

STOP. Read
references/independent-evaluator-protocol.md
in full before forming any task verdict. Tripwire summary: never accept the implementing agent's transcript, success message, or memory note as evidence. In Compozy mode, read the implementing agent's
```
.compozy/tasks/<slug>/memory/<phase>.md
```
artifacts and classify anomalies (
```
genuine-failure
```
/
```
grader-bug
```
/
```
ambiguous-task
```
/
```
bypass-exploit
```
) in the
```
Errors / Corrections
```
section of
```
memory/qa-execution.md
```
before judging the task.
Read each
```
task_NN.md
```
and its body. Summarize each task into a Task Implementation Matrix (column names mirror cy-codex-loop frontmatter):
- ```
task_path
```
  (e.g.,
```
.compozy/tasks/<slug>/task_07.md
```
  )
- ```
declared_status
```
  — literal frontmatter
```
status:
```
  value
- ```
title
```
  ,
```
type
```
  ,
```
complexity
```
  ,
```
dependencies
```
  — mirrored from frontmatter
- ```
techspec_deliverable
```
  — linked section in
```
_techspec.md
```
  when present
- Requirements, subtasks, checklist items, success criteria, dependent files
- ```
implementation_evidence
```
  — files, modules, routes, commands, migrations, seeds, tests
- ```
verification_evidence
```
  — commands executed, exit codes, output summaries
- ```
qa_verdict
```
  —
```
PASS
```
  |
```
PARTIAL
```
  |
```
FAIL
```
  |
```
REOPEN
```
  |
```
BLOCKED
```
  (distinct from
```
declared_status
```
  )
- ```
ai_audit_findings
```
  — red flag IDs that fired in Step 4 with verdict
- ```
action
```
  —
```
none
```
  |
```
fixed
```
  |
```
reopened-frontmatter
```
  |
```
BUG-NNN.md filed
```
- ```
linked_bugs
```
  — BUG IDs
Do not treat a task
```
declared_status
```
, checked checkbox, memory note, or prior agent summary as proof. Verify every completed or claimed-complete task against actual files, public behavior, automated tests, and acceptance criteria.
Classify each task with
```
qa_verdict
```
:
- ```
PASS
```
  : every material requirement and success criterion has implementation and fresh verification evidence.
- ```
PARTIAL
```
  : implementation exists but one or more non-critical requirements, tests, or evidence are missing.
- ```
FAIL
```
  : claimed behavior does not work or a critical requirement is absent.
- ```
REOPEN
```
  : the source
```
task_NN.md
```
  has
```
status: completed
```
  in frontmatter but the QA verdict is
```
PARTIAL
```
  or
```
FAIL
```
  .
- ```
BLOCKED
```
  : audit cannot continue because a concrete prerequisite is missing.

Step 4: AI Test-Hygiene Scan (RF-1..RF-6)

STOP. Read
references/ai-implementation-audit.md
in full before scanning the test diff of any task with
declared_status: completed
. That file owns the Red Flag scanners (RF-1..RF-6), the Requirement→Test mapping rules, and the verdict matrix.
Run the scans against the diff since the task baseline (
```
git log --follow <test_file>
```
,
```
git diff <baseline_sha>..HEAD
```
).
Emit verdict
```
FAIL
```
automatically when scanners detect:
- Weakened assertions on P0/P1 Success Criterion (RF-2).
- ```
.skip
```
  /
```
.only
```
  /
```
xit
```
  /
```
t.Skip
```
  inserted in the diff (RF-1).
- Mocks inserted in tests whose corresponding TC declared
```
External Dependencies
```
  as Integration/E2E (RF-3).
- Snapshot drift on P0/P1 with no requirement-change justification (RF-4).
Record findings in the Task Implementation Matrix column
```
ai_audit_findings
```
and in the per-task block of
```
audit-report.md
```
.
Apply the Requirement → Test mapping table from
```
references/ai-implementation-audit.md
```
. For every Success Criterion in
```
task_NN.md
```
(frontmatter or body) and every linked bullet in
```
_techspec.md
```
, find the corresponding test by name, reference, or assertion content. Mark each criterion
```
covers
```
/
```
weak
```
/
```
missing
```
. A checked item or
```
status: completed
```
without a
```
covers
```
row is an audit failure.

Step 5: Reopen, File Bugs, Write Memory

Mark incomplete completed tasks as
```
REOPEN
```
in the matrix.
In Compozy mode, write audit notes to
```
.compozy/tasks/<slug>/memory/qa-execution.md
```
using the canonical sections required by cy-codex-loop:
```
Objective Snapshot
```
,
```
Important Decisions
```
,
```
Learnings
```
,
```
Files / Surfaces
```
,
```
Errors / Corrections
```
,
```
Ready for Next Run
```
. This file must be written before any
```
task_NN.md
```
frontmatter is flipped (memory-precedes-status invariant).
Edit the offending
```
task_NN.md
```
frontmatter
```
status:
```
back to
```
pending
```
(or
```
in_progress
```
if salvageable). Never write to
state.yaml
— cy-codex-loop's
```
update-state.py
```
owns mutation; frontmatter wins because the next iteration reconciles from it.
File
```
BUG-<num>.md
```
under
```
<audit-output-path>/audit/issues/
```
using
```
assets/issue-template.md
```
. Include:
- The task path under
```
Reopens task:
```
  .
- The failed Success Criterion under
```
Summary:
```
  .
- The original strict assertion (when RF-2 fired) under
```
Root cause:
```
  .
- The red flag ID and verdict under
```
Automation Follow-up:
```
  notes.
- The transcript anomaly classification (when applicable) under
```
Related:
```
  .
When the missing work is a bounded root-cause fix inside the audit scope, you may implement it, add regression coverage, and rerun the task proof. Otherwise reopen the task — do not silently pass it.

Step 6: Quality Gates Verdict

Re-run the canonical verification gate from scratch after the last code change made during the audit.
Compile the Quality Gates section of
```
audit-report.md
```
. Each gate is
```
PASS
```
/
```
FAIL
```
/
```
N/A
```
:
- Flaky rate <2% in canonical suite.
- Zero
```
FAIL
```
  from AI test-hygiene audit on P0/P1 tasks.
- Zero
```
Critical
```
  /
```
High
```
  issues open.
- Coverage delta ≥ baseline (no regression).
- Zero unresolved
```
flaky-suspect
```
  on P0 flows.
A
```
FAIL
```
on any gate blocks an unconditional PASS verdict for the run.

Step 7: Write the Audit Report

Summarize the audit using

assets/audit-report-template.md

and write the report to

<audit-output-path>/audit/audit-report.md

Mandatory sections:
- Claim / Command / Exit code / Verdict per command executed in Step 2 and Step 6.
- AUTOMATED COVERAGE — support detected, harness, canonical command, required flows with classification, specs added or updated.
- TASK IMPLEMENTATION AUDIT — Compozy slug, plan sources, matrix totals, per-task verdicts, reopened/fixed/blocked tasks, links to bugs.
- SUITE HEALTH SNAPSHOT — flaky rate, flaky events list, mutation score (when harness exists), coverage delta vs baseline, blocked count, manual-only count, AI audit findings count.
- QUALITY GATES — PASS/FAIL/N/A per gate.
- ISSUES FILED — total, by severity, with
```
Reopens task:
```
  annotations.
When running in a Compozy slug, the final
```
audit-report.md
```
PASS feeds cy-codex-loop's
```
verify.last_status=PASS
```
precondition for Phase E — do not call
update-state.py
; cy-codex-loop owns that mutation.
Report blocked scenarios, missing credentials, or environment gaps with the exact command or prerequisite that stopped execution.

Error Handling

If command discovery returns multiple plausible gates, prefer the broadest repository-defined command and explain the tie-breaker.
If E2E support signals are weak or contradictory, prefer explicit config files and runnable commands before claiming the repository supports E2E.
If no canonical verify command exists, read
```
references/project-signals.md
```
, choose the broadest safe install, lint, test, and build commands for the detected ecosystem, and state that assumption explicitly.
If a required live dependency is unavailable, validate every local boundary that does not require the missing dependency and report the blocked live validation separately.
If a failure appears unrelated to the audited tasks, prove that with a clean reproduction before excluding it from the audit scope.
If the repository has an E2E harness but credentials, runtime services, or test data prevent execution, keep the affected flow classified as
```
blocked
```
and report the exact prerequisite that is missing.
If
```
task_NN.md
```
files are marked
```
status: completed
```
but contain unchecked subtasks, missing deliverables, or unverified criteria, do not call the audit a pass. Write
```
memory/qa-execution.md
```
first, then edit frontmatter
```
status:
```
back to
```
pending
```
or
```
in_progress
```
, and file
```
BUG-<num>.md
```
per Step 5. Never write to
```
state.yaml
```
.
If a test fails and passes on retry without a code change, do not promote to PASS. Register as
```
flaky-suspect
```
per
```
references/flaky-triage.md
```
, record the event in the Suite Health Snapshot, and treat any unresolved
```
flaky-suspect
```
on a P0 flow as a blocker for the final verdict.
If the AI test-hygiene scan (Step 4) detects weakened assertions, skipped tests, or mocks hiding integration in a task with
```
declared_status: completed
```
, do not call the audit a pass. Apply the verdict matrix in
```
references/ai-implementation-audit.md
```
, file
```
BUG-<num>.md
```
with Type
```
Functional
```
, and flip frontmatter
```
status:
```
per Step 5.

Companion: qa-execution

agent-output-audit

validates that the implementing AI agent did what it claimed.

qa-execution

validates that a real human user can succeed at the product. They are complementary, not redundant:

Run

agent-output-audit

to certify that

task_NN.md status: completed

reflects real work.

Run
```
qa-execution
```
to certify that the product, taken as a whole, is acceptable to end users.

A Compozy slug typically wants both: audit the task implementations, then exercise the resulting product through user-flow QA. They share no output directory, no bug taxonomy, and no procedures — keep them separate.

agent-output-audit

NPX Install

Tags

SKILL.md Content