agent-output-audit

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Agent Output Audit

Agent输出审计

Independent verification of AI-implemented work. The skill that asks: "Did the implementing agent actually do what
task_NN.md
says it did?" — not "Would a real user succeed at this product?" (that's

qa-execution

对AI实现的工作进行独立验证。该技能旨在确认：“执行任务的Agent是否真正完成了
task_NN.md
中要求的工作？” —— 而非*“真实用户能否成功使用该产品？”*（后者属于

qa-execution

的范畴）。

Required Reading Router

必读文档指引

Match your task to the row. Read the listed files in full before producing output. They are not appendices — they are load-bearing. Inline content in this SKILL.md is a pointer, not a substitute.

Task	MUST read
Discovering install/lint/test/build/start commands (Step 1)	`references/project-signals.md`
Deciding E2E support and classifying coverage (Step 1)	`references/e2e-coverage.md`
Building the audit scope checklist (Step 2)	`references/checklist.md`
Holding independent-evaluator stance on AI tasks (Step 3)	`references/independent-evaluator-protocol.md`
Scanning test diffs for AI hygiene red flags (Step 4)	`references/ai-implementation-audit.md`
Diagnosing a test that passed on retry without a code change	`references/flaky-triage.md`

将你的任务与对应行匹配。在生成输出前完整阅读列出的文档。这些文档不是附录——它们是核心支撑内容。本SKILL.md中的内联内容仅作为指引，不能替代这些文档。

任务	必须阅读的文档
发现install/lint/test/build/start命令（步骤1）	`references/project-signals.md`
判断E2E支持情况并分类测试覆盖范围（步骤1）	`references/e2e-coverage.md`
构建审计范围检查清单（步骤2）	`references/checklist.md`
在AI任务中保持独立评估者立场（步骤3）	`references/independent-evaluator-protocol.md`
扫描测试差异以发现AI测试规范的红色预警（步骤4）	`references/ai-implementation-audit.md`
诊断无需修改代码即可通过重试的测试	`references/flaky-triage.md`

Reference Index

参考文档索引

```
references/project-signals.md
```
— Heuristics for picking install/lint/test/build/start commands across ecosystems when the repo lacks an umbrella gate.

references/e2e-coverage.md

— Taxonomy for

existing-e2e

needs-e2e

manual-only

blocked

and how to detect harness support.

```
references/checklist.md
```
— Audit checklist by category: contract discovery, baseline, task audit, AI hygiene, flaky detection, quality gates.
```
references/ai-implementation-audit.md
```
— Red Flag scanners (RF-1..RF-6), Requirement→Test mapping, verdict matrix for completed tasks.

references/independent-evaluator-protocol.md

— What counts (and doesn't count) as evidence; transcript classification (

genuine-failure

grader-bug

ambiguous-task

bypass-exploit

```
references/flaky-triage.md
```
— Taxonomy, diagnosis protocol, and quarantine workflow for retry-passes-without-code-change failures.

```
references/project-signals.md
```
— 当仓库没有统一的质量门时，用于在不同技术栈中选择install/lint/test/build/start命令的启发式规则。
```
references/e2e-coverage.md
```
— 针对
```
existing-e2e
```
/
```
needs-e2e
```
/
```
manual-only
```
/
```
blocked
```
的分类体系，以及如何检测E2E测试框架的支持情况。
```
references/checklist.md
```
— 按类别划分的审计检查清单：契约发现、基线验证、任务审计、AI测试规范、不稳定测试检测、质量门。
```
references/ai-implementation-audit.md
```
— 红色预警扫描器（RF-1至RF-6）、需求→测试映射规则、已完成任务的判定矩阵。

references/independent-evaluator-protocol.md

— 哪些内容可作为证据（哪些不可）；对话记录分类（

genuine-failure

grader-bug

ambiguous-task

bypass-exploit

）。

```
references/flaky-triage.md
```
— 无需修改代码即可通过重试的失败案例的分类体系、诊断流程和隔离工作流。

Required Inputs

必要输入

audit-output-path (optional): Directory where audit artifacts (bugs, audit report, evidence) are stored. When provided, create the directory if it does not exist and use it for all audit outputs. When omitted, fall back to repository conventions or
```
/tmp/agent-output-audit-<slug>
```
.

audit-output-path（可选）：存储审计工件（缺陷、审计报告、证据）的目录。如果提供了该路径，若目录不存在则创建，并将所有审计输出存入其中。若未提供，则遵循仓库约定或使用
```
/tmp/agent-output-audit-<slug>
```
。

Procedures

流程步骤

Step 1: Discover the Repository Verification Contract

Read root instructions, repository docs, and CI/build files before running commands.
Execute
```
python3 scripts/discover-project-contract.py --root .
```
to surface candidate install, verify, build, test, lint, start commands, and E2E signals.
STOP. Read
references/project-signals.md
in full before picking commands when discovery surfaces more than one plausible gate or the repo mixes ecosystems.
STOP. Read
references/e2e-coverage.md
in full before classifying any flow.
Prefer repository-defined umbrella commands such as
```
make verify
```
,
```
just verify
```
, or CI entrypoints over language-default commands.
Resolve the audit artifact directory. If the user provided an
```
audit-output-path
```
argument, use it. Otherwise use repository conventions, falling back to
```
/tmp/agent-output-audit-<slug>
```
. Create the
```
audit/
```
subdirectory; store all bugs and reports under
```
<audit-output-path>/audit/
```
.
Detect Compozy mode. If
```
.compozy/tasks/<slug>/
```
exists, record the slug and switch into Compozy-aware audit:
- Read
```
state.yaml
```
  (read-only — never write to it;
  scripts/update-state.py
  owns mutation per the cy-codex-loop contract).
- Read
```
_techspec.md
```
  (deliverable source of truth) and
```
_tasks.md
```
  (task roster) when present.
- List every
```
task_NN.md
```
  and capture its frontmatter
```
status:
```
  value (allowed:
```
pending
```
  ,
```
in_progress
```
  ,
```
completed
```
  ). When
```
task_NN.md
```
  frontmatter disagrees with
```
state.yaml
```
  , treat frontmatter as the source of truth.
- Note the canonical memory slot
```
.compozy/tasks/<slug>/memory/qa-execution.md
```
  — Step 4 writes audit notes there before any status flips.

Step 2: Run the Baseline Verification Gate

Install dependencies with the repository-preferred command.
Run the canonical verification gate once before any audit work. Execute in fastest-first order: lint and type-check, then build, then unit tests, then integration tests.
If the E2E command is separate from the umbrella gate, decide whether to run it now or after runtime prerequisites are ready, then record that plan explicitly.
If the baseline fails, read the first failing output carefully and determine whether it is pre-existing or introduced by current work before moving on. 4a. Flaky-failure protocol. When a baseline command fails, before classifying as pre-existing or new, run the failing test in isolation 3-5 times on the same SHA. If it passes at least once without code changes, classify as
```
flaky-suspect
```
, record in
```
audit-report.md
```
under
```
SUITE HEALTH SNAPSHOT
```
(test name, attempts, retry outcome, suspected category), and do NOT promote to PASS via retry. STOP. Read
references/flaky-triage.md
in full before assigning a suspected category or proposing a quarantine.

Step 3: Audit Task Implementations (Compozy mode and any AI-implemented tasks)

Skip this step only when no task, phase, PRD, tech spec, or implementation-plan artifacts exist.

STOP. Read
references/independent-evaluator-protocol.md
in full before forming any task verdict. Tripwire summary: never accept the implementing agent's transcript, success message, or memory note as evidence. In Compozy mode, read the implementing agent's
```
.compozy/tasks/<slug>/memory/<phase>.md
```
artifacts and classify anomalies (
```
genuine-failure
```
/
```
grader-bug
```
/
```
ambiguous-task
```
/
```
bypass-exploit
```
) in the
```
Errors / Corrections
```
section of
```
memory/qa-execution.md
```
before judging the task.
Read each
```
task_NN.md
```
and its body. Summarize each task into a Task Implementation Matrix (column names mirror cy-codex-loop frontmatter):
- ```
task_path
```
  (e.g.,
```
.compozy/tasks/<slug>/task_07.md
```
  )
- ```
declared_status
```
  — literal frontmatter
```
status:
```
  value
- ```
title
```
  ,
```
type
```
  ,
```
complexity
```
  ,
```
dependencies
```
  — mirrored from frontmatter
- ```
techspec_deliverable
```
  — linked section in
```
_techspec.md
```
  when present
- Requirements, subtasks, checklist items, success criteria, dependent files
- ```
implementation_evidence
```
  — files, modules, routes, commands, migrations, seeds, tests
- ```
verification_evidence
```
  — commands executed, exit codes, output summaries
- ```
qa_verdict
```
  —
```
PASS
```
  |
```
PARTIAL
```
  |
```
FAIL
```
  |
```
REOPEN
```
  |
```
BLOCKED
```
  (distinct from
```
declared_status
```
  )
- ```
ai_audit_findings
```
  — red flag IDs that fired in Step 4 with verdict
- ```
action
```
  —
```
none
```
  |
```
fixed
```
  |
```
reopened-frontmatter
```
  |
```
BUG-NNN.md filed
```
- ```
linked_bugs
```
  — BUG IDs
Do not treat a task
```
declared_status
```
, checked checkbox, memory note, or prior agent summary as proof. Verify every completed or claimed-complete task against actual files, public behavior, automated tests, and acceptance criteria.
Classify each task with
```
qa_verdict
```
:
- ```
PASS
```
  : every material requirement and success criterion has implementation and fresh verification evidence.
- ```
PARTIAL
```
  : implementation exists but one or more non-critical requirements, tests, or evidence are missing.
- ```
FAIL
```
  : claimed behavior does not work or a critical requirement is absent.
- ```
REOPEN
```
  : the source
```
task_NN.md
```
  has
```
status: completed
```
  in frontmatter but the QA verdict is
```
PARTIAL
```
  or
```
FAIL
```
  .
- ```
BLOCKED
```
  : audit cannot continue because a concrete prerequisite is missing.

Step 4: AI Test-Hygiene Scan (RF-1..RF-6)

STOP. Read
references/ai-implementation-audit.md
in full before scanning the test diff of any task with
declared_status: completed
. That file owns the Red Flag scanners (RF-1..RF-6), the Requirement→Test mapping rules, and the verdict matrix.
Run the scans against the diff since the task baseline (
```
git log --follow <test_file>
```
,
```
git diff <baseline_sha>..HEAD
```
).
Emit verdict
```
FAIL
```
automatically when scanners detect:
- Weakened assertions on P0/P1 Success Criterion (RF-2).
- ```
.skip
```
  /
```
.only
```
  /
```
xit
```
  /
```
t.Skip
```
  inserted in the diff (RF-1).
- Mocks inserted in tests whose corresponding TC declared
```
External Dependencies
```
  as Integration/E2E (RF-3).
- Snapshot drift on P0/P1 with no requirement-change justification (RF-4).
Record findings in the Task Implementation Matrix column
```
ai_audit_findings
```
and in the per-task block of
```
audit-report.md
```
.
Apply the Requirement → Test mapping table from
```
references/ai-implementation-audit.md
```
. For every Success Criterion in
```
task_NN.md
```
(frontmatter or body) and every linked bullet in
```
_techspec.md
```
, find the corresponding test by name, reference, or assertion content. Mark each criterion
```
covers
```
/
```
weak
```
/
```
missing
```
. A checked item or
```
status: completed
```
without a
```
covers
```
row is an audit failure.

Step 5: Reopen, File Bugs, Write Memory

Mark incomplete completed tasks as
```
REOPEN
```
in the matrix.
In Compozy mode, write audit notes to
```
.compozy/tasks/<slug>/memory/qa-execution.md
```
using the canonical sections required by cy-codex-loop:
```
Objective Snapshot
```
,
```
Important Decisions
```
,
```
Learnings
```
,
```
Files / Surfaces
```
,
```
Errors / Corrections
```
,
```
Ready for Next Run
```
. This file must be written before any
```
task_NN.md
```
frontmatter is flipped (memory-precedes-status invariant).
Edit the offending
```
task_NN.md
```
frontmatter
```
status:
```
back to
```
pending
```
(or
```
in_progress
```
if salvageable). Never write to
state.yaml
— cy-codex-loop's
```
update-state.py
```
owns mutation; frontmatter wins because the next iteration reconciles from it.
File
```
BUG-<num>.md
```
under
```
<audit-output-path>/audit/issues/
```
using
```
assets/issue-template.md
```
. Include:
- The task path under
```
Reopens task:
```
  .
- The failed Success Criterion under
```
Summary:
```
  .
- The original strict assertion (when RF-2 fired) under
```
Root cause:
```
  .
- The red flag ID and verdict under
```
Automation Follow-up:
```
  notes.
- The transcript anomaly classification (when applicable) under
```
Related:
```
  .
When the missing work is a bounded root-cause fix inside the audit scope, you may implement it, add regression coverage, and rerun the task proof. Otherwise reopen the task — do not silently pass it.

Step 6: Quality Gates Verdict

Re-run the canonical verification gate from scratch after the last code change made during the audit.
Compile the Quality Gates section of
```
audit-report.md
```
. Each gate is
```
PASS
```
/
```
FAIL
```
/
```
N/A
```
:
- Flaky rate <2% in canonical suite.
- Zero
```
FAIL
```
  from AI test-hygiene audit on P0/P1 tasks.
- Zero
```
Critical
```
  /
```
High
```
  issues open.
- Coverage delta ≥ baseline (no regression).
- Zero unresolved
```
flaky-suspect
```
  on P0 flows.
A
```
FAIL
```
on any gate blocks an unconditional PASS verdict for the run.

Step 7: Write the Audit Report

Summarize the audit using

assets/audit-report-template.md

and write the report to

<audit-output-path>/audit/audit-report.md

Mandatory sections:
- Claim / Command / Exit code / Verdict per command executed in Step 2 and Step 6.
- AUTOMATED COVERAGE — support detected, harness, canonical command, required flows with classification, specs added or updated.
- TASK IMPLEMENTATION AUDIT — Compozy slug, plan sources, matrix totals, per-task verdicts, reopened/fixed/blocked tasks, links to bugs.
- SUITE HEALTH SNAPSHOT — flaky rate, flaky events list, mutation score (when harness exists), coverage delta vs baseline, blocked count, manual-only count, AI audit findings count.
- QUALITY GATES — PASS/FAIL/N/A per gate.
- ISSUES FILED — total, by severity, with
```
Reopens task:
```
  annotations.
When running in a Compozy slug, the final
```
audit-report.md
```
PASS feeds cy-codex-loop's
```
verify.last_status=PASS
```
precondition for Phase E — do not call
update-state.py
; cy-codex-loop owns that mutation.
Report blocked scenarios, missing credentials, or environment gaps with the exact command or prerequisite that stopped execution.

步骤1：发现仓库验证契约

在执行命令前，阅读根目录说明、仓库文档和CI/构建文件。
执行
```
python3 scripts/discover-project-contract.py --root .
```
以获取候选的install、verify、build、test、lint、start命令以及E2E信号。
停止操作。当发现多个可行的质量门或仓库混合了多种技术栈时，在选择命令前完整阅读
references/project-signals.md
。
停止操作。在对任何流程进行分类前，完整阅读
references/e2e-coverage.md
。
优先选择仓库定义的统一命令，如
```
make verify
```
、
```
just verify
```
或CI入口点，而非语言默认命令。
确定审计工件目录。如果用户提供了
```
audit-output-path
```
参数，则使用该路径；否则遵循仓库约定，默认使用
```
/tmp/agent-output-audit-<slug>
```
。创建
```
audit/
```
子目录；将所有缺陷和报告存入
```
<audit-output-path>/audit/
```
。
检测Compozy模式。如果
```
.compozy/tasks/<slug>/
```
目录存在，记录slug并切换到Compozy感知审计模式：
- 读取
```
state.yaml
```
  （只读——切勿写入；根据cy-codex-loop契约，
  scripts/update-state.py
  负责该文件的修改）。
- 读取
```
_techspec.md
```
  （交付物的真实来源）和
```
_tasks.md
```
  （任务清单）（如果存在）。
- 列出所有
```
task_NN.md
```
  文件并提取其前置信息中的
```
status:
```
  值（允许的值：
```
pending
```
  、
```
in_progress
```
  、
```
completed
```
  ）。当
```
task_NN.md
```
  的前置信息与
```
state.yaml
```
  不一致时，以前置信息为准。
- 记录标准存储位置
```
.compozy/tasks/<slug>/memory/qa-execution.md
```
  ——步骤4在修改任何状态前，会将审计笔记写入该文件。

步骤2：执行基线验证质量门

使用仓库推荐的命令安装依赖。
在开始任何审计工作前，先执行一次标准验证质量门。按最快优先顺序执行：代码检查和类型校验，然后构建，接着单元测试，最后集成测试。
如果E2E命令与统一质量门分离，决定是立即执行还是在运行时前置条件就绪后执行，并明确记录该计划。
如果基线验证失败，仔细阅读第一个失败输出，确定该失败是预先存在的还是当前工作引入的，然后再继续。 4a. 不稳定失败处理流程。当基线命令失败时，在将其归类为预先存在或新失败之前，在同一个SHA上单独运行失败的测试3-5次。如果无需修改代码即可至少通过一次，则将其归类为
```
flaky-suspect
```
，在
```
audit-report.md
```
的
```
SUITE HEALTH SNAPSHOT
```
部分记录（测试名称、尝试次数、重试结果、疑似类别），并且不要通过重试将其标记为PASS。停止操作。在分配疑似类别或提出隔离建议前，完整阅读
references/flaky-triage.md
。

步骤3：审计任务实现（Compozy模式及所有AI实现的任务）

仅当不存在任务、阶段、PRD、技术规范或实现计划工件时，才跳过此步骤。

**停止操作。在得出任何任务判定前，完整阅读
```
references/independent-evaluator-protocol.md
```
。**核心要点：绝不将执行任务的Agent的对话记录、成功消息或存储笔记作为证据。在Compozy模式下，在判定任务前，先阅读执行Agent的
```
.compozy/tasks/<slug>/memory/<phase>.md
```
工件，并在
```
memory/qa-execution.md
```
的
```
Errors / Corrections
```
部分对异常进行分类（
```
genuine-failure
```
/
```
grader-bug
```
/
```
ambiguous-task
```
/
```
bypass-exploit
```
）。
阅读每个
```
task_NN.md
```
及其内容。将每个任务总结到任务实现矩阵中（列名与cy-codex-loop前置信息一致）：
- ```
task_path
```
  （例如：
```
.compozy/tasks/<slug>/task_07.md
```
  ）
- ```
declared_status
```
  — 前置信息中
```
status:
```
  的字面值
- ```
title
```
  、
```
type
```
  、
```
complexity
```
  、
```
dependencies
```
  — 与前置信息一致
- ```
techspec_deliverable
```
  — 若存在，则为
```
_techspec.md
```
  中的关联章节
- 需求、子任务、检查清单项、成功标准、依赖文件
- ```
implementation_evidence
```
  — 文件、模块、路由、命令、迁移脚本、种子数据、测试
- ```
verification_evidence
```
  — 执行的命令、退出码、输出摘要
- ```
qa_verdict
```
  —
```
PASS
```
  |
```
PARTIAL
```
  |
```
FAIL
```
  |
```
REOPEN
```
  |
```
BLOCKED
```
  （与
```
declared_status
```
  不同）
- ```
ai_audit_findings
```
  — 步骤4中触发的红色预警ID及判定结果
- ```
action
```
  —
```
none
```
  |
```
fixed
```
  |
```
reopened-frontmatter
```
  |
```
BUG-NNN.md filed
```
- ```
linked_bugs
```
  — 缺陷ID
不要将任务的
```
declared_status
```
、勾选的复选框、存储笔记或之前Agent的总结作为证据。针对每个已完成或声称已完成的任务，对照实际文件、公开行为、自动化测试和验收标准进行验证。
使用
```
qa_verdict
```
对每个任务进行分类：
- ```
PASS
```
  ：所有重要需求和成功标准都有对应的实现和最新的验证证据。
- ```
PARTIAL
```
  ：存在实现，但缺少一个或多个非关键需求、测试或证据。
- ```
FAIL
```
  ：声称的功能无法正常工作，或缺少关键需求。
- ```
REOPEN
```
  ：源文件
```
task_NN.md
```
  的前置信息中
```
status: completed
```
  ，但QA判定为
```
PARTIAL
```
  或
```
FAIL
```
  。
- ```
BLOCKED
```
  ：因缺少具体的前置条件，审计无法继续。

步骤4：AI测试规范扫描（RF-1至RF-6）

**停止操作。在扫描任何
```
declared_status: completed
```
的任务的测试差异前，完整阅读
```
references/ai-implementation-audit.md
```
。**该文件定义了红色预警扫描器（RF-1至RF-6）、需求→测试映射规则和判定矩阵。

针对任务基线以来的差异运行扫描（

git log --follow <test_file>

、

git diff <baseline_sha>..HEAD

）。

当扫描器检测到以下情况时，自动判定为
```
FAIL
```
：
- P0/P1成功标准的断言被弱化（RF-2）。
- 差异中插入了
```
.skip
```
  /
```
.only
```
  /
```
xit
```
  /
```
t.Skip
```
  （RF-1）。
- 在对应测试用例声明
```
External Dependencies
```
  为集成/E2E的测试中插入了模拟（RF-3）。
- P0/P1的快照发生偏移且无需求变更的合理理由（RF-4）。
将发现结果记录到任务实现矩阵的
```
ai_audit_findings
```
列以及
```
audit-report.md
```
的每个任务区块中。
应用
```
references/ai-implementation-audit.md
```
中的需求→测试映射表。针对
```
task_NN.md
```
（前置信息或内容）中的每个成功标准，以及
```
_techspec.md
```
中的每个关联项目，通过名称、引用或断言内容找到对应的测试。将每个标准标记为
```
covers
```
/
```
weak
```
/
```
missing
```
。若某个勾选的项目或
```
status: completed
```
没有对应的
```
covers
```
记录，则判定为审计失败。

步骤5：重新开启任务、提交缺陷、写入存储笔记

在矩阵中将未完成的已标记任务标记为
```
REOPEN
```
。
在Compozy模式下，按照cy-codex-loop要求的标准章节，将审计笔记写入
```
.compozy/tasks/<slug>/memory/qa-execution.md
```
：
```
Objective Snapshot
```
、
```
Important Decisions
```
、
```
Learnings
```
、
```
Files / Surfaces
```
、
```
Errors / Corrections
```
、
```
Ready for Next Run
```
。必须在修改任何
```
task_NN.md
```
的前置信息之前写入该文件（存储优先于状态的原则）。
将有问题的
```
task_NN.md
```
的前置信息
```
status:
```
修改回
```
pending
```
（若可修复则改为
```
in_progress
```
）。切勿写入
state.yaml
——cy-codex-loop的
```
update-state.py
```
负责该文件的修改；前置信息优先，因为下一次迭代会以此为基准进行协调。
使用
```
assets/issue-template.md
```
在
```
<audit-output-path>/audit/issues/
```
下创建
```
BUG-<num>.md
```
。内容需包含：
- ```
Reopens task:
```
  下填写任务路径。
- ```
Summary:
```
  下填写失败的成功标准。
- ```
Root cause:
```
  下填写原始的严格断言（当RF-2触发时）。
- ```
Automation Follow-up:
```
  备注下填写红色预警ID和判定结果。
- ```
Related:
```
  下填写对话记录异常分类（若适用）。
若缺失的工作是审计范围内的有限根因修复，你可以实现该修复、添加回归测试覆盖，并重新运行任务验证。否则重新开启任务——不要悄悄标记为通过。

步骤6：质量门判定

在审计期间最后一次代码修改后，从头重新执行标准验证质量门。
编写
```
audit-report.md
```
的质量门部分。每个质量门的结果为
```
PASS
```
/
```
FAIL
```
/
```
N/A
```
：
- 标准测试套件的不稳定率<2%。
- P0/P1任务的AI测试规范审计无
```
FAIL
```
  结果。
- 无未解决的
```
Critical
```
  /
```
High
```
  级问题。
- 测试覆盖增量≥基线（无回归）。
- P0流程无未解决的
```
flaky-suspect
```
  。
任何一个质量门判定为
```
FAIL
```
，都会阻止本次运行获得无条件PASS的结果。

步骤7：编写审计报告

使用

assets/audit-report-template.md

总结审计内容，并将报告写入

<audit-output-path>/audit/audit-report.md

。

必须包含的章节：
- Claim / Command / Exit code / Verdict：步骤2和步骤6中执行的每个命令的相关信息。
- AUTOMATED COVERAGE：检测到的支持情况、测试框架、标准命令、已分类的必要流程、新增或更新的规范。
- TASK IMPLEMENTATION AUDIT：Compozy slug、计划来源、矩阵汇总、每个任务的判定结果、重新开启/修复/阻塞的任务、缺陷链接。
- SUITE HEALTH SNAPSHOT：不稳定率、不稳定事件列表、变异分数（若有测试框架）、与基线相比的覆盖增量、阻塞数量、仅手动测试数量、AI审计发现数量。
- QUALITY GATES：每个质量门的PASS/FAIL/N/A结果。
- ISSUES FILED：缺陷总数、按严重程度分类、带有
```
Reopens task:
```
  注释。
在Compozy slug中运行时，最终
```
audit-report.md
```
的PASS结果会作为cy-codex-loop阶段E的
```
verify.last_status=PASS
```
前置条件——不要调用
update-state.py
；cy-codex-loop负责该修改。
报告阻塞场景、缺失的凭证或环境差距时，需提供导致执行停止的具体命令或前置条件。

Error Handling

错误处理

If command discovery returns multiple plausible gates, prefer the broadest repository-defined command and explain the tie-breaker.
If E2E support signals are weak or contradictory, prefer explicit config files and runnable commands before claiming the repository supports E2E.
If no canonical verify command exists, read
```
references/project-signals.md
```
, choose the broadest safe install, lint, test, and build commands for the detected ecosystem, and state that assumption explicitly.
If a required live dependency is unavailable, validate every local boundary that does not require the missing dependency and report the blocked live validation separately.
If a failure appears unrelated to the audited tasks, prove that with a clean reproduction before excluding it from the audit scope.
If the repository has an E2E harness but credentials, runtime services, or test data prevent execution, keep the affected flow classified as
```
blocked
```
and report the exact prerequisite that is missing.
If
```
task_NN.md
```
files are marked
```
status: completed
```
but contain unchecked subtasks, missing deliverables, or unverified criteria, do not call the audit a pass. Write
```
memory/qa-execution.md
```
first, then edit frontmatter
```
status:
```
back to
```
pending
```
or
```
in_progress
```
, and file
```
BUG-<num>.md
```
per Step 5. Never write to
```
state.yaml
```
.
If a test fails and passes on retry without a code change, do not promote to PASS. Register as
```
flaky-suspect
```
per
```
references/flaky-triage.md
```
, record the event in the Suite Health Snapshot, and treat any unresolved
```
flaky-suspect
```
on a P0 flow as a blocker for the final verdict.
If the AI test-hygiene scan (Step 4) detects weakened assertions, skipped tests, or mocks hiding integration in a task with
```
declared_status: completed
```
, do not call the audit a pass. Apply the verdict matrix in
```
references/ai-implementation-audit.md
```
, file
```
BUG-<num>.md
```
with Type
```
Functional
```
, and flip frontmatter
```
status:
```
per Step 5.

如果命令发现返回多个可行的质量门，优先选择覆盖范围最广的仓库定义命令，并说明选择理由。
如果E2E支持信号较弱或相互矛盾，在声称仓库支持E2E之前，优先依据明确的配置文件和可运行的命令。
如果不存在标准的verify命令，阅读
```
references/project-signals.md
```
，为检测到的技术栈选择覆盖范围最广且安全的install、lint、test和build命令，并明确说明该假设。
如果所需的实时依赖不可用，验证所有不依赖该缺失项的本地边界，并单独报告阻塞的实时验证情况。
如果某个失败看起来与审计任务无关，在将其排除出审计范围前，需通过干净的复现证明这一点。
如果仓库有E2E测试框架，但凭证、运行时服务或测试数据导致无法执行，将受影响的流程归类为
```
blocked
```
，并报告缺失的具体前置条件。
如果
```
task_NN.md
```
文件标记为
```
status: completed
```
但包含未勾选的子任务、缺失的交付物或未验证的标准，不要判定审计通过。先写入
```
memory/qa-execution.md
```
，然后将前置信息的
```
status:
```
修改回
```
pending
```
或
```
in_progress
```
，并按照步骤5提交
```
BUG-<num>.md
```
。切勿写入
```
state.yaml
```
。
如果某个测试失败，但无需修改代码即可通过重试，不要将其标记为PASS。按照
```
references/flaky-triage.md
```
将其注册为
```
flaky-suspect
```
，在套件健康快照中记录该事件，并将P0流程中任何未解决的
```
flaky-suspect
```
视为最终判定的阻塞项。
如果AI测试规范扫描（步骤4）在
```
declared_status: completed
```
的任务中检测到弱化的断言、跳过的测试或隐藏集成的模拟，不要判定审计通过。应用
```
references/ai-implementation-audit.md
```
中的判定矩阵，提交类型为
```
Functional
```
的
```
BUG-<num>.md
```
，并按照步骤5修改前置信息的
```
status:
```
。

Companion: qa-execution

配套工具：qa-execution

agent-output-audit

validates that the implementing AI agent did what it claimed.

qa-execution

validates that a real human user can succeed at the product. They are complementary, not redundant:

Run

agent-output-audit

to certify that

task_NN.md status: completed

reflects real work.

Run
```
qa-execution
```
to certify that the product, taken as a whole, is acceptable to end users.

A Compozy slug typically wants both: audit the task implementations, then exercise the resulting product through user-flow QA. They share no output directory, no bug taxonomy, and no procedures — keep them separate.

agent-output-audit

用于验证执行任务的AI Agent是否完成了其声称的工作。

qa-execution

用于验证真实人类用户能否成功使用该产品。二者互为补充，而非重复：

运行
```
agent-output-audit
```
以确认
```
task_NN.md status: completed
```
反映了真实完成的工作。
运行
```
qa-execution
```
以确认整个产品是否能被终端用户接受。

Compozy slug通常需要同时使用这两个工具：先审计任务实现，然后通过用户流QA测试最终产品。二者不共享输出目录、缺陷分类体系和流程——请将它们分开使用。