printing-press-output-review

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

printing-press-output-review (internal)

printing-press-output-review(内部专用)

Review the sampled outputs from a printed CLI for plausibility bugs that dogfood, verify, and the rule-based
scorecard --live-check
rules can't catch. Wave B policy: all findings surface as warnings, never errors.
This skill is internal-only (
user-invocable: false
). It's invoked by parents — main printing-press skill at shipcheck Phase 4.85, polish skill during its diagnostic loop. Running it standalone would produce floating findings text with no ship verdict, no fixes applied, no publish offer; the actionable wrappers are
/printing-press
and
/printing-press-polish
. The skill carries
context: fork
so the reviewer agent's diagnostic chatter stays isolated from the calling skill's context.
审查已生成CLI的采样输出,排查dogfood、验证环节以及规则化
scorecard --live-check
规则无法检测到的合理性问题。Wave B策略:所有检测结果均以警告形式呈现,绝不标记为错误。
本技能为内部专用
user-invocable: false
),由父技能调用——主printing-press技能在shipcheck阶段4.85调用,polish技能在其诊断循环中调用。单独运行本技能只会生成无关联的检测文本,不会给出发布判定结果、不会应用修复、也不会提供发布选项;可执行封装为
/printing-press
/printing-press-polish
。本技能带有
context: fork
参数,确保审查Agent的诊断信息与调用技能的上下文相互隔离。

Input

输入

The caller passes
$CLI_DIR
as the argument: an absolute path to the printed CLI's working directory.
调用者需传入
$CLI_DIR
作为参数:已生成CLI工作目录的绝对路径。

What this catches

检测范围

Bugs that rule-based checks miss, typically surfaced by 5 minutes of hands-on testing but slipping past dogfood, verify, and
scorecard --live-check
rules:
  • Substring-match results that coincidentally contain the query but don't match semantically (e.g., a query matches a substring of a larger unrelated term)
  • Aggregation commands silently dropping sources when only some of the requested N come back
  • Ranking or sort commands returning top-N results that aren't plausibly the best for the query (broken weights, extractor fallbacks)
  • URLs in output pointing at category index pages, feed endpoints, or random-selector routes rather than canonical content permalinks
  • Format bugs the rule-based layer doesn't catch (mojibake, inconsistent pluralization, truncated/wrapped cell content)
规则化检查遗漏的问题,通常需要5分钟手动测试才能发现,但会逃过dogfood、验证环节以及
scorecard --live-check
规则的检测:
  • 子字符串匹配结果恰好包含查询词但语义不匹配(例如,查询词匹配了某个不相关术语的子字符串)
  • 聚合命令在仅返回部分请求的N个数据源时,静默丢失其余数据源
  • 排序或排名命令返回的前N个结果与查询需求不符(权重失效、提取器回退)
  • 输出中的URL指向分类索引页、Feed端点或随机选择器路由,而非标准内容永久链接
  • 规则化层未检测到的格式错误(乱码、复数形式不一致、单元格内容截断/换行)

Procedure

流程

Step 1: Gather sample data

步骤1:采集样本数据

bash
printing-press scorecard --dir "$CLI_DIR" --live-check --json > /tmp/output-review-livecheck.json 2>&1 || true
If the scorecard call fails or
/tmp/output-review-livecheck.json
is empty, return the SKIP result (Step 3) without dispatching the reviewer.
bash
printing-press scorecard --dir "$CLI_DIR" --live-check --json > /tmp/output-review-livecheck.json 2>&1 || true
若scorecard调用失败或
/tmp/output-review-livecheck.json
为空,则返回SKIP结果(步骤3),无需调度审查Agent。

Step 2: Dispatch the reviewer agent

步骤2:调度审查Agent

Use the Agent tool (general-purpose) with this prompt contract:
Review the sampled outputs from the shipped CLI at
$CLI_DIR
. You have these ground-truth sources:
  • Sampled command output: read
    /tmp/output-review-livecheck.json
    and inspect the
    live_check.features[]
    array. Each entry has the command, example invocation, actual stdout (in
    output_sample
    , bounded to ~4 KiB), the pass/fail reason, and a
    warnings
    array (populated by rule-based checks like the raw-HTML-entity detector).
  • Review only
    status: pass
    entries.
    Entries with
    status: fail
    either crashed, timed out, or had placeholder args (
    <id>
    ,
    <url>
    ) that never produced real output — their sample is empty and there's nothing for you to judge. Phase 5 dogfood handles test-coverage and exit-code concerns.
  • $CLI_DIR/research.json
    novel_features
    (planned behavior per feature) and
    novel_features_built
    (verified built commands).
  • The CLI binary at
    $CLI_DIR/<cli-name>-pp-cli
    — you may invoke additional commands to gather more output when a finding needs verification.
For each of these checks, report findings under 50 words each. Only report issues a human user would notice in 5 minutes of hands-on testing — not every edge case a thorough QA pass might find:
  1. Output semantically matches query intent. For sampled novel features with a query argument, judge relevance beyond what the mechanical query-token check in live-check already enforced. A feature that passed live-check's
    outputMentionsQuery
    test still contains some query token somewhere — but "buttermilk" appearing as a substring of "butter" results, or "brownies" returning a chili recipe because the extractor fell back to adjacent content, both slip past the mechanical check. Only flag when a human user would look at the top results and say "this isn't what I asked for." Skip this check when the example has no query argument.
  2. No obvious format bugs. Does the output contain raw HTML entities, mojibake (question marks or replacement chars in titles), or malformed URLs (pointing at category index pages, feed endpoints, or random-selector routes rather than canonical content permalinks)? Rule-based live-check catches numeric entities; this layer catches the broader class.
  3. Aggregation commands show all requested sources. For commands with a
    --source
    /
    --site
    /
    --region
    CSV flag: if the user requested N sources, does output show N, or does stderr explain the missing ones? Silent drops of failed sources are a top failure mode for fan-out commands.
  4. Result ordering/ranking makes sense. For commands that claim to rank or sort, does the top result look plausibly best given the query? Watch for broken score weights, off-by-one sort bugs, and silent fallback to recency when relevance computation fails.
Return a list of findings. For each: check name, severity (
warning
in Wave B;
error
reserved for Wave C), one-line description, one-sentence fix suggestion. If the CLI passes all four checks, return "PASS — no findings."
使用通用Agent工具,按照以下提示协议执行:
审查
$CLI_DIR
路径下已发布CLI的采样输出。你可使用以下真实数据源:
  • 采样命令输出:读取
    /tmp/output-review-livecheck.json
    并检查
    live_check.features[]
    数组。每个条目包含命令、示例调用、实际标准输出(位于
    output_sample
    ,大小限制约4 KiB)、通过/失败原因,以及
    warnings
    数组(由规则化检查填充,如原始HTML实体检测器)。
  • 仅审查
    status: pass
    的条目
    status: fail
    的条目要么崩溃、超时,要么包含占位符参数(
    <id>
    <url>
    )未生成真实输出——其样本为空,无需你进行判定。阶段5的dogfood环节负责处理测试覆盖率和退出码相关问题。
  • $CLI_DIR/research.json
    中的
    novel_features
    (功能规划行为)和
    novel_features_built
    (已验证的已构建命令)。
  • $CLI_DIR/<cli-name>-pp-cli
    路径下的CLI二进制文件——当需要验证检测结果时,你可调用额外命令获取更多输出。
针对以下每项检查,报告结果需控制在50字以内。仅报告人类用户在5分钟手动测试中会注意到的问题——无需覆盖全面QA可能发现的所有边缘情况:
  1. 输出在语义上匹配查询意图。对于带有查询参数的新增采样功能,需在live-check的机械查询令牌检查基础上,进一步判断相关性。通过live-check的
    outputMentionsQuery
    测试的功能,必然在某处包含查询令牌——但“buttermilk”出现在“butter”相关结果中,或“brownies”查询返回chili食谱(因提取器回退到相邻内容),这类情况会逃过机械检查。仅当人类用户看到顶部结果时会说“这不是我要的内容”,才标记为问题。若示例无查询参数,则跳过此项检查。
  2. 无明显格式错误。输出是否包含原始HTML实体、乱码(标题中的问号或替换字符),或格式错误的URL(指向分类索引页、Feed端点或随机选择器路由,而非标准内容永久链接)?规则化live-check可检测数字实体,本环节负责检测更广泛的格式问题。
  3. 聚合命令显示所有请求的数据源。对于带有
    --source
    /
    --site
    /
    --region
    CSV参数的命令:若用户请求N个数据源,输出是否显示N个?或标准错误流是否说明缺失原因?静默丢失失效数据源是扇出命令的主要失败模式。
  4. 结果排序/排名合理。对于声称支持排序或排名的命令,顶部结果是否符合查询需求?需留意权重失效、排序偏移一位错误,以及相关性计算失败时静默回退到按时间排序的情况。
返回检测结果列表。每个结果需包含:检查名称、严重程度(Wave B阶段为
warning
error
预留至Wave C阶段)、一行描述、一句修复建议。若CLI通过所有四项检查,则返回“PASS — 无检测结果”。

Step 3: Emit the structured result block

步骤3:输出结构化结果块

End the skill response with a
---OUTPUT-REVIEW-RESULT---
block the parent parses:
On clean pass:
---OUTPUT-REVIEW-RESULT---
status: PASS
findings: []
---END-OUTPUT-REVIEW-RESULT---
On warnings:
---OUTPUT-REVIEW-RESULT---
status: WARN
findings:
- check: <check-name>
  severity: warning
  description: <one-line>
  suggestion: <one-sentence>
- ...
---END-OUTPUT-REVIEW-RESULT---
On reviewer failure (timeout, agent-budget exhaustion, missing live-check data):
---OUTPUT-REVIEW-RESULT---
status: SKIP
reason: <one-line description>
findings: []
---END-OUTPUT-REVIEW-RESULT---
在技能响应末尾添加
---OUTPUT-REVIEW-RESULT---
块,供父技能解析:
检查通过时:
---OUTPUT-REVIEW-RESULT---
status: PASS
findings: []
---END-OUTPUT-REVIEW-RESULT---
存在警告时:
---OUTPUT-REVIEW-RESULT---
status: WARN
findings:
- check: <check-name>
  severity: warning
  description: <one-line>
  suggestion: <one-sentence>
- ...
---END-OUTPUT-REVIEW-RESULT---
审查Agent失败时(超时、Agent预算耗尽、缺少live-check数据):
---OUTPUT-REVIEW-RESULT---
status: SKIP
reason: <one-line description>
findings: []
---END-OUTPUT-REVIEW-RESULT---

Wave B policy (current)

Wave B策略(当前)

  • All findings surface as
    warning
    — never
    error
    . Shipcheck proceeds regardless.
  • The caller logs findings to the run's artifact directory (e.g.,
    manuscripts/<api>/<run>/proofs/phase-4.85-findings.md
    ) and surfaces them to the user. Findings are not persisted to
    scorecard.json
    — that path is reserved for Wave C.
  • The user decides case by case whether to fix before shipping.
Non-interactive contract (CI, cron, batch regeneration):
  • If stdout is not a TTY, callers follow fail-open-with-log: findings recorded, shipcheck proceeds without prompting.
  • status: SKIP
    (reviewer crash, timeout, missing data) is informational — shipcheck does not block on it.
  • No
    --auto-approve-warnings
    flag yet. The policy is already "warnings don't block" in Wave B, so the flag has no effect to gate.
Wave C (separate future PR) will flip
error
-severity findings to blocking after calibration data across the library shows false-positive rate below 10%.
  • 所有检测结果均标记为
    warning
    ——绝不标记为
    error
    。Shipcheck流程将继续执行。
  • 调用者会将检测结果记录到运行的工件目录(例如
    manuscripts/<api>/<run>/proofs/phase-4.85-findings.md
    )并展示给用户。检测结果不会持久化到
    scorecard.json
    ——该路径预留至Wave C阶段。
  • 用户需根据具体情况决定是否在发布前修复问题。
非交互式协议(CI、定时任务、批量重新生成):
  • 若标准输出不是TTY,调用者将执行“失败开放并记录”策略:记录检测结果,Shipcheck流程继续执行,无需提示。
  • status: SKIP
    (审查Agent崩溃、超时、缺少数据)仅为信息性提示——Shipcheck流程不会因此阻塞。
  • 暂未提供
    --auto-approve-warnings
    参数。Wave B阶段的策略已明确“警告不会阻塞流程”,因此该参数无实际作用。
Wave C阶段(后续独立PR)将在全库校准数据显示误报率低于10%后,将
error
严重程度的检测结果设置为阻塞流程。

Why agentic vs template-only

为何选择Agent化而非纯模板方案

Output-plausibility questions are not pattern-matchable against source. Rule-based live-check rules cover what regexes can (numeric HTML entities, query-token absence). Everything else — "are these substitution results plausibly correct for the query?", "does the top search result look related?" — is an LLM-shaped question. The token cost is bounded (once per run, not per command) and the catch rate against the bug classes that motivated this phase justifies the dispatch.
输出合理性问题无法通过源文件的模式匹配解决。规则化live-check规则覆盖了正则表达式可处理的场景(数字HTML实体、查询令牌缺失)。其他所有问题——“这些替换结果是否符合查询需求?”、“顶部搜索结果是否相关?”——都是适合LLM解决的问题。令牌成本可控(每次运行仅调用一次,而非每个命令调用一次),且针对本阶段目标问题类别的检测率足以证明调度Agent的合理性。

Known blind spots

已知盲点

  • Can't verify numeric accuracy (prices, ratings, rankings vs ground-truth). If the CLI says a recipe has 4.8 stars and it actually has 4.2, this skill won't catch it.
  • Can't detect data-freshness issues (recipe published 2019 vs 2024). These need live comparison against authoritative sources.
  • Can't judge subjective preferences ("is this the best recipe for chocolate chip cookies?").
  • Sampled outputs only — covers the commands in
    live_check.features[]
    . Full command-tree coverage belongs in Phase 5 dogfood.
  • Non-English output: the reviewer's query-intent check assumes English-language query/output. For non-English CLIs, calibrate the prompt separately.
  • 无法验证数值准确性(价格、评分、排名与真实数据对比)。若CLI显示某食谱评分为4.8星,而实际为4.2星,本技能无法检测到该问题。
  • 无法检测数据新鲜度问题(食谱发布于2019年还是2024年)。这类问题需要与权威数据源进行实时对比。
  • 无法判断主观偏好(“这是巧克力曲奇的最佳食谱吗?”)。
  • 仅覆盖采样输出——仅检查
    live_check.features[]
    中的命令。全命令树覆盖属于阶段5的dogfood环节。
  • 非英文输出:审查Agent的查询意图检查默认假设查询/输出为英文。对于非英文CLI,需单独校准提示词。