printing-press-output-review
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseprinting-press-output-review (internal)
printing-press-output-review(内部专用)
Review the sampled outputs from a printed CLI for plausibility bugs that dogfood, verify, and the rule-based rules can't catch. Wave B policy: all findings surface as warnings, never errors.
scorecard --live-checkThis skill is internal-only (). It's invoked by parents — main printing-press skill at shipcheck Phase 4.85, polish skill during its diagnostic loop. Running it standalone would produce floating findings text with no ship verdict, no fixes applied, no publish offer; the actionable wrappers are and . The skill carries so the reviewer agent's diagnostic chatter stays isolated from the calling skill's context.
user-invocable: false/printing-press/printing-press-polishcontext: fork审查已生成CLI的采样输出,排查dogfood、验证环节以及规则化规则无法检测到的合理性问题。Wave B策略:所有检测结果均以警告形式呈现,绝不标记为错误。
scorecard --live-check本技能为内部专用(),由父技能调用——主printing-press技能在shipcheck阶段4.85调用,polish技能在其诊断循环中调用。单独运行本技能只会生成无关联的检测文本,不会给出发布判定结果、不会应用修复、也不会提供发布选项;可执行封装为和。本技能带有参数,确保审查Agent的诊断信息与调用技能的上下文相互隔离。
user-invocable: false/printing-press/printing-press-polishcontext: forkInput
输入
The caller passes as the argument: an absolute path to the printed CLI's working directory.
$CLI_DIR调用者需传入作为参数:已生成CLI工作目录的绝对路径。
$CLI_DIRWhat this catches
检测范围
Bugs that rule-based checks miss, typically surfaced by 5 minutes of hands-on testing but slipping past dogfood, verify, and rules:
scorecard --live-check- Substring-match results that coincidentally contain the query but don't match semantically (e.g., a query matches a substring of a larger unrelated term)
- Aggregation commands silently dropping sources when only some of the requested N come back
- Ranking or sort commands returning top-N results that aren't plausibly the best for the query (broken weights, extractor fallbacks)
- URLs in output pointing at category index pages, feed endpoints, or random-selector routes rather than canonical content permalinks
- Format bugs the rule-based layer doesn't catch (mojibake, inconsistent pluralization, truncated/wrapped cell content)
规则化检查遗漏的问题,通常需要5分钟手动测试才能发现,但会逃过dogfood、验证环节以及规则的检测:
scorecard --live-check- 子字符串匹配结果恰好包含查询词但语义不匹配(例如,查询词匹配了某个不相关术语的子字符串)
- 聚合命令在仅返回部分请求的N个数据源时,静默丢失其余数据源
- 排序或排名命令返回的前N个结果与查询需求不符(权重失效、提取器回退)
- 输出中的URL指向分类索引页、Feed端点或随机选择器路由,而非标准内容永久链接
- 规则化层未检测到的格式错误(乱码、复数形式不一致、单元格内容截断/换行)
Procedure
流程
Step 1: Gather sample data
步骤1:采集样本数据
bash
printing-press scorecard --dir "$CLI_DIR" --live-check --json > /tmp/output-review-livecheck.json 2>&1 || trueIf the scorecard call fails or is empty, return the SKIP result (Step 3) without dispatching the reviewer.
/tmp/output-review-livecheck.jsonbash
printing-press scorecard --dir "$CLI_DIR" --live-check --json > /tmp/output-review-livecheck.json 2>&1 || true若scorecard调用失败或为空,则返回SKIP结果(步骤3),无需调度审查Agent。
/tmp/output-review-livecheck.jsonStep 2: Dispatch the reviewer agent
步骤2:调度审查Agent
Use the Agent tool (general-purpose) with this prompt contract:
Review the sampled outputs from the shipped CLI at. You have these ground-truth sources:$CLI_DIR
- Sampled command output: read
and inspect the/tmp/output-review-livecheck.jsonarray. Each entry has the command, example invocation, actual stdout (inlive_check.features[], bounded to ~4 KiB), the pass/fail reason, and aoutput_samplearray (populated by rule-based checks like the raw-HTML-entity detector).warnings- Review only
entries. Entries withstatus: passeither crashed, timed out, or had placeholder args (status: fail,<id>) that never produced real output — their sample is empty and there's nothing for you to judge. Phase 5 dogfood handles test-coverage and exit-code concerns.<url>$CLI_DIR/research.json(planned behavior per feature) andnovel_features(verified built commands).novel_features_built- The CLI binary at
— you may invoke additional commands to gather more output when a finding needs verification.$CLI_DIR/<cli-name>-pp-cliFor each of these checks, report findings under 50 words each. Only report issues a human user would notice in 5 minutes of hands-on testing — not every edge case a thorough QA pass might find:
- Output semantically matches query intent. For sampled novel features with a query argument, judge relevance beyond what the mechanical query-token check in live-check already enforced. A feature that passed live-check's
test still contains some query token somewhere — but "buttermilk" appearing as a substring of "butter" results, or "brownies" returning a chili recipe because the extractor fell back to adjacent content, both slip past the mechanical check. Only flag when a human user would look at the top results and say "this isn't what I asked for." Skip this check when the example has no query argument.outputMentionsQuery- No obvious format bugs. Does the output contain raw HTML entities, mojibake (question marks or replacement chars in titles), or malformed URLs (pointing at category index pages, feed endpoints, or random-selector routes rather than canonical content permalinks)? Rule-based live-check catches numeric entities; this layer catches the broader class.
- Aggregation commands show all requested sources. For commands with a
/--source/--siteCSV flag: if the user requested N sources, does output show N, or does stderr explain the missing ones? Silent drops of failed sources are a top failure mode for fan-out commands.--region- Result ordering/ranking makes sense. For commands that claim to rank or sort, does the top result look plausibly best given the query? Watch for broken score weights, off-by-one sort bugs, and silent fallback to recency when relevance computation fails.
Return a list of findings. For each: check name, severity (in Wave B;warningreserved for Wave C), one-line description, one-sentence fix suggestion. If the CLI passes all four checks, return "PASS — no findings."error
使用通用Agent工具,按照以下提示协议执行:
审查路径下已发布CLI的采样输出。你可使用以下真实数据源:$CLI_DIR
- 采样命令输出:读取
并检查/tmp/output-review-livecheck.json数组。每个条目包含命令、示例调用、实际标准输出(位于live_check.features[],大小限制约4 KiB)、通过/失败原因,以及output_sample数组(由规则化检查填充,如原始HTML实体检测器)。warnings- 仅审查
的条目。status: pass的条目要么崩溃、超时,要么包含占位符参数(status: fail、<id>)未生成真实输出——其样本为空,无需你进行判定。阶段5的dogfood环节负责处理测试覆盖率和退出码相关问题。<url> 中的$CLI_DIR/research.json(功能规划行为)和novel_features(已验证的已构建命令)。novel_features_built 路径下的CLI二进制文件——当需要验证检测结果时,你可调用额外命令获取更多输出。$CLI_DIR/<cli-name>-pp-cli针对以下每项检查,报告结果需控制在50字以内。仅报告人类用户在5分钟手动测试中会注意到的问题——无需覆盖全面QA可能发现的所有边缘情况:
- 输出在语义上匹配查询意图。对于带有查询参数的新增采样功能,需在live-check的机械查询令牌检查基础上,进一步判断相关性。通过live-check的
测试的功能,必然在某处包含查询令牌——但“buttermilk”出现在“butter”相关结果中,或“brownies”查询返回chili食谱(因提取器回退到相邻内容),这类情况会逃过机械检查。仅当人类用户看到顶部结果时会说“这不是我要的内容”,才标记为问题。若示例无查询参数,则跳过此项检查。outputMentionsQuery- 无明显格式错误。输出是否包含原始HTML实体、乱码(标题中的问号或替换字符),或格式错误的URL(指向分类索引页、Feed端点或随机选择器路由,而非标准内容永久链接)?规则化live-check可检测数字实体,本环节负责检测更广泛的格式问题。
- 聚合命令显示所有请求的数据源。对于带有
/--source/--siteCSV参数的命令:若用户请求N个数据源,输出是否显示N个?或标准错误流是否说明缺失原因?静默丢失失效数据源是扇出命令的主要失败模式。--region- 结果排序/排名合理。对于声称支持排序或排名的命令,顶部结果是否符合查询需求?需留意权重失效、排序偏移一位错误,以及相关性计算失败时静默回退到按时间排序的情况。
返回检测结果列表。每个结果需包含:检查名称、严重程度(Wave B阶段为;warning预留至Wave C阶段)、一行描述、一句修复建议。若CLI通过所有四项检查,则返回“PASS — 无检测结果”。error
Step 3: Emit the structured result block
步骤3:输出结构化结果块
End the skill response with a block the parent parses:
---OUTPUT-REVIEW-RESULT---On clean pass:
---OUTPUT-REVIEW-RESULT---
status: PASS
findings: []
---END-OUTPUT-REVIEW-RESULT---On warnings:
---OUTPUT-REVIEW-RESULT---
status: WARN
findings:
- check: <check-name>
severity: warning
description: <one-line>
suggestion: <one-sentence>
- ...
---END-OUTPUT-REVIEW-RESULT---On reviewer failure (timeout, agent-budget exhaustion, missing live-check data):
---OUTPUT-REVIEW-RESULT---
status: SKIP
reason: <one-line description>
findings: []
---END-OUTPUT-REVIEW-RESULT---在技能响应末尾添加块,供父技能解析:
---OUTPUT-REVIEW-RESULT---检查通过时:
---OUTPUT-REVIEW-RESULT---
status: PASS
findings: []
---END-OUTPUT-REVIEW-RESULT---存在警告时:
---OUTPUT-REVIEW-RESULT---
status: WARN
findings:
- check: <check-name>
severity: warning
description: <one-line>
suggestion: <one-sentence>
- ...
---END-OUTPUT-REVIEW-RESULT---审查Agent失败时(超时、Agent预算耗尽、缺少live-check数据):
---OUTPUT-REVIEW-RESULT---
status: SKIP
reason: <one-line description>
findings: []
---END-OUTPUT-REVIEW-RESULT---Wave B policy (current)
Wave B策略(当前)
- All findings surface as — never
warning. Shipcheck proceeds regardless.error - The caller logs findings to the run's artifact directory (e.g., ) and surfaces them to the user. Findings are not persisted to
manuscripts/<api>/<run>/proofs/phase-4.85-findings.md— that path is reserved for Wave C.scorecard.json - The user decides case by case whether to fix before shipping.
Non-interactive contract (CI, cron, batch regeneration):
- If stdout is not a TTY, callers follow fail-open-with-log: findings recorded, shipcheck proceeds without prompting.
- (reviewer crash, timeout, missing data) is informational — shipcheck does not block on it.
status: SKIP - No flag yet. The policy is already "warnings don't block" in Wave B, so the flag has no effect to gate.
--auto-approve-warnings
Wave C (separate future PR) will flip -severity findings to blocking after calibration data across the library shows false-positive rate below 10%.
error- 所有检测结果均标记为——绝不标记为
warning。Shipcheck流程将继续执行。error - 调用者会将检测结果记录到运行的工件目录(例如)并展示给用户。检测结果不会持久化到
manuscripts/<api>/<run>/proofs/phase-4.85-findings.md——该路径预留至Wave C阶段。scorecard.json - 用户需根据具体情况决定是否在发布前修复问题。
非交互式协议(CI、定时任务、批量重新生成):
- 若标准输出不是TTY,调用者将执行“失败开放并记录”策略:记录检测结果,Shipcheck流程继续执行,无需提示。
- (审查Agent崩溃、超时、缺少数据)仅为信息性提示——Shipcheck流程不会因此阻塞。
status: SKIP - 暂未提供参数。Wave B阶段的策略已明确“警告不会阻塞流程”,因此该参数无实际作用。
--auto-approve-warnings
Wave C阶段(后续独立PR)将在全库校准数据显示误报率低于10%后,将严重程度的检测结果设置为阻塞流程。
errorWhy agentic vs template-only
为何选择Agent化而非纯模板方案
Output-plausibility questions are not pattern-matchable against source. Rule-based live-check rules cover what regexes can (numeric HTML entities, query-token absence). Everything else — "are these substitution results plausibly correct for the query?", "does the top search result look related?" — is an LLM-shaped question. The token cost is bounded (once per run, not per command) and the catch rate against the bug classes that motivated this phase justifies the dispatch.
输出合理性问题无法通过源文件的模式匹配解决。规则化live-check规则覆盖了正则表达式可处理的场景(数字HTML实体、查询令牌缺失)。其他所有问题——“这些替换结果是否符合查询需求?”、“顶部搜索结果是否相关?”——都是适合LLM解决的问题。令牌成本可控(每次运行仅调用一次,而非每个命令调用一次),且针对本阶段目标问题类别的检测率足以证明调度Agent的合理性。
Known blind spots
已知盲点
- Can't verify numeric accuracy (prices, ratings, rankings vs ground-truth). If the CLI says a recipe has 4.8 stars and it actually has 4.2, this skill won't catch it.
- Can't detect data-freshness issues (recipe published 2019 vs 2024). These need live comparison against authoritative sources.
- Can't judge subjective preferences ("is this the best recipe for chocolate chip cookies?").
- Sampled outputs only — covers the commands in . Full command-tree coverage belongs in Phase 5 dogfood.
live_check.features[] - Non-English output: the reviewer's query-intent check assumes English-language query/output. For non-English CLIs, calibrate the prompt separately.
- 无法验证数值准确性(价格、评分、排名与真实数据对比)。若CLI显示某食谱评分为4.8星,而实际为4.2星,本技能无法检测到该问题。
- 无法检测数据新鲜度问题(食谱发布于2019年还是2024年)。这类问题需要与权威数据源进行实时对比。
- 无法判断主观偏好(“这是巧克力曲奇的最佳食谱吗?”)。
- 仅覆盖采样输出——仅检查中的命令。全命令树覆盖属于阶段5的dogfood环节。
live_check.features[] - 非英文输出:审查Agent的查询意图检查默认假设查询/输出为英文。对于非英文CLI,需单独校准提示词。