result-diagnosis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseResult Diagnosis
实验结果诊断
Diagnose what an experiment result means for the project. This skill is for decision-making after results exist, especially when they are negative, surprising, unstable, or hard to interpret.
Use this skill when:
- a method does not improve over baseline
- results vary strongly across seeds
- a metric improves but another metric worsens
- a baseline unexpectedly wins
- a plot or table looks suspicious
- a result may be caused by an implementation bug, metric bug, data issue, or unfair comparison
- early experiments suggest revising the algorithm or paper claim
- the user asks "what does this result mean?" or "what should we do next?"
Do not use this skill to write a polished report. Pair it with after the diagnosis is clear.
experiment-report-writerPair this skill with:
- when the diagnosis should update claims, evidence, risks, actions, or worktree status
research-project-memory - when results need a shareable report
experiment-report-writer - when the diagnosis points to method revision
algorithm-design-planner - when the diagnosis requires a new controlled experiment
experiment-design-planner - when the next step is a rerun, sanity check, or ablation
run-experiment - when the right action is to narrow or reframe paper claims
conference-writing-adapter
诊断实验结果对项目的意义。此技能用于获得结果后的决策环节,尤其适用于结果为负面、意外、不稳定或难以解读的情况。
在以下场景使用此技能:
- 目标方法未优于基线
- 不同随机种子下结果差异极大
- 某一指标提升但另一指标恶化
- 基线表现意外优于目标方法
- 图表或表格看起来可疑
- 结果可能由实现漏洞、指标漏洞、数据问题或不公平对比导致
- 早期实验表明需要修改算法或论文主张
- 用户询问“这个结果意味着什么?”或“接下来该怎么做?”
请勿使用此技能撰写精美的报告。诊断明确后,可搭配使用。
experiment-report-writer可搭配以下技能使用:
- 当诊断需要更新主张、证据、风险、行动或工作树状态时,搭配
research-project-memory - 当结果需要生成可分享的报告时,搭配
experiment-report-writer - 当诊断指出需要修改方法时,搭配
algorithm-design-planner - 当诊断要求开展新的受控实验时,搭配
experiment-design-planner - 当下一步是重新运行、合理性检查或消融实验时,搭配
run-experiment - 当正确行动是缩小或重构论文主张时,搭配
conference-writing-adapter
Skill Directory Layout
技能目录结构
text
<installed-skill-dir>/
├── SKILL.md
└── references/
├── diagnosis-taxonomy.md
├── evidence-audit.md
├── next-decision-rules.md
├── report-template.md
└── triage-protocol.mdtext
<installed-skill-dir>/
├── SKILL.md
└── references/
├── diagnosis-taxonomy.md
├── evidence-audit.md
├── next-decision-rules.md
├── report-template.md
└── triage-protocol.mdProgressive Loading
渐进式加载
- Always read ,
references/diagnosis-taxonomy.md, andreferences/triage-protocol.md.references/next-decision-rules.md - Read when inspecting logs, configs, metrics, plots, runs, or code state.
references/evidence-audit.md - Use for full diagnosis reports.
references/report-template.md - If a result depends on current SOTA, benchmark conventions, or recent baseline performance, verify current sources with web search or user-provided papers.
- 务必阅读、
references/diagnosis-taxonomy.md和references/triage-protocol.md。references/next-decision-rules.md - 当检查日志、配置、指标、图表、运行记录或代码状态时,阅读。
references/evidence-audit.md - 使用生成完整的诊断报告。
references/report-template.md - 如果结果依赖当前SOTA、基准测试惯例或近期基线表现,请通过网络搜索或用户提供的论文验证最新信息。
Core Principles
核心原则
- Diagnose before optimizing.
- Separate observed result from interpretation.
- Prefer simple sanity checks before expensive reruns.
- Treat negative results as information: they may kill a claim, not the whole project.
- Do not blame the algorithm before checking implementation, data, metric, baseline, and selection rules.
- Do not blame implementation forever when repeated controlled evidence falsifies the claim.
- Every diagnosis should end with a decision: debug, rerun, ablate, revise method, narrow claim, write, park, or kill.
- Record uncertainty explicitly.
- 先诊断再优化。
- 将观察到的结果与解读分开。
- 在进行昂贵的重新运行前,优先选择简单的合理性检查。
- 将负面结果视为信息:它们可能否定某个主张,而非整个项目。
- 在检查实现、数据、指标、基线和选择规则之前,不要归咎于算法。
- 当反复的受控证据否定主张时,不要一直归咎于实现问题。
- 每次诊断都应得出明确决策:调试、重新运行、消融实验、修改方法、缩小主张范围、撰写报告、搁置或终止。
- 明确记录不确定性。
Step 1 - Define the Result and Expected Behavior
步骤1 - 定义结果与预期行为
Extract:
- experiment question and linked claim
- method and baseline
- dataset/split
- metrics and expected direction
- observed result
- number of seeds/repeats
- configs, commit, logs, tables, and figures
- what result was expected and why
- whether this result affects paper claims or only internal debugging
Rewrite vague input into:
text
Expected [method] to improve [metric/diagnostic] over [baseline] on [setting], but observed [result] under [controls].If expected behavior was never defined, route back to .
experiment-design-planner提取以下信息:
- 实验问题及关联主张
- 目标方法与基线
- 数据集/划分
- 指标及预期变化方向
- 观察到的结果
- 随机种子/重复次数
- 配置、提交记录、日志、表格和图表
- 预期的结果及原因
- 该结果是否影响论文主张,还是仅用于内部调试
将模糊的输入改写为:
text
预期[目标方法]在[场景]下相比[基线]提升[指标/诊断项],但在[控制条件]下观察到[结果]。如果从未定义过预期行为,请转向。
experiment-design-plannerStep 2 - Classify the Symptom
步骤2 - 分类症状
Read .
references/diagnosis-taxonomy.mdClassify the primary symptom:
- no improvement
- regression
- instability or high variance
- metric conflict
- suspiciously large gain
- baseline unexpectedly strong
- diagnostic/performance mismatch
- training failure or divergence
- reproducibility failure
- plot/table inconsistency
- result contradicts paper story
Then classify likely diagnosis categories:
- implementation bug
- metric/evaluation bug
- data/split/preprocessing issue
- unfair baseline or tuning issue
- seed variance or insufficient repeats
- optimization/hyperparameter issue
- method mechanism failure
- scale/regime mismatch
- claim/evidence mismatch
- expected negative result
阅读。
references/diagnosis-taxonomy.md对主要症状进行分类:
- 无提升
- 性能退化
- 不稳定或高方差
- 指标冲突
- 提升幅度异常大
- 基线表现意外强劲
- 诊断结果与性能不匹配
- 训练失败或发散
- 可复现性失败
- 图表/表格不一致
- 结果与论文叙事矛盾
然后对可能的诊断类别进行分类:
- 实现漏洞
- 指标/评估漏洞
- 数据/划分/预处理问题
- 基线不公平或调优问题
- 随机种子差异或重复次数不足
- 优化/超参数问题
- 方法机制失效
- 规模/场景不匹配
- 主张与证据不匹配
- 预期的负面结果
Step 3 - Gather Evidence
步骤3 - 收集证据
Read .
references/evidence-audit.mdPrefer primary artifacts:
- config diffs
- run commands
- git commit
- logs and stderr
- metric files
- checkpoints
- seeds
- dataset versions and split hashes
- plots and tables
- previous baseline runs
- implementation changes
Mark missing evidence rather than guessing.
阅读。
references/evidence-audit.md优先使用原始 artifacts:
- 配置差异
- 运行命令
- Git提交记录
- 日志和标准错误输出
- 指标文件
- 检查点
- 随机种子
- 数据集版本和划分哈希值
- 图表和表格
- 之前的基线运行记录
- 实现变更
标记缺失的证据,而非猜测。
Step 4 - Run Triage
步骤4 - 执行分流处理
Read .
references/triage-protocol.mdUse this order:
- Reproducibility and provenance: correct commit, config, data, seed, output path.
- Metric and evaluation: metric direction, aggregation, split, leakage, postprocessing.
- Baseline fairness: same budget, tuning, checkpoint rule, data, sampler, and code path.
- Implementation sanity: feature flag, tensor shapes, gradient flow, loss scale, train/eval mode.
- Statistical stability: seeds, variance, confidence intervals, outliers.
- Mechanism diagnostic: whether the intended mechanism changed.
- Claim alignment: whether the result supports, weakens, or falsifies the paper claim.
Stop early only when a blocking bug or invalid comparison is found.
阅读。
references/triage-protocol.md按照以下顺序进行:
- 可复现性与来源:确认提交记录、配置、数据、随机种子、输出路径是否正确。
- 指标与评估:确认指标方向、聚合方式、划分、数据泄露、后处理是否正确。
- 基线公平性:确认是否使用相同的资源预算、调优策略、检查点规则、数据、采样器和代码路径。
- 实现合理性:确认功能开关、张量形状、梯度流、损失缩放、训练/评估模式是否正确。
- 统计稳定性:检查随机种子、方差、置信区间、异常值。
- 机制诊断:确认预期的机制是否发生变化。
- 主张对齐:确认结果是否支持、削弱或否定论文主张。
只有当发现阻碍性漏洞或无效对比时,才提前终止流程。
Step 5 - Build Competing Explanations
步骤5 - 构建竞争性解释
For each plausible explanation, state:
- evidence for it
- evidence against it
- cheapest test that would distinguish it
- decision if true
At minimum consider:
- bug
- bad metric
- weak experiment design
- baseline too strong or under-tuned
- hyperparameter issue
- mechanism false
- claim too broad
对于每个合理的解释,说明:
- 支持该解释的证据
- 反对该解释的证据
- 区分该解释的最低成本测试方法
- 如果该解释为真,对应的决策
至少考虑以下情况:
- 漏洞
- 指标问题
- 实验设计缺陷
- 基线过强或调优不足
- 超参数问题
- 机制不成立
- 主张过于宽泛
Step 6 - Choose Next Decision
步骤6 - 选择下一步决策
Read .
references/next-decision-rules.mdChoose one primary decision:
- : result is not trustworthy until a bug or provenance issue is resolved
debug - : result is plausible but underpowered or missing controls
rerun - : result needs mechanism isolation
ablate - : mechanism likely needs design change
revise-method - : evidence supports a smaller or different claim
narrow-claim - : evidence is trustworthy enough to report
write - : result is inconclusive and not worth immediate compute
park - : claim or direction is falsified under fair controls
kill
Do not pick if basic provenance or fairness is unresolved.
write阅读。
references/next-decision-rules.md选择一个主要决策:
- :结果不可信,需先解决漏洞或来源问题
debug - :结果合理但说服力不足或缺少控制条件
rerun - :结果需要进行机制隔离
ablate - :机制可能需要设计变更
revise-method - :证据支持更小范围或不同的主张
narrow-claim - :证据足够可信,可以撰写报告
write - :结果不确定,且不值得立即投入计算资源
park - :在公平控制条件下,主张或研究方向已被否定
kill
如果基本的来源或公平性问题未解决,请勿选择。
writeStep 7 - Write the Diagnosis
步骤7 - 撰写诊断报告
Use for full reports.
references/report-template.mdIf saving to a project and no path is given, use:
text
docs/diagnosis/result_diagnosis_YYYY-MM-DD_<short-name>.mdRequired output:
markdown
undefined使用生成完整报告。
references/report-template.md如果要保存到项目中且未指定路径,请使用:
text
docs/diagnosis/result_diagnosis_YYYY-MM-DD_<short-name>.md必填输出格式:
markdown
undefinedResult Diagnosis: [Short Name]
实验结果诊断:[简短名称]
Result Snapshot
结果快照
Expected vs Observed
预期与实际对比
Symptom Classification
症状分类
Evidence Checked
已检查的证据
Competing Explanations
竞争性解释
Most Likely Diagnosis
最可能的诊断
Decision
决策
Next Checks or Actions
下一步检查或行动
Claim Impact
对主张的影响
Project Memory Writeback
项目记忆回写
undefinedundefinedStep 8 - Write Back to Project Memory
步骤8 - 回写到项目记忆
If the project uses , update:
research-project-memory- : observed result, limitations, and source paths
memory/evidence-board.md - : claims supported, weakened, revised, unsupported, or cut
memory/claim-board.md - : bugs, metric risks, baseline risks, mechanism risks, or claim risks
memory/risk-board.md - : debug, rerun, ablation, method revision, writing, park, or kill actions
memory/action-board.md - : durable decisions such as killing a claim, changing method, or narrowing scope
memory/decision-log.md - worktree : latest result and exit condition if a branch/worktree is involved
.agent/worktree-status.md
Use for verified results and for explanations. Mark stale claims explicitly.
observedinferred如果项目使用,更新以下内容:
research-project-memory- :观察到的结果、局限性和来源路径
memory/evidence-board.md - :已支持、削弱、修订、不支持或删除的主张
memory/claim-board.md - :漏洞、指标风险、基线风险、机制风险或主张风险
memory/risk-board.md - :调试、重新运行、消融实验、方法修改、撰写报告、搁置或终止等行动
memory/action-board.md - :持久化决策,例如否定某个主张、更改方法或缩小范围
memory/decision-log.md - 工作树:最新结果,以及涉及分支/工作树时的退出条件
.agent/worktree-status.md
使用标记已验证的结果,使用标记解释内容。明确标记过时的主张。
observedinferredFinal Sanity Check
最终合理性检查
Before finalizing:
- observed result and interpretation are separated
- provenance and config are checked or listed as missing
- metric direction and aggregation are clear
- baseline fairness is addressed
- implementation sanity checks are considered
- seed variance and repeats are considered
- mechanism diagnostic is checked when relevant
- result is mapped to a concrete decision
- paper claim impact is explicit
- project memory is updated when present
在最终确定前,确认:
- 观察到的结果与解读已分开
- 来源和配置已检查,或已列为缺失项
- 指标方向和聚合方式清晰
- 基线公平性已解决
- 已考虑实现合理性检查
- 已考虑随机种子差异和重复次数
- 相关时已进行机制诊断
- 结果已映射到具体决策
- 对论文主张的影响明确
- 存在项目记忆时已完成更新