result-diagnosis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Result Diagnosis

实验结果诊断

Diagnose what an experiment result means for the project. This skill is for decision-making after results exist, especially when they are negative, surprising, unstable, or hard to interpret.

Use this skill when:

a method does not improve over baseline
results vary strongly across seeds
a metric improves but another metric worsens
a baseline unexpectedly wins
a plot or table looks suspicious
a result may be caused by an implementation bug, metric bug, data issue, or unfair comparison
early experiments suggest revising the algorithm or paper claim
the user asks "what does this result mean?" or "what should we do next?"

Do not use this skill to write a polished report. Pair it with

experiment-report-writer

after the diagnosis is clear.

Pair this skill with:

```
research-project-memory
```
when the diagnosis should update claims, evidence, risks, actions, or worktree status
```
experiment-report-writer
```
when results need a shareable report
```
algorithm-design-planner
```
when the diagnosis points to method revision
```
experiment-design-planner
```
when the diagnosis requires a new controlled experiment
```
run-experiment
```
when the next step is a rerun, sanity check, or ablation
```
conference-writing-adapter
```
when the right action is to narrow or reframe paper claims

诊断实验结果对项目的意义。此技能用于获得结果后的决策环节，尤其适用于结果为负面、意外、不稳定或难以解读的情况。

在以下场景使用此技能：

目标方法未优于基线
不同随机种子下结果差异极大
某一指标提升但另一指标恶化
基线表现意外优于目标方法
图表或表格看起来可疑
结果可能由实现漏洞、指标漏洞、数据问题或不公平对比导致
早期实验表明需要修改算法或论文主张
用户询问“这个结果意味着什么？”或“接下来该怎么做？”

请勿使用此技能撰写精美的报告。诊断明确后，可搭配

experiment-report-writer

使用。

可搭配以下技能使用：

当诊断需要更新主张、证据、风险、行动或工作树状态时，搭配
```
research-project-memory
```
当结果需要生成可分享的报告时，搭配
```
experiment-report-writer
```
当诊断指出需要修改方法时，搭配
```
algorithm-design-planner
```
当诊断要求开展新的受控实验时，搭配
```
experiment-design-planner
```
当下一步是重新运行、合理性检查或消融实验时，搭配
```
run-experiment
```
当正确行动是缩小或重构论文主张时，搭配
```
conference-writing-adapter
```

Skill Directory Layout

技能目录结构

text

<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── diagnosis-taxonomy.md
    ├── evidence-audit.md
    ├── next-decision-rules.md
    ├── report-template.md
    └── triage-protocol.md

text

<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── diagnosis-taxonomy.md
    ├── evidence-audit.md
    ├── next-decision-rules.md
    ├── report-template.md
    └── triage-protocol.md

Progressive Loading

渐进式加载

Always read

references/diagnosis-taxonomy.md

references/triage-protocol.md

, and

references/next-decision-rules.md

Read
```
references/evidence-audit.md
```
when inspecting logs, configs, metrics, plots, runs, or code state.
Use
```
references/report-template.md
```
for full diagnosis reports.
If a result depends on current SOTA, benchmark conventions, or recent baseline performance, verify current sources with web search or user-provided papers.

务必阅读

references/diagnosis-taxonomy.md

、

references/triage-protocol.md

和

references/next-decision-rules.md

。

当检查日志、配置、指标、图表、运行记录或代码状态时，阅读
```
references/evidence-audit.md
```
。
使用
```
references/report-template.md
```
生成完整的诊断报告。
如果结果依赖当前SOTA、基准测试惯例或近期基线表现，请通过网络搜索或用户提供的论文验证最新信息。

Core Principles

核心原则

Diagnose before optimizing.
Separate observed result from interpretation.
Prefer simple sanity checks before expensive reruns.
Treat negative results as information: they may kill a claim, not the whole project.
Do not blame the algorithm before checking implementation, data, metric, baseline, and selection rules.
Do not blame implementation forever when repeated controlled evidence falsifies the claim.
Every diagnosis should end with a decision: debug, rerun, ablate, revise method, narrow claim, write, park, or kill.
Record uncertainty explicitly.

先诊断再优化。
将观察到的结果与解读分开。
在进行昂贵的重新运行前，优先选择简单的合理性检查。
将负面结果视为信息：它们可能否定某个主张，而非整个项目。
在检查实现、数据、指标、基线和选择规则之前，不要归咎于算法。
当反复的受控证据否定主张时，不要一直归咎于实现问题。
每次诊断都应得出明确决策：调试、重新运行、消融实验、修改方法、缩小主张范围、撰写报告、搁置或终止。
明确记录不确定性。

Step 1 - Define the Result and Expected Behavior

步骤1 - 定义结果与预期行为

Extract:

experiment question and linked claim
method and baseline
dataset/split
metrics and expected direction
observed result
number of seeds/repeats
configs, commit, logs, tables, and figures
what result was expected and why
whether this result affects paper claims or only internal debugging

Rewrite vague input into:

text

Expected [method] to improve [metric/diagnostic] over [baseline] on [setting], but observed [result] under [controls].

If expected behavior was never defined, route back to

experiment-design-planner

提取以下信息：

实验问题及关联主张
目标方法与基线
数据集/划分
指标及预期变化方向
观察到的结果
随机种子/重复次数
配置、提交记录、日志、表格和图表
预期的结果及原因
该结果是否影响论文主张，还是仅用于内部调试

将模糊的输入改写为：

text

预期[目标方法]在[场景]下相比[基线]提升[指标/诊断项]，但在[控制条件]下观察到[结果]。

如果从未定义过预期行为，请转向

experiment-design-planner

。

Step 2 - Classify the Symptom

步骤2 - 分类症状

Read

references/diagnosis-taxonomy.md

Classify the primary symptom:

no improvement
regression
instability or high variance
metric conflict
suspiciously large gain
baseline unexpectedly strong
diagnostic/performance mismatch
training failure or divergence
reproducibility failure
plot/table inconsistency
result contradicts paper story

Then classify likely diagnosis categories:

implementation bug
metric/evaluation bug
data/split/preprocessing issue
unfair baseline or tuning issue
seed variance or insufficient repeats
optimization/hyperparameter issue
method mechanism failure
scale/regime mismatch
claim/evidence mismatch
expected negative result

阅读

references/diagnosis-taxonomy.md

。

对主要症状进行分类：

无提升
性能退化
不稳定或高方差
指标冲突
提升幅度异常大
基线表现意外强劲
诊断结果与性能不匹配
训练失败或发散
可复现性失败
图表/表格不一致
结果与论文叙事矛盾

然后对可能的诊断类别进行分类：

实现漏洞
指标/评估漏洞
数据/划分/预处理问题
基线不公平或调优问题
随机种子差异或重复次数不足
优化/超参数问题
方法机制失效
规模/场景不匹配
主张与证据不匹配
预期的负面结果

Step 3 - Gather Evidence

步骤3 - 收集证据

Read

references/evidence-audit.md

Prefer primary artifacts:

config diffs
run commands
git commit
logs and stderr
metric files
checkpoints
seeds
dataset versions and split hashes
plots and tables
previous baseline runs
implementation changes

Mark missing evidence rather than guessing.

阅读

references/evidence-audit.md

。

优先使用原始 artifacts：

配置差异
运行命令
Git提交记录
日志和标准错误输出
指标文件
检查点
随机种子
数据集版本和划分哈希值
图表和表格
之前的基线运行记录
实现变更

标记缺失的证据，而非猜测。

Step 4 - Run Triage

步骤4 - 执行分流处理

Read

references/triage-protocol.md

Use this order:

Reproducibility and provenance: correct commit, config, data, seed, output path.
Metric and evaluation: metric direction, aggregation, split, leakage, postprocessing.
Baseline fairness: same budget, tuning, checkpoint rule, data, sampler, and code path.
Implementation sanity: feature flag, tensor shapes, gradient flow, loss scale, train/eval mode.
Statistical stability: seeds, variance, confidence intervals, outliers.
Mechanism diagnostic: whether the intended mechanism changed.
Claim alignment: whether the result supports, weakens, or falsifies the paper claim.

Stop early only when a blocking bug or invalid comparison is found.

阅读

references/triage-protocol.md

。

按照以下顺序进行：

可复现性与来源：确认提交记录、配置、数据、随机种子、输出路径是否正确。
指标与评估：确认指标方向、聚合方式、划分、数据泄露、后处理是否正确。
基线公平性：确认是否使用相同的资源预算、调优策略、检查点规则、数据、采样器和代码路径。
实现合理性：确认功能开关、张量形状、梯度流、损失缩放、训练/评估模式是否正确。
统计稳定性：检查随机种子、方差、置信区间、异常值。
机制诊断：确认预期的机制是否发生变化。
主张对齐：确认结果是否支持、削弱或否定论文主张。

只有当发现阻碍性漏洞或无效对比时，才提前终止流程。

Step 5 - Build Competing Explanations

步骤5 - 构建竞争性解释

For each plausible explanation, state:

evidence for it
evidence against it
cheapest test that would distinguish it
decision if true

At minimum consider:

bug
bad metric
weak experiment design
baseline too strong or under-tuned
hyperparameter issue
mechanism false
claim too broad

对于每个合理的解释，说明：

支持该解释的证据
反对该解释的证据
区分该解释的最低成本测试方法
如果该解释为真，对应的决策

至少考虑以下情况：

漏洞
指标问题
实验设计缺陷
基线过强或调优不足
超参数问题
机制不成立
主张过于宽泛

Step 6 - Choose Next Decision

步骤6 - 选择下一步决策

Read

references/next-decision-rules.md

Choose one primary decision:

```
debug
```
: result is not trustworthy until a bug or provenance issue is resolved
```
rerun
```
: result is plausible but underpowered or missing controls
```
ablate
```
: result needs mechanism isolation
```
revise-method
```
: mechanism likely needs design change
```
narrow-claim
```
: evidence supports a smaller or different claim
```
write
```
: evidence is trustworthy enough to report
```
park
```
: result is inconclusive and not worth immediate compute
```
kill
```
: claim or direction is falsified under fair controls

Do not pick

write

if basic provenance or fairness is unresolved.

阅读

references/next-decision-rules.md

。

选择一个主要决策：

```
debug
```
：结果不可信，需先解决漏洞或来源问题
```
rerun
```
：结果合理但说服力不足或缺少控制条件
```
ablate
```
：结果需要进行机制隔离
```
revise-method
```
：机制可能需要设计变更
```
narrow-claim
```
：证据支持更小范围或不同的主张
```
write
```
：证据足够可信，可以撰写报告
```
park
```
：结果不确定，且不值得立即投入计算资源
```
kill
```
：在公平控制条件下，主张或研究方向已被否定

如果基本的来源或公平性问题未解决，请勿选择

write

。

Step 7 - Write the Diagnosis

步骤7 - 撰写诊断报告

Use

references/report-template.md

for full reports.

If saving to a project and no path is given, use:

text

docs/diagnosis/result_diagnosis_YYYY-MM-DD_<short-name>.md

Required output:

markdown

undefined

使用

references/report-template.md

生成完整报告。

如果要保存到项目中且未指定路径，请使用：

text

docs/diagnosis/result_diagnosis_YYYY-MM-DD_<short-name>.md

必填输出格式：

markdown

undefined

Result Diagnosis: [Short Name]

实验结果诊断：[简短名称]

Result Snapshot

结果快照

Expected vs Observed

预期与实际对比

Symptom Classification

症状分类

Evidence Checked

已检查的证据

Competing Explanations

竞争性解释

Most Likely Diagnosis

最可能的诊断

Decision

决策

Next Checks or Actions

下一步检查或行动

Claim Impact

对主张的影响

Project Memory Writeback

项目记忆回写

undefined

undefined

Step 8 - Write Back to Project Memory

步骤8 - 回写到项目记忆

If the project uses

research-project-memory

, update:

```
memory/evidence-board.md
```
: observed result, limitations, and source paths
```
memory/claim-board.md
```
: claims supported, weakened, revised, unsupported, or cut
```
memory/risk-board.md
```
: bugs, metric risks, baseline risks, mechanism risks, or claim risks
```
memory/action-board.md
```
: debug, rerun, ablation, method revision, writing, park, or kill actions
```
memory/decision-log.md
```
: durable decisions such as killing a claim, changing method, or narrowing scope
worktree
```
.agent/worktree-status.md
```
: latest result and exit condition if a branch/worktree is involved

Use

observed

for verified results and

inferred

for explanations. Mark stale claims explicitly.

如果项目使用

research-project-memory

，更新以下内容：

```
memory/evidence-board.md
```
：观察到的结果、局限性和来源路径
```
memory/claim-board.md
```
：已支持、削弱、修订、不支持或删除的主张
```
memory/risk-board.md
```
：漏洞、指标风险、基线风险、机制风险或主张风险
```
memory/action-board.md
```
：调试、重新运行、消融实验、方法修改、撰写报告、搁置或终止等行动
```
memory/decision-log.md
```
：持久化决策，例如否定某个主张、更改方法或缩小范围
工作树
```
.agent/worktree-status.md
```
：最新结果，以及涉及分支/工作树时的退出条件

使用

observed

标记已验证的结果，使用

inferred

标记解释内容。明确标记过时的主张。

Final Sanity Check

最终合理性检查

Before finalizing:

observed result and interpretation are separated
provenance and config are checked or listed as missing
metric direction and aggregation are clear
baseline fairness is addressed
implementation sanity checks are considered
seed variance and repeats are considered
mechanism diagnostic is checked when relevant
result is mapped to a concrete decision
paper claim impact is explicit
project memory is updated when present

在最终确定前，确认：

观察到的结果与解读已分开
来源和配置已检查，或已列为缺失项
指标方向和聚合方式清晰
基线公平性已解决
已考虑实现合理性检查
已考虑随机种子差异和重复次数
相关时已进行机制诊断
结果已映射到具体决策
对论文主张的影响明确
存在项目记忆时已完成更新