result-diagnosis

Original：🇺🇸 English

Translated

Diagnose surprising, negative, unstable, or ambiguous ML/AI experiment results and decide whether to debug implementation, rerun experiments, change metrics or baselines, revise the algorithm, narrow the paper claim, park, or kill a direction. Use this skill whenever results do not match expectations, a method fails, metrics conflict, seeds vary, baselines beat the method, plots look suspicious, or the user asks what to do next after experimental results.

4installs

Sourcea-green-hand-jack/ml-research-skills

Added on2026-05-02

NPX Install

npx skill4agent add a-green-hand-jack/ml-research-skills result-diagnosis

SKILL.md Content

View Translation Comparison →

Result Diagnosis

Diagnose what an experiment result means for the project. This skill is for decision-making after results exist, especially when they are negative, surprising, unstable, or hard to interpret.

Use this skill when:

a method does not improve over baseline
results vary strongly across seeds
a metric improves but another metric worsens
a baseline unexpectedly wins
a plot or table looks suspicious
a result may be caused by an implementation bug, metric bug, data issue, or unfair comparison
early experiments suggest revising the algorithm or paper claim
the user asks "what does this result mean?" or "what should we do next?"

Do not use this skill to write a polished report. Pair it with

experiment-report-writer

after the diagnosis is clear.

Pair this skill with:

```
research-project-memory
```
when the diagnosis should update claims, evidence, risks, actions, or worktree status
```
experiment-report-writer
```
when results need a shareable report
```
algorithm-design-planner
```
when the diagnosis points to method revision
```
experiment-design-planner
```
when the diagnosis requires a new controlled experiment
```
run-experiment
```
when the next step is a rerun, sanity check, or ablation
```
conference-writing-adapter
```
when the right action is to narrow or reframe paper claims

Skill Directory Layout

text

<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── diagnosis-taxonomy.md
    ├── evidence-audit.md
    ├── next-decision-rules.md
    ├── report-template.md
    └── triage-protocol.md

Progressive Loading

Always read

references/diagnosis-taxonomy.md

,

references/triage-protocol.md

, and

references/next-decision-rules.md

.

Read
```
references/evidence-audit.md
```
when inspecting logs, configs, metrics, plots, runs, or code state.
Use
```
references/report-template.md
```
for full diagnosis reports.
If a result depends on current SOTA, benchmark conventions, or recent baseline performance, verify current sources with web search or user-provided papers.

Core Principles

Diagnose before optimizing.
Separate observed result from interpretation.
Prefer simple sanity checks before expensive reruns.
Treat negative results as information: they may kill a claim, not the whole project.
Do not blame the algorithm before checking implementation, data, metric, baseline, and selection rules.
Do not blame implementation forever when repeated controlled evidence falsifies the claim.
Every diagnosis should end with a decision: debug, rerun, ablate, revise method, narrow claim, write, park, or kill.
Record uncertainty explicitly.

Step 1 - Define the Result and Expected Behavior

Extract:

experiment question and linked claim
method and baseline
dataset/split
metrics and expected direction
observed result
number of seeds/repeats
configs, commit, logs, tables, and figures
what result was expected and why
whether this result affects paper claims or only internal debugging

Rewrite vague input into:

text

Expected [method] to improve [metric/diagnostic] over [baseline] on [setting], but observed [result] under [controls].

If expected behavior was never defined, route back to

experiment-design-planner

.

Step 2 - Classify the Symptom

Read

references/diagnosis-taxonomy.md

.

Classify the primary symptom:

no improvement
regression
instability or high variance
metric conflict
suspiciously large gain
baseline unexpectedly strong
diagnostic/performance mismatch
training failure or divergence
reproducibility failure
plot/table inconsistency
result contradicts paper story

Then classify likely diagnosis categories:

implementation bug
metric/evaluation bug
data/split/preprocessing issue
unfair baseline or tuning issue
seed variance or insufficient repeats
optimization/hyperparameter issue
method mechanism failure
scale/regime mismatch
claim/evidence mismatch
expected negative result

Step 3 - Gather Evidence

Read

references/evidence-audit.md

.

Prefer primary artifacts:

config diffs
run commands
git commit
logs and stderr
metric files
checkpoints
seeds
dataset versions and split hashes
plots and tables
previous baseline runs
implementation changes

Mark missing evidence rather than guessing.

Step 4 - Run Triage

Read

references/triage-protocol.md

.

Use this order:

Reproducibility and provenance: correct commit, config, data, seed, output path.
Metric and evaluation: metric direction, aggregation, split, leakage, postprocessing.
Baseline fairness: same budget, tuning, checkpoint rule, data, sampler, and code path.
Implementation sanity: feature flag, tensor shapes, gradient flow, loss scale, train/eval mode.
Statistical stability: seeds, variance, confidence intervals, outliers.
Mechanism diagnostic: whether the intended mechanism changed.
Claim alignment: whether the result supports, weakens, or falsifies the paper claim.

Stop early only when a blocking bug or invalid comparison is found.

Step 5 - Build Competing Explanations

For each plausible explanation, state:

evidence for it
evidence against it
cheapest test that would distinguish it
decision if true

At minimum consider:

bug
bad metric
weak experiment design
baseline too strong or under-tuned
hyperparameter issue
mechanism false
claim too broad

Step 6 - Choose Next Decision

Read

references/next-decision-rules.md

.

Choose one primary decision:

```
debug
```
: result is not trustworthy until a bug or provenance issue is resolved
```
rerun
```
: result is plausible but underpowered or missing controls
```
ablate
```
: result needs mechanism isolation
```
revise-method
```
: mechanism likely needs design change
```
narrow-claim
```
: evidence supports a smaller or different claim
```
write
```
: evidence is trustworthy enough to report
```
park
```
: result is inconclusive and not worth immediate compute
```
kill
```
: claim or direction is falsified under fair controls

Do not pick

write

if basic provenance or fairness is unresolved.

Step 7 - Write the Diagnosis

Use

references/report-template.md

for full reports.

If saving to a project and no path is given, use:

text

docs/diagnosis/result_diagnosis_YYYY-MM-DD_<short-name>.md

Required output:

markdown

# Result Diagnosis: [Short Name]

## Result Snapshot
## Expected vs Observed
## Symptom Classification
## Evidence Checked
## Competing Explanations
## Most Likely Diagnosis
## Decision
## Next Checks or Actions
## Claim Impact
## Project Memory Writeback

Step 8 - Write Back to Project Memory

If the project uses

research-project-memory

, update:

```
memory/evidence-board.md
```
: observed result, limitations, and source paths
```
memory/claim-board.md
```
: claims supported, weakened, revised, unsupported, or cut
```
memory/risk-board.md
```
: bugs, metric risks, baseline risks, mechanism risks, or claim risks
```
memory/action-board.md
```
: debug, rerun, ablation, method revision, writing, park, or kill actions
```
memory/decision-log.md
```
: durable decisions such as killing a claim, changing method, or narrowing scope
worktree
```
.agent/worktree-status.md
```
: latest result and exit condition if a branch/worktree is involved

Use

observed

for verified results and

inferred

for explanations. Mark stale claims explicitly.

Final Sanity Check

Before finalizing:

observed result and interpretation are separated
provenance and config are checked or listed as missing
metric direction and aggregation are clear
baseline fairness is addressed
implementation sanity checks are considered
seed variance and repeats are considered
mechanism diagnostic is checked when relevant
result is mapped to a concrete decision
paper claim impact is explicit
project memory is updated when present

result-diagnosis

NPX Install

Tags

SKILL.md Content

Result Diagnosis

Skill Directory Layout

Progressive Loading

Core Principles

Step 1 - Define the Result and Expected Behavior

Step 2 - Classify the Symptom

Step 3 - Gather Evidence

Step 4 - Run Triage

Step 5 - Build Competing Explanations

Step 6 - Choose Next Decision

Step 7 - Write the Diagnosis

Step 8 - Write Back to Project Memory

Final Sanity Check