Loading...
Loading...
Audit experiment integrity before claiming results. Uses cross-model review (GPT-5.4) to check for fake ground truth, score normalization fraud, phantom results, and insufficient scope. Use when user says "审计实验", "check experiment integrity", "audit results", "实验诚实度", or after experiments complete before writing claims.
npx skill4agent add wanshuiyin/auto-claude-code-research-in-sleep experiment-auditshared-references/reviewer-independence.mdshared-references/experiment-integrity.mdcodex— reviewer: oracle-proshared-references/reviewer-routing.mdScan project directory for:
1. Evaluation scripts: *eval*.py, *metric*.py, *test*.py, *benchmark*.py
2. Result files: *.json, *.csv in results/, outputs/, logs/
3. Ground truth paths: look in eval scripts for data loading (dataset paths, GT references)
4. Experiment tracker: EXPERIMENT_TRACKER.md, EXPERIMENT_LOG.md
5. Paper claims: NARRATIVE_REPORT.md, paper/sections/*.tex, PAPER_PLAN.md
6. Config files: *.yaml, *.toml, *.json configs with metric definitionsmcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
sandbox: read-only
cwd: [project directory]
prompt: |
You are an experiment integrity auditor. Read ALL files listed below
and check for the following fraud patterns.
Files to read:
- Evaluation scripts: [list paths]
- Result files: [list paths]
- Experiment tracker: [list paths]
- Paper claims: [list paths]
- Config files: [list paths]
## Audit Checklist
### A. Ground Truth Provenance
For each evaluation script:
1. Where does "ground truth" / "reference" / "target" come from?
2. Is it loaded from the DATASET, or generated/derived from MODEL OUTPUTS?
3. If derived: is it explicitly labeled as proxy evaluation?
4. Are official eval scripts used when available for this benchmark?
FAIL if: GT is derived from model outputs without explicit proxy labeling.
### B. Score Normalization
For each metric computation:
1. Is any metric divided by max/min/mean of the model's OWN output?
2. Are raw scores reported alongside any normalized scores?
3. Are any scores suspiciously close to 1.0 or 100%?
FAIL if: Normalization denominator comes from prediction statistics.
### C. Result File Existence
For each claim in the paper/narrative:
1. Does the referenced result file actually exist?
2. Does the claimed metric key exist in that file?
3. Does the claimed NUMBER match what's in the file?
4. Is the experiment tracker status DONE (not TODO/IN_PROGRESS)?
FAIL if: Claimed results reference nonexistent files or mismatched numbers.
### D. Dead Code Detection
For each metric function defined in eval scripts:
1. Is it actually CALLED in any evaluation pipeline?
2. Does its output appear in any result file?
WARN if: Metric functions exist but are never called.
### E. Scope Assessment
1. How many scenes/datasets/configurations were actually tested?
2. How many seeds/runs per configuration?
3. Does the paper use words like "comprehensive", "extensive", "robust"?
4. Is the actual scope sufficient for those claims?
WARN if: Scope language exceeds actual evidence.
### F. Evaluation Type Classification
Classify each evaluation as:
- real_gt: uses dataset-provided ground truth
- synthetic_proxy: uses model-generated reference
- self_supervised_proxy: no GT by design
- simulation_only: simulated environment
- human_eval: human judges
## Output Format
For each check (A-F), report:
- Status: PASS | WARN | FAIL
- Evidence: exact file:line references
- Details: what specifically was found
Overall verdict: PASS | WARN | FAIL
Be thorough. Read every eval script line by line.EXPERIMENT_AUDIT.md# Experiment Audit Report
**Date**: [today]
**Auditor**: GPT-5.4 xhigh (cross-model, read-only)
**Project**: [project name]
## Overall Verdict: [PASS | WARN | FAIL]
## Integrity Status: [pass | warn | fail]
## Checks
### A. Ground Truth Provenance: [PASS|WARN|FAIL]
[details + file:line evidence]
### B. Score Normalization: [PASS|WARN|FAIL]
[details]
### C. Result File Existence: [PASS|WARN|FAIL]
[details]
### D. Dead Code Detection: [PASS|WARN|FAIL]
[details]
### E. Scope Assessment: [PASS|WARN|FAIL]
[details]
### F. Evaluation Type: [real_gt | synthetic_proxy | ...]
[classification + evidence]
## Action Items
- [specific fixes if WARN or FAIL]
## Claim Impact
- Claim 1: [supported | needs qualifier | unsupported]
- Claim 2: ...EXPERIMENT_AUDIT.json{
"date": "2026-04-10",
"auditor": "gpt-5.4-xhigh",
"overall_verdict": "warn",
"integrity_status": "warn",
"checks": {
"gt_provenance": {"status": "pass", "details": "..."},
"score_normalization": {"status": "warn", "details": "..."},
"result_existence": {"status": "pass", "details": "..."},
"dead_code": {"status": "pass", "details": "..."},
"scope": {"status": "warn", "details": "..."},
"eval_type": "real_gt"
},
"claims": [
{"id": "C1", "impact": "supported"},
{"id": "C2", "impact": "needs_qualifier"}
]
}🔬 Experiment Audit Complete
GT Provenance: ✅ PASS — real dataset GT used
Score Normalization: ⚠️ WARN — boundary metric uses self-reference
Result Existence: ✅ PASS — all files exist, numbers match
Dead Code: ✅ PASS — all metric functions called
Scope: ⚠️ WARN — 2 scenes, paper says "comprehensive"
Overall: ⚠️ WARN
See EXPERIMENT_AUDIT.md for details./experiment-bridge/auto-review-loop/experiment-bridge → results ready
↓
/experiment-audit (automatic, advisory)
├── PASS → continue normally
├── WARN → print ⚠️ warning, continue, tag claims as [INTEGRITY: WARN]
└── FAIL → print 🔴 alert, continue, tag claims as [INTEGRITY CONCERN]
↓
/auto-review-loop → proceeds with integrity tags visible to reviewerif EXPERIMENT_AUDIT.json exists:
read integrity_status
attach to verdict: {claim_supported: "yes", integrity_status: "warn"}
if integrity_status == "fail":
downgrade verdict display: "yes [INTEGRITY CONCERN]"
else:
verdict as normal, integrity_status = "unavailable"
mark as "provisional — no integrity audit"if EXPERIMENT_AUDIT.json exists AND integrity_status == "fail":
add footnote to affected claims: "Note: integrity audit flagged concerns with this evaluation"mcp__codex__codexmcp__codex__codex-replyshared-references/review-tracing.mdtools/save_trace.sh.aris/traces/<skill>/<date>_run<NN>/--- trace:full