Loading...
Loading...
Compare original and translation side by side
shared-references/reviewer-independence.mdshared-references/experiment-integrity.mdshared-references/reviewer-independence.mdshared-references/experiment-integrity.mdcodex— reviewer: oracle-proshared-references/reviewer-routing.mdcodex— reviewer: oracle-proshared-references/reviewer-routing.mdScan project directory for:
1. Evaluation scripts: *eval*.py, *metric*.py, *test*.py, *benchmark*.py
2. Result files: *.json, *.csv in results/, outputs/, logs/
3. Ground truth paths: look in eval scripts for data loading (dataset paths, GT references)
4. Experiment tracker: EXPERIMENT_TRACKER.md, EXPERIMENT_LOG.md
5. Paper claims: NARRATIVE_REPORT.md, paper/sections/*.tex, PAPER_PLAN.md
6. Config files: *.yaml, *.toml, *.json configs with metric definitions扫描项目目录,查找:
1. 评估脚本: *eval*.py, *metric*.py, *test*.py, *benchmark*.py
2. 结果文件: results/、outputs/、logs/目录下的*.json、*.csv文件
3. 真值路径: 在评估脚本中查找数据加载逻辑(数据集路径、GT参考数据)
4. 实验跟踪文档: EXPERIMENT_TRACKER.md, EXPERIMENT_LOG.md
5. 论文声明: NARRATIVE_REPORT.md, paper/sections/*.tex, PAPER_PLAN.md
6. 配置文件: 包含指标定义的*.yaml、*.toml、*.json配置文件mcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
sandbox: read-only
cwd: [project directory]
prompt: |
You are an experiment integrity auditor. Read ALL files listed below
and check for the following fraud patterns.
Files to read:
- Evaluation scripts: [list paths]
- Result files: [list paths]
- Experiment tracker: [list paths]
- Paper claims: [list paths]
- Config files: [list paths]
## Audit Checklist
### A. Ground Truth Provenance
For each evaluation script:
1. Where does "ground truth" / "reference" / "target" come from?
2. Is it loaded from the DATASET, or generated/derived from MODEL OUTPUTS?
3. If derived: is it explicitly labeled as proxy evaluation?
4. Are official eval scripts used when available for this benchmark?
FAIL if: GT is derived from model outputs without explicit proxy labeling.
### B. Score Normalization
For each metric computation:
1. Is any metric divided by max/min/mean of the model's OWN output?
2. Are raw scores reported alongside any normalized scores?
3. Are any scores suspiciously close to 1.0 or 100%?
FAIL if: Normalization denominator comes from prediction statistics.
### C. Result File Existence
For each claim in the paper/narrative:
1. Does the referenced result file actually exist?
2. Does the claimed metric key exist in that file?
3. Does the claimed NUMBER match what's in the file?
4. Is the experiment tracker status DONE (not TODO/IN_PROGRESS)?
FAIL if: Claimed results reference nonexistent files or mismatched numbers.
### D. Dead Code Detection
For each metric function defined in eval scripts:
1. Is it actually CALLED in any evaluation pipeline?
2. Does its output appear in any result file?
WARN if: Metric functions exist but are never called.
### E. Scope Assessment
1. How many scenes/datasets/configurations were actually tested?
2. How many seeds/runs per configuration?
3. Does the paper use words like "comprehensive", "extensive", "robust"?
4. Is the actual scope sufficient for those claims?
WARN if: Scope language exceeds actual evidence.
### F. Evaluation Type Classification
Classify each evaluation as:
- real_gt: uses dataset-provided ground truth
- synthetic_proxy: uses model-generated reference
- self_supervised_proxy: no GT by design
- simulation_only: simulated environment
- human_eval: human judges
## Output Format
For each check (A-F), report:
- Status: PASS | WARN | FAIL
- Evidence: exact file:line references
- Details: what specifically was found
Overall verdict: PASS | WARN | FAIL
Be thorough. Read every eval script line by line.mcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
sandbox: read-only
cwd: [project directory]
prompt: |
You are an experiment integrity auditor. Read ALL files listed below
and check for the following fraud patterns.
Files to read:
- Evaluation scripts: [list paths]
- Result files: [list paths]
- Experiment tracker: [list paths]
- Paper claims: [list paths]
- Config files: [list paths]
## Audit Checklist
### A. Ground Truth Provenance
For each evaluation script:
1. Where does "ground truth" / "reference" / "target" come from?
2. Is it loaded from the DATASET, or generated/derived from MODEL OUTPUTS?
3. If derived: is it explicitly labeled as proxy evaluation?
4. Are official eval scripts used when available for this benchmark?
FAIL if: GT is derived from model outputs without explicit proxy labeling.
### B. Score Normalization
For each metric computation:
1. Is any metric divided by max/min/mean of the model's OWN output?
2. Are raw scores reported alongside any normalized scores?
3. Are any scores suspiciously close to 1.0 or 100%?
FAIL if: Normalization denominator comes from prediction statistics.
### C. Result File Existence
For each claim in the paper/narrative:
1. Does the referenced result file actually exist?
2. Does the claimed metric key exist in that file?
3. Does the claimed NUMBER match what's in the file?
4. Is the experiment tracker status DONE (not TODO/IN_PROGRESS)?
FAIL if: Claimed results reference nonexistent files or mismatched numbers.
### D. Dead Code Detection
For each metric function defined in eval scripts:
1. Is it actually CALLED in any evaluation pipeline?
2. Does its output appear in any result file?
WARN if: Metric functions exist but are never called.
### E. Scope Assessment
1. How many scenes/datasets/configurations were actually tested?
2. How many seeds/runs per configuration?
3. Does the paper use words like "comprehensive", "extensive", "robust"?
4. Is the actual scope sufficient for those claims?
WARN if: Scope language exceeds actual evidence.
### F. Evaluation Type Classification
Classify each evaluation as:
- real_gt: uses dataset-provided ground truth
- synthetic_proxy: uses model-generated reference
- self_supervised_proxy: no GT by design
- simulation_only: simulated environment
- human_eval: human judges
## Output Format
For each check (A-F), report:
- Status: PASS | WARN | FAIL
- Evidence: exact file:line references
- Details: what specifically was found
Overall verdict: PASS | WARN | FAIL
Be thorough. Read every eval script line by line.EXPERIMENT_AUDIT.mdundefinedEXPERIMENT_AUDIT.mdundefined
Also write `EXPERIMENT_AUDIT.json` for machine consumption:
```json
{
"date": "2026-04-10",
"auditor": "gpt-5.4-xhigh",
"overall_verdict": "warn",
"integrity_status": "warn",
"checks": {
"gt_provenance": {"status": "pass", "details": "..."},
"score_normalization": {"status": "warn", "details": "..."},
"result_existence": {"status": "pass", "details": "..."},
"dead_code": {"status": "pass", "details": "..."},
"scope": {"status": "warn", "details": "..."},
"eval_type": "real_gt"
},
"claims": [
{"id": "C1", "impact": "supported"},
{"id": "C2", "impact": "needs_qualifier"}
]
}
同时撰写供机器读取的`EXPERIMENT_AUDIT.json`:
```json
{
"date": "2026-04-10",
"auditor": "gpt-5.4-xhigh",
"overall_verdict": "warn",
"integrity_status": "warn",
"checks": {
"gt_provenance": {"status": "pass", "details": "..."},
"score_normalization": {"status": "warn", "details": "..."},
"result_existence": {"status": "pass", "details": "..."},
"dead_code": {"status": "pass", "details": "..."},
"scope": {"status": "warn", "details": "..."},
"eval_type": "real_gt"
},
"claims": [
{"id": "C1", "impact": "supported"},
{"id": "C2", "impact": "needs_qualifier"}
]
}🔬 Experiment Audit Complete
GT Provenance: ✅ PASS — real dataset GT used
Score Normalization: ⚠️ WARN — boundary metric uses self-reference
Result Existence: ✅ PASS — all files exist, numbers match
Dead Code: ✅ PASS — all metric functions called
Scope: ⚠️ WARN — 2 scenes, paper says "comprehensive"
Overall: ⚠️ WARN
See EXPERIMENT_AUDIT.md for details.🔬 实验审计完成
真值来源: ✅ 通过 — 使用真实数据集真值
分数归一化: ⚠️ 警告 — 边界指标使用自引用计算
结果文件存在性: ✅ 通过 — 所有文件存在,数据匹配
死代码检测: ✅ 通过 — 所有指标函数均被调用
范围评估: ⚠️ 警告 — 仅测试2个场景,论文称“全面评估”
整体结论: ⚠️ 警告
详情请查看EXPERIMENT_AUDIT.md。/experiment-bridge/auto-review-loop/experiment-bridge → results ready
↓
/experiment-audit (automatic, advisory)
├── PASS → continue normally
├── WARN → print ⚠️ warning, continue, tag claims as [INTEGRITY: WARN]
└── FAIL → print 🔴 alert, continue, tag claims as [INTEGRITY CONCERN]
↓
/auto-review-loop → proceeds with integrity tags visible to reviewer/experiment-bridge/auto-review-loop/experiment-bridge → 结果准备就绪
↓
/experiment-audit(自动运行,仅提供建议)
├── 通过 → 正常继续流程
├── 警告 → 打印⚠️警告,继续流程,为声明添加[INTEGRITY: WARN]标签
└── 失败 → 打印🔴警报,继续流程,为声明添加[INTEGRITY CONCERN]标签
↓
/auto-review-loop → 流程继续,审核者可见完整性标签if EXPERIMENT_AUDIT.json exists:
read integrity_status
attach to verdict: {claim_supported: "yes", integrity_status: "warn"}
if integrity_status == "fail":
downgrade verdict display: "yes [INTEGRITY CONCERN]"
else:
verdict as normal, integrity_status = "unavailable"
mark as "provisional — no integrity audit"如果EXPERIMENT_AUDIT.json存在:
读取integrity_status字段
将其附加到结论中: {claim_supported: "yes", integrity_status: "warn"}
如果integrity_status == "fail":
降级结论显示: "yes [INTEGRITY CONCERN]"
否则:
结论正常显示,integrity_status = "unavailable"
标记为“临时结论 — 未进行完整性审计”if EXPERIMENT_AUDIT.json exists AND integrity_status == "fail":
add footnote to affected claims: "Note: integrity audit flagged concerns with this evaluation"如果EXPERIMENT_AUDIT.json存在 且 integrity_status == "fail":
为受影响的声明添加脚注: "注:完整性审计发现该评估存在问题"mcp__codex__codexmcp__codex__codex-replyshared-references/review-tracing.mdtools/save_trace.sh.aris/traces/<skill>/<date>_run<NN>/--- trace:fullmcp__codex__codexmcp__codex__codex-replyshared-references/review-tracing.mdtools/save_trace.sh.aris/traces/<skill>/<date>_run<NN>/--- trace:full