codex-readiness-unit-test
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Codex Readiness Unit Test
LLM Codex Readiness 单元测试
Instruction-first, in-session "readiness" for evaluating AGENTS/PLANS documentation quality without any external APIs or SDKs. All checks run against the current working directory (cwd), with no monorepo discovery. Each run writes to and updates . Keep execution deterministic (filesystem scanning + local command execution only). All LLM evaluation happens in-session and must output strict JSON via the provided references.
.codex-readiness-unit-test/<timestamp>/.codex-readiness-unit-test/latest.json以指令为核心的会话内「就绪性」评估,无需任何外部API或SDK即可评估AGENTS/PLANS文档质量。所有检查均针对当前工作目录(cwd)运行,不进行单仓库发现。每次运行都会写入目录并更新。确保执行的确定性(仅包含文件系统扫描+本地命令执行)。所有LLM评估均在会话内完成,且必须通过提供的参考输出严格的JSON格式。
.codex-readiness-unit-test/<timestamp>/.codex-readiness-unit-test/latest.jsonQuick Start
快速开始
- Collect evidence:
python skills/codex-readiness-unit-test/bin/collect_evidence.py
- Run deterministic checks:
python skills/codex-readiness-unit-test/bin/deterministic_rules.py
- Run LLM checks using references in and store
references/..codex-readiness-unit-test/<timestamp>/llm_results.json - If execute mode is requested, build a plan, get confirmation, run:
python skills/codex-readiness-unit-test/bin/run_plan.py --plan .codex-readiness-unit-test/<timestamp>/plan.json
- Generate the report:
python skills/codex-readiness-unit-test/bin/scoring.py --mode read-only|execute
Outputs (per run, under ):
.codex-readiness-unit-test/<timestamp>/report.jsonreport.htmlsummary.json- (execute mode)
logs/*
- 收集证据:
python skills/codex-readiness-unit-test/bin/collect_evidence.py
- 运行确定性检查:
python skills/codex-readiness-unit-test/bin/deterministic_rules.py
- 使用中的参考内容运行LLM检查,并将结果存储到
references/。.codex-readiness-unit-test/<timestamp>/llm_results.json - 如果请求执行模式,构建计划、获取确认后运行:
python skills/codex-readiness-unit-test/bin/run_plan.py --plan .codex-readiness-unit-test/<timestamp>/plan.json
- 生成报告:
python skills/codex-readiness-unit-test/bin/scoring.py --mode read-only|execute
输出内容(每次运行的结果存放在目录下):
.codex-readiness-unit-test/<timestamp>/report.jsonreport.htmlsummary.json- (仅执行模式)
logs/*
Runbook
运行手册
This skill produces a deterministic evidence file plus an in-session LLM evaluation, then compiles a JSON report and HTML scorecard. It requires no OpenAI API key and makes no external HTTP calls.
该工具会生成确定性证据文件以及会话内LLM评估结果,然后编译成JSON报告和HTML评分卡。无需OpenAI API密钥,也不会发起任何外部HTTP请求。
Minimal Inputs
最小输入参数
- :
modeorread-only(required)execute - : optional (default 600)
soft_timeout_seconds
- :
mode或read-only(必填)execute - : 可选(默认值600)
soft_timeout_seconds
Modes (Read-only vs Execute)
模式(只读 vs 执行)
- Read-only: Collect evidence, run deterministic rules, and run LLM checks #3–#5. No commands are executed, check #6 is marked , and no execution logs/summary are produced.
NOT_RUN - Execute: Everything in read-only plus a confirmed is executed via
plan.json. This enables check #6 and produces execution logs +run_plan.pyfor scoring.execution_summary.json
Always ask the user which mode to run (read-only vs. execute) before proceeding.
- 只读: 收集证据、运行确定性检查、执行LLM检查#3–#5。不会执行任何命令,检查#6标记为,不会生成执行日志/摘要。
NOT_RUN - 执行: 包含只读模式的所有操作外加通过执行已确认的
run_plan.py。此模式启用检查#6,并生成执行日志+plan.json用于评分。execution_summary.json
在开始前务必询问用户要运行哪种模式(只读 vs 执行)。
Check Types
检查类型
- Deterministic: filesystem-only checks (#1 AGENTS.md exists, #2 PLANS.md exists, #3 AGENTS.md <= 300 lines, #4 config.toml exists at repo root, repo .codex/, or user .codex/)
- LLM: in-session Codex evaluation (#3 project context, #4 commands, #5 loops; commands may live in AGENTS or referenced skills)
- Hybrid: deterministic execution + LLM rationale (#6 execution)
Skill references are discovered from AGENTS.md via or patterns; their files are added to evidence for the LLM checks.
$SkillName.codex/skills/<name>SKILL.mdAll checks run relative to the current working directory and are defined in , weighted equally by default. Each run writes outputs to and updates .
The helper scripts read by default to locate the latest run directory.
skills/codex-readiness-unit-test/references/checks/checks.json.codex-readiness-unit-test/<timestamp>/.codex-readiness-unit-test/latest.json.codex-readiness-unit-test/latest.json- 确定性检查: 仅基于文件系统的检查(#1 AGENTS.md存在,#2 PLANS.md存在,#3 AGENTS.md不超过300行,#4 config.toml存在于仓库根目录、仓库的.codex/目录或用户的.codex/目录)
- LLM检查: 会话内Codex评估(#3 项目上下文,#4 命令,#5 循环;命令可能存在于AGENTS或引用的技能中)
- 混合检查: 确定性执行+LLM合理性验证(#6 执行情况)
通过AGENTS.md中的或模式发现技能引用;它们的文件会被添加到证据中用于LLM检查。
$SkillName.codex/skills/<name>SKILL.md所有检查均相对于当前工作目录运行,定义在中,默认权重相同。每次运行都会将输出写入目录并更新。
辅助脚本默认读取来定位最新的运行目录。
skills/codex-readiness-unit-test/references/checks/checks.json.codex-readiness-unit-test/<timestamp>/.codex-readiness-unit-test/latest.json.codex-readiness-unit-test/latest.jsonStrict JSON + Retry Loop (Required)
严格JSON + 重试循环(必填)
For each LLM/HYBRID check:
- Run the specialized prompt expecting strict JSON.
- If JSON is invalid or missing keys, run with the raw output.
skills/codex-readiness-unit-test/references/json_fix.md - Retry up to 2 additional attempts (max 3 total).
- If still invalid: mark the check as WARN with rationale: "Invalid JSON from evaluator after retries".
The JSON schema is:
json
{
"status": "PASS|WARN|FAIL|NOT_RUN",
"rationale": "string",
"evidence_quotes": [{"path":"...","quote":"..."}],
"recommendations": ["..."],
"confidence": 0.0
}对于每个LLM/混合检查:
- 运行专门的提示,要求输出严格的JSON格式。
- 如果JSON无效或缺少键,使用处理原始输出。
skills/codex-readiness-unit-test/references/json_fix.md - 最多重试2次(总计最多3次尝试)。
- 如果仍然无效: 将检查标记为WARN,理由为: "经过重试后评估器仍输出无效JSON"。
JSON schema如下:
json
{
"status": "PASS|WARN|FAIL|NOT_RUN",
"rationale": "string",
"evidence_quotes": [{"path":"...","quote":"..."}],
"recommendations": ["..."],
"confidence": 0.0
}Single Confirmation (Required)
单次确认(必填)
Combine the command summary and execute plan into one concise confirmation step. Present:
- The extracted build/test/dev loop commands (human-readable, labeled).
- The planned execute details (cwd, ordered commands, soft timeout policy, env).
Ask for a single confirmation to proceed. Do not paste raw JSON, full evidence, or the full . If declined, mark execute-required checks as
plan.json.NOT_RUN
将命令摘要和执行计划合并为一个简洁的确认步骤。展示:
- 提取的构建/测试/开发循环命令(易读格式,带标签)。
- 计划执行的详细信息(工作目录、命令顺序、软超时策略、环境变量)。
请求用户进行单次确认后再继续。请勿粘贴原始JSON、完整证据或完整的。如果用户拒绝,将需要执行的检查标记为
plan.json。NOT_RUN
Required Files
必填文件
- (from
.codex-readiness-unit-test/<timestamp>/evidence.json)collect_evidence.py - (from
.codex-readiness-unit-test/<timestamp>/deterministic_results.json)deterministic_rules.py - (from in-session references)
.codex-readiness-unit-test/<timestamp>/llm_results.json - (execute mode only)
.codex-readiness-unit-test/<timestamp>/execution_summary.json - and
.codex-readiness-unit-test/<timestamp>/report.json(from.codex-readiness-unit-test/<timestamp>/report.html)scoring.py - (structured pass/fail summary from
.codex-readiness-unit-test/<timestamp>/summary.json)scoring.py - (stable pointer to the latest run directory)
.codex-readiness-unit-test/latest.json
- (来自
.codex-readiness-unit-test/<timestamp>/evidence.json)collect_evidence.py - (来自
.codex-readiness-unit-test/<timestamp>/deterministic_results.json)deterministic_rules.py - (来自会话内参考内容)
.codex-readiness-unit-test/<timestamp>/llm_results.json - (仅执行模式)
.codex-readiness-unit-test/<timestamp>/execution_summary.json - 和
.codex-readiness-unit-test/<timestamp>/report.json(来自.codex-readiness-unit-test/<timestamp>/report.html)scoring.py - (来自
.codex-readiness-unit-test/<timestamp>/summary.json的结构化通过/失败摘要)scoring.py - (指向最新运行目录的稳定指针)
.codex-readiness-unit-test/latest.json
Prompt Mapping
提示映射
- #3 →
project_context_specifiedskills/codex-readiness-unit-test/references/project_context.md - #4 →
build_test_commands_existskills/codex-readiness-unit-test/references/commands.md - #5 →
dev_build_test_loops_documentedskills/codex-readiness-unit-test/references/loop_quality.md - #6 →
dev_build_test_loop_executionskills/codex-readiness-unit-test/references/execution_explanation.md
- #3 →
project_context_specifiedskills/codex-readiness-unit-test/references/project_context.md - #4 →
build_test_commands_existskills/codex-readiness-unit-test/references/commands.md - #5 →
dev_build_test_loops_documentedskills/codex-readiness-unit-test/references/loop_quality.md - #6 →
dev_build_test_loop_executionskills/codex-readiness-unit-test/references/execution_explanation.md
plan.json schema (execute mode)
plan.json schema(执行模式)
json
{
"project_dir": "relative/or/absolute/path (optional)",
"cwd": "optional/absolute/path (defaults to current directory)",
"commands": [
{"label": "setup", "cmd": "npm install"},
{"label": "build", "cmd": "npm run build"},
{"label": "test", "cmd": "npm test"}
],
"env": {
"EXAMPLE": "value"
}
}Place inside the run directory (e.g., ).
plan.json.codex-readiness-unit-test/<timestamp>/plan.jsonjson
{
"project_dir": "relative/or/absolute/path (optional)",
"cwd": "optional/absolute/path (defaults to current directory)",
"commands": [
{"label": "setup", "cmd": "npm install"},
{"label": "build", "cmd": "npm run build"},
{"label": "test", "cmd": "npm test"}
],
"env": {
"EXAMPLE": "value"
}
}将放置在运行目录内(例如: )。
plan.json.codex-readiness-unit-test/<timestamp>/plan.jsonllm_results.json schema
llm_results.json schema
json
{
"project_context_specified": {"status":"PASS","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.7},
"build_test_commands_exist": {"status":"PASS","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.7},
"dev_build_test_loops_documented": {"status":"WARN","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.6},
"dev_build_test_loop_execution": {"status":"PASS","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.6}
}json
{
"project_context_specified": {"status":"PASS","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.7},
"build_test_commands_exist": {"status":"PASS","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.7},
"dev_build_test_loops_documented": {"status":"WARN","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.6},
"dev_build_test_loop_execution": {"status":"PASS","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.6}
}Scoring Rules
评分规则
- PASS = 100% of weight
- WARN = 50% of weight
- FAIL/NOT_RUN = 0%
- Overall status: FAIL if any FAIL; else WARN if any WARN or NOT_RUN; else PASS.
- PASS = 100%权重
- WARN = 50%权重
- FAIL/NOT_RUN = 0%
- 整体状态: 若存在任何FAIL则为FAIL;否则若存在任何WARN或NOT_RUN则为WARN;否则为PASS。
Safety + Timeouts
安全与超时
- Denylisted commands are not executed and marked FAIL.
- Soft timeout defaults to 600s; hard cap defaults to 3x soft timeout.
- Execution logs are written to .
.codex-readiness-unit-test/<timestamp>/logs/
- 黑名单中的命令不会被执行,并标记为FAIL。
- 软超时默认值为600秒;硬上限默认值为软超时的3倍。
- 执行日志写入目录。
.codex-readiness-unit-test/<timestamp>/logs/