codex-readiness-unit-test

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Codex Readiness Unit Test

LLM Codex Readiness 单元测试

Instruction-first, in-session "readiness" for evaluating AGENTS/PLANS documentation quality without any external APIs or SDKs. All checks run against the current working directory (cwd), with no monorepo discovery. Each run writes to
.codex-readiness-unit-test/<timestamp>/
and updates
.codex-readiness-unit-test/latest.json
. Keep execution deterministic (filesystem scanning + local command execution only). All LLM evaluation happens in-session and must output strict JSON via the provided references.
以指令为核心的会话内「就绪性」评估,无需任何外部API或SDK即可评估AGENTS/PLANS文档质量。所有检查均针对当前工作目录(cwd)运行,不进行单仓库发现。每次运行都会写入
.codex-readiness-unit-test/<timestamp>/
目录并更新
.codex-readiness-unit-test/latest.json
。确保执行的确定性(仅包含文件系统扫描+本地命令执行)。所有LLM评估均在会话内完成,且必须通过提供的参考输出严格的JSON格式。

Quick Start

快速开始

  1. Collect evidence:
    • python skills/codex-readiness-unit-test/bin/collect_evidence.py
  2. Run deterministic checks:
    • python skills/codex-readiness-unit-test/bin/deterministic_rules.py
  3. Run LLM checks using references in
    references/
    and store
    .codex-readiness-unit-test/<timestamp>/llm_results.json
    .
  4. If execute mode is requested, build a plan, get confirmation, run:
    • python skills/codex-readiness-unit-test/bin/run_plan.py --plan .codex-readiness-unit-test/<timestamp>/plan.json
  5. Generate the report:
    • python skills/codex-readiness-unit-test/bin/scoring.py --mode read-only|execute
Outputs (per run, under
.codex-readiness-unit-test/<timestamp>/
):
  • report.json
  • report.html
  • summary.json
  • logs/*
    (execute mode)
  1. 收集证据:
    • python skills/codex-readiness-unit-test/bin/collect_evidence.py
  2. 运行确定性检查:
    • python skills/codex-readiness-unit-test/bin/deterministic_rules.py
  3. 使用
    references/
    中的参考内容运行LLM检查,并将结果存储到
    .codex-readiness-unit-test/<timestamp>/llm_results.json
  4. 如果请求执行模式,构建计划、获取确认后运行:
    • python skills/codex-readiness-unit-test/bin/run_plan.py --plan .codex-readiness-unit-test/<timestamp>/plan.json
  5. 生成报告:
    • python skills/codex-readiness-unit-test/bin/scoring.py --mode read-only|execute
输出内容(每次运行的结果存放在
.codex-readiness-unit-test/<timestamp>/
目录下):
  • report.json
  • report.html
  • summary.json
  • logs/*
    (仅执行模式)

Runbook

运行手册

This skill produces a deterministic evidence file plus an in-session LLM evaluation, then compiles a JSON report and HTML scorecard. It requires no OpenAI API key and makes no external HTTP calls.
该工具会生成确定性证据文件以及会话内LLM评估结果,然后编译成JSON报告和HTML评分卡。无需OpenAI API密钥,也不会发起任何外部HTTP请求。

Minimal Inputs

最小输入参数

  • mode
    :
    read-only
    or
    execute
    (required)
  • soft_timeout_seconds
    : optional (default 600)
  • mode
    :
    read-only
    execute
    (必填)
  • soft_timeout_seconds
    : 可选(默认值600)

Modes (Read-only vs Execute)

模式(只读 vs 执行)

  • Read-only: Collect evidence, run deterministic rules, and run LLM checks #3–#5. No commands are executed, check #6 is marked
    NOT_RUN
    , and no execution logs/summary are produced.
  • Execute: Everything in read-only plus a confirmed
    plan.json
    is executed via
    run_plan.py
    . This enables check #6 and produces execution logs +
    execution_summary.json
    for scoring.
Always ask the user which mode to run (read-only vs. execute) before proceeding.
  • 只读: 收集证据、运行确定性检查、执行LLM检查#3–#5。不会执行任何命令,检查#6标记为
    NOT_RUN
    ,不会生成执行日志/摘要。
  • 执行: 包含只读模式的所有操作外加通过
    run_plan.py
    执行已确认的
    plan.json
    。此模式启用检查#6,并生成执行日志+
    execution_summary.json
    用于评分。
在开始前务必询问用户要运行哪种模式(只读 vs 执行)。

Check Types

检查类型

  • Deterministic: filesystem-only checks (#1 AGENTS.md exists, #2 PLANS.md exists, #3 AGENTS.md <= 300 lines, #4 config.toml exists at repo root, repo .codex/, or user .codex/)
  • LLM: in-session Codex evaluation (#3 project context, #4 commands, #5 loops; commands may live in AGENTS or referenced skills)
  • Hybrid: deterministic execution + LLM rationale (#6 execution)
Skill references are discovered from AGENTS.md via
$SkillName
or
.codex/skills/<name>
patterns; their
SKILL.md
files are added to evidence for the LLM checks.
All checks run relative to the current working directory and are defined in
skills/codex-readiness-unit-test/references/checks/checks.json
, weighted equally by default. Each run writes outputs to
.codex-readiness-unit-test/<timestamp>/
and updates
.codex-readiness-unit-test/latest.json
. The helper scripts read
.codex-readiness-unit-test/latest.json
by default to locate the latest run directory.
  • 确定性检查: 仅基于文件系统的检查(#1 AGENTS.md存在,#2 PLANS.md存在,#3 AGENTS.md不超过300行,#4 config.toml存在于仓库根目录、仓库的.codex/目录或用户的.codex/目录)
  • LLM检查: 会话内Codex评估(#3 项目上下文,#4 命令,#5 循环;命令可能存在于AGENTS或引用的技能中)
  • 混合检查: 确定性执行+LLM合理性验证(#6 执行情况)
通过AGENTS.md中的
$SkillName
.codex/skills/<name>
模式发现技能引用;它们的
SKILL.md
文件会被添加到证据中用于LLM检查。
所有检查均相对于当前工作目录运行,定义在
skills/codex-readiness-unit-test/references/checks/checks.json
中,默认权重相同。每次运行都会将输出写入
.codex-readiness-unit-test/<timestamp>/
目录并更新
.codex-readiness-unit-test/latest.json
。 辅助脚本默认读取
.codex-readiness-unit-test/latest.json
来定位最新的运行目录。

Strict JSON + Retry Loop (Required)

严格JSON + 重试循环(必填)

For each LLM/HYBRID check:
  1. Run the specialized prompt expecting strict JSON.
  2. If JSON is invalid or missing keys, run
    skills/codex-readiness-unit-test/references/json_fix.md
    with the raw output.
  3. Retry up to 2 additional attempts (max 3 total).
  4. If still invalid: mark the check as WARN with rationale: "Invalid JSON from evaluator after retries".
The JSON schema is:
json
{
  "status": "PASS|WARN|FAIL|NOT_RUN",
  "rationale": "string",
  "evidence_quotes": [{"path":"...","quote":"..."}],
  "recommendations": ["..."],
  "confidence": 0.0
}
对于每个LLM/混合检查:
  1. 运行专门的提示,要求输出严格的JSON格式。
  2. 如果JSON无效或缺少键,使用
    skills/codex-readiness-unit-test/references/json_fix.md
    处理原始输出。
  3. 最多重试2次(总计最多3次尝试)。
  4. 如果仍然无效: 将检查标记为WARN,理由为: "经过重试后评估器仍输出无效JSON"。
JSON schema如下:
json
{
  "status": "PASS|WARN|FAIL|NOT_RUN",
  "rationale": "string",
  "evidence_quotes": [{"path":"...","quote":"..."}],
  "recommendations": ["..."],
  "confidence": 0.0
}

Single Confirmation (Required)

单次确认(必填)

Combine the command summary and execute plan into one concise confirmation step. Present:
  • The extracted build/test/dev loop commands (human-readable, labeled).
  • The planned execute details (cwd, ordered commands, soft timeout policy, env). Ask for a single confirmation to proceed. Do not paste raw JSON, full evidence, or the full
    plan.json
    . If declined, mark execute-required checks as
    NOT_RUN
    .
将命令摘要和执行计划合并为一个简洁的确认步骤。展示:
  • 提取的构建/测试/开发循环命令(易读格式,带标签)。
  • 计划执行的详细信息(工作目录、命令顺序、软超时策略、环境变量)。 请求用户进行单次确认后再继续。请勿粘贴原始JSON、完整证据或完整的
    plan.json
    。如果用户拒绝,将需要执行的检查标记为
    NOT_RUN

Required Files

必填文件

  • .codex-readiness-unit-test/<timestamp>/evidence.json
    (from
    collect_evidence.py
    )
  • .codex-readiness-unit-test/<timestamp>/deterministic_results.json
    (from
    deterministic_rules.py
    )
  • .codex-readiness-unit-test/<timestamp>/llm_results.json
    (from in-session references)
  • .codex-readiness-unit-test/<timestamp>/execution_summary.json
    (execute mode only)
  • .codex-readiness-unit-test/<timestamp>/report.json
    and
    .codex-readiness-unit-test/<timestamp>/report.html
    (from
    scoring.py
    )
  • .codex-readiness-unit-test/<timestamp>/summary.json
    (structured pass/fail summary from
    scoring.py
    )
  • .codex-readiness-unit-test/latest.json
    (stable pointer to the latest run directory)
  • .codex-readiness-unit-test/<timestamp>/evidence.json
    (来自
    collect_evidence.py
  • .codex-readiness-unit-test/<timestamp>/deterministic_results.json
    (来自
    deterministic_rules.py
  • .codex-readiness-unit-test/<timestamp>/llm_results.json
    (来自会话内参考内容)
  • .codex-readiness-unit-test/<timestamp>/execution_summary.json
    (仅执行模式)
  • .codex-readiness-unit-test/<timestamp>/report.json
    .codex-readiness-unit-test/<timestamp>/report.html
    (来自
    scoring.py
  • .codex-readiness-unit-test/<timestamp>/summary.json
    (来自
    scoring.py
    的结构化通过/失败摘要)
  • .codex-readiness-unit-test/latest.json
    (指向最新运行目录的稳定指针)

Prompt Mapping

提示映射

  • #3
    project_context_specified
    skills/codex-readiness-unit-test/references/project_context.md
  • #4
    build_test_commands_exist
    skills/codex-readiness-unit-test/references/commands.md
  • #5
    dev_build_test_loops_documented
    skills/codex-readiness-unit-test/references/loop_quality.md
  • #6
    dev_build_test_loop_execution
    skills/codex-readiness-unit-test/references/execution_explanation.md
  • #3
    project_context_specified
    skills/codex-readiness-unit-test/references/project_context.md
  • #4
    build_test_commands_exist
    skills/codex-readiness-unit-test/references/commands.md
  • #5
    dev_build_test_loops_documented
    skills/codex-readiness-unit-test/references/loop_quality.md
  • #6
    dev_build_test_loop_execution
    skills/codex-readiness-unit-test/references/execution_explanation.md

plan.json schema (execute mode)

plan.json schema(执行模式)

json
{
  "project_dir": "relative/or/absolute/path (optional)",
  "cwd": "optional/absolute/path (defaults to current directory)",
  "commands": [
    {"label": "setup", "cmd": "npm install"},
    {"label": "build", "cmd": "npm run build"},
    {"label": "test", "cmd": "npm test"}
  ],
  "env": {
    "EXAMPLE": "value"
  }
}
Place
plan.json
inside the run directory (e.g.,
.codex-readiness-unit-test/<timestamp>/plan.json
).
json
{
  "project_dir": "relative/or/absolute/path (optional)",
  "cwd": "optional/absolute/path (defaults to current directory)",
  "commands": [
    {"label": "setup", "cmd": "npm install"},
    {"label": "build", "cmd": "npm run build"},
    {"label": "test", "cmd": "npm test"}
  ],
  "env": {
    "EXAMPLE": "value"
  }
}
plan.json
放置在运行目录内(例如:
.codex-readiness-unit-test/<timestamp>/plan.json
)。

llm_results.json schema

llm_results.json schema

json
{
  "project_context_specified": {"status":"PASS","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.7},
  "build_test_commands_exist": {"status":"PASS","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.7},
  "dev_build_test_loops_documented": {"status":"WARN","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.6},
  "dev_build_test_loop_execution": {"status":"PASS","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.6}
}
json
{
  "project_context_specified": {"status":"PASS","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.7},
  "build_test_commands_exist": {"status":"PASS","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.7},
  "dev_build_test_loops_documented": {"status":"WARN","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.6},
  "dev_build_test_loop_execution": {"status":"PASS","rationale":"...","evidence_quotes":[],"recommendations":[],"confidence":0.6}
}

Scoring Rules

评分规则

  • PASS = 100% of weight
  • WARN = 50% of weight
  • FAIL/NOT_RUN = 0%
  • Overall status: FAIL if any FAIL; else WARN if any WARN or NOT_RUN; else PASS.
  • PASS = 100%权重
  • WARN = 50%权重
  • FAIL/NOT_RUN = 0%
  • 整体状态: 若存在任何FAIL则为FAIL;否则若存在任何WARN或NOT_RUN则为WARN;否则为PASS。

Safety + Timeouts

安全与超时

  • Denylisted commands are not executed and marked FAIL.
  • Soft timeout defaults to 600s; hard cap defaults to 3x soft timeout.
  • Execution logs are written to
    .codex-readiness-unit-test/<timestamp>/logs/
    .
  • 黑名单中的命令不会被执行,并标记为FAIL。
  • 软超时默认值为600秒;硬上限默认值为软超时的3倍。
  • 执行日志写入
    .codex-readiness-unit-test/<timestamp>/logs/
    目录。