create-context-tests
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesecreate-context-tests
create-context-tests
nao testRULES.mdnao testRULES.mdHow many tests
测试数量
One test per key metric in is the floor. Then add tests for: time scoping (especially "last 8 weeks" / "last 30 days"), CTE / multi-step queries, edge cases (NULLs, empty windows), and ambiguous wording ("our users", "active") to validate naming-convention rules.
## Key Metrics Reference每个中的关键指标至少对应一个测试,这是最低要求。此外还需添加以下场景的测试:时间范围限定(尤其是“过去8周”/“过去30天”)、CTE/多步骤查询、边缘情况(NULL值、空窗口)以及模糊表述(“我们的用户”、“活跃用户”),以此验证命名规则。
## Key Metrics ReferenceTwo authoring rules — apply to every test
两条编写规则——适用于所有测试
Rule 1 — Prompts read like real chat. Vague, short, no table/column/method hints. The test verifies the agent reaches the right answer from a real-user input.
| Bad | Good |
|---|---|
"What was the churn rate from | "How's churn looking this quarter?" |
"Compute MRR as SUM( | "What's our MRR?" |
Rule 2 — Output column names encode format / unit, not source. A column name communicates how to interpret the value.
| Bad | Good |
|---|---|
| |
| |
| |
Naming patterns: or for rates; for money; ; . See .
<metric>_float_0_1<metric>_percentage_0_100<metric>_<currency>_<unit><thing>_count<thing>_date_yyyy_mm_ddtemplates/test.yaml规则1:提示语需贴近真实对话。表述模糊、简短,不包含表/列/方法提示。测试需验证agent能否从真实用户输入得出正确答案。
| 错误示例 | 正确示例 |
|---|---|
“ | “本季度客户流失情况如何?” |
“计算MRR,公式为SUM( | “我们的MRR是多少?” |
规则2:输出列名需体现格式/单位,而非来源。列名应能说明如何解读对应数值。
| 错误示例 | 正确示例 |
|---|---|
| |
| |
| |
命名模式:比率类使用或;金额类使用;计数类使用;日期类使用。详见。
<metric>_float_0_1<metric>_percentage_0_100<metric>_<currency>_<unit><thing>_count<thing>_date_yyyy_mm_ddtemplates/test.yamlSteps
步骤
-
Ask once: does the user have trusted source-of-truth queries (Looker, dashboards, prior benchmarks)? If yes, transform each into a test (rewrite SELECT to apply Rule 2; reverse-engineer a Rule 1 prompt). For metrics without a trusted query, draft new tests one per metric.
-
Save flat under(no subfolders), one YAML file per test. Use
tests/.templates/test.yaml -
Have the user validate — confirm prompts match their team's phrasing and SQL matches their definition of truth.
-
Run. Prerequisites:
nao test- into the project directory (where
cdlives).nao_config.yaml - Start in the background (the test runner reuses the chat server).
nao chat & - LLM configured in .
nao_config.yaml - First run prompts for login credentials — let the user type them; don't script around it.
- If you see at
AI_APICallError: Not Found(nohttps://api.anthropic.com/messages), run/v1/first (parent agent CLI is leaking env vars). Seeunset ANTHROPIC_BASE_URL ANTHROPIC_API_KEYfor the full note.setup-context
bashnao test -m <model_id> -t 10 # -t = parallelism -
Recap results: pass rate, token cost, wall-clock time. Cite this as the baseline.
-
Diagnose failures (optional): readfor each failure, identify the rule gap, propose the smallest fix, then route to
tests/outputs/(orwrite-context-rulesfor systemic issues). Re-run between fixes so impact is attributable.audit-context
-
先询问一次:用户是否有可信的基准查询来源(如Looker、仪表盘、过往基准)?如果有,将每个查询转换为测试(重写SELECT语句以符合规则2;反向推导符合规则1的提示语)。对于没有可信查询的指标,为每个指标起草新测试。
-
直接保存到目录下(不要子文件夹),每个测试对应一个YAML文件。使用
tests/模板。templates/test.yaml -
让用户验证——确认提示语符合团队的表述习惯,且SQL符合他们对“真实结果”的定义。
-
运行。前提条件:
nao test- 进入项目目录(所在目录)。
nao_config.yaml - 在后台启动(测试运行器会复用聊天服务器)。
nao chat & - 在中配置好LLM。
nao_config.yaml - 首次运行会提示输入登录凭证——让用户手动输入,不要通过脚本绕过。
- 如果在(无
https://api.anthropic.com/messages)处出现/v1/错误,请先运行AI_APICallError: Not Found(父agent CLI泄露了环境变量)。详见unset ANTHROPIC_BASE_URL ANTHROPIC_API_KEY中的完整说明。setup-context
bashnao test -m <model_id> -t 10 # -t = 并行数 - 进入项目目录(
-
总结结果:通过率、token成本、耗时。将此作为基准值。
-
诊断失败(可选):查看每个失败测试的内容,找出规则漏洞,提出最小修复方案,然后转至
tests/outputs/(或针对系统性问题转至write-context-rules)。每次修复后重新运行,以便明确修复的影响。audit-context
Guardrails
约束规则
- Tests' SQL must execute as-is — no in
<placeholder>. Use real table / column names.FROM - Never leak the answer in or output column names (Rules 1 + 2).
prompt - One test per metric is the floor; coverage tests come after.
- Apply one context fix at a time between runs.
- If a test contradicts , stop and ask which is correct — it's a bug in one or the other.
RULES.md
- 测试中的SQL必须可直接执行——子句中不能包含
FROM。使用真实的表/列名。<placeholder> - 不得在或输出列名中泄露答案(符合规则1和规则2)。
prompt - 每个指标至少对应一个测试;覆盖性测试后续再添加。
- 每次运行之间仅应用一项上下文修复。
- 如果测试与冲突,请暂停并询问哪一个是正确的——两者中必有一个存在漏洞。
RULES.md
Templates
模板
- — single-test format.
templates/test.yaml
- — 单测试格式。
templates/test.yaml