create-context-tests

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

create-context-tests

nao test

runs each natural-language prompt through the agent, executes both the agent's SQL and the test's expected SQL against the warehouse, and diffs the result data row-by-row. A test passes only if the actual data matches — same rows, same values. The suite is the reliability benchmark; every change to

RULES.md

is measured against it. Reference: docs.getnao.io/nao-agent/context-engineering/evaluation.

nao test

会将每个自然语言提示传入agent，同时执行agent生成的SQL和测试中预期的SQL，并逐行对比结果数据。只有当实际数据完全匹配（行数相同、数值一致）时，测试才会通过。该测试套件是可靠性基准；对

RULES.md

的每一处修改都将以此为标准进行衡量。参考文档：docs.getnao.io/nao-agent/context-engineering/evaluation

How many tests

测试数量

One test per key metric in
## Key Metrics Reference
is the floor. Then add tests for: time scoping (especially "last 8 weeks" / "last 30 days"), CTE / multi-step queries, edge cases (NULLs, empty windows), and ambiguous wording ("our users", "active") to validate naming-convention rules.

每个
## Key Metrics Reference
中的关键指标至少对应一个测试，这是最低要求。此外还需添加以下场景的测试：时间范围限定（尤其是“过去8周”/“过去30天”）、CTE/多步骤查询、边缘情况（NULL值、空窗口）以及模糊表述（“我们的用户”、“活跃用户”），以此验证命名规则。

Two authoring rules — apply to every test

两条编写规则——适用于所有测试

Rule 1 — Prompts read like real chat. Vague, short, no table/column/method hints. The test verifies the agent reaches the right answer from a real-user input.

Bad	Good
"What was the churn rate from `fct_subscriptions` in Q1?"	"How's churn looking this quarter?"
"Compute MRR as SUM( `mrr_amount` ) where status='active'"	"What's our MRR?"

Rule 2 — Output column names encode format / unit, not source. A column name communicates how to interpret the value.

Bad	Good
`churn_rate_from_fct_subscriptions`	`churn_rate_float_0_1`
`mrr_amount_fct_stripe_mrr`	`mrr_usd_dollars`
`signup_at_dim_users`	`signup_date_yyyy_mm_dd`

Naming patterns:

<metric>_float_0_1

<metric>_percentage_0_100

for rates;

<metric>_<currency>_<unit>

for money;

<thing>_count

;

<thing>_date_yyyy_mm_dd

. See

templates/test.yaml

规则1：提示语需贴近真实对话。表述模糊、简短，不包含表/列/方法提示。测试需验证agent能否从真实用户输入得出正确答案。

错误示例	正确示例
“ `fct_subscriptions` 表中Q1的客户流失率是多少？”	“本季度客户流失情况如何？”
“计算MRR，公式为SUM( `mrr_amount` )，其中status='active'”	“我们的MRR是多少？”

规则2：输出列名需体现格式/单位，而非来源。列名应能说明如何解读对应数值。

错误示例	正确示例
`churn_rate_from_fct_subscriptions`	`churn_rate_float_0_1`
`mrr_amount_fct_stripe_mrr`	`mrr_usd_dollars`
`signup_at_dim_users`	`signup_date_yyyy_mm_dd`

命名模式：比率类使用

<metric>_float_0_1

或

<metric>_percentage_0_100

；金额类使用

<metric>_<currency>_<unit>

；计数类使用

<thing>_count

；日期类使用

<thing>_date_yyyy_mm_dd

。详见

templates/test.yaml

。

Steps

步骤

Ask once: does the user have trusted source-of-truth queries (Looker, dashboards, prior benchmarks)? If yes, transform each into a test (rewrite SELECT to apply Rule 2; reverse-engineer a Rule 1 prompt). For metrics without a trusted query, draft new tests one per metric.
Save flat under
tests/
(no subfolders), one YAML file per test. Use
```
templates/test.yaml
```
.
Have the user validate — confirm prompts match their team's phrasing and SQL matches their definition of truth.
Run
nao test
. Prerequisites:
- ```
cd
```
  into the project directory (where
```
nao_config.yaml
```
  lives).
- Start
```
nao chat &
```
  in the background (the test runner reuses the chat server).
- LLM configured in
```
nao_config.yaml
```
  .
- First run prompts for login credentials — let the user type them; don't script around it.
- If you see
```
AI_APICallError: Not Found
```
  at
```
https://api.anthropic.com/messages
```
  (no
```
/v1/
```
  ), run
```
unset ANTHROPIC_BASE_URL ANTHROPIC_API_KEY
```
  first (parent agent CLI is leaking env vars). See
```
setup-context
```
  for the full note.
bash
```
nao test -m <model_id> -t 10   # -t = parallelism
```
Recap results: pass rate, token cost, wall-clock time. Cite this as the baseline.
Diagnose failures (optional): read
```
tests/outputs/
```
for each failure, identify the rule gap, propose the smallest fix, then route to
```
write-context-rules
```
(or
```
audit-context
```
for systemic issues). Re-run between fixes so impact is attributable.

先询问一次：用户是否有可信的基准查询来源（如Looker、仪表盘、过往基准）？如果有，将每个查询转换为测试（重写SELECT语句以符合规则2；反向推导符合规则1的提示语）。对于没有可信查询的指标，为每个指标起草新测试。
直接保存到
tests/
目录下（不要子文件夹），每个测试对应一个YAML文件。使用
```
templates/test.yaml
```
模板。
让用户验证——确认提示语符合团队的表述习惯，且SQL符合他们对“真实结果”的定义。
运行
nao test
。前提条件：
- 进入项目目录（
```
nao_config.yaml
```
  所在目录）。
- 在后台启动
```
nao chat &
```
  （测试运行器会复用聊天服务器）。
- 在
```
nao_config.yaml
```
  中配置好LLM。
- 首次运行会提示输入登录凭证——让用户手动输入，不要通过脚本绕过。
- 如果在
```
https://api.anthropic.com/messages
```
  （无
```
/v1/
```
  ）处出现
```
AI_APICallError: Not Found
```
  错误，请先运行
```
unset ANTHROPIC_BASE_URL ANTHROPIC_API_KEY
```
  （父agent CLI泄露了环境变量）。详见
```
setup-context
```
  中的完整说明。
bash
```
nao test -m <model_id> -t 10   # -t = 并行数
```
总结结果：通过率、token成本、耗时。将此作为基准值。
诊断失败（可选）：查看每个失败测试的
```
tests/outputs/
```
内容，找出规则漏洞，提出最小修复方案，然后转至
```
write-context-rules
```
（或针对系统性问题转至
```
audit-context
```
）。每次修复后重新运行，以便明确修复的影响。

Guardrails

约束规则

Tests' SQL must execute as-is — no
```
<placeholder>
```
in
```
FROM
```
. Use real table / column names.
Never leak the answer in
```
prompt
```
or output column names (Rules 1 + 2).
One test per metric is the floor; coverage tests come after.
Apply one context fix at a time between runs.
If a test contradicts
```
RULES.md
```
, stop and ask which is correct — it's a bug in one or the other.

测试中的SQL必须可直接执行——
```
FROM
```
子句中不能包含
```
<placeholder>
```
。使用真实的表/列名。
不得在
```
prompt
```
或输出列名中泄露答案（符合规则1和规则2）。
每个指标至少对应一个测试；覆盖性测试后续再添加。
每次运行之间仅应用一项上下文修复。
如果测试与
```
RULES.md
```
冲突，请暂停并询问哪一个是正确的——两者中必有一个存在漏洞。

Templates

模板

```
templates/test.yaml
```
— single-test format.

```
templates/test.yaml
```
— 单测试格式。