create-context-tests

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

create-context-tests

create-context-tests

nao test
runs each natural-language prompt through the agent, executes both the agent's SQL and the test's expected SQL against the warehouse, and diffs the result data row-by-row. A test passes only if the actual data matches — same rows, same values. The suite is the reliability benchmark; every change to
RULES.md
is measured against it. Reference: docs.getnao.io/nao-agent/context-engineering/evaluation.
nao test
会将每个自然语言提示传入agent,同时执行agent生成的SQL和测试中预期的SQL,并逐行对比结果数据。只有当实际数据完全匹配(行数相同、数值一致)时,测试才会通过。该测试套件是可靠性基准;对
RULES.md
的每一处修改都将以此为标准进行衡量。参考文档:docs.getnao.io/nao-agent/context-engineering/evaluation

How many tests

测试数量

One test per key metric in
## Key Metrics Reference
is the floor. Then add tests for: time scoping (especially "last 8 weeks" / "last 30 days"), CTE / multi-step queries, edge cases (NULLs, empty windows), and ambiguous wording ("our users", "active") to validate naming-convention rules.
每个
## Key Metrics Reference
中的关键指标至少对应一个测试
,这是最低要求。此外还需添加以下场景的测试:时间范围限定(尤其是“过去8周”/“过去30天”)、CTE/多步骤查询、边缘情况(NULL值、空窗口)以及模糊表述(“我们的用户”、“活跃用户”),以此验证命名规则。

Two authoring rules — apply to every test

两条编写规则——适用于所有测试

Rule 1 — Prompts read like real chat. Vague, short, no table/column/method hints. The test verifies the agent reaches the right answer from a real-user input.
BadGood
"What was the churn rate from
fct_subscriptions
in Q1?"
"How's churn looking this quarter?"
"Compute MRR as SUM(
mrr_amount
) where status='active'"
"What's our MRR?"
Rule 2 — Output column names encode format / unit, not source. A column name communicates how to interpret the value.
BadGood
churn_rate_from_fct_subscriptions
churn_rate_float_0_1
mrr_amount_fct_stripe_mrr
mrr_usd_dollars
signup_at_dim_users
signup_date_yyyy_mm_dd
Naming patterns:
<metric>_float_0_1
or
<metric>_percentage_0_100
for rates;
<metric>_<currency>_<unit>
for money;
<thing>_count
;
<thing>_date_yyyy_mm_dd
. See
templates/test.yaml
.
规则1:提示语需贴近真实对话。表述模糊、简短,不包含表/列/方法提示。测试需验证agent能否从真实用户输入得出正确答案。
错误示例正确示例
fct_subscriptions
表中Q1的客户流失率是多少?”
“本季度客户流失情况如何?”
“计算MRR,公式为SUM(
mrr_amount
),其中status='active'”
“我们的MRR是多少?”
规则2:输出列名需体现格式/单位,而非来源。列名应能说明如何解读对应数值。
错误示例正确示例
churn_rate_from_fct_subscriptions
churn_rate_float_0_1
mrr_amount_fct_stripe_mrr
mrr_usd_dollars
signup_at_dim_users
signup_date_yyyy_mm_dd
命名模式:比率类使用
<metric>_float_0_1
<metric>_percentage_0_100
;金额类使用
<metric>_<currency>_<unit>
;计数类使用
<thing>_count
;日期类使用
<thing>_date_yyyy_mm_dd
。详见
templates/test.yaml

Steps

步骤

  1. Ask once: does the user have trusted source-of-truth queries (Looker, dashboards, prior benchmarks)? If yes, transform each into a test (rewrite SELECT to apply Rule 2; reverse-engineer a Rule 1 prompt). For metrics without a trusted query, draft new tests one per metric.
  2. Save flat under
    tests/
    (no subfolders), one YAML file per test. Use
    templates/test.yaml
    .
  3. Have the user validate — confirm prompts match their team's phrasing and SQL matches their definition of truth.
  4. Run
    nao test
    .
    Prerequisites:
    • cd
      into the project directory (where
      nao_config.yaml
      lives).
    • Start
      nao chat &
      in the background (the test runner reuses the chat server).
    • LLM configured in
      nao_config.yaml
      .
    • First run prompts for login credentials — let the user type them; don't script around it.
    • If you see
      AI_APICallError: Not Found
      at
      https://api.anthropic.com/messages
      (no
      /v1/
      ), run
      unset ANTHROPIC_BASE_URL ANTHROPIC_API_KEY
      first (parent agent CLI is leaking env vars). See
      setup-context
      for the full note.
    bash
    nao test -m <model_id> -t 10   # -t = parallelism
  5. Recap results: pass rate, token cost, wall-clock time. Cite this as the baseline.
  6. Diagnose failures (optional): read
    tests/outputs/
    for each failure, identify the rule gap, propose the smallest fix, then route to
    write-context-rules
    (or
    audit-context
    for systemic issues). Re-run between fixes so impact is attributable.
  1. 先询问一次:用户是否有可信的基准查询来源(如Looker、仪表盘、过往基准)?如果有,将每个查询转换为测试(重写SELECT语句以符合规则2;反向推导符合规则1的提示语)。对于没有可信查询的指标,为每个指标起草新测试。
  2. 直接保存到
    tests/
    目录下
    (不要子文件夹),每个测试对应一个YAML文件。使用
    templates/test.yaml
    模板。
  3. 让用户验证——确认提示语符合团队的表述习惯,且SQL符合他们对“真实结果”的定义。
  4. 运行
    nao test
    。前提条件:
    • 进入项目目录(
      nao_config.yaml
      所在目录)。
    • 在后台启动
      nao chat &
      (测试运行器会复用聊天服务器)。
    • nao_config.yaml
      中配置好LLM。
    • 首次运行会提示输入登录凭证——让用户手动输入,不要通过脚本绕过。
    • 如果在
      https://api.anthropic.com/messages
      (无
      /v1/
      )处出现
      AI_APICallError: Not Found
      错误,请先运行
      unset ANTHROPIC_BASE_URL ANTHROPIC_API_KEY
      (父agent CLI泄露了环境变量)。详见
      setup-context
      中的完整说明。
    bash
    nao test -m <model_id> -t 10   # -t = 并行数
  5. 总结结果:通过率、token成本、耗时。将此作为基准值。
  6. 诊断失败(可选):查看每个失败测试的
    tests/outputs/
    内容,找出规则漏洞,提出最小修复方案,然后转至
    write-context-rules
    (或针对系统性问题转至
    audit-context
    )。每次修复后重新运行,以便明确修复的影响。

Guardrails

约束规则

  • Tests' SQL must execute as-is — no
    <placeholder>
    in
    FROM
    . Use real table / column names.
  • Never leak the answer in
    prompt
    or output column names (Rules 1 + 2).
  • One test per metric is the floor; coverage tests come after.
  • Apply one context fix at a time between runs.
  • If a test contradicts
    RULES.md
    , stop and ask which is correct — it's a bug in one or the other.
  • 测试中的SQL必须可直接执行——
    FROM
    子句中不能包含
    <placeholder>
    。使用真实的表/列名。
  • 不得在
    prompt
    或输出列名中泄露答案(符合规则1和规则2)。
  • 每个指标至少对应一个测试;覆盖性测试后续再添加。
  • 每次运行之间仅应用一项上下文修复。
  • 如果测试与
    RULES.md
    冲突,请暂停并询问哪一个是正确的——两者中必有一个存在漏洞。

Templates

模板

  • templates/test.yaml
    — single-test format.
  • templates/test.yaml
    — 单测试格式。