omni-ai-eval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Omni Eval

Omni 评估

Run evals against Omni's AI query generation APIs — submit test prompts, capture the generated query JSON, compare it against expected results, and score accuracy across dimensions.
Tip: Use
omni-ai-optimizer
to improve scores after identifying failures, and
omni-model-explorer
to discover available topics and fields for building eval cases.
针对Omni的AI查询生成API运行评估——提交测试提示词,捕获生成的查询JSON,将其与预期结果对比,并从多维度对准确性进行评分。
提示:识别出失败案例后,可使用
omni-ai-optimizer
提升评分;使用
omni-model-explorer
可发现用于构建评估案例的可用主题和字段。

Prerequisites

前置条件

bash
undefined
bash
undefined

Verify the Omni CLI is installed — if not, ask the user to install it

验证Omni CLI是否已安装——若未安装,请提示用户进行安装

command -v omni >/dev/null || echo "ERROR: Omni CLI is not installed."

```bash
command -v omni >/dev/null || echo "ERROR: Omni CLI is not installed."

```bash

Show available profiles and select the appropriate one

显示可用配置文件并选择合适的配置

omni config show
omni config show

If multiple profiles exist, ask the user which to use, then switch:

若存在多个配置文件,请询问用户使用哪个,然后切换:

omni config use <profile-name>

You also need a **model ID** and an **eval set** — a file of test cases with prompts and expected query structures. See the [Eval Design Guide](https://docs.omni.co/ai/eval-design-guide) for best practices on building eval sets.
omni config use <profile-name>

你还需要一个**model ID**和一个**评估集**——包含测试案例的文件,其中每个案例包含提示词和预期查询结构。构建评估集的最佳实践可参考[评估设计指南](https://docs.omni.co/ai/eval-design-guide)。

Discovering Commands

命令查询

bash
omni ai --help    # AI operations (generate-query, jobs, pick-topic)
Tip: Use
-o json
to force structured output for programmatic parsing, or
-o human
for readable tables. The default is
auto
(human in a TTY, JSON when piped).
bash
omni ai --help    # AI操作(generate-query、jobs、pick-topic)
提示:使用
-o json
强制输出结构化内容以支持程序化解析,或使用
-o human
输出易读的表格。默认值为
auto
(终端环境下输出human格式,管道传输时输出JSON格式)。

Eval Input Format

评估输入格式

Each eval case pairs a natural language prompt with the expected query structure. JSONL (one JSON object per line) works well for bulk runs:
jsonl
{"id": "rev-by-month", "prompt": "Show me revenue by month", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["order_items.created_at[month]", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}]}, "tags": ["time-series"]}
{"id": "top-customers", "prompt": "Top 10 customers by spend", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["users.name", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.total_revenue", "sort_descending": true}]}, "tags": ["top-n"]}
FieldRequiredDescription
id
YesUnique identifier for the eval case
prompt
YesNatural language question to send to AI
modelId
YesTarget model UUID
expected
YesObject with
topic
,
fields
,
filters
,
sorts
branchId
NoBranch to test against
currentTopicName
NoConstrain to a specific topic
tags
NoArray of tags for filtering/grouping results
Note: JSONL is shown here, but any structured format works — CSV, JSON arrays, YAML — as long as you can iterate over cases and extract these fields.
每个评估案例将自然语言提示词与预期查询结构配对。JSONL(每行一个JSON对象)非常适合批量运行:
jsonl
{"id": "rev-by-month", "prompt": "Show me revenue by month", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["order_items.created_at[month]", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}]}, "tags": ["time-series"]}
{"id": "top-customers", "prompt": "Top 10 customers by spend", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["users.name", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.total_revenue", "sort_descending": true}]}, "tags": ["top-n"]}
字段是否必填描述
id
评估案例的唯一标识符
prompt
发送给AI的自然语言问题
modelId
目标模型的UUID
expected
包含
topic
fields
filters
sorts
的对象
branchId
要测试的分支ID
currentTopicName
将范围限制为特定主题
tags
用于过滤/分组结果的标签数组
注意:此处展示的是JSONL格式,但任何结构化格式均可——CSV、JSON数组、YAML——只要你能遍历案例并提取上述字段即可。

Running Evals: Fast Path (Generate Query API)

运行评估:快速路径(Generate Query API)

The synchronous generate-query endpoint is the fastest way to eval query generation. Pass
--run-query false
to get only the generated query JSON without executing it against the database.
同步generate-query端点是评估查询生成的最快方式。传递
--run-query false
可仅获取生成的查询JSON,而无需在数据库中执行该查询。

Single Eval Call

单次评估调用

bash
omni ai generate-query your-model-id "Show me revenue by month" --run-query false
bash
omni ai generate-query your-model-id "Show me revenue by month" --run-query false

Response Structure

响应结构

json
{
  "query": {
    "fields": ["order_items.created_at[month]", "order_items.total_revenue"],
    "table": "order_items",
    "filters": {},
    "sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}],
    "limit": 500
  },
  "topic": "order_items",
  "error": null
}
json
{
  "query": {
    "fields": ["order_items.created_at[month]", "order_items.total_revenue"],
    "table": "order_items",
    "filters": {},
    "sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}],
    "limit": 500
  },
  "topic": "order_items",
  "error": null
}

Request Parameters

请求参数

Arg/FlagRequiredDescription
<model-id>
YesUUID of the Omni model (positional arg)
<prompt>
YesNatural language question (positional arg)
--run-query
NoSet
false
to skip query execution (faster, default
true
)
--branch-id
NoBranch UUID for branch-specific testing
--current-topic-name
NoConstrain topic selection to a specific topic
参数/标志是否必填描述
<model-id>
Omni模型的UUID(位置参数)
<prompt>
自然语言问题(位置参数)
--run-query
设置为
false
可跳过查询执行(速度更快,默认值为
true
--branch-id
用于分支特定测试的分支UUID
--current-topic-name
将主题选择范围限制为特定主题

Batch Loop (bash)

批量循环(bash)

bash
while IFS= read -r line; do
  id=$(echo "$line" | jq -r '.id')
  prompt=$(echo "$line" | jq -r '.prompt')
  model_id=$(echo "$line" | jq -r '.modelId')
  branch_id=$(echo "$line" | jq -r '.branchId // empty')

  branch_flag=""
  if [ -n "$branch_id" ]; then
    branch_flag="--branch-id $branch_id"
  fi

  result=$(omni ai generate-query "$model_id" "$prompt" --run-query false $branch_flag --compact)

  echo "{\"id\": \"$id\", \"generated\": $result}" >> eval_results.jsonl
done < eval_cases.jsonl
bash
while IFS= read -r line; do
  id=$(echo "$line" | jq -r '.id')
  prompt=$(echo "$line" | jq -r '.prompt')
  model_id=$(echo "$line" | jq -r '.modelId')
  branch_id=$(echo "$line" | jq -r '.branchId // empty')

  branch_flag=""
  if [ -n "$branch_id" ]; then
    branch_flag="--branch-id $branch_id"
  fi

  result=$(omni ai generate-query "$model_id" "$prompt" --run-query false $branch_flag --compact)

  echo "{\"id\": \"$id\", \"generated\": $result}" >> eval_results.jsonl
done < eval_cases.jsonl

Running Evals: Agentic Path (AI Jobs API)

运行评估:智能代理路径(AI Jobs API)

Use the async AI Jobs API when you want to test the full agentic workflow — multi-step analysis, tool use, and topic selection as Blobby would actually behave in production.
当你想要测试完整的智能代理工作流——多步骤分析、工具使用以及Blobby在生产环境中的实际主题选择行为时,请使用异步AI Jobs API。

Submit a Job

提交任务

bash
omni ai job-submit your-model-id "Show me revenue by month"
Response:
json
{
  "jobId": "job-uuid",
  "conversationId": "conv-uuid",
  "omniChatUrl": "https://yourorg.omniapp.co/chat/..."
}
bash
omni ai job-submit your-model-id "Show me revenue by month"
响应:
json
{
  "jobId": "job-uuid",
  "conversationId": "conv-uuid",
  "omniChatUrl": "https://yourorg.omniapp.co/chat/..."
}

Poll for Completion

轮询任务完成状态

bash
omni ai job-status <jobId>
Status progression:
QUEUED
EXECUTING
COMPLETE
(or
FAILED
). Poll with backoff (e.g., 2s, 4s, 8s) until the
state
is terminal.
bash
omni ai job-status <jobId>
状态演进:
QUEUED
EXECUTING
COMPLETE
(或
FAILED
)。使用退避策略(如2秒、4秒、8秒)进行轮询,直到
state
进入终态。

Get Result

获取结果

bash
omni ai job-result <jobId>
The result contains an
actions
array. Look for actions with
type: "generate_query"
to extract the query JSON:
json
{
  "actions": [
    {
      "type": "generate_query",
      "message": "Querying revenue by month...",
      "result": {
        "queryName": "Revenue by Month",
        "query": { "fields": [...], "table": "...", "filters": {...} },
        "status": "success",
        "totalRowCount": 12
      }
    }
  ],
  "topic": "order_items",
  "resultSummary": "Here are the monthly revenue figures..."
}
bash
omni ai job-result <jobId>
结果包含
actions
数组。查找
type: "generate_query"
的操作以提取查询JSON:
json
{
  "actions": [
    {
      "type": "generate_query",
      "message": "Querying revenue by month...",
      "result": {
        "queryName": "Revenue by Month",
        "query": { "fields": [...], "table": "...", "filters": {...} },
        "status": "success",
        "totalRowCount": 12
      }
    }
  ],
  "topic": "order_items",
  "resultSummary": "Here are the monthly revenue figures..."
}

When to Use Which Path

路径选择指南

CriterionGenerate Query (Fast)AI Jobs (Agentic)
SpeedSynchronous, fastAsync, slower
VolumeHigh-volume runsLower volume
ScopeQuery generation onlyFull agent workflow
Use caseField/filter accuracyEnd-to-end behavior
Multi-stepSingle queryMay generate multiple queries
评判标准Generate Query(快速)AI Jobs(智能代理)
速度同步,快速异步,较慢
量级高量级运行较低量级
范围仅查询生成完整代理工作流
使用场景字段/过滤器准确性端到端行为
多步骤单查询可能生成多个查询

Testing Topic Selection

测试主题选择

Eval topic selection independently with the pick-topic endpoint:
bash
omni ai pick-topic your-model-id "How many users signed up last month?"
Response:
json
{
  "topicId": "users"
}
This lets you score topic selection accuracy as a separate dimension — useful when topic selection is a known weak point.
使用pick-topic端点独立评估主题选择:
bash
omni ai pick-topic your-model-id "How many users signed up last month?"
响应:
json
{
  "topicId": "users"
}
这让你可以将主题选择准确性作为单独维度评分——当主题选择是已知短板时非常有用。

Scoring: Structural Query Comparison

评分:结构化查询对比

Compare the generated query JSON against the expected query across four dimensions:
DimensionComparison MethodScoring
topic
Exact string matchpass/fail
fields
Set comparison (order-independent)pass/fail + similarity score
filters
Key-value match (key present + value match)pass/fail per filter key
sorts
Ordered array comparisonpass/fail
从四个维度将生成的查询JSON与预期查询进行对比:
维度对比方法评分方式
topic
精确字符串匹配通过/失败
fields
集合对比(与顺序无关)通过/失败 + 相似度评分
filters
键值匹配(键存在且值匹配)按过滤器键进行通过/失败判断
sorts
有序数组对比通过/失败

Example Comparison Logic (TypeScript)

对比逻辑示例(TypeScript)

typescript
function scoreEval(expected: any, generated: any) {
  // Topic: exact match
  const topicPass = generated.topic === expected.topic;

  // Fields: set comparison (order-independent)
  const expectedFields = new Set(expected.fields);
  const generatedFields = new Set(generated.query.fields);
  const missing = [...expectedFields].filter(f => !generatedFields.has(f));
  const extra = [...generatedFields].filter(f => !expectedFields.has(f));
  const fieldsPass = missing.length === 0 && extra.length === 0;

  // Filters: key-value match
  const expectedFilters = expected.filters || {};
  const generatedFilters = generated.query.filters || {};
  const missingKeys = Object.keys(expectedFilters).filter(k => !(k in generatedFilters));
  const wrongValues = Object.keys(expectedFilters)
    .filter(k => k in generatedFilters && generatedFilters[k] !== expectedFilters[k]);
  const filtersPass = missingKeys.length === 0 && wrongValues.length === 0;

  // Sorts: ordered comparison
  const sortsPass = JSON.stringify(expected.sorts || []) ===
    JSON.stringify(generated.query.sorts || []);

  return {
    topic: topicPass,
    fields: { pass: fieldsPass, missing, extra },
    filters: { pass: filtersPass, missingKeys, wrongValues },
    sorts: sortsPass,
    allPass: topicPass && fieldsPass && filtersPass && sortsPass,
  };
}
typescript
function scoreEval(expected: any, generated: any) {
  // 主题:精确匹配
  const topicPass = generated.topic === expected.topic;

  // 字段:集合对比(与顺序无关)
  const expectedFields = new Set(expected.fields);
  const generatedFields = new Set(generated.query.fields);
  const missing = [...expectedFields].filter(f => !generatedFields.has(f));
  const extra = [...generatedFields].filter(f => !expectedFields.has(f));
  const fieldsPass = missing.length === 0 && extra.length === 0;

  // 过滤器:键值匹配
  const expectedFilters = expected.filters || {};
  const generatedFilters = generated.query.filters || {};
  const missingKeys = Object.keys(expectedFilters).filter(k => !(k in generatedFilters));
  const wrongValues = Object.keys(expectedFilters)
    .filter(k => k in generatedFilters && generatedFilters[k] !== expectedFilters[k]);
  const filtersPass = missingKeys.length === 0 && wrongValues.length === 0;

  // 排序:有序对比
  const sortsPass = JSON.stringify(expected.sorts || []) ===
    JSON.stringify(generated.query.sorts || []);

  return {
    topic: topicPass,
    fields: { pass: fieldsPass, missing, extra },
    filters: { pass: filtersPass, missingKeys, wrongValues },
    sorts: sortsPass,
    allPass: topicPass && fieldsPass && filtersPass && sortsPass,
  };
}

Aggregate Scoring

聚合评分

Compute pass rates across all eval cases:
Eval Results: 47/50 passed (94.0%)
  Topic:   49/50 (98.0%)
  Fields:  47/50 (94.0%)
  Filters: 48/50 (96.0%)
  Sorts:   50/50 (100.0%)
Per-dimension rates help pinpoint where accuracy is weakest — if topic accuracy is high but filter accuracy is low, focus
ai_context
improvements on filter-related guidance.
计算所有评估案例的通过率:
评估结果:47/50 通过(94.0%)
  主题:   49/50(98.0%)
  字段:  47/50(94.0%)
  过滤器: 48/50(96.0%)
  排序:   50/50(100.0%)
各维度的通过率有助于定位准确性最弱的环节——如果主题准确性很高但过滤器准确性较低,请专注于针对过滤器相关指引优化
ai_context

A/B Comparison

A/B对比

Run the same eval suite with one variable changed to measure impact. This is the core workflow for understanding whether a change improves or degrades AI accuracy.
在仅改变一个变量的情况下运行同一评估套件,以此衡量变更的影响。这是了解变更是否提升或降低AI准确性的核心工作流。

Common Variables to Compare

常见对比变量

  • Model branches — pass different
    --branch-id
    values to test context changes on a branch before merging
  • Topic scope
    --current-topic-name "orders"
    vs omitted (auto-select)
  • Model context changes
    ai_context
    ,
    sample_queries
    , field descriptions (apply via
    omni-model-builder
    on a branch, then eval against that branch)
  • Prompt wording — same expected query, different prompt text
  • AI configuration — model type, thinking level, or other AI parameters
  • 模型分支——传递不同的
    --branch-id
    值,在合并前测试分支上的上下文变更
  • 主题范围——
    --current-topic-name "orders"
    与 省略该参数(自动选择)
  • 模型上下文变更——
    ai_context
    sample_queries
    、字段描述(通过分支上的
    omni-model-builder
    应用,然后针对该分支进行评估)
  • 提示词措辞——预期查询相同,提示词文本不同
  • AI配置——模型类型、思考级别或其他AI参数

Workflow

工作流

  1. Run eval suite with configuration A → save as
    results_a.jsonl
  2. Run eval suite with configuration B → save as
    results_b.jsonl
  3. Score both result sets
  4. Compare side-by-side, checking for regressions
  1. 使用配置A运行评估套件 → 保存为
    results_a.jsonl
  2. 使用配置B运行评估套件 → 保存为
    results_b.jsonl
  3. 为两组结果评分
  4. 并排对比,检查是否存在回归

Example Comparison Output

对比输出示例

A/B Comparison: main vs branch/new-context
                      A (main)    B (new-context)    Delta
Overall pass rate:    88.0%       94.0%              +6.0%
Topic accuracy:       96.0%       98.0%              +2.0%
Field accuracy:       90.0%       94.0%              +4.0%
Filter accuracy:      88.0%       96.0%              +8.0%

Regressions (passed in A, failed in B):
  - rev-by-quarter: fields missing order_items.total_revenue

Improvements (failed in A, passed in B):
  - customer-count: topic now correctly selects users
  - top-products: filters now include status=complete
Important: Always check for regressions, not just overall improvement. A net improvement that breaks previously-correct cases may indicate an
ai_context
conflict.
A/B对比:main vs branch/new-context
                      A (main)    B (new-context)    差值
整体通过率:    88.0%       94.0%              +6.0%
主题准确性:       96.0%       98.0%              +2.0%
字段准确性:       90.0%       94.0%              +4.0%
过滤器准确性:      88.0%       96.0%              +8.0%

回归案例(在A中通过,在B中失败):
  - rev-by-quarter:字段缺少order_items.total_revenue

优化案例(在A中失败,在B中通过):
  - customer-count:主题现在正确选择users
  - top-products:过滤器现在包含status=complete
重要提示:请始终检查是否存在回归,而不仅仅是整体提升。如果整体提升但破坏了之前正确的案例,可能表明存在
ai_context
冲突。

Snapshotting Model State

模型状态快照

Before running evals, snapshot the model definition so results are reproducible:
bash
undefined
运行评估前,请对模型定义进行快照,以便结果可复现:
bash
undefined

Save model YAML

保存模型YAML

omni models yaml-get <modelId> --compact > model_snapshot_$(date +%Y%m%d).json
omni models yaml-get <modelId> --compact > model_snapshot_$(date +%Y%m%d).json

Validate model integrity

验证模型完整性

omni models validate <modelId>

Version your eval set alongside model snapshots so you can trace which model state produced which scores.
omni models validate <modelId>

将评估集与模型快照一起版本化,以便你可以追溯哪个模型状态产生了对应的评分。

Known Issues & Gotchas

已知问题与注意事项

  • Filter comparison can be complex — Omni supports rich filter expressions (
    "last 7 days"
    ,
    "between 10 and 100"
    ,
    "not null"
    ). The structural comparison above uses exact string match on filter values. If the AI produces semantically equivalent but syntactically different expressions, you may see false failures. Consider normalizing common patterns or using a Jaccard threshold.
  • AI Jobs are async — poll with exponential backoff. Don't hammer the status endpoint.
  • Rate limiting — for high-volume eval runs, add a small delay between calls or batch requests.
  • limit
    field may vary
    — the AI may choose different limits than expected. Consider excluding
    limit
    from strict comparison if it's not critical to your eval.
  • table
    vs
    topic
    — the generate-query response returns
    topic
    as a top-level field and
    table
    inside the query object. These usually match but aren't always identical. Compare against the top-level
    topic
    .
  • 过滤器对比可能复杂——Omni支持丰富的过滤器表达式(
    "last 7 days"
    "between 10 and 100"
    "not null"
    )。上述结构对比使用过滤器值的精确字符串匹配。如果AI生成语义等效但语法不同的表达式,你可能会看到误判失败。考虑对常见模式进行归一化,或使用Jaccard阈值。
  • AI Jobs是异步的——使用指数退避策略进行轮询,不要频繁调用状态端点。
  • 速率限制——对于高量级评估运行,请在调用之间添加短暂延迟或批量请求。
  • limit
    字段可能不同
    ——AI可能选择与预期不同的限制值。如果
    limit
    对你的评估不重要,请考虑将其排除在严格对比之外。
  • table
    vs
    topic
    ——generate-query响应在顶层返回
    topic
    字段,在查询对象内返回
    table
    字段。它们通常匹配,但并非始终完全相同。请与顶层
    topic
    进行对比。

Docs Reference

文档参考

Related Skills

相关技能

  • omni-query — run golden queries to validate expected results
  • omni-model-explorer — discover topics and fields for building eval cases
  • omni-ai-optimizer — improve AI accuracy based on eval findings
  • omni-model-builder — apply context changes on branches before A/B testing
  • omni-query——运行标准查询以验证预期结果
  • omni-model-explorer——发现用于构建评估案例的主题和字段
  • omni-ai-optimizer——根据评估结果提升AI准确性
  • omni-model-builder——在A/B测试前在分支上应用上下文变更