omni-ai-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Omni Eval

Omni 评估

Run evals against Omni's AI query generation APIs — submit test prompts, capture the generated query JSON, compare it against expected results, and score accuracy across dimensions.

Tip: Use
omni-ai-optimizer
to improve scores after identifying failures, and
omni-model-explorer
to discover available topics and fields for building eval cases.

针对Omni的AI查询生成API运行评估——提交测试提示词，捕获生成的查询JSON，将其与预期结果对比，并从多维度对准确性进行评分。

提示：识别出失败案例后，可使用
omni-ai-optimizer
提升评分；使用
omni-model-explorer
可发现用于构建评估案例的可用主题和字段。

Prerequisites

前置条件

bash

undefined

bash

undefined

Verify the Omni CLI is installed — if not, ask the user to install it

验证Omni CLI是否已安装——若未安装，请提示用户进行安装

See: https://github.com/exploreomni/cli#readme

参考：https://github.com/exploreomni/cli#readme

command -v omni >/dev/null || echo "ERROR: Omni CLI is not installed."


```bash

command -v omni >/dev/null || echo "ERROR: Omni CLI is not installed."


```bash

Show available profiles and select the appropriate one

显示可用配置文件并选择合适的配置

omni config show

If multiple profiles exist, ask the user which to use, then switch:

若存在多个配置文件，请询问用户使用哪个，然后切换：

omni config use <profile-name>


You also need a **model ID** and an **eval set** — a file of test cases with prompts and expected query structures. See the [Eval Design Guide](https://docs.omni.co/ai/eval-design-guide) for best practices on building eval sets.

omni config use <profile-name>


你还需要一个**model ID**和一个**评估集**——包含测试案例的文件，其中每个案例包含提示词和预期查询结构。构建评估集的最佳实践可参考[评估设计指南](https://docs.omni.co/ai/eval-design-guide)。

Discovering Commands

命令查询

bash

omni ai --help    # AI operations (generate-query, jobs, pick-topic)

Tip: Use
-o json
to force structured output for programmatic parsing, or
-o human
for readable tables. The default is
auto
(human in a TTY, JSON when piped).

bash

omni ai --help    # AI操作（generate-query、jobs、pick-topic）

提示：使用
-o json
强制输出结构化内容以支持程序化解析，或使用
-o human
输出易读的表格。默认值为
auto
（终端环境下输出human格式，管道传输时输出JSON格式）。

Eval Input Format

评估输入格式

Each eval case pairs a natural language prompt with the expected query structure. JSONL (one JSON object per line) works well for bulk runs:

jsonl

{"id": "rev-by-month", "prompt": "Show me revenue by month", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["order_items.created_at[month]", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}]}, "tags": ["time-series"]}
{"id": "top-customers", "prompt": "Top 10 customers by spend", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["users.name", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.total_revenue", "sort_descending": true}]}, "tags": ["top-n"]}

Field	Required	Description
`id`	Yes	Unique identifier for the eval case
`prompt`	Yes	Natural language question to send to AI
`modelId`	Yes	Target model UUID
`expected`	Yes	Object with `topic` , `fields` , `filters` , `sorts`
`branchId`	No	Branch to test against
`currentTopicName`	No	Constrain to a specific topic
`tags`	No	Array of tags for filtering/grouping results

Note: JSONL is shown here, but any structured format works — CSV, JSON arrays, YAML — as long as you can iterate over cases and extract these fields.

每个评估案例将自然语言提示词与预期查询结构配对。JSONL（每行一个JSON对象）非常适合批量运行：

jsonl

{"id": "rev-by-month", "prompt": "Show me revenue by month", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["order_items.created_at[month]", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}]}, "tags": ["time-series"]}
{"id": "top-customers", "prompt": "Top 10 customers by spend", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["users.name", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.total_revenue", "sort_descending": true}]}, "tags": ["top-n"]}

字段	是否必填	描述
`id`	是	评估案例的唯一标识符
`prompt`	是	发送给AI的自然语言问题
`modelId`	是	目标模型的UUID
`expected`	是	包含 `topic` 、 `fields` 、 `filters` 、 `sorts` 的对象
`branchId`	否	要测试的分支ID
`currentTopicName`	否	将范围限制为特定主题
`tags`	否	用于过滤/分组结果的标签数组

注意：此处展示的是JSONL格式，但任何结构化格式均可——CSV、JSON数组、YAML——只要你能遍历案例并提取上述字段即可。

Running Evals: Fast Path (Generate Query API)

运行评估：快速路径（Generate Query API）

The synchronous generate-query endpoint is the fastest way to eval query generation. Pass

--run-query false

to get only the generated query JSON without executing it against the database.

同步generate-query端点是评估查询生成的最快方式。传递

--run-query false

可仅获取生成的查询JSON，而无需在数据库中执行该查询。

Single Eval Call

单次评估调用

bash

omni ai generate-query your-model-id "Show me revenue by month" --run-query false

bash

omni ai generate-query your-model-id "Show me revenue by month" --run-query false

Response Structure

响应结构

json

{
  "query": {
    "fields": ["order_items.created_at[month]", "order_items.total_revenue"],
    "table": "order_items",
    "filters": {},
    "sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}],
    "limit": 500
  },
  "topic": "order_items",
  "error": null
}

json

{
  "query": {
    "fields": ["order_items.created_at[month]", "order_items.total_revenue"],
    "table": "order_items",
    "filters": {},
    "sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}],
    "limit": 500
  },
  "topic": "order_items",
  "error": null
}

Request Parameters

请求参数

Arg/Flag	Required	Description
`<model-id>`	Yes	UUID of the Omni model (positional arg)
`<prompt>`	Yes	Natural language question (positional arg)
`--run-query`	No	Set `false` to skip query execution (faster, default `true` )
`--branch-id`	No	Branch UUID for branch-specific testing
`--current-topic-name`	No	Constrain topic selection to a specific topic

参数/标志	是否必填	描述
`<model-id>`	是	Omni模型的UUID（位置参数）
`<prompt>`	是	自然语言问题（位置参数）
`--run-query`	否	设置为 `false` 可跳过查询执行（速度更快，默认值为 `true` ）
`--branch-id`	否	用于分支特定测试的分支UUID
`--current-topic-name`	否	将主题选择范围限制为特定主题

Batch Loop (bash)

批量循环（bash）

bash

while IFS= read -r line; do
  id=$(echo "$line" | jq -r '.id')
  prompt=$(echo "$line" | jq -r '.prompt')
  model_id=$(echo "$line" | jq -r '.modelId')
  branch_id=$(echo "$line" | jq -r '.branchId // empty')

  branch_flag=""
  if [ -n "$branch_id" ]; then
    branch_flag="--branch-id $branch_id"
  fi

  result=$(omni ai generate-query "$model_id" "$prompt" --run-query false $branch_flag --compact)

  echo "{\"id\": \"$id\", \"generated\": $result}" >> eval_results.jsonl
done < eval_cases.jsonl

bash

while IFS= read -r line; do
  id=$(echo "$line" | jq -r '.id')
  prompt=$(echo "$line" | jq -r '.prompt')
  model_id=$(echo "$line" | jq -r '.modelId')
  branch_id=$(echo "$line" | jq -r '.branchId // empty')

  branch_flag=""
  if [ -n "$branch_id" ]; then
    branch_flag="--branch-id $branch_id"
  fi

  result=$(omni ai generate-query "$model_id" "$prompt" --run-query false $branch_flag --compact)

  echo "{\"id\": \"$id\", \"generated\": $result}" >> eval_results.jsonl
done < eval_cases.jsonl

Running Evals: Agentic Path (AI Jobs API)

运行评估：智能代理路径（AI Jobs API）

Use the async AI Jobs API when you want to test the full agentic workflow — multi-step analysis, tool use, and topic selection as Blobby would actually behave in production.

当你想要测试完整的智能代理工作流——多步骤分析、工具使用以及Blobby在生产环境中的实际主题选择行为时，请使用异步AI Jobs API。

Submit a Job

提交任务

bash

omni ai job-submit your-model-id "Show me revenue by month"

Response:

json

{
  "jobId": "job-uuid",
  "conversationId": "conv-uuid",
  "omniChatUrl": "https://yourorg.omniapp.co/chat/..."
}

bash

omni ai job-submit your-model-id "Show me revenue by month"

响应：

json

{
  "jobId": "job-uuid",
  "conversationId": "conv-uuid",
  "omniChatUrl": "https://yourorg.omniapp.co/chat/..."
}

Poll for Completion

轮询任务完成状态

bash

omni ai job-status <jobId>

Status progression:

QUEUED

→

EXECUTING

→

COMPLETE

(or

FAILED

). Poll with backoff (e.g., 2s, 4s, 8s) until the

state

is terminal.

bash

omni ai job-status <jobId>

状态演进：

QUEUED

→

EXECUTING

→

COMPLETE

（或

FAILED

）。使用退避策略（如2秒、4秒、8秒）进行轮询，直到

state

进入终态。

Get Result

获取结果

bash

omni ai job-result <jobId>

The result contains an

actions

array. Look for actions with

type: "generate_query"

to extract the query JSON:

json

{
  "actions": [
    {
      "type": "generate_query",
      "message": "Querying revenue by month...",
      "result": {
        "queryName": "Revenue by Month",
        "query": { "fields": [...], "table": "...", "filters": {...} },
        "status": "success",
        "totalRowCount": 12
      }
    }
  ],
  "topic": "order_items",
  "resultSummary": "Here are the monthly revenue figures..."
}

bash

omni ai job-result <jobId>

结果包含

actions

数组。查找

type: "generate_query"

的操作以提取查询JSON：

json

{
  "actions": [
    {
      "type": "generate_query",
      "message": "Querying revenue by month...",
      "result": {
        "queryName": "Revenue by Month",
        "query": { "fields": [...], "table": "...", "filters": {...} },
        "status": "success",
        "totalRowCount": 12
      }
    }
  ],
  "topic": "order_items",
  "resultSummary": "Here are the monthly revenue figures..."
}

When to Use Which Path

路径选择指南

Criterion	Generate Query (Fast)	AI Jobs (Agentic)
Speed	Synchronous, fast	Async, slower
Volume	High-volume runs	Lower volume
Scope	Query generation only	Full agent workflow
Use case	Field/filter accuracy	End-to-end behavior
Multi-step	Single query	May generate multiple queries

评判标准	Generate Query（快速）	AI Jobs（智能代理）
速度	同步，快速	异步，较慢
量级	高量级运行	较低量级
范围	仅查询生成	完整代理工作流
使用场景	字段/过滤器准确性	端到端行为
多步骤	单查询	可能生成多个查询

Testing Topic Selection

测试主题选择

Eval topic selection independently with the pick-topic endpoint:

bash

omni ai pick-topic your-model-id "How many users signed up last month?"

Response:

json

{
  "topicId": "users"
}

This lets you score topic selection accuracy as a separate dimension — useful when topic selection is a known weak point.

使用pick-topic端点独立评估主题选择：

bash

omni ai pick-topic your-model-id "How many users signed up last month?"

响应：

json

{
  "topicId": "users"
}

这让你可以将主题选择准确性作为单独维度评分——当主题选择是已知短板时非常有用。

Scoring: Structural Query Comparison

评分：结构化查询对比

Compare the generated query JSON against the expected query across four dimensions:

Dimension	Comparison Method	Scoring
`topic`	Exact string match	pass/fail
`fields`	Set comparison (order-independent)	pass/fail + similarity score
`filters`	Key-value match (key present + value match)	pass/fail per filter key
`sorts`	Ordered array comparison	pass/fail

从四个维度将生成的查询JSON与预期查询进行对比：

维度	对比方法	评分方式
`topic`	精确字符串匹配	通过/失败
`fields`	集合对比（与顺序无关）	通过/失败 + 相似度评分
`filters`	键值匹配（键存在且值匹配）	按过滤器键进行通过/失败判断
`sorts`	有序数组对比	通过/失败

Example Comparison Logic (TypeScript)

对比逻辑示例（TypeScript）

typescript

function scoreEval(expected: any, generated: any) {
  // Topic: exact match
  const topicPass = generated.topic === expected.topic;

  // Fields: set comparison (order-independent)
  const expectedFields = new Set(expected.fields);
  const generatedFields = new Set(generated.query.fields);
  const missing = [...expectedFields].filter(f => !generatedFields.has(f));
  const extra = [...generatedFields].filter(f => !expectedFields.has(f));
  const fieldsPass = missing.length === 0 && extra.length === 0;

  // Filters: key-value match
  const expectedFilters = expected.filters || {};
  const generatedFilters = generated.query.filters || {};
  const missingKeys = Object.keys(expectedFilters).filter(k => !(k in generatedFilters));
  const wrongValues = Object.keys(expectedFilters)
    .filter(k => k in generatedFilters && generatedFilters[k] !== expectedFilters[k]);
  const filtersPass = missingKeys.length === 0 && wrongValues.length === 0;

  // Sorts: ordered comparison
  const sortsPass = JSON.stringify(expected.sorts || []) ===
    JSON.stringify(generated.query.sorts || []);

  return {
    topic: topicPass,
    fields: { pass: fieldsPass, missing, extra },
    filters: { pass: filtersPass, missingKeys, wrongValues },
    sorts: sortsPass,
    allPass: topicPass && fieldsPass && filtersPass && sortsPass,
  };
}

typescript

function scoreEval(expected: any, generated: any) {
  // 主题：精确匹配
  const topicPass = generated.topic === expected.topic;

  // 字段：集合对比（与顺序无关）
  const expectedFields = new Set(expected.fields);
  const generatedFields = new Set(generated.query.fields);
  const missing = [...expectedFields].filter(f => !generatedFields.has(f));
  const extra = [...generatedFields].filter(f => !expectedFields.has(f));
  const fieldsPass = missing.length === 0 && extra.length === 0;

  // 过滤器：键值匹配
  const expectedFilters = expected.filters || {};
  const generatedFilters = generated.query.filters || {};
  const missingKeys = Object.keys(expectedFilters).filter(k => !(k in generatedFilters));
  const wrongValues = Object.keys(expectedFilters)
    .filter(k => k in generatedFilters && generatedFilters[k] !== expectedFilters[k]);
  const filtersPass = missingKeys.length === 0 && wrongValues.length === 0;

  // 排序：有序对比
  const sortsPass = JSON.stringify(expected.sorts || []) ===
    JSON.stringify(generated.query.sorts || []);

  return {
    topic: topicPass,
    fields: { pass: fieldsPass, missing, extra },
    filters: { pass: filtersPass, missingKeys, wrongValues },
    sorts: sortsPass,
    allPass: topicPass && fieldsPass && filtersPass && sortsPass,
  };
}

Aggregate Scoring

聚合评分

Compute pass rates across all eval cases:

Eval Results: 47/50 passed (94.0%)
  Topic:   49/50 (98.0%)
  Fields:  47/50 (94.0%)
  Filters: 48/50 (96.0%)
  Sorts:   50/50 (100.0%)

Per-dimension rates help pinpoint where accuracy is weakest — if topic accuracy is high but filter accuracy is low, focus

ai_context

improvements on filter-related guidance.

计算所有评估案例的通过率：

评估结果：47/50 通过（94.0%）
  主题：   49/50（98.0%）
  字段：  47/50（94.0%）
  过滤器： 48/50（96.0%）
  排序：   50/50（100.0%）

各维度的通过率有助于定位准确性最弱的环节——如果主题准确性很高但过滤器准确性较低，请专注于针对过滤器相关指引优化

ai_context

。

A/B Comparison

A/B对比

Run the same eval suite with one variable changed to measure impact. This is the core workflow for understanding whether a change improves or degrades AI accuracy.

在仅改变一个变量的情况下运行同一评估套件，以此衡量变更的影响。这是了解变更是否提升或降低AI准确性的核心工作流。

Common Variables to Compare

常见对比变量

Model branches — pass different
```
--branch-id
```
values to test context changes on a branch before merging
Topic scope —
```
--current-topic-name "orders"
```
vs omitted (auto-select)
Model context changes —
```
ai_context
```
,
```
sample_queries
```
, field descriptions (apply via
```
omni-model-builder
```
on a branch, then eval against that branch)
Prompt wording — same expected query, different prompt text
AI configuration — model type, thinking level, or other AI parameters

模型分支——传递不同的
```
--branch-id
```
值，在合并前测试分支上的上下文变更
主题范围——
```
--current-topic-name "orders"
```
与省略该参数（自动选择）
模型上下文变更——
```
ai_context
```
、
```
sample_queries
```
、字段描述（通过分支上的
```
omni-model-builder
```
应用，然后针对该分支进行评估）
提示词措辞——预期查询相同，提示词文本不同
AI配置——模型类型、思考级别或其他AI参数

Workflow

工作流

Run eval suite with configuration A → save as
```
results_a.jsonl
```
Run eval suite with configuration B → save as
```
results_b.jsonl
```
Score both result sets
Compare side-by-side, checking for regressions

使用配置A运行评估套件 → 保存为
```
results_a.jsonl
```
使用配置B运行评估套件 → 保存为
```
results_b.jsonl
```
为两组结果评分
并排对比，检查是否存在回归

Example Comparison Output

对比输出示例

A/B Comparison: main vs branch/new-context
                      A (main)    B (new-context)    Delta
Overall pass rate:    88.0%       94.0%              +6.0%
Topic accuracy:       96.0%       98.0%              +2.0%
Field accuracy:       90.0%       94.0%              +4.0%
Filter accuracy:      88.0%       96.0%              +8.0%

Regressions (passed in A, failed in B):
  - rev-by-quarter: fields missing order_items.total_revenue

Improvements (failed in A, passed in B):
  - customer-count: topic now correctly selects users
  - top-products: filters now include status=complete

Important: Always check for regressions, not just overall improvement. A net improvement that breaks previously-correct cases may indicate an
ai_context
conflict.

A/B对比：main vs branch/new-context
                      A (main)    B (new-context)    差值
整体通过率：    88.0%       94.0%              +6.0%
主题准确性：       96.0%       98.0%              +2.0%
字段准确性：       90.0%       94.0%              +4.0%
过滤器准确性：      88.0%       96.0%              +8.0%

回归案例（在A中通过，在B中失败）：
  - rev-by-quarter：字段缺少order_items.total_revenue

优化案例（在A中失败，在B中通过）：
  - customer-count：主题现在正确选择users
  - top-products：过滤器现在包含status=complete

重要提示：请始终检查是否存在回归，而不仅仅是整体提升。如果整体提升但破坏了之前正确的案例，可能表明存在
ai_context
冲突。

Snapshotting Model State

模型状态快照

Before running evals, snapshot the model definition so results are reproducible:

bash

undefined

运行评估前，请对模型定义进行快照，以便结果可复现：

bash

undefined

Save model YAML

保存模型YAML

omni models yaml-get <modelId> --compact > model_snapshot_$(date +%Y%m%d).json

Validate model integrity

验证模型完整性

omni models validate <modelId>


Version your eval set alongside model snapshots so you can trace which model state produced which scores.

omni models validate <modelId>


将评估集与模型快照一起版本化，以便你可以追溯哪个模型状态产生了对应的评分。

Known Issues & Gotchas

已知问题与注意事项

Filter comparison can be complex — Omni supports rich filter expressions (
```
"last 7 days"
```
,
```
"between 10 and 100"
```
,
```
"not null"
```
). The structural comparison above uses exact string match on filter values. If the AI produces semantically equivalent but syntactically different expressions, you may see false failures. Consider normalizing common patterns or using a Jaccard threshold.
AI Jobs are async — poll with exponential backoff. Don't hammer the status endpoint.
Rate limiting — for high-volume eval runs, add a small delay between calls or batch requests.
limit
field may vary — the AI may choose different limits than expected. Consider excluding
```
limit
```
from strict comparison if it's not critical to your eval.
table
vs
topic
— the generate-query response returns
```
topic
```
as a top-level field and
```
table
```
inside the query object. These usually match but aren't always identical. Compare against the top-level
```
topic
```
.

过滤器对比可能复杂——Omni支持丰富的过滤器表达式（
```
"last 7 days"
```
、
```
"between 10 and 100"
```
、
```
"not null"
```
）。上述结构对比使用过滤器值的精确字符串匹配。如果AI生成语义等效但语法不同的表达式，你可能会看到误判失败。考虑对常见模式进行归一化，或使用Jaccard阈值。
AI Jobs是异步的——使用指数退避策略进行轮询，不要频繁调用状态端点。
速率限制——对于高量级评估运行，请在调用之间添加短暂延迟或批量请求。
limit
字段可能不同——AI可能选择与预期不同的限制值。如果
```
limit
```
对你的评估不重要，请考虑将其排除在严格对比之外。
table
vs
topic
——generate-query响应在顶层返回
```
topic
```
字段，在查询对象内返回
```
table
```
字段。它们通常匹配，但并非始终完全相同。请与顶层
```
topic
```
进行对比。

omni-ai-eval

Original

Translation

Omni Eval

Omni 评估

Prerequisites

前置条件

Verify the Omni CLI is installed — if not, ask the user to install it

验证Omni CLI是否已安装——若未安装，请提示用户进行安装

See: https://github.com/exploreomni/cli#readme

参考：https://github.com/exploreomni/cli#readme

Show available profiles and select the appropriate one

显示可用配置文件并选择合适的配置

If multiple profiles exist, ask the user which to use, then switch:

若存在多个配置文件，请询问用户使用哪个，然后切换：

Discovering Commands

命令查询

Eval Input Format

评估输入格式

Running Evals: Fast Path (Generate Query API)

运行评估：快速路径（Generate Query API）

Single Eval Call

单次评估调用

Response Structure

响应结构

Request Parameters

请求参数

Batch Loop (bash)

批量循环（bash）

Running Evals: Agentic Path (AI Jobs API)

运行评估：智能代理路径（AI Jobs API）

Submit a Job

提交任务

Poll for Completion

轮询任务完成状态

Get Result

获取结果

When to Use Which Path

路径选择指南

Testing Topic Selection

测试主题选择

Scoring: Structural Query Comparison

评分：结构化查询对比

Example Comparison Logic (TypeScript)

对比逻辑示例（TypeScript）

Aggregate Scoring

聚合评分

A/B Comparison

A/B对比

Common Variables to Compare

常见对比变量

Workflow

工作流

Example Comparison Output

对比输出示例

Snapshotting Model State

模型状态快照

Save model YAML

保存模型YAML

Validate model integrity

验证模型完整性

Known Issues & Gotchas

已知问题与注意事项

Docs Reference

文档参考

Related Skills

相关技能