omni-ai-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOmni Eval
Omni 评估
Run evals against Omni's AI query generation APIs — submit test prompts, capture the generated query JSON, compare it against expected results, and score accuracy across dimensions.
Tip: Useto improve scores after identifying failures, andomni-ai-optimizerto discover available topics and fields for building eval cases.omni-model-explorer
针对Omni的AI查询生成API运行评估——提交测试提示词,捕获生成的查询JSON,将其与预期结果对比,并从多维度对准确性进行评分。
提示:识别出失败案例后,可使用提升评分;使用omni-ai-optimizer可发现用于构建评估案例的可用主题和字段。omni-model-explorer
Prerequisites
前置条件
bash
undefinedbash
undefinedVerify the Omni CLI is installed — if not, ask the user to install it
验证Omni CLI是否已安装——若未安装,请提示用户进行安装
command -v omni >/dev/null || echo "ERROR: Omni CLI is not installed."
```bashcommand -v omni >/dev/null || echo "ERROR: Omni CLI is not installed."
```bashShow available profiles and select the appropriate one
显示可用配置文件并选择合适的配置
omni config show
omni config show
If multiple profiles exist, ask the user which to use, then switch:
若存在多个配置文件,请询问用户使用哪个,然后切换:
omni config use <profile-name>
You also need a **model ID** and an **eval set** — a file of test cases with prompts and expected query structures. See the [Eval Design Guide](https://docs.omni.co/ai/eval-design-guide) for best practices on building eval sets.omni config use <profile-name>
你还需要一个**model ID**和一个**评估集**——包含测试案例的文件,其中每个案例包含提示词和预期查询结构。构建评估集的最佳实践可参考[评估设计指南](https://docs.omni.co/ai/eval-design-guide)。Discovering Commands
命令查询
bash
omni ai --help # AI operations (generate-query, jobs, pick-topic)Tip: Useto force structured output for programmatic parsing, or-o jsonfor readable tables. The default is-o human(human in a TTY, JSON when piped).auto
bash
omni ai --help # AI操作(generate-query、jobs、pick-topic)提示:使用强制输出结构化内容以支持程序化解析,或使用-o json输出易读的表格。默认值为-o human(终端环境下输出human格式,管道传输时输出JSON格式)。auto
Eval Input Format
评估输入格式
Each eval case pairs a natural language prompt with the expected query structure. JSONL (one JSON object per line) works well for bulk runs:
jsonl
{"id": "rev-by-month", "prompt": "Show me revenue by month", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["order_items.created_at[month]", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}]}, "tags": ["time-series"]}
{"id": "top-customers", "prompt": "Top 10 customers by spend", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["users.name", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.total_revenue", "sort_descending": true}]}, "tags": ["top-n"]}| Field | Required | Description |
|---|---|---|
| Yes | Unique identifier for the eval case |
| Yes | Natural language question to send to AI |
| Yes | Target model UUID |
| Yes | Object with |
| No | Branch to test against |
| No | Constrain to a specific topic |
| No | Array of tags for filtering/grouping results |
Note: JSONL is shown here, but any structured format works — CSV, JSON arrays, YAML — as long as you can iterate over cases and extract these fields.
每个评估案例将自然语言提示词与预期查询结构配对。JSONL(每行一个JSON对象)非常适合批量运行:
jsonl
{"id": "rev-by-month", "prompt": "Show me revenue by month", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["order_items.created_at[month]", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}]}, "tags": ["time-series"]}
{"id": "top-customers", "prompt": "Top 10 customers by spend", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["users.name", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.total_revenue", "sort_descending": true}]}, "tags": ["top-n"]}| 字段 | 是否必填 | 描述 |
|---|---|---|
| 是 | 评估案例的唯一标识符 |
| 是 | 发送给AI的自然语言问题 |
| 是 | 目标模型的UUID |
| 是 | 包含 |
| 否 | 要测试的分支ID |
| 否 | 将范围限制为特定主题 |
| 否 | 用于过滤/分组结果的标签数组 |
注意:此处展示的是JSONL格式,但任何结构化格式均可——CSV、JSON数组、YAML——只要你能遍历案例并提取上述字段即可。
Running Evals: Fast Path (Generate Query API)
运行评估:快速路径(Generate Query API)
The synchronous generate-query endpoint is the fastest way to eval query generation. Pass to get only the generated query JSON without executing it against the database.
--run-query false同步generate-query端点是评估查询生成的最快方式。传递可仅获取生成的查询JSON,而无需在数据库中执行该查询。
--run-query falseSingle Eval Call
单次评估调用
bash
omni ai generate-query your-model-id "Show me revenue by month" --run-query falsebash
omni ai generate-query your-model-id "Show me revenue by month" --run-query falseResponse Structure
响应结构
json
{
"query": {
"fields": ["order_items.created_at[month]", "order_items.total_revenue"],
"table": "order_items",
"filters": {},
"sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}],
"limit": 500
},
"topic": "order_items",
"error": null
}json
{
"query": {
"fields": ["order_items.created_at[month]", "order_items.total_revenue"],
"table": "order_items",
"filters": {},
"sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}],
"limit": 500
},
"topic": "order_items",
"error": null
}Request Parameters
请求参数
| Arg/Flag | Required | Description |
|---|---|---|
| Yes | UUID of the Omni model (positional arg) |
| Yes | Natural language question (positional arg) |
| No | Set |
| No | Branch UUID for branch-specific testing |
| No | Constrain topic selection to a specific topic |
| 参数/标志 | 是否必填 | 描述 |
|---|---|---|
| 是 | Omni模型的UUID(位置参数) |
| 是 | 自然语言问题(位置参数) |
| 否 | 设置为 |
| 否 | 用于分支特定测试的分支UUID |
| 否 | 将主题选择范围限制为特定主题 |
Batch Loop (bash)
批量循环(bash)
bash
while IFS= read -r line; do
id=$(echo "$line" | jq -r '.id')
prompt=$(echo "$line" | jq -r '.prompt')
model_id=$(echo "$line" | jq -r '.modelId')
branch_id=$(echo "$line" | jq -r '.branchId // empty')
branch_flag=""
if [ -n "$branch_id" ]; then
branch_flag="--branch-id $branch_id"
fi
result=$(omni ai generate-query "$model_id" "$prompt" --run-query false $branch_flag --compact)
echo "{\"id\": \"$id\", \"generated\": $result}" >> eval_results.jsonl
done < eval_cases.jsonlbash
while IFS= read -r line; do
id=$(echo "$line" | jq -r '.id')
prompt=$(echo "$line" | jq -r '.prompt')
model_id=$(echo "$line" | jq -r '.modelId')
branch_id=$(echo "$line" | jq -r '.branchId // empty')
branch_flag=""
if [ -n "$branch_id" ]; then
branch_flag="--branch-id $branch_id"
fi
result=$(omni ai generate-query "$model_id" "$prompt" --run-query false $branch_flag --compact)
echo "{\"id\": \"$id\", \"generated\": $result}" >> eval_results.jsonl
done < eval_cases.jsonlRunning Evals: Agentic Path (AI Jobs API)
运行评估:智能代理路径(AI Jobs API)
Use the async AI Jobs API when you want to test the full agentic workflow — multi-step analysis, tool use, and topic selection as Blobby would actually behave in production.
当你想要测试完整的智能代理工作流——多步骤分析、工具使用以及Blobby在生产环境中的实际主题选择行为时,请使用异步AI Jobs API。
Submit a Job
提交任务
bash
omni ai job-submit your-model-id "Show me revenue by month"Response:
json
{
"jobId": "job-uuid",
"conversationId": "conv-uuid",
"omniChatUrl": "https://yourorg.omniapp.co/chat/..."
}bash
omni ai job-submit your-model-id "Show me revenue by month"响应:
json
{
"jobId": "job-uuid",
"conversationId": "conv-uuid",
"omniChatUrl": "https://yourorg.omniapp.co/chat/..."
}Poll for Completion
轮询任务完成状态
bash
omni ai job-status <jobId>Status progression: → → (or ). Poll with backoff (e.g., 2s, 4s, 8s) until the is terminal.
QUEUEDEXECUTINGCOMPLETEFAILEDstatebash
omni ai job-status <jobId>状态演进: → → (或)。使用退避策略(如2秒、4秒、8秒)进行轮询,直到进入终态。
QUEUEDEXECUTINGCOMPLETEFAILEDstateGet Result
获取结果
bash
omni ai job-result <jobId>The result contains an array. Look for actions with to extract the query JSON:
actionstype: "generate_query"json
{
"actions": [
{
"type": "generate_query",
"message": "Querying revenue by month...",
"result": {
"queryName": "Revenue by Month",
"query": { "fields": [...], "table": "...", "filters": {...} },
"status": "success",
"totalRowCount": 12
}
}
],
"topic": "order_items",
"resultSummary": "Here are the monthly revenue figures..."
}bash
omni ai job-result <jobId>结果包含数组。查找的操作以提取查询JSON:
actionstype: "generate_query"json
{
"actions": [
{
"type": "generate_query",
"message": "Querying revenue by month...",
"result": {
"queryName": "Revenue by Month",
"query": { "fields": [...], "table": "...", "filters": {...} },
"status": "success",
"totalRowCount": 12
}
}
],
"topic": "order_items",
"resultSummary": "Here are the monthly revenue figures..."
}When to Use Which Path
路径选择指南
| Criterion | Generate Query (Fast) | AI Jobs (Agentic) |
|---|---|---|
| Speed | Synchronous, fast | Async, slower |
| Volume | High-volume runs | Lower volume |
| Scope | Query generation only | Full agent workflow |
| Use case | Field/filter accuracy | End-to-end behavior |
| Multi-step | Single query | May generate multiple queries |
| 评判标准 | Generate Query(快速) | AI Jobs(智能代理) |
|---|---|---|
| 速度 | 同步,快速 | 异步,较慢 |
| 量级 | 高量级运行 | 较低量级 |
| 范围 | 仅查询生成 | 完整代理工作流 |
| 使用场景 | 字段/过滤器准确性 | 端到端行为 |
| 多步骤 | 单查询 | 可能生成多个查询 |
Testing Topic Selection
测试主题选择
Eval topic selection independently with the pick-topic endpoint:
bash
omni ai pick-topic your-model-id "How many users signed up last month?"Response:
json
{
"topicId": "users"
}This lets you score topic selection accuracy as a separate dimension — useful when topic selection is a known weak point.
使用pick-topic端点独立评估主题选择:
bash
omni ai pick-topic your-model-id "How many users signed up last month?"响应:
json
{
"topicId": "users"
}这让你可以将主题选择准确性作为单独维度评分——当主题选择是已知短板时非常有用。
Scoring: Structural Query Comparison
评分:结构化查询对比
Compare the generated query JSON against the expected query across four dimensions:
| Dimension | Comparison Method | Scoring |
|---|---|---|
| Exact string match | pass/fail |
| Set comparison (order-independent) | pass/fail + similarity score |
| Key-value match (key present + value match) | pass/fail per filter key |
| Ordered array comparison | pass/fail |
从四个维度将生成的查询JSON与预期查询进行对比:
| 维度 | 对比方法 | 评分方式 |
|---|---|---|
| 精确字符串匹配 | 通过/失败 |
| 集合对比(与顺序无关) | 通过/失败 + 相似度评分 |
| 键值匹配(键存在且值匹配) | 按过滤器键进行通过/失败判断 |
| 有序数组对比 | 通过/失败 |
Example Comparison Logic (TypeScript)
对比逻辑示例(TypeScript)
typescript
function scoreEval(expected: any, generated: any) {
// Topic: exact match
const topicPass = generated.topic === expected.topic;
// Fields: set comparison (order-independent)
const expectedFields = new Set(expected.fields);
const generatedFields = new Set(generated.query.fields);
const missing = [...expectedFields].filter(f => !generatedFields.has(f));
const extra = [...generatedFields].filter(f => !expectedFields.has(f));
const fieldsPass = missing.length === 0 && extra.length === 0;
// Filters: key-value match
const expectedFilters = expected.filters || {};
const generatedFilters = generated.query.filters || {};
const missingKeys = Object.keys(expectedFilters).filter(k => !(k in generatedFilters));
const wrongValues = Object.keys(expectedFilters)
.filter(k => k in generatedFilters && generatedFilters[k] !== expectedFilters[k]);
const filtersPass = missingKeys.length === 0 && wrongValues.length === 0;
// Sorts: ordered comparison
const sortsPass = JSON.stringify(expected.sorts || []) ===
JSON.stringify(generated.query.sorts || []);
return {
topic: topicPass,
fields: { pass: fieldsPass, missing, extra },
filters: { pass: filtersPass, missingKeys, wrongValues },
sorts: sortsPass,
allPass: topicPass && fieldsPass && filtersPass && sortsPass,
};
}typescript
function scoreEval(expected: any, generated: any) {
// 主题:精确匹配
const topicPass = generated.topic === expected.topic;
// 字段:集合对比(与顺序无关)
const expectedFields = new Set(expected.fields);
const generatedFields = new Set(generated.query.fields);
const missing = [...expectedFields].filter(f => !generatedFields.has(f));
const extra = [...generatedFields].filter(f => !expectedFields.has(f));
const fieldsPass = missing.length === 0 && extra.length === 0;
// 过滤器:键值匹配
const expectedFilters = expected.filters || {};
const generatedFilters = generated.query.filters || {};
const missingKeys = Object.keys(expectedFilters).filter(k => !(k in generatedFilters));
const wrongValues = Object.keys(expectedFilters)
.filter(k => k in generatedFilters && generatedFilters[k] !== expectedFilters[k]);
const filtersPass = missingKeys.length === 0 && wrongValues.length === 0;
// 排序:有序对比
const sortsPass = JSON.stringify(expected.sorts || []) ===
JSON.stringify(generated.query.sorts || []);
return {
topic: topicPass,
fields: { pass: fieldsPass, missing, extra },
filters: { pass: filtersPass, missingKeys, wrongValues },
sorts: sortsPass,
allPass: topicPass && fieldsPass && filtersPass && sortsPass,
};
}Aggregate Scoring
聚合评分
Compute pass rates across all eval cases:
Eval Results: 47/50 passed (94.0%)
Topic: 49/50 (98.0%)
Fields: 47/50 (94.0%)
Filters: 48/50 (96.0%)
Sorts: 50/50 (100.0%)Per-dimension rates help pinpoint where accuracy is weakest — if topic accuracy is high but filter accuracy is low, focus improvements on filter-related guidance.
ai_context计算所有评估案例的通过率:
评估结果:47/50 通过(94.0%)
主题: 49/50(98.0%)
字段: 47/50(94.0%)
过滤器: 48/50(96.0%)
排序: 50/50(100.0%)各维度的通过率有助于定位准确性最弱的环节——如果主题准确性很高但过滤器准确性较低,请专注于针对过滤器相关指引优化。
ai_contextA/B Comparison
A/B对比
Run the same eval suite with one variable changed to measure impact. This is the core workflow for understanding whether a change improves or degrades AI accuracy.
在仅改变一个变量的情况下运行同一评估套件,以此衡量变更的影响。这是了解变更是否提升或降低AI准确性的核心工作流。
Common Variables to Compare
常见对比变量
- Model branches — pass different values to test context changes on a branch before merging
--branch-id - Topic scope — vs omitted (auto-select)
--current-topic-name "orders" - Model context changes — ,
ai_context, field descriptions (apply viasample_querieson a branch, then eval against that branch)omni-model-builder - Prompt wording — same expected query, different prompt text
- AI configuration — model type, thinking level, or other AI parameters
- 模型分支——传递不同的值,在合并前测试分支上的上下文变更
--branch-id - 主题范围——与 省略该参数(自动选择)
--current-topic-name "orders" - 模型上下文变更——、
ai_context、字段描述(通过分支上的sample_queries应用,然后针对该分支进行评估)omni-model-builder - 提示词措辞——预期查询相同,提示词文本不同
- AI配置——模型类型、思考级别或其他AI参数
Workflow
工作流
- Run eval suite with configuration A → save as
results_a.jsonl - Run eval suite with configuration B → save as
results_b.jsonl - Score both result sets
- Compare side-by-side, checking for regressions
- 使用配置A运行评估套件 → 保存为
results_a.jsonl - 使用配置B运行评估套件 → 保存为
results_b.jsonl - 为两组结果评分
- 并排对比,检查是否存在回归
Example Comparison Output
对比输出示例
A/B Comparison: main vs branch/new-context
A (main) B (new-context) Delta
Overall pass rate: 88.0% 94.0% +6.0%
Topic accuracy: 96.0% 98.0% +2.0%
Field accuracy: 90.0% 94.0% +4.0%
Filter accuracy: 88.0% 96.0% +8.0%
Regressions (passed in A, failed in B):
- rev-by-quarter: fields missing order_items.total_revenue
Improvements (failed in A, passed in B):
- customer-count: topic now correctly selects users
- top-products: filters now include status=completeImportant: Always check for regressions, not just overall improvement. A net improvement that breaks previously-correct cases may indicate anconflict.ai_context
A/B对比:main vs branch/new-context
A (main) B (new-context) 差值
整体通过率: 88.0% 94.0% +6.0%
主题准确性: 96.0% 98.0% +2.0%
字段准确性: 90.0% 94.0% +4.0%
过滤器准确性: 88.0% 96.0% +8.0%
回归案例(在A中通过,在B中失败):
- rev-by-quarter:字段缺少order_items.total_revenue
优化案例(在A中失败,在B中通过):
- customer-count:主题现在正确选择users
- top-products:过滤器现在包含status=complete重要提示:请始终检查是否存在回归,而不仅仅是整体提升。如果整体提升但破坏了之前正确的案例,可能表明存在冲突。ai_context
Snapshotting Model State
模型状态快照
Before running evals, snapshot the model definition so results are reproducible:
bash
undefined运行评估前,请对模型定义进行快照,以便结果可复现:
bash
undefinedSave model YAML
保存模型YAML
omni models yaml-get <modelId> --compact > model_snapshot_$(date +%Y%m%d).json
omni models yaml-get <modelId> --compact > model_snapshot_$(date +%Y%m%d).json
Validate model integrity
验证模型完整性
omni models validate <modelId>
Version your eval set alongside model snapshots so you can trace which model state produced which scores.omni models validate <modelId>
将评估集与模型快照一起版本化,以便你可以追溯哪个模型状态产生了对应的评分。Known Issues & Gotchas
已知问题与注意事项
- Filter comparison can be complex — Omni supports rich filter expressions (,
"last 7 days","between 10 and 100"). The structural comparison above uses exact string match on filter values. If the AI produces semantically equivalent but syntactically different expressions, you may see false failures. Consider normalizing common patterns or using a Jaccard threshold."not null" - AI Jobs are async — poll with exponential backoff. Don't hammer the status endpoint.
- Rate limiting — for high-volume eval runs, add a small delay between calls or batch requests.
- field may vary — the AI may choose different limits than expected. Consider excluding
limitfrom strict comparison if it's not critical to your eval.limit - vs
table— the generate-query response returnstopicas a top-level field andtopicinside the query object. These usually match but aren't always identical. Compare against the top-leveltable.topic
- 过滤器对比可能复杂——Omni支持丰富的过滤器表达式(、
"last 7 days"、"between 10 and 100")。上述结构对比使用过滤器值的精确字符串匹配。如果AI生成语义等效但语法不同的表达式,你可能会看到误判失败。考虑对常见模式进行归一化,或使用Jaccard阈值。"not null" - AI Jobs是异步的——使用指数退避策略进行轮询,不要频繁调用状态端点。
- 速率限制——对于高量级评估运行,请在调用之间添加短暂延迟或批量请求。
- 字段可能不同——AI可能选择与预期不同的限制值。如果
limit对你的评估不重要,请考虑将其排除在严格对比之外。limit - vs
table——generate-query响应在顶层返回topic字段,在查询对象内返回topic字段。它们通常匹配,但并非始终完全相同。请与顶层table进行对比。topic
Docs Reference
文档参考
Related Skills
相关技能
- omni-query — run golden queries to validate expected results
- omni-model-explorer — discover topics and fields for building eval cases
- omni-ai-optimizer — improve AI accuracy based on eval findings
- omni-model-builder — apply context changes on branches before A/B testing
- omni-query——运行标准查询以验证预期结果
- omni-model-explorer——发现用于构建评估案例的主题和字段
- omni-ai-optimizer——根据评估结果提升AI准确性
- omni-model-builder——在A/B测试前在分支上应用上下文变更