exploring-llm-evaluations
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseExploring LLM evaluations
探索LLM评估
PostHog evaluations score events. Each evaluation is one of two types,
both first-class:
$ai_generation- — deterministic Hog code that returns
hog/true(and optionally N/A). Best for objective rule-based checks: format validation (JSON parses, schema matches), length limits, keyword presence/absence, regex patterns, structural assertions, latency thresholds, cost guards. Cheap, fast, reproducible — no LLM call per run. Prefer this when the criterion can be expressed as code.false - — an LLM scores generations against a prompt you write. Best for subjective or fuzzy checks: tone, helpfulness, hallucination detection, off-topic drift, instruction-following. Costs an LLM call per run and requires AI data processing approval at the org level.
llm_judge
Results from both types land in ClickHouse as events with the same
schema, so the read/query/summary workflows are identical regardless of evaluator type —
the only thing that changes is whether was written by Hog
code or by an LLM.
$ai_evaluation$ai_evaluation_reasoningThis skill covers the full lifecycle: list/inspect/manage evaluation configs (Hog or
LLM judge), run them on specific generations, query individual results, and get an
AI-generated summary of pass/fail/N/A patterns across many runs.
PostHog评估会为事件打分。每种评估都属于以下两种类型之一,且二者均为一等功能:
$ai_generation- —— 确定性Hog代码,返回
hog/true(可选返回N/A)。最适合基于规则的客观检查:格式验证(JSON可解析、匹配 schema)、长度限制、关键词存在/缺失、正则表达式模式、结构断言、延迟阈值、成本管控。成本低、速度快、可复现——每次运行无需调用LLM。当评判标准可通过代码表达时,优先选择此类型。false - —— 由LLM根据你编写的提示词为生成结果打分。最适合主观或模糊检查:语气、有用性、幻觉检测、偏离主题、指令遵循度。每次运行需调用LLM,且需要组织层面的AI数据处理审批。
llm_judge
两种类型的评估结果都会以事件的形式存入ClickHouse,且具有相同的 schema,因此无论评估器类型如何,读取/查询/总结工作流都是相同的——唯一的区别在于是由Hog代码生成还是LLM生成。
$ai_evaluation$ai_evaluation_reasoning本技能覆盖完整生命周期:列出/检查/管理评估配置(Hog或LLM评判器)、针对特定生成结果运行评估、查询单个结果,以及获取AI生成的多轮运行中通过/失败/N/A模式的总结。
Tools
工具
| Tool | Purpose |
|---|---|
| List/search evaluation configs (filter by name, enabled flag) |
| Get a single evaluation config by UUID |
| Create a new |
| Update an existing evaluation (name, prompt, enabled, …) |
| Soft-delete an evaluation |
| Run an evaluation against a specific |
| Dry-run Hog source against recent generations (no save) |
| AI-powered summary of pass/fail/N/A patterns across runs |
| Ad-hoc HogQL over |
| Drill into the underlying generation that an evaluation scored |
All tools are defined in .
llma-evaluation-*products/llm_analytics/mcp/tools.yaml| 工具名称 | 用途 |
|---|---|
| 列出/搜索评估配置(按名称、启用状态筛选) |
| 通过UUID获取单个评估配置 |
| 创建新的 |
| 更新现有评估(名称、提示词、启用状态等) |
| 软删除评估 |
| 针对特定 |
| 针对近期生成结果试运行Hog代码(不保存结果) |
| 基于AI生成多轮运行中通过/失败/N/A模式的总结 |
| 对 |
| 深入查看评估打分对应的底层生成结果 |
所有工具均定义于。
llma-evaluation-*products/llm_analytics/mcp/tools.yamlEvent schema
事件Schema
Every run of an evaluation emits an event. Key properties:
$ai_evaluation| Property | Meaning |
|---|---|
| UUID of the evaluation config |
| Human-readable name |
| UUID of the |
| Parent trace ID (for jumping to the trace UI) |
| |
| Free-text explanation (set by the LLM judge or Hog code) |
| |
When , the run counts as N/A regardless of .
For evaluations that don't support N/A, this property may be — treat null as "applicable".
$ai_evaluation_applicable = false$ai_evaluation_resultnull每次评估运行都会触发一个事件。关键属性如下:
$ai_evaluation| 属性名称 | 含义 |
|---|---|
| 评估配置的UUID |
| 易读的评估名称 |
| 被打分的 |
| 父级追踪ID(用于跳转至追踪UI) |
| |
| 自由文本解释(由LLM评判器或Hog代码设置) |
| 当评估器判定生成结果不适用时为 |
当时,无论如何,本次运行都算作N/A。对于不支持N/A的评估,该属性可能为——将null视为“适用”。
$ai_evaluation_applicable = false$ai_evaluation_resultnullWorkflow: investigate why an evaluation is failing
工作流:排查评估失败原因
Works the same way for and evaluations — the differences only matter
when you eventually go to fix the evaluator (edit the prompt vs. edit the Hog source).
llm_judgehog此流程对和评估均适用——区别仅在于最终修复评估器时(编辑提示词 vs 编辑Hog代码)。
llm_judgehogStep 1 — Find the evaluation
步骤1 —— 找到目标评估
json
posthog:llma-evaluation-list
{ "search": "hallucination", "enabled": true }Look at the returned , , , and either:
idnameevaluation_type- for an
evaluation_config.promptllm_judge - for a
evaluation_config.sourceevaluatorhog
The Hog source is the ground truth for why a hog evaluator passes or fails — read it
before assuming the failure is in the generation.
json
posthog:llma-evaluation-list
{ "search": "hallucination", "enabled": true }查看返回结果中的、、,以及:
idnameevaluation_type- 对于,查看
llm_judgeevaluation_config.prompt - 对于评估器,查看
hogevaluation_config.source
Hog代码是hog评估器通过或失败的依据——在假设失败源于生成结果之前,请先阅读Hog代码。
Step 2 — Get the AI-generated summary
步骤2 —— 获取AI生成的总结
json
posthog:llma-evaluation-summary-create
{
"evaluation_id": "<uuid>",
"filter": "fail"
}Returns:
- — natural-language summary
overall_assessment - — grouped patterns with
fail_patterns,title,description, andfrequencyexample_generation_ids - and
pass_patterns— same shape, populated whenna_patternsincludes themfilter - — actionable next steps
recommendations - —
statistics,total_analyzed,pass_count,fail_countna_count
The endpoint analyses the most recent ~250 runs ().
Results are cached for one hour per .
Pass to recompute.
EVALUATION_SUMMARY_MAX_RUNS(evaluation_id, filter, set_of_generation_ids)force_refresh: trueCompare filters in two calls to spot what's distinctive about failures vs passes:
json
posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "pass" }Then diff the against the from Step 2.
pass_patternsfail_patternsjson
posthog:llma-evaluation-summary-create
{
"evaluation_id": "<uuid>",
"filter": "fail"
}返回结果包含:
- —— 自然语言总结
overall_assessment - —— 分组模式,包含
fail_patterns、title、description和frequencyexample_generation_ids - 和
pass_patterns—— 结构相同,当筛选条件包含时会填充内容na_patterns - —— 可执行的下一步建议
recommendations - ——
statistics、total_analyzed、pass_count、fail_countna_count
该端点会分析最近约250次运行()。结果会按缓存一小时。传入可重新计算。
EVALUATION_SUMMARY_MAX_RUNS(evaluation_id, filter, set_of_generation_ids)force_refresh: true通过两次调用对比筛选条件,找出失败与通过结果的差异:
json
posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "pass" }然后对比步骤2中得到的与本次的。
fail_patternspass_patternsStep 3 — Drill into example failing runs
步骤3 —— 深入查看失败示例运行
Each pattern surfaces . Pull the underlying trace for the most
representative example:
example_generation_idsjson
posthog:query-llm-trace
{ "traceId": "<trace_id>", "dateRange": {"date_from": "-30d"} }(If you only have a generation ID, query for it via first to find the
parent trace ID — see below.)
execute-sql每个模式都会显示。提取最具代表性示例的底层追踪:
example_generation_idsjson
posthog:query-llm-trace
{ "traceId": "<trace_id>", "dateRange": {"date_from": "-30d"} }(如果只有生成ID,需先通过查询找到父级追踪ID——见下文。)
execute-sqlStep 4 — Verify the pattern with raw SQL
步骤4 —— 用原始SQL验证模式
The summary is LLM-generated and should be verified. Use to count and
spot-check:
execute-sqlsql
posthog:execute-sql
SELECT
properties.$ai_target_event_id AS generation_id,
properties.$ai_trace_id AS trace_id,
properties.$ai_evaluation_reasoning AS reasoning,
timestamp
FROM events
WHERE event = '$ai_evaluation'
AND properties.$ai_evaluation_id = '<evaluation_uuid>'
AND properties.$ai_evaluation_result = false
AND (
properties.$ai_evaluation_applicable IS NULL
OR properties.$ai_evaluation_applicable != false
)
AND timestamp >= now() - INTERVAL 7 DAY
ORDER BY timestamp DESC
LIMIT 25The N/A guard () is important — it matches the same logic the
backend uses to bucket runs.
IS NULL OR != false总结由LLM生成,需进行验证。使用进行计数和抽样检查:
execute-sqlsql
posthog:execute-sql
SELECT
properties.$ai_target_event_id AS generation_id,
properties.$ai_trace_id AS trace_id,
properties.$ai_evaluation_reasoning AS reasoning,
timestamp
FROM events
WHERE event = '$ai_evaluation'
AND properties.$ai_evaluation_id = '<evaluation_uuid>'
AND properties.$ai_evaluation_result = false
AND (
properties.$ai_evaluation_applicable IS NULL
OR properties.$ai_evaluation_applicable != false
)
AND timestamp >= now() - INTERVAL 7 DAY
ORDER BY timestamp DESC
LIMIT 25N/A过滤条件()非常重要——它与后端用于分类运行结果的逻辑一致。
IS NULL OR != falseWorkflow: run an evaluation against a specific generation
工作流:针对特定生成结果运行评估
Use this when the user pastes a trace/generation URL and asks "what would evaluation X
say about this?".
json
posthog:llma-evaluation-run
{
"evaluationId": "<eval_uuid>",
"target_event_id": "<generation_event_uuid>",
"timestamp": "2026-04-01T19:39:20Z",
"event": "$ai_generation"
}The is required for an efficient ClickHouse lookup of the target event.
Pass if you have it — it speeds up the lookup further.
timestampdistinct_id当用户粘贴追踪/生成结果URL并询问“评估X会如何评价这个结果?”时,使用此流程。
json
posthog:llma-evaluation-run
{
"evaluationId": "<eval_uuid>",
"target_event_id": "<generation_event_uuid>",
"timestamp": "2026-04-01T19:39:20Z",
"event": "$ai_generation"
}timestampdistinct_idWorkflow: build and test a new evaluator
工作流:构建并测试新评估器
Hog evaluator (deterministic, code-based)
Hog评估器(确定性、基于代码)
Reach for this first when the criterion is rule-based — it's cheaper, faster, and
reproducible. Prototype with (no save):
llma-evaluation-test-hogjson
posthog:llma-evaluation-test-hog
{
"source": "return event.properties.$ai_output_choices[1].content contains 'sorry';",
"sample_count": 5,
"allows_na": false
}The handler returns the boolean result for each of the most recent N
events. Iterate on the source until it behaves as expected, then promote it via
:
$ai_generationllma-evaluation-createjson
posthog:llma-evaluation-create
{
"name": "Output is valid JSON",
"description": "Fails when the assistant message can't be parsed as JSON",
"evaluation_type": "hog",
"evaluation_config": {
"source": "let raw := event.properties.$ai_output_choices[1].content; try { jsonParseStr(raw); return true; } catch { return false; }"
},
"output_type": "boolean",
"enabled": true
}Hog evaluators have full access to the event and its properties — common patterns
include schema validation, length/token limits, regex matches, and tool-call shape
checks. Because they're deterministic, results are reproducible across reruns and
trivially diff-able.
当评判标准基于规则时,优先选择此类型——成本更低、速度更快、可复现。使用进行原型开发(不保存结果):
llma-evaluation-test-hogjson
posthog:llma-evaluation-test-hog
{
"source": "return event.properties.$ai_output_choices[1].content contains 'sorry';",
"sample_count": 5,
"allows_na": false
}处理器会返回最近N个事件的布尔结果。迭代优化代码直至符合预期,然后通过正式创建:
$ai_generationllma-evaluation-createjson
posthog:llma-evaluation-create
{
"name": "Output is valid JSON",
"description": "Fails when the assistant message can't be parsed as JSON",
"evaluation_type": "hog",
"evaluation_config": {
"source": "let raw := event.properties.$ai_output_choices[1].content; try { jsonParseStr(raw); return true; } catch { return false; }"
},
"output_type": "boolean",
"enabled": true
}Hog评估器可完全访问事件及其属性——常见模式包括schema验证、长度/令牌限制、正则匹配、工具调用形状检查。由于具有确定性,结果在多次运行中可复现,且易于对比差异。
LLM-judge evaluator (subjective, prompt-based)
LLM评判器评估器(主观性、基于提示词)
Use this when the criterion is fuzzy and a code rule would be brittle (tone, factuality,
helpfulness, on-topic-ness). There's no equivalent of for LLM
judges — the typical loop is to create the evaluator with , run it
manually against a handful of representative generations via , inspect
the results, refine the prompt with , and then flip
when you're satisfied:
llma-evaluation-test-hogenabled: falsellma-evaluation-runllma-evaluation-updateenabled: truejson
posthog:llma-evaluation-create
{
"name": "Response stays on-topic",
"description": "LLM judge — fails if the assistant changes topic from the user's question",
"evaluation_type": "llm_judge",
"evaluation_config": {
"prompt": "You are evaluating whether the assistant's reply stays on-topic relative to the user's most recent question. Return true if it does, false if the assistant changed the subject. Return N/A if the user did not actually ask a question."
},
"output_type": "boolean",
"output_config": { "allows_na": true },
"model_configuration": {
"provider": "openai",
"model": "gpt-5-mini"
},
"enabled": false
}Then dry-run against a known-good and a known-bad generation:
json
posthog:llma-evaluation-run
{
"evaluationId": "<new_eval_uuid>",
"target_event_id": "<generation_uuid>",
"timestamp": "2026-04-01T19:39:20Z"
}LLM judges require organisation AI data processing approval. Hog evaluators do not.
当评判标准模糊、代码规则难以实现时使用此类型(语气、真实性、有用性、相关性)。LLM评判器没有类似的工具——典型流程是创建评估器时设置,通过手动针对少量代表性生成结果运行,检查结果,通过优化提示词,然后在满意后设置:
llma-evaluation-test-hogenabled: falsellma-evaluation-runllma-evaluation-updateenabled: truejson
posthog:llma-evaluation-create
{
"name": "Response stays on-topic",
"description": "LLM judge — fails if the assistant changes topic from the user's question",
"evaluation_type": "llm_judge",
"evaluation_config": {
"prompt": "You are evaluating whether the assistant's reply stays on-topic relative to the user's most recent question. Return true if it does, false if the assistant changed the subject. Return N/A if the user did not actually ask a question."
},
"output_type": "boolean",
"output_config": { "allows_na": true },
"model_configuration": {
"provider": "openai",
"model": "gpt-5-mini"
},
"enabled": false
}然后针对已知合格和不合格的生成结果试运行:
json
posthog:llma-evaluation-run
{
"evaluationId": "<new_eval_uuid>",
"target_event_id": "<generation_uuid>",
"timestamp": "2026-04-01T19:39:20Z"
}LLM评判器需要组织层面的AI数据处理审批。Hog评估器无此要求。
Workflow: manage the evaluation lifecycle
工作流:管理评估生命周期
| Action | Tool |
|---|---|
| Add a Hog evaluator | |
| Add an LLM-judge evaluator | |
| Tweak the source or prompt | |
| Toggle N/A handling | |
| Disable temporarily | |
| Remove | |
llm_judgeis_ai_data_processing_approvedllma-evaluation-summary-create| 操作 | 工具 |
|---|---|
| 添加Hog评估器 | |
| 添加LLM评判器评估器 | |
| 修改代码或提示词 | |
| 切换N/A处理 | |
| 临时禁用 | |
| 删除 | |
llm_judgeis_ai_data_processing_approvedllma-evaluation-summary-createWhen to use Hog vs LLM judge
何时选择Hog vs LLM评判器
Reach for Hog by default. Switch to LLM judge only when the criterion can't be
expressed as code.
| Use Hog when… | Use LLM judge when… |
|---|---|
| The check is structural (JSON parses, schema matches) | The check is about meaning (on-topic, helpful, factual) |
| You need a deterministic, reproducible result | A small amount of judgement variability is acceptable |
| The criterion is cheap to compute | The criterion requires reading and understanding text |
| You can't get AI data processing approval | You have approval and the criterion is genuinely fuzzy |
| You need to enforce a hard limit (length, cost, etc.) | You need to rate a quality dimension |
| You want sub-millisecond evaluation | A few hundred milliseconds + LLM cost are acceptable |
A common pattern is to layer them: a Hog evaluator gates obvious format/length
violations cheaply, and an LLM-judge evaluator only fires on the generations that pass
the Hog gate (via ).
conditions默认优先选择Hog。仅当评判标准无法通过代码表达时,才切换到LLM评判器。
| 选择Hog的场景 | 选择LLM评判器的场景 |
|---|---|
| 检查内容为结构性的(JSON可解析、匹配schema) | 检查内容为语义性的(相关性、有用性、真实性) |
| 需要确定性、可复现的结果 | 可接受少量评判差异 |
| 评判标准计算成本低 | 评判标准需要读取并理解文本 |
| 无法获得AI数据处理审批 | 已获得审批且评判标准确实模糊 |
| 需要强制执行硬性限制(长度、成本等) | 需要对质量维度打分 |
| 需要亚毫秒级的评估速度 | 可接受数百毫秒延迟+LLM成本 |
常见模式是分层使用:Hog评估器低成本拦截明显的格式/长度违规,LLM评判器仅针对通过Hog筛选的生成结果运行(通过配置)。
conditionsInvestigation patterns
排查模式
The summarisation tool works the same way regardless of whether the evaluator is
or — it analyses the resulting events, not the evaluator
itself. The fix path differs (edit Hog source vs. edit prompt) but the diagnosis is
identical.
hogllm_judge$ai_evaluation总结工具的工作方式与评估器是还是无关——它分析的是生成的事件,而非评估器本身。修复路径不同(编辑Hog代码 vs 编辑提示词),但诊断流程一致。
hogllm_judge$ai_evaluation"Why is evaluation X suddenly failing more?"
"为什么评估X突然失败次数变多了?"
-
— confirm the evaluation is still enabled and unchanged (compare
llma-evaluation-listorevaluation_config.sourceto the version you expect)evaluation_config.prompt -
with
llma-evaluation-summary-create— get the dominant failure patterns and example IDsfilter: "fail" -
SQL count of fails per day to confirm the regression window:sql
SELECT toDate(timestamp) AS day, count() AS fails FROM events WHERE event = '$ai_evaluation' AND properties.$ai_evaluation_id = '<uuid>' AND properties.$ai_evaluation_result = false AND timestamp >= now() - INTERVAL 30 DAY GROUP BY day ORDER BY day -
Drill into a representative trace per pattern via
query-llm-trace
-
—— 确认评估仍处于启用状态且未被修改(对比
llma-evaluation-list或evaluation_config.source与预期版本)evaluation_config.prompt -
,设置
llma-evaluation-summary-create—— 获取主要失败模式和示例IDfilter: "fail" -
通过SQL按天统计失败次数,确认回归窗口:sql
SELECT toDate(timestamp) AS day, count() AS fails FROM events WHERE event = '$ai_evaluation' AND properties.$ai_evaluation_id = '<uuid>' AND properties.$ai_evaluation_result = false AND timestamp >= now() - INTERVAL 30 DAY GROUP BY day ORDER BY day -
通过深入查看每个模式的代表性追踪
query-llm-trace
"Are passes and fails caused by the same root content?"
"通过和失败是否由相同的核心内容导致?"
- Generate two summaries: one with , one with
filter: "pass"filter: "fail" - If and
pass_patternsdescribe similar content:fail_patterns- For an : the prompt or rubric is probably ambiguous — reword
llm_judgeand useevaluation_config.promptllma-evaluation-update - For a evaluator: the rule is probably under- or over-matching — read the source via
hog, narrow the predicate, and retest withllma-evaluation-getbefore pushing the fix viallma-evaluation-test-hogllma-evaluation-update
- For an
- 生成两个总结:一个设置,一个设置
filter: "pass"filter: "fail" - 如果和
pass_patterns描述的内容相似:fail_patterns- 对于:提示词或评分标准可能存在歧义——重新修改
llm_judge并使用evaluation_config.prompt更新llma-evaluation-update - 对于评估器:规则可能匹配过度或不足——通过
hog查看代码,缩小判断条件,在通过llma-evaluation-get推送修复前,使用llma-evaluation-update重新测试llma-evaluation-test-hog
- 对于
"Did a Hog evaluator regression after a code change?"
"Hog评估器在代码变更后出现回归了吗?"
Hog evaluators are reproducible — if the source hasn't changed, identical inputs should
yield identical outputs. When fail rates jump for a Hog evaluator:
- — note the current source and
llma-evaluation-getupdated_at - Spot-check the latest failing runs with the SQL query from Step 4 above
- Re-run the source against those exact generations using with a modified
llma-evaluation-test-hogfilter that targets themconditions - If the test results match the live results, the change is in the generations, not the evaluator (a model upgrade, prompt change upstream, etc.) — investigate the producer
- If they diverge, the evaluator was edited; check git history of the source field via the activity log
Hog评估器具有可复现性——如果代码未变更,相同输入应产生相同输出。当Hog评估器的失败率骤增时:
- —— 记录当前代码和
llma-evaluation-get时间updated_at - 通过上述步骤4中的SQL查询抽查最新的失败运行
- 使用,修改
llma-evaluation-test-hog筛选目标结果,针对这些特定生成结果重新运行代码conditions - 如果测试结果与线上结果一致,说明变化源于生成结果而非评估器(模型升级、上游提示词变更等)——排查生成结果的来源
- 如果结果不一致,说明评估器已被修改;通过活动日志查看代码字段的git历史
"What kinds of generations does this evaluator skip as N/A?"
"此评估器会跳过哪些类型的生成结果作为N/A?"
json
posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "na" }Inspect to see whether the N/A logic is doing the right thing. If a
pattern in looks like something that should have been scored:
na_patternsna_patterns- For an : the applicability instruction in the prompt is too broad — narrow it
llm_judge - For a evaluator with
hog: the source is returningoutput_config.allows_na: true(or whatever the N/A signal is) too eagerly — tighten the preconditionnull
json
posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "na" }查看,确认N/A逻辑是否正常工作。如果中的某个模式看起来应该被打分:
na_patternsna_patterns- 对于:提示词中的适用性说明过于宽泛——缩小范围
llm_judge - 对于设置了的
output_config.allows_na: true评估器:代码过于频繁返回hog(或其他N/A信号)——收紧前置条件null
"Score this single generation right now"
"立即为这个单一生成结果打分"
llma-evaluation-run使用,传入追踪的生成ID和时间戳。适用于抽查或将评估集成到更大的Agent工作流中。
llma-evaluation-runConstructing UI links
构建UI链接
- Evaluations list:
https://app.posthog.com/llm-analytics/evaluations - Single evaluation:
https://app.posthog.com/llm-analytics/evaluations/<evaluation_id> - Underlying generation/trace: see the skill's URL conventions
exploring-llm-traces
Always surface the relevant link so the user can verify in the UI.
- 评估列表:
https://app.posthog.com/llm-analytics/evaluations - 单个评估:
https://app.posthog.com/llm-analytics/evaluations/<evaluation_id> - 底层生成结果/追踪:参考技能的URL规范
exploring-llm-traces
始终提供相关链接,方便用户在UI中验证。
Tips
提示
- The summary tool is rate-limited (burst, sustained, daily) and caches results
for one hour — repeated calls with the same are cheap; use
(evaluation_id, filter)only when you genuinely need fresh analysisforce_refresh: true - Pass to scope a summary to a specific cohort of runs (max 250)
generation_ids: [...] - The block in the summary response is computed from raw data, not the LLM — trust those counts even if a pattern's
statisticsfield is qualitativefrequency - For rich filtering not supported by (e.g. by author or model configuration), fall back to
llma-evaluation-listagainst theexecute-sqlPostgres table or theevaluationsClickHouse events$ai_evaluation - When showing failure patterns to the user, always include 1-2 example trace links so they can validate the pattern visually
- tools use
llma-evaluation-*for read tools andevaluation:readfor mutating tools;evaluation:writeusesllma-evaluation-summary-createllm_analytics:write - Hog evaluators are reproducible — if you suspect a regression, with the suspect source against the failing generations is the fastest way to bisect whether the change is in the evaluator or in the producer of the generations
llma-evaluation-test-hog - LLM-judge evaluators are non-deterministic across reruns; expect 1-5% noise even with
a fixed prompt and model. If you're chasing a small regression in fail rate, prefer
Hog or pin a deterministic provider/seed in the
model_configuration
- 总结工具有速率限制(突发速率、持续速率、每日速率),且结果缓存一小时——相同的重复调用成本低;仅当确实需要新鲜分析时,才使用
(evaluation_id, filter)force_refresh: true - 传入可将总结范围限定在特定批次的运行(最多250个)
generation_ids: [...] - 总结响应中的块由原始数据计算得出,而非LLM生成——即使模式的
statistics字段是定性描述,也可信任这些计数frequency - 对于不支持的复杂筛选(如按作者或模型配置筛选),可直接针对Postgres的
llma-evaluation-list表或ClickHouse的evaluations事件使用$ai_evaluationexecute-sql - 向用户展示失败模式时,始终包含1-2个示例追踪链接,方便用户直观验证模式
- 工具中,读取类工具使用
llma-evaluation-*权限,修改类工具使用evaluation:read权限;evaluation:write使用llma-evaluation-summary-create权限llm_analytics:write - Hog评估器具有可复现性——如果怀疑出现回归,使用针对失败结果运行可疑代码,是快速排查变化源于评估器还是生成结果来源的最佳方式
llma-evaluation-test-hog - LLM评判器在多次运行中具有非确定性;即使提示词和模型固定,也会存在1-5%的误差。如果追踪的失败率回归很小,优先选择Hog,或在中固定确定性的提供商/种子
model_configuration