exploring-llm-evaluations

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Exploring LLM evaluations

探索LLM评估

PostHog evaluations score
$ai_generation
events. Each evaluation is one of two types, both first-class:
  • hog
    — deterministic Hog code that returns
    true
    /
    false
    (and optionally N/A). Best for objective rule-based checks: format validation (JSON parses, schema matches), length limits, keyword presence/absence, regex patterns, structural assertions, latency thresholds, cost guards. Cheap, fast, reproducible — no LLM call per run. Prefer this when the criterion can be expressed as code.
  • llm_judge
    — an LLM scores generations against a prompt you write. Best for subjective or fuzzy checks: tone, helpfulness, hallucination detection, off-topic drift, instruction-following. Costs an LLM call per run and requires AI data processing approval at the org level.
Results from both types land in ClickHouse as
$ai_evaluation
events with the same schema, so the read/query/summary workflows are identical regardless of evaluator type — the only thing that changes is whether
$ai_evaluation_reasoning
was written by Hog code or by an LLM.
This skill covers the full lifecycle: list/inspect/manage evaluation configs (Hog or LLM judge), run them on specific generations, query individual results, and get an AI-generated summary of pass/fail/N/A patterns across many runs.
PostHog评估会为
$ai_generation
事件打分。每种评估都属于以下两种类型之一,且二者均为一等功能:
  • hog
    —— 确定性Hog代码,返回
    true
    /
    false
    (可选返回N/A)。最适合基于规则的客观检查:格式验证(JSON可解析、匹配 schema)、长度限制、关键词存在/缺失、正则表达式模式、结构断言、延迟阈值、成本管控。成本低、速度快、可复现——每次运行无需调用LLM。当评判标准可通过代码表达时,优先选择此类型。
  • llm_judge
    —— 由LLM根据你编写的提示词为生成结果打分。最适合主观或模糊检查:语气、有用性、幻觉检测、偏离主题、指令遵循度。每次运行需调用LLM,且需要组织层面的AI数据处理审批。
两种类型的评估结果都会以
$ai_evaluation
事件的形式存入ClickHouse,且具有相同的 schema,因此无论评估器类型如何,读取/查询/总结工作流都是相同的——唯一的区别在于
$ai_evaluation_reasoning
是由Hog代码生成还是LLM生成。
本技能覆盖完整生命周期:列出/检查/管理评估配置(Hog或LLM评判器)、针对特定生成结果运行评估、查询单个结果,以及获取AI生成的多轮运行中通过/失败/N/A模式的总结。

Tools

工具

ToolPurpose
posthog:llma-evaluation-list
List/search evaluation configs (filter by name, enabled flag)
posthog:llma-evaluation-get
Get a single evaluation config by UUID
posthog:llma-evaluation-create
Create a new
llm_judge
or
hog
evaluation
posthog:llma-evaluation-update
Update an existing evaluation (name, prompt, enabled, …)
posthog:llma-evaluation-delete
Soft-delete an evaluation
posthog:llma-evaluation-run
Run an evaluation against a specific
$ai_generation
event
posthog:llma-evaluation-test-hog
Dry-run Hog source against recent generations (no save)
posthog:llma-evaluation-summary-create
AI-powered summary of pass/fail/N/A patterns across runs
posthog:execute-sql
Ad-hoc HogQL over
$ai_evaluation
events
posthog:query-llm-trace
Drill into the underlying generation that an evaluation scored
All
llma-evaluation-*
tools are defined in
products/llm_analytics/mcp/tools.yaml
.
工具名称用途
posthog:llma-evaluation-list
列出/搜索评估配置(按名称、启用状态筛选)
posthog:llma-evaluation-get
通过UUID获取单个评估配置
posthog:llma-evaluation-create
创建新的
llm_judge
hog
评估
posthog:llma-evaluation-update
更新现有评估(名称、提示词、启用状态等)
posthog:llma-evaluation-delete
软删除评估
posthog:llma-evaluation-run
针对特定
$ai_generation
事件运行评估
posthog:llma-evaluation-test-hog
针对近期生成结果试运行Hog代码(不保存结果)
posthog:llma-evaluation-summary-create
基于AI生成多轮运行中通过/失败/N/A模式的总结
posthog:execute-sql
$ai_evaluation
事件执行临时HogQL查询
posthog:query-llm-trace
深入查看评估打分对应的底层生成结果
所有
llma-evaluation-*
工具均定义于
products/llm_analytics/mcp/tools.yaml

Event schema

事件Schema

Every run of an evaluation emits an
$ai_evaluation
event. Key properties:
PropertyMeaning
$ai_evaluation_id
UUID of the evaluation config
$ai_evaluation_name
Human-readable name
$ai_target_event_id
UUID of the
$ai_generation
event being scored
$ai_trace_id
Parent trace ID (for jumping to the trace UI)
$ai_evaluation_result
true
= pass,
false
= fail
$ai_evaluation_reasoning
Free-text explanation (set by the LLM judge or Hog code)
$ai_evaluation_applicable
false
when the evaluator decided the generation is N/A
When
$ai_evaluation_applicable = false
, the run counts as N/A regardless of
$ai_evaluation_result
. For evaluations that don't support N/A, this property may be
null
— treat null as "applicable".
每次评估运行都会触发一个
$ai_evaluation
事件。关键属性如下:
属性名称含义
$ai_evaluation_id
评估配置的UUID
$ai_evaluation_name
易读的评估名称
$ai_target_event_id
被打分的
$ai_generation
事件的UUID
$ai_trace_id
父级追踪ID(用于跳转至追踪UI)
$ai_evaluation_result
true
=通过,
false
=失败
$ai_evaluation_reasoning
自由文本解释(由LLM评判器或Hog代码设置)
$ai_evaluation_applicable
当评估器判定生成结果不适用时为
false
$ai_evaluation_applicable = false
时,无论
$ai_evaluation_result
如何,本次运行都算作N/A。对于不支持N/A的评估,该属性可能为
null
——将null视为“适用”。

Workflow: investigate why an evaluation is failing

工作流:排查评估失败原因

Works the same way for
llm_judge
and
hog
evaluations — the differences only matter when you eventually go to fix the evaluator (edit the prompt vs. edit the Hog source).
此流程对
llm_judge
hog
评估均适用——区别仅在于最终修复评估器时(编辑提示词 vs 编辑Hog代码)。

Step 1 — Find the evaluation

步骤1 —— 找到目标评估

json
posthog:llma-evaluation-list
{ "search": "hallucination", "enabled": true }
Look at the returned
id
,
name
,
evaluation_type
, and either:
  • evaluation_config.prompt
    for an
    llm_judge
  • evaluation_config.source
    for a
    hog
    evaluator
The Hog source is the ground truth for why a hog evaluator passes or fails — read it before assuming the failure is in the generation.
json
posthog:llma-evaluation-list
{ "search": "hallucination", "enabled": true }
查看返回结果中的
id
name
evaluation_type
,以及:
  • 对于
    llm_judge
    ,查看
    evaluation_config.prompt
  • 对于
    hog
    评估器,查看
    evaluation_config.source
Hog代码是hog评估器通过或失败的依据——在假设失败源于生成结果之前,请先阅读Hog代码。

Step 2 — Get the AI-generated summary

步骤2 —— 获取AI生成的总结

json
posthog:llma-evaluation-summary-create
{
  "evaluation_id": "<uuid>",
  "filter": "fail"
}
Returns:
  • overall_assessment
    — natural-language summary
  • fail_patterns
    — grouped patterns with
    title
    ,
    description
    ,
    frequency
    , and
    example_generation_ids
  • pass_patterns
    and
    na_patterns
    — same shape, populated when
    filter
    includes them
  • recommendations
    — actionable next steps
  • statistics
    total_analyzed
    ,
    pass_count
    ,
    fail_count
    ,
    na_count
The endpoint analyses the most recent ~250 runs (
EVALUATION_SUMMARY_MAX_RUNS
). Results are cached for one hour per
(evaluation_id, filter, set_of_generation_ids)
. Pass
force_refresh: true
to recompute.
Compare filters in two calls to spot what's distinctive about failures vs passes:
json
posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "pass" }
Then diff the
pass_patterns
against the
fail_patterns
from Step 2.
json
posthog:llma-evaluation-summary-create
{
  "evaluation_id": "<uuid>",
  "filter": "fail"
}
返回结果包含:
  • overall_assessment
    —— 自然语言总结
  • fail_patterns
    —— 分组模式,包含
    title
    description
    frequency
    example_generation_ids
  • pass_patterns
    na_patterns
    —— 结构相同,当筛选条件包含时会填充内容
  • recommendations
    —— 可执行的下一步建议
  • statistics
    ——
    total_analyzed
    pass_count
    fail_count
    na_count
该端点会分析最近约250次运行(
EVALUATION_SUMMARY_MAX_RUNS
)。结果会按
(evaluation_id, filter, set_of_generation_ids)
缓存一小时。传入
force_refresh: true
可重新计算。
通过两次调用对比筛选条件,找出失败与通过结果的差异:
json
posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "pass" }
然后对比步骤2中得到的
fail_patterns
与本次的
pass_patterns

Step 3 — Drill into example failing runs

步骤3 —— 深入查看失败示例运行

Each pattern surfaces
example_generation_ids
. Pull the underlying trace for the most representative example:
json
posthog:query-llm-trace
{ "traceId": "<trace_id>", "dateRange": {"date_from": "-30d"} }
(If you only have a generation ID, query for it via
execute-sql
first to find the parent trace ID — see below.)
每个模式都会显示
example_generation_ids
。提取最具代表性示例的底层追踪:
json
posthog:query-llm-trace
{ "traceId": "<trace_id>", "dateRange": {"date_from": "-30d"} }
(如果只有生成ID,需先通过
execute-sql
查询找到父级追踪ID——见下文。)

Step 4 — Verify the pattern with raw SQL

步骤4 —— 用原始SQL验证模式

The summary is LLM-generated and should be verified. Use
execute-sql
to count and spot-check:
sql
posthog:execute-sql
SELECT
    properties.$ai_target_event_id AS generation_id,
    properties.$ai_trace_id AS trace_id,
    properties.$ai_evaluation_reasoning AS reasoning,
    timestamp
FROM events
WHERE event = '$ai_evaluation'
    AND properties.$ai_evaluation_id = '<evaluation_uuid>'
    AND properties.$ai_evaluation_result = false
    AND (
        properties.$ai_evaluation_applicable IS NULL
        OR properties.$ai_evaluation_applicable != false
    )
    AND timestamp >= now() - INTERVAL 7 DAY
ORDER BY timestamp DESC
LIMIT 25
The N/A guard (
IS NULL OR != false
) is important — it matches the same logic the backend uses to bucket runs.
总结由LLM生成,需进行验证。使用
execute-sql
进行计数和抽样检查:
sql
posthog:execute-sql
SELECT
    properties.$ai_target_event_id AS generation_id,
    properties.$ai_trace_id AS trace_id,
    properties.$ai_evaluation_reasoning AS reasoning,
    timestamp
FROM events
WHERE event = '$ai_evaluation'
    AND properties.$ai_evaluation_id = '<evaluation_uuid>'
    AND properties.$ai_evaluation_result = false
    AND (
        properties.$ai_evaluation_applicable IS NULL
        OR properties.$ai_evaluation_applicable != false
    )
    AND timestamp >= now() - INTERVAL 7 DAY
ORDER BY timestamp DESC
LIMIT 25
N/A过滤条件(
IS NULL OR != false
)非常重要——它与后端用于分类运行结果的逻辑一致。

Workflow: run an evaluation against a specific generation

工作流:针对特定生成结果运行评估

Use this when the user pastes a trace/generation URL and asks "what would evaluation X say about this?".
json
posthog:llma-evaluation-run
{
  "evaluationId": "<eval_uuid>",
  "target_event_id": "<generation_event_uuid>",
  "timestamp": "2026-04-01T19:39:20Z",
  "event": "$ai_generation"
}
The
timestamp
is required for an efficient ClickHouse lookup of the target event. Pass
distinct_id
if you have it — it speeds up the lookup further.
当用户粘贴追踪/生成结果URL并询问“评估X会如何评价这个结果?”时,使用此流程。
json
posthog:llma-evaluation-run
{
  "evaluationId": "<eval_uuid>",
  "target_event_id": "<generation_event_uuid>",
  "timestamp": "2026-04-01T19:39:20Z",
  "event": "$ai_generation"
}
timestamp
是高效查询ClickHouse中目标事件的必填项。如果有
distinct_id
也请传入——可进一步加速查询。

Workflow: build and test a new evaluator

工作流:构建并测试新评估器

Hog evaluator (deterministic, code-based)

Hog评估器(确定性、基于代码)

Reach for this first when the criterion is rule-based — it's cheaper, faster, and reproducible. Prototype with
llma-evaluation-test-hog
(no save):
json
posthog:llma-evaluation-test-hog
{
  "source": "return event.properties.$ai_output_choices[1].content contains 'sorry';",
  "sample_count": 5,
  "allows_na": false
}
The handler returns the boolean result for each of the most recent N
$ai_generation
events. Iterate on the source until it behaves as expected, then promote it via
llma-evaluation-create
:
json
posthog:llma-evaluation-create
{
  "name": "Output is valid JSON",
  "description": "Fails when the assistant message can't be parsed as JSON",
  "evaluation_type": "hog",
  "evaluation_config": {
    "source": "let raw := event.properties.$ai_output_choices[1].content; try { jsonParseStr(raw); return true; } catch { return false; }"
  },
  "output_type": "boolean",
  "enabled": true
}
Hog evaluators have full access to the event and its properties — common patterns include schema validation, length/token limits, regex matches, and tool-call shape checks. Because they're deterministic, results are reproducible across reruns and trivially diff-able.
当评判标准基于规则时,优先选择此类型——成本更低、速度更快、可复现。使用
llma-evaluation-test-hog
进行原型开发(不保存结果):
json
posthog:llma-evaluation-test-hog
{
  "source": "return event.properties.$ai_output_choices[1].content contains 'sorry';",
  "sample_count": 5,
  "allows_na": false
}
处理器会返回最近N个
$ai_generation
事件的布尔结果。迭代优化代码直至符合预期,然后通过
llma-evaluation-create
正式创建:
json
posthog:llma-evaluation-create
{
  "name": "Output is valid JSON",
  "description": "Fails when the assistant message can't be parsed as JSON",
  "evaluation_type": "hog",
  "evaluation_config": {
    "source": "let raw := event.properties.$ai_output_choices[1].content; try { jsonParseStr(raw); return true; } catch { return false; }"
  },
  "output_type": "boolean",
  "enabled": true
}
Hog评估器可完全访问事件及其属性——常见模式包括schema验证、长度/令牌限制、正则匹配、工具调用形状检查。由于具有确定性,结果在多次运行中可复现,且易于对比差异。

LLM-judge evaluator (subjective, prompt-based)

LLM评判器评估器(主观性、基于提示词)

Use this when the criterion is fuzzy and a code rule would be brittle (tone, factuality, helpfulness, on-topic-ness). There's no equivalent of
llma-evaluation-test-hog
for LLM judges — the typical loop is to create the evaluator with
enabled: false
, run it manually against a handful of representative generations via
llma-evaluation-run
, inspect the results, refine the prompt with
llma-evaluation-update
, and then flip
enabled: true
when you're satisfied:
json
posthog:llma-evaluation-create
{
  "name": "Response stays on-topic",
  "description": "LLM judge — fails if the assistant changes topic from the user's question",
  "evaluation_type": "llm_judge",
  "evaluation_config": {
    "prompt": "You are evaluating whether the assistant's reply stays on-topic relative to the user's most recent question. Return true if it does, false if the assistant changed the subject. Return N/A if the user did not actually ask a question."
  },
  "output_type": "boolean",
  "output_config": { "allows_na": true },
  "model_configuration": {
    "provider": "openai",
    "model": "gpt-5-mini"
  },
  "enabled": false
}
Then dry-run against a known-good and a known-bad generation:
json
posthog:llma-evaluation-run
{
  "evaluationId": "<new_eval_uuid>",
  "target_event_id": "<generation_uuid>",
  "timestamp": "2026-04-01T19:39:20Z"
}
LLM judges require organisation AI data processing approval. Hog evaluators do not.
当评判标准模糊、代码规则难以实现时使用此类型(语气、真实性、有用性、相关性)。LLM评判器没有类似
llma-evaluation-test-hog
的工具——典型流程是创建评估器时设置
enabled: false
,通过
llma-evaluation-run
手动针对少量代表性生成结果运行,检查结果,通过
llma-evaluation-update
优化提示词,然后在满意后设置
enabled: true
json
posthog:llma-evaluation-create
{
  "name": "Response stays on-topic",
  "description": "LLM judge — fails if the assistant changes topic from the user's question",
  "evaluation_type": "llm_judge",
  "evaluation_config": {
    "prompt": "You are evaluating whether the assistant's reply stays on-topic relative to the user's most recent question. Return true if it does, false if the assistant changed the subject. Return N/A if the user did not actually ask a question."
  },
  "output_type": "boolean",
  "output_config": { "allows_na": true },
  "model_configuration": {
    "provider": "openai",
    "model": "gpt-5-mini"
  },
  "enabled": false
}
然后针对已知合格和不合格的生成结果试运行:
json
posthog:llma-evaluation-run
{
  "evaluationId": "<new_eval_uuid>",
  "target_event_id": "<generation_uuid>",
  "timestamp": "2026-04-01T19:39:20Z"
}
LLM评判器需要组织层面的AI数据处理审批。Hog评估器无此要求。

Workflow: manage the evaluation lifecycle

工作流:管理评估生命周期

ActionTool
Add a Hog evaluator
llma-evaluation-create
with
evaluation_type: "hog"
and
evaluation_config.source
Add an LLM-judge evaluator
llma-evaluation-create
with
evaluation_type: "llm_judge"
,
evaluation_config.prompt
, and a
model_configuration
Tweak the source or prompt
llma-evaluation-update
(edits
evaluation_config.source
for Hog,
evaluation_config.prompt
for LLM judge)
Toggle N/A handling
llma-evaluation-update
with
output_config.allows_na
Disable temporarily
llma-evaluation-update
with
enabled: false
Remove
llma-evaluation-delete
(soft-delete via PATCH
{deleted: true}
)
llm_judge
evaluations require AI data processing approval at the org level (
is_ai_data_processing_approved
). The same gate applies to
llma-evaluation-summary-create
. Hog evaluations do not require this gate — they run as plain code on the ingestion pipeline.
操作工具
添加Hog评估器
llma-evaluation-create
,设置
evaluation_type: "hog"
evaluation_config.source
添加LLM评判器评估器
llma-evaluation-create
,设置
evaluation_type: "llm_judge"
evaluation_config.prompt
model_configuration
修改代码或提示词
llma-evaluation-update
(针对Hog修改
evaluation_config.source
,针对LLM评判器修改
evaluation_config.prompt
切换N/A处理
llma-evaluation-update
,设置
output_config.allows_na
临时禁用
llma-evaluation-update
,设置
enabled: false
删除
llma-evaluation-delete
(通过PATCH
{deleted: true}
软删除)
llm_judge
评估需要组织层面的AI数据处理审批(
is_ai_data_processing_approved
)。
llma-evaluation-summary-create
同样需要此审批。Hog评估无需此审批——它们作为普通代码在 ingestion 管道中运行。

When to use Hog vs LLM judge

何时选择Hog vs LLM评判器

Reach for Hog by default. Switch to LLM judge only when the criterion can't be expressed as code.
Use Hog when…Use LLM judge when…
The check is structural (JSON parses, schema matches)The check is about meaning (on-topic, helpful, factual)
You need a deterministic, reproducible resultA small amount of judgement variability is acceptable
The criterion is cheap to computeThe criterion requires reading and understanding text
You can't get AI data processing approvalYou have approval and the criterion is genuinely fuzzy
You need to enforce a hard limit (length, cost, etc.)You need to rate a quality dimension
You want sub-millisecond evaluationA few hundred milliseconds + LLM cost are acceptable
A common pattern is to layer them: a Hog evaluator gates obvious format/length violations cheaply, and an LLM-judge evaluator only fires on the generations that pass the Hog gate (via
conditions
).
默认优先选择Hog。仅当评判标准无法通过代码表达时,才切换到LLM评判器。
选择Hog的场景选择LLM评判器的场景
检查内容为结构性的(JSON可解析、匹配schema)检查内容为语义性的(相关性、有用性、真实性)
需要确定性、可复现的结果可接受少量评判差异
评判标准计算成本低评判标准需要读取并理解文本
无法获得AI数据处理审批已获得审批且评判标准确实模糊
需要强制执行硬性限制(长度、成本等)需要对质量维度打分
需要亚毫秒级的评估速度可接受数百毫秒延迟+LLM成本
常见模式是分层使用:Hog评估器低成本拦截明显的格式/长度违规,LLM评判器仅针对通过Hog筛选的生成结果运行(通过
conditions
配置)。

Investigation patterns

排查模式

The summarisation tool works the same way regardless of whether the evaluator is
hog
or
llm_judge
— it analyses the resulting
$ai_evaluation
events, not the evaluator itself. The fix path differs (edit Hog source vs. edit prompt) but the diagnosis is identical.
总结工具的工作方式与评估器是
hog
还是
llm_judge
无关——它分析的是生成的
$ai_evaluation
事件,而非评估器本身。修复路径不同(编辑Hog代码 vs 编辑提示词),但诊断流程一致。

"Why is evaluation X suddenly failing more?"

"为什么评估X突然失败次数变多了?"

  1. llma-evaluation-list
    — confirm the evaluation is still enabled and unchanged (compare
    evaluation_config.source
    or
    evaluation_config.prompt
    to the version you expect)
  2. llma-evaluation-summary-create
    with
    filter: "fail"
    — get the dominant failure patterns and example IDs
  3. SQL count of fails per day to confirm the regression window:
    sql
    SELECT toDate(timestamp) AS day, count() AS fails
    FROM events
    WHERE event = '$ai_evaluation'
        AND properties.$ai_evaluation_id = '<uuid>'
        AND properties.$ai_evaluation_result = false
        AND timestamp >= now() - INTERVAL 30 DAY
    GROUP BY day
    ORDER BY day
  4. Drill into a representative trace per pattern via
    query-llm-trace
  1. llma-evaluation-list
    —— 确认评估仍处于启用状态且未被修改(对比
    evaluation_config.source
    evaluation_config.prompt
    与预期版本)
  2. llma-evaluation-summary-create
    ,设置
    filter: "fail"
    —— 获取主要失败模式和示例ID
  3. 通过SQL按天统计失败次数,确认回归窗口:
    sql
    SELECT toDate(timestamp) AS day, count() AS fails
    FROM events
    WHERE event = '$ai_evaluation'
        AND properties.$ai_evaluation_id = '<uuid>'
        AND properties.$ai_evaluation_result = false
        AND timestamp >= now() - INTERVAL 30 DAY
    GROUP BY day
    ORDER BY day
  4. 通过
    query-llm-trace
    深入查看每个模式的代表性追踪

"Are passes and fails caused by the same root content?"

"通过和失败是否由相同的核心内容导致?"

  1. Generate two summaries: one with
    filter: "pass"
    , one with
    filter: "fail"
  2. If
    pass_patterns
    and
    fail_patterns
    describe similar content:
    • For an
      llm_judge
      : the prompt or rubric is probably ambiguous — reword
      evaluation_config.prompt
      and use
      llma-evaluation-update
    • For a
      hog
      evaluator: the rule is probably under- or over-matching — read the source via
      llma-evaluation-get
      , narrow the predicate, and retest with
      llma-evaluation-test-hog
      before pushing the fix via
      llma-evaluation-update
  1. 生成两个总结:一个设置
    filter: "pass"
    ,一个设置
    filter: "fail"
  2. 如果
    pass_patterns
    fail_patterns
    描述的内容相似:
    • 对于
      llm_judge
      :提示词或评分标准可能存在歧义——重新修改
      evaluation_config.prompt
      并使用
      llma-evaluation-update
      更新
    • 对于
      hog
      评估器:规则可能匹配过度或不足——通过
      llma-evaluation-get
      查看代码,缩小判断条件,在通过
      llma-evaluation-update
      推送修复前,使用
      llma-evaluation-test-hog
      重新测试

"Did a Hog evaluator regression after a code change?"

"Hog评估器在代码变更后出现回归了吗?"

Hog evaluators are reproducible — if the source hasn't changed, identical inputs should yield identical outputs. When fail rates jump for a Hog evaluator:
  1. llma-evaluation-get
    — note the current source and
    updated_at
  2. Spot-check the latest failing runs with the SQL query from Step 4 above
  3. Re-run the source against those exact generations using
    llma-evaluation-test-hog
    with a modified
    conditions
    filter that targets them
  4. If the test results match the live results, the change is in the generations, not the evaluator (a model upgrade, prompt change upstream, etc.) — investigate the producer
  5. If they diverge, the evaluator was edited; check git history of the source field via the activity log
Hog评估器具有可复现性——如果代码未变更,相同输入应产生相同输出。当Hog评估器的失败率骤增时:
  1. llma-evaluation-get
    —— 记录当前代码和
    updated_at
    时间
  2. 通过上述步骤4中的SQL查询抽查最新的失败运行
  3. 使用
    llma-evaluation-test-hog
    ,修改
    conditions
    筛选目标结果,针对这些特定生成结果重新运行代码
  4. 如果测试结果与线上结果一致,说明变化源于生成结果而非评估器(模型升级、上游提示词变更等)——排查生成结果的来源
  5. 如果结果不一致,说明评估器已被修改;通过活动日志查看代码字段的git历史

"What kinds of generations does this evaluator skip as N/A?"

"此评估器会跳过哪些类型的生成结果作为N/A?"

json
posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "na" }
Inspect
na_patterns
to see whether the N/A logic is doing the right thing. If a pattern in
na_patterns
looks like something that should have been scored:
  • For an
    llm_judge
    : the applicability instruction in the prompt is too broad — narrow it
  • For a
    hog
    evaluator with
    output_config.allows_na: true
    : the source is returning
    null
    (or whatever the N/A signal is) too eagerly — tighten the precondition
json
posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "na" }
查看
na_patterns
,确认N/A逻辑是否正常工作。如果
na_patterns
中的某个模式看起来应该被打分:
  • 对于
    llm_judge
    :提示词中的适用性说明过于宽泛——缩小范围
  • 对于设置了
    output_config.allows_na: true
    hog
    评估器:代码过于频繁返回
    null
    (或其他N/A信号)——收紧前置条件

"Score this single generation right now"

"立即为这个单一生成结果打分"

llma-evaluation-run
with the trace's generation ID and timestamp. Useful for spot-checking or wiring evaluations into a larger agent loop.
使用
llma-evaluation-run
,传入追踪的生成ID和时间戳。适用于抽查或将评估集成到更大的Agent工作流中。

Constructing UI links

构建UI链接

  • Evaluations list:
    https://app.posthog.com/llm-analytics/evaluations
  • Single evaluation:
    https://app.posthog.com/llm-analytics/evaluations/<evaluation_id>
  • Underlying generation/trace: see the
    exploring-llm-traces
    skill's URL conventions
Always surface the relevant link so the user can verify in the UI.
  • 评估列表
    https://app.posthog.com/llm-analytics/evaluations
  • 单个评估
    https://app.posthog.com/llm-analytics/evaluations/<evaluation_id>
  • 底层生成结果/追踪:参考
    exploring-llm-traces
    技能的URL规范
始终提供相关链接,方便用户在UI中验证。

Tips

提示

  • The summary tool is rate-limited (burst, sustained, daily) and caches results for one hour — repeated calls with the same
    (evaluation_id, filter)
    are cheap; use
    force_refresh: true
    only when you genuinely need fresh analysis
  • Pass
    generation_ids: [...]
    to scope a summary to a specific cohort of runs (max 250)
  • The
    statistics
    block in the summary response is computed from raw data, not the LLM — trust those counts even if a pattern's
    frequency
    field is qualitative
  • For rich filtering not supported by
    llma-evaluation-list
    (e.g. by author or model configuration), fall back to
    execute-sql
    against the
    evaluations
    Postgres table or the
    $ai_evaluation
    ClickHouse events
  • When showing failure patterns to the user, always include 1-2 example trace links so they can validate the pattern visually
  • llma-evaluation-*
    tools use
    evaluation:read
    for read tools and
    evaluation:write
    for mutating tools;
    llma-evaluation-summary-create
    uses
    llm_analytics:write
  • Hog evaluators are reproducible — if you suspect a regression,
    llma-evaluation-test-hog
    with the suspect source against the failing generations is the fastest way to bisect whether the change is in the evaluator or in the producer of the generations
  • LLM-judge evaluators are non-deterministic across reruns; expect 1-5% noise even with a fixed prompt and model. If you're chasing a small regression in fail rate, prefer Hog or pin a deterministic provider/seed in the
    model_configuration
  • 总结工具有速率限制(突发速率、持续速率、每日速率),且结果缓存一小时——相同
    (evaluation_id, filter)
    的重复调用成本低;仅当确实需要新鲜分析时,才使用
    force_refresh: true
  • 传入
    generation_ids: [...]
    可将总结范围限定在特定批次的运行(最多250个)
  • 总结响应中的
    statistics
    块由原始数据计算得出,而非LLM生成——即使模式的
    frequency
    字段是定性描述,也可信任这些计数
  • 对于
    llma-evaluation-list
    不支持的复杂筛选(如按作者或模型配置筛选),可直接针对Postgres的
    evaluations
    表或ClickHouse的
    $ai_evaluation
    事件使用
    execute-sql
  • 向用户展示失败模式时,始终包含1-2个示例追踪链接,方便用户直观验证模式
  • llma-evaluation-*
    工具中,读取类工具使用
    evaluation:read
    权限,修改类工具使用
    evaluation:write
    权限;
    llma-evaluation-summary-create
    使用
    llm_analytics:write
    权限
  • Hog评估器具有可复现性——如果怀疑出现回归,使用
    llma-evaluation-test-hog
    针对失败结果运行可疑代码,是快速排查变化源于评估器还是生成结果来源的最佳方式
  • LLM评判器在多次运行中具有非确定性;即使提示词和模型固定,也会存在1-5%的误差。如果追踪的失败率回归很小,优先选择Hog,或在
    model_configuration
    中固定确定性的提供商/种子