exploring-llm-evaluations

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Exploring LLM evaluations

探索LLM评估

PostHog evaluations score

$ai_generation

events. Each evaluation is one of two types, both first-class:

hog
— deterministic Hog code that returns
```
true
```
/
```
false
```
(and optionally N/A). Best for objective rule-based checks: format validation (JSON parses, schema matches), length limits, keyword presence/absence, regex patterns, structural assertions, latency thresholds, cost guards. Cheap, fast, reproducible — no LLM call per run. Prefer this when the criterion can be expressed as code.
llm_judge
— an LLM scores generations against a prompt you write. Best for subjective or fuzzy checks: tone, helpfulness, hallucination detection, off-topic drift, instruction-following. Costs an LLM call per run and requires AI data processing approval at the org level.

Results from both types land in ClickHouse as

$ai_evaluation

events with the same schema, so the read/query/summary workflows are identical regardless of evaluator type — the only thing that changes is whether

$ai_evaluation_reasoning

was written by Hog code or by an LLM.

This skill covers the full lifecycle: list/inspect/manage evaluation configs (Hog or LLM judge), run them on specific generations, query individual results, and get an AI-generated summary of pass/fail/N/A patterns across many runs.

PostHog评估会为

$ai_generation

事件打分。每种评估都属于以下两种类型之一，且二者均为一等功能：

hog
—— 确定性Hog代码，返回
```
true
```
/
```
false
```
（可选返回N/A）。最适合基于规则的客观检查：格式验证（JSON可解析、匹配 schema）、长度限制、关键词存在/缺失、正则表达式模式、结构断言、延迟阈值、成本管控。成本低、速度快、可复现——每次运行无需调用LLM。当评判标准可通过代码表达时，优先选择此类型。
llm_judge
—— 由LLM根据你编写的提示词为生成结果打分。最适合主观或模糊检查：语气、有用性、幻觉检测、偏离主题、指令遵循度。每次运行需调用LLM，且需要组织层面的AI数据处理审批。

两种类型的评估结果都会以

$ai_evaluation

事件的形式存入ClickHouse，且具有相同的 schema，因此无论评估器类型如何，读取/查询/总结工作流都是相同的——唯一的区别在于

$ai_evaluation_reasoning

是由Hog代码生成还是LLM生成。

本技能覆盖完整生命周期：列出/检查/管理评估配置（Hog或LLM评判器）、针对特定生成结果运行评估、查询单个结果，以及获取AI生成的多轮运行中通过/失败/N/A模式的总结。

Tools

工具

Tool	Purpose
`posthog:llma-evaluation-list`	List/search evaluation configs (filter by name, enabled flag)
`posthog:llma-evaluation-get`	Get a single evaluation config by UUID
`posthog:llma-evaluation-create`	Create a new `llm_judge` or `hog` evaluation
`posthog:llma-evaluation-update`	Update an existing evaluation (name, prompt, enabled, …)
`posthog:llma-evaluation-delete`	Soft-delete an evaluation
`posthog:llma-evaluation-run`	Run an evaluation against a specific `$ai_generation` event
`posthog:llma-evaluation-test-hog`	Dry-run Hog source against recent generations (no save)
`posthog:llma-evaluation-summary-create`	AI-powered summary of pass/fail/N/A patterns across runs
`posthog:execute-sql`	Ad-hoc HogQL over `$ai_evaluation` events
`posthog:query-llm-trace`	Drill into the underlying generation that an evaluation scored

All

llma-evaluation-*

tools are defined in

products/llm_analytics/mcp/tools.yaml

工具名称	用途
`posthog:llma-evaluation-list`	列出/搜索评估配置（按名称、启用状态筛选）
`posthog:llma-evaluation-get`	通过UUID获取单个评估配置
`posthog:llma-evaluation-create`	创建新的 `llm_judge` 或 `hog` 评估
`posthog:llma-evaluation-update`	更新现有评估（名称、提示词、启用状态等）
`posthog:llma-evaluation-delete`	软删除评估
`posthog:llma-evaluation-run`	针对特定 `$ai_generation` 事件运行评估
`posthog:llma-evaluation-test-hog`	针对近期生成结果试运行Hog代码（不保存结果）
`posthog:llma-evaluation-summary-create`	基于AI生成多轮运行中通过/失败/N/A模式的总结
`posthog:execute-sql`	对 `$ai_evaluation` 事件执行临时HogQL查询
`posthog:query-llm-trace`	深入查看评估打分对应的底层生成结果

所有

llma-evaluation-*

工具均定义于

products/llm_analytics/mcp/tools.yaml

。

Event schema

事件Schema

Every run of an evaluation emits an

$ai_evaluation

event. Key properties:

Property	Meaning
`$ai_evaluation_id`	UUID of the evaluation config
`$ai_evaluation_name`	Human-readable name
`$ai_target_event_id`	UUID of the `$ai_generation` event being scored
`$ai_trace_id`	Parent trace ID (for jumping to the trace UI)
`$ai_evaluation_result`	`true` = pass, `false` = fail
`$ai_evaluation_reasoning`	Free-text explanation (set by the LLM judge or Hog code)
`$ai_evaluation_applicable`	`false` when the evaluator decided the generation is N/A

When

$ai_evaluation_applicable = false

, the run counts as N/A regardless of

$ai_evaluation_result

. For evaluations that don't support N/A, this property may be

null

— treat null as "applicable".

每次评估运行都会触发一个

$ai_evaluation

事件。关键属性如下：

属性名称	含义
`$ai_evaluation_id`	评估配置的UUID
`$ai_evaluation_name`	易读的评估名称
`$ai_target_event_id`	被打分的 `$ai_generation` 事件的UUID
`$ai_trace_id`	父级追踪ID（用于跳转至追踪UI）
`$ai_evaluation_result`	`true` =通过， `false` =失败
`$ai_evaluation_reasoning`	自由文本解释（由LLM评判器或Hog代码设置）
`$ai_evaluation_applicable`	当评估器判定生成结果不适用时为 `false`

当

$ai_evaluation_applicable = false

时，无论

$ai_evaluation_result

如何，本次运行都算作N/A。对于不支持N/A的评估，该属性可能为

null

——将null视为“适用”。

Workflow: investigate why an evaluation is failing

工作流：排查评估失败原因

Works the same way for

llm_judge

and

hog

evaluations — the differences only matter when you eventually go to fix the evaluator (edit the prompt vs. edit the Hog source).

此流程对

llm_judge

和

hog

评估均适用——区别仅在于最终修复评估器时（编辑提示词 vs 编辑Hog代码）。

Step 1 — Find the evaluation

步骤1 —— 找到目标评估

json

posthog:llma-evaluation-list
{ "search": "hallucination", "enabled": true }

Look at the returned

id

name

evaluation_type

, and either:

```
evaluation_config.prompt
```
for an
```
llm_judge
```
```
evaluation_config.source
```
for a
```
hog
```
evaluator

The Hog source is the ground truth for why a hog evaluator passes or fails — read it before assuming the failure is in the generation.

json

posthog:llma-evaluation-list
{ "search": "hallucination", "enabled": true }

查看返回结果中的

id

、

name

、

evaluation_type

，以及：

对于
```
llm_judge
```
，查看
```
evaluation_config.prompt
```
对于
```
hog
```
评估器，查看
```
evaluation_config.source
```

Hog代码是hog评估器通过或失败的依据——在假设失败源于生成结果之前，请先阅读Hog代码。

Step 2 — Get the AI-generated summary

步骤2 —— 获取AI生成的总结

json

posthog:llma-evaluation-summary-create
{
  "evaluation_id": "<uuid>",
  "filter": "fail"
}

Returns:

```
overall_assessment
```
— natural-language summary

fail_patterns

— grouped patterns with

title

description

frequency

, and

example_generation_ids

```
pass_patterns
```
and
```
na_patterns
```
— same shape, populated when
```
filter
```
includes them
```
recommendations
```
— actionable next steps

statistics

—

total_analyzed

pass_count

fail_count

na_count

The endpoint analyses the most recent ~250 runs (

EVALUATION_SUMMARY_MAX_RUNS

). Results are cached for one hour per

(evaluation_id, filter, set_of_generation_ids)

. Pass

force_refresh: true

to recompute.

Compare filters in two calls to spot what's distinctive about failures vs passes:

json

posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "pass" }

Then diff the

pass_patterns

against the

fail_patterns

from Step 2.

json

posthog:llma-evaluation-summary-create
{
  "evaluation_id": "<uuid>",
  "filter": "fail"
}

返回结果包含：

```
overall_assessment
```
—— 自然语言总结

fail_patterns

—— 分组模式，包含

title

、

description

、

frequency

和

example_generation_ids

```
pass_patterns
```
和
```
na_patterns
```
—— 结构相同，当筛选条件包含时会填充内容
```
recommendations
```
—— 可执行的下一步建议

statistics

——

total_analyzed

、

pass_count

、

fail_count

、

na_count

该端点会分析最近约250次运行（

EVALUATION_SUMMARY_MAX_RUNS

）。结果会按

(evaluation_id, filter, set_of_generation_ids)

缓存一小时。传入

force_refresh: true

可重新计算。

通过两次调用对比筛选条件，找出失败与通过结果的差异：

json

posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "pass" }

然后对比步骤2中得到的

fail_patterns

与本次的

pass_patterns

。

Step 3 — Drill into example failing runs

步骤3 —— 深入查看失败示例运行

Each pattern surfaces

example_generation_ids

. Pull the underlying trace for the most representative example:

json

posthog:query-llm-trace
{ "traceId": "<trace_id>", "dateRange": {"date_from": "-30d"} }

(If you only have a generation ID, query for it via

execute-sql

first to find the parent trace ID — see below.)

每个模式都会显示

example_generation_ids

。提取最具代表性示例的底层追踪：

json

posthog:query-llm-trace
{ "traceId": "<trace_id>", "dateRange": {"date_from": "-30d"} }

（如果只有生成ID，需先通过

execute-sql

查询找到父级追踪ID——见下文。）

Step 4 — Verify the pattern with raw SQL

步骤4 —— 用原始SQL验证模式

The summary is LLM-generated and should be verified. Use

execute-sql

to count and spot-check:

sql

posthog:execute-sql
SELECT
    properties.$ai_target_event_id AS generation_id,
    properties.$ai_trace_id AS trace_id,
    properties.$ai_evaluation_reasoning AS reasoning,
    timestamp
FROM events
WHERE event = '$ai_evaluation'
    AND properties.$ai_evaluation_id = '<evaluation_uuid>'
    AND properties.$ai_evaluation_result = false
    AND (
        properties.$ai_evaluation_applicable IS NULL
        OR properties.$ai_evaluation_applicable != false
    )
    AND timestamp >= now() - INTERVAL 7 DAY
ORDER BY timestamp DESC
LIMIT 25

The N/A guard (

IS NULL OR != false

) is important — it matches the same logic the backend uses to bucket runs.

总结由LLM生成，需进行验证。使用

execute-sql

进行计数和抽样检查：

sql

posthog:execute-sql
SELECT
    properties.$ai_target_event_id AS generation_id,
    properties.$ai_trace_id AS trace_id,
    properties.$ai_evaluation_reasoning AS reasoning,
    timestamp
FROM events
WHERE event = '$ai_evaluation'
    AND properties.$ai_evaluation_id = '<evaluation_uuid>'
    AND properties.$ai_evaluation_result = false
    AND (
        properties.$ai_evaluation_applicable IS NULL
        OR properties.$ai_evaluation_applicable != false
    )
    AND timestamp >= now() - INTERVAL 7 DAY
ORDER BY timestamp DESC
LIMIT 25

N/A过滤条件（

IS NULL OR != false

）非常重要——它与后端用于分类运行结果的逻辑一致。

Workflow: run an evaluation against a specific generation

工作流：针对特定生成结果运行评估

Use this when the user pastes a trace/generation URL and asks "what would evaluation X say about this?".

json

posthog:llma-evaluation-run
{
  "evaluationId": "<eval_uuid>",
  "target_event_id": "<generation_event_uuid>",
  "timestamp": "2026-04-01T19:39:20Z",
  "event": "$ai_generation"
}

The

timestamp

is required for an efficient ClickHouse lookup of the target event. Pass

distinct_id

if you have it — it speeds up the lookup further.

当用户粘贴追踪/生成结果URL并询问“评估X会如何评价这个结果？”时，使用此流程。

json

posthog:llma-evaluation-run
{
  "evaluationId": "<eval_uuid>",
  "target_event_id": "<generation_event_uuid>",
  "timestamp": "2026-04-01T19:39:20Z",
  "event": "$ai_generation"
}

timestamp

是高效查询ClickHouse中目标事件的必填项。如果有

distinct_id

也请传入——可进一步加速查询。

Workflow: build and test a new evaluator

工作流：构建并测试新评估器

Hog evaluator (deterministic, code-based)

Hog评估器（确定性、基于代码）

Reach for this first when the criterion is rule-based — it's cheaper, faster, and reproducible. Prototype with

llma-evaluation-test-hog

(no save):

json

posthog:llma-evaluation-test-hog
{
  "source": "return event.properties.$ai_output_choices[1].content contains 'sorry';",
  "sample_count": 5,
  "allows_na": false
}

The handler returns the boolean result for each of the most recent N

$ai_generation

events. Iterate on the source until it behaves as expected, then promote it via

llma-evaluation-create

json

posthog:llma-evaluation-create
{
  "name": "Output is valid JSON",
  "description": "Fails when the assistant message can't be parsed as JSON",
  "evaluation_type": "hog",
  "evaluation_config": {
    "source": "let raw := event.properties.$ai_output_choices[1].content; try { jsonParseStr(raw); return true; } catch { return false; }"
  },
  "output_type": "boolean",
  "enabled": true
}

Hog evaluators have full access to the event and its properties — common patterns include schema validation, length/token limits, regex matches, and tool-call shape checks. Because they're deterministic, results are reproducible across reruns and trivially diff-able.

当评判标准基于规则时，优先选择此类型——成本更低、速度更快、可复现。使用

llma-evaluation-test-hog

进行原型开发（不保存结果）：

json

posthog:llma-evaluation-test-hog
{
  "source": "return event.properties.$ai_output_choices[1].content contains 'sorry';",
  "sample_count": 5,
  "allows_na": false
}

处理器会返回最近N个

$ai_generation

事件的布尔结果。迭代优化代码直至符合预期，然后通过

llma-evaluation-create

正式创建：

json

posthog:llma-evaluation-create
{
  "name": "Output is valid JSON",
  "description": "Fails when the assistant message can't be parsed as JSON",
  "evaluation_type": "hog",
  "evaluation_config": {
    "source": "let raw := event.properties.$ai_output_choices[1].content; try { jsonParseStr(raw); return true; } catch { return false; }"
  },
  "output_type": "boolean",
  "enabled": true
}

Hog评估器可完全访问事件及其属性——常见模式包括schema验证、长度/令牌限制、正则匹配、工具调用形状检查。由于具有确定性，结果在多次运行中可复现，且易于对比差异。

LLM-judge evaluator (subjective, prompt-based)

LLM评判器评估器（主观性、基于提示词）

Use this when the criterion is fuzzy and a code rule would be brittle (tone, factuality, helpfulness, on-topic-ness). There's no equivalent of

llma-evaluation-test-hog

for LLM judges — the typical loop is to create the evaluator with

enabled: false

, run it manually against a handful of representative generations via

llma-evaluation-run

, inspect the results, refine the prompt with

llma-evaluation-update

, and then flip

enabled: true

when you're satisfied:

json

posthog:llma-evaluation-create
{
  "name": "Response stays on-topic",
  "description": "LLM judge — fails if the assistant changes topic from the user's question",
  "evaluation_type": "llm_judge",
  "evaluation_config": {
    "prompt": "You are evaluating whether the assistant's reply stays on-topic relative to the user's most recent question. Return true if it does, false if the assistant changed the subject. Return N/A if the user did not actually ask a question."
  },
  "output_type": "boolean",
  "output_config": { "allows_na": true },
  "model_configuration": {
    "provider": "openai",
    "model": "gpt-5-mini"
  },
  "enabled": false
}

Then dry-run against a known-good and a known-bad generation:

json

posthog:llma-evaluation-run
{
  "evaluationId": "<new_eval_uuid>",
  "target_event_id": "<generation_uuid>",
  "timestamp": "2026-04-01T19:39:20Z"
}

LLM judges require organisation AI data processing approval. Hog evaluators do not.

当评判标准模糊、代码规则难以实现时使用此类型（语气、真实性、有用性、相关性）。LLM评判器没有类似

llma-evaluation-test-hog

的工具——典型流程是创建评估器时设置

enabled: false

，通过

llma-evaluation-run

手动针对少量代表性生成结果运行，检查结果，通过

llma-evaluation-update

优化提示词，然后在满意后设置

enabled: true

：

json

posthog:llma-evaluation-create
{
  "name": "Response stays on-topic",
  "description": "LLM judge — fails if the assistant changes topic from the user's question",
  "evaluation_type": "llm_judge",
  "evaluation_config": {
    "prompt": "You are evaluating whether the assistant's reply stays on-topic relative to the user's most recent question. Return true if it does, false if the assistant changed the subject. Return N/A if the user did not actually ask a question."
  },
  "output_type": "boolean",
  "output_config": { "allows_na": true },
  "model_configuration": {
    "provider": "openai",
    "model": "gpt-5-mini"
  },
  "enabled": false
}

然后针对已知合格和不合格的生成结果试运行：

json

posthog:llma-evaluation-run
{
  "evaluationId": "<new_eval_uuid>",
  "target_event_id": "<generation_uuid>",
  "timestamp": "2026-04-01T19:39:20Z"
}

LLM评判器需要组织层面的AI数据处理审批。Hog评估器无此要求。

Workflow: manage the evaluation lifecycle

工作流：管理评估生命周期

Action	Tool
Add a Hog evaluator	`llma-evaluation-create` with `evaluation_type: "hog"` and `evaluation_config.source`
Add an LLM-judge evaluator	`llma-evaluation-create` with `evaluation_type: "llm_judge"` , `evaluation_config.prompt` , and a `model_configuration`
Tweak the source or prompt	`llma-evaluation-update` (edits `evaluation_config.source` for Hog, `evaluation_config.prompt` for LLM judge)
Toggle N/A handling	`llma-evaluation-update` with `output_config.allows_na`
Disable temporarily	`llma-evaluation-update` with `enabled: false`
Remove	`llma-evaluation-delete` (soft-delete via PATCH `{deleted: true}` )

llm_judge

evaluations require AI data processing approval at the org level (

is_ai_data_processing_approved

). The same gate applies to

llma-evaluation-summary-create

. Hog evaluations do not require this gate — they run as plain code on the ingestion pipeline.

操作	工具
添加Hog评估器	`llma-evaluation-create` ，设置 `evaluation_type: "hog"` 和 `evaluation_config.source`
添加LLM评判器评估器	`llma-evaluation-create` ，设置 `evaluation_type: "llm_judge"` 、 `evaluation_config.prompt` 和 `model_configuration`
修改代码或提示词	`llma-evaluation-update` （针对Hog修改 `evaluation_config.source` ，针对LLM评判器修改 `evaluation_config.prompt` ）
切换N/A处理	`llma-evaluation-update` ，设置 `output_config.allows_na`
临时禁用	`llma-evaluation-update` ，设置 `enabled: false`
删除	`llma-evaluation-delete` （通过PATCH `{deleted: true}` 软删除）

llm_judge

评估需要组织层面的AI数据处理审批（

is_ai_data_processing_approved

）。

llma-evaluation-summary-create

同样需要此审批。Hog评估无需此审批——它们作为普通代码在 ingestion 管道中运行。

When to use Hog vs LLM judge

何时选择Hog vs LLM评判器

Reach for Hog by default. Switch to LLM judge only when the criterion can't be expressed as code.

Use Hog when…	Use LLM judge when…
The check is structural (JSON parses, schema matches)	The check is about meaning (on-topic, helpful, factual)
You need a deterministic, reproducible result	A small amount of judgement variability is acceptable
The criterion is cheap to compute	The criterion requires reading and understanding text
You can't get AI data processing approval	You have approval and the criterion is genuinely fuzzy
You need to enforce a hard limit (length, cost, etc.)	You need to rate a quality dimension
You want sub-millisecond evaluation	A few hundred milliseconds + LLM cost are acceptable

A common pattern is to layer them: a Hog evaluator gates obvious format/length violations cheaply, and an LLM-judge evaluator only fires on the generations that pass the Hog gate (via

conditions

默认优先选择Hog。仅当评判标准无法通过代码表达时，才切换到LLM评判器。

选择Hog的场景	选择LLM评判器的场景
检查内容为结构性的（JSON可解析、匹配schema）	检查内容为语义性的（相关性、有用性、真实性）
需要确定性、可复现的结果	可接受少量评判差异
评判标准计算成本低	评判标准需要读取并理解文本
无法获得AI数据处理审批	已获得审批且评判标准确实模糊
需要强制执行硬性限制（长度、成本等）	需要对质量维度打分
需要亚毫秒级的评估速度	可接受数百毫秒延迟+LLM成本

常见模式是分层使用：Hog评估器低成本拦截明显的格式/长度违规，LLM评判器仅针对通过Hog筛选的生成结果运行（通过

conditions

配置）。

Investigation patterns

排查模式

The summarisation tool works the same way regardless of whether the evaluator is

hog

llm_judge

— it analyses the resulting

$ai_evaluation

events, not the evaluator itself. The fix path differs (edit Hog source vs. edit prompt) but the diagnosis is identical.

总结工具的工作方式与评估器是

hog

还是

llm_judge

无关——它分析的是生成的

$ai_evaluation

事件，而非评估器本身。修复路径不同（编辑Hog代码 vs 编辑提示词），但诊断流程一致。

"Why is evaluation X suddenly failing more?"

"为什么评估X突然失败次数变多了？"

```
llma-evaluation-list
```
— confirm the evaluation is still enabled and unchanged (compare
```
evaluation_config.source
```
or
```
evaluation_config.prompt
```
to the version you expect)
```
llma-evaluation-summary-create
```
with
```
filter: "fail"
```
— get the dominant failure patterns and example IDs

SQL count of fails per day to confirm the regression window:

sql

SELECT toDate(timestamp) AS day, count() AS fails
FROM events
WHERE event = '$ai_evaluation'
    AND properties.$ai_evaluation_id = '<uuid>'
    AND properties.$ai_evaluation_result = false
    AND timestamp >= now() - INTERVAL 30 DAY
GROUP BY day
ORDER BY day

Drill into a representative trace per pattern via
```
query-llm-trace
```

```
llma-evaluation-list
```
—— 确认评估仍处于启用状态且未被修改（对比
```
evaluation_config.source
```
或
```
evaluation_config.prompt
```
与预期版本）
```
llma-evaluation-summary-create
```
，设置
```
filter: "fail"
```
—— 获取主要失败模式和示例ID

通过SQL按天统计失败次数，确认回归窗口：

sql

SELECT toDate(timestamp) AS day, count() AS fails
FROM events
WHERE event = '$ai_evaluation'
    AND properties.$ai_evaluation_id = '<uuid>'
    AND properties.$ai_evaluation_result = false
    AND timestamp >= now() - INTERVAL 30 DAY
GROUP BY day
ORDER BY day

通过
```
query-llm-trace
```
深入查看每个模式的代表性追踪

"Are passes and fails caused by the same root content?"

"通过和失败是否由相同的核心内容导致？"

Generate two summaries: one with
```
filter: "pass"
```
, one with
```
filter: "fail"
```
If
```
pass_patterns
```
and
```
fail_patterns
```
describe similar content:
- For an
```
llm_judge
```
  : the prompt or rubric is probably ambiguous — reword
```
evaluation_config.prompt
```
  and use
```
llma-evaluation-update
```
- For a
```
hog
```
  evaluator: the rule is probably under- or over-matching — read the source via
```
llma-evaluation-get
```
  , narrow the predicate, and retest with
```
llma-evaluation-test-hog
```
  before pushing the fix via
```
llma-evaluation-update
```

生成两个总结：一个设置
```
filter: "pass"
```
，一个设置
```
filter: "fail"
```
如果
```
pass_patterns
```
和
```
fail_patterns
```
描述的内容相似：
- 对于
```
llm_judge
```
  ：提示词或评分标准可能存在歧义——重新修改
```
evaluation_config.prompt
```
  并使用
```
llma-evaluation-update
```
  更新
- 对于
```
hog
```
  评估器：规则可能匹配过度或不足——通过
```
llma-evaluation-get
```
  查看代码，缩小判断条件，在通过
```
llma-evaluation-update
```
  推送修复前，使用
```
llma-evaluation-test-hog
```
  重新测试

"Did a Hog evaluator regression after a code change?"

"Hog评估器在代码变更后出现回归了吗？"

Hog evaluators are reproducible — if the source hasn't changed, identical inputs should yield identical outputs. When fail rates jump for a Hog evaluator:

```
llma-evaluation-get
```
— note the current source and
```
updated_at
```
Spot-check the latest failing runs with the SQL query from Step 4 above
Re-run the source against those exact generations using
```
llma-evaluation-test-hog
```
with a modified
```
conditions
```
filter that targets them
If the test results match the live results, the change is in the generations, not the evaluator (a model upgrade, prompt change upstream, etc.) — investigate the producer
If they diverge, the evaluator was edited; check git history of the source field via the activity log

Hog评估器具有可复现性——如果代码未变更，相同输入应产生相同输出。当Hog评估器的失败率骤增时：

```
llma-evaluation-get
```
—— 记录当前代码和
```
updated_at
```
时间
通过上述步骤4中的SQL查询抽查最新的失败运行
使用
```
llma-evaluation-test-hog
```
，修改
```
conditions
```
筛选目标结果，针对这些特定生成结果重新运行代码
如果测试结果与线上结果一致，说明变化源于生成结果而非评估器（模型升级、上游提示词变更等）——排查生成结果的来源
如果结果不一致，说明评估器已被修改；通过活动日志查看代码字段的git历史

"What kinds of generations does this evaluator skip as N/A?"

"此评估器会跳过哪些类型的生成结果作为N/A？"

json

posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "na" }

Inspect

na_patterns

to see whether the N/A logic is doing the right thing. If a pattern in

na_patterns

looks like something that should have been scored:

For an
```
llm_judge
```
: the applicability instruction in the prompt is too broad — narrow it
For a
```
hog
```
evaluator with
```
output_config.allows_na: true
```
: the source is returning
```
null
```
(or whatever the N/A signal is) too eagerly — tighten the precondition

json

posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "na" }

查看

na_patterns

，确认N/A逻辑是否正常工作。如果

na_patterns

中的某个模式看起来应该被打分：

对于
```
llm_judge
```
：提示词中的适用性说明过于宽泛——缩小范围
对于设置了
```
output_config.allows_na: true
```
的
```
hog
```
评估器：代码过于频繁返回
```
null
```
（或其他N/A信号）——收紧前置条件

"Score this single generation right now"

"立即为这个单一生成结果打分"

llma-evaluation-run

with the trace's generation ID and timestamp. Useful for spot-checking or wiring evaluations into a larger agent loop.

使用

llma-evaluation-run

，传入追踪的生成ID和时间戳。适用于抽查或将评估集成到更大的Agent工作流中。

Constructing UI links

构建UI链接

Evaluations list:

https://app.posthog.com/llm-analytics/evaluations

Single evaluation:

https://app.posthog.com/llm-analytics/evaluations/<evaluation_id>

Underlying generation/trace: see the
```
exploring-llm-traces
```
skill's URL conventions

Always surface the relevant link so the user can verify in the UI.

评估列表：

https://app.posthog.com/llm-analytics/evaluations

单个评估：

https://app.posthog.com/llm-analytics/evaluations/<evaluation_id>

底层生成结果/追踪：参考
```
exploring-llm-traces
```
技能的URL规范

始终提供相关链接，方便用户在UI中验证。

Tips

提示

The summary tool is rate-limited (burst, sustained, daily) and caches results for one hour — repeated calls with the same
```
(evaluation_id, filter)
```
are cheap; use
```
force_refresh: true
```
only when you genuinely need fresh analysis
Pass
```
generation_ids: [...]
```
to scope a summary to a specific cohort of runs (max 250)
The
```
statistics
```
block in the summary response is computed from raw data, not the LLM — trust those counts even if a pattern's
```
frequency
```
field is qualitative
For rich filtering not supported by
```
llma-evaluation-list
```
(e.g. by author or model configuration), fall back to
```
execute-sql
```
against the
```
evaluations
```
Postgres table or the
```
$ai_evaluation
```
ClickHouse events
When showing failure patterns to the user, always include 1-2 example trace links so they can validate the pattern visually

llma-evaluation-*

tools use

evaluation:read

for read tools and

evaluation:write

for mutating tools;

llma-evaluation-summary-create

uses

llm_analytics:write

Hog evaluators are reproducible — if you suspect a regression,
```
llma-evaluation-test-hog
```
with the suspect source against the failing generations is the fastest way to bisect whether the change is in the evaluator or in the producer of the generations
LLM-judge evaluators are non-deterministic across reruns; expect 1-5% noise even with a fixed prompt and model. If you're chasing a small regression in fail rate, prefer Hog or pin a deterministic provider/seed in the
```
model_configuration
```

总结工具有速率限制（突发速率、持续速率、每日速率），且结果缓存一小时——相同
```
(evaluation_id, filter)
```
的重复调用成本低；仅当确实需要新鲜分析时，才使用
```
force_refresh: true
```
传入
```
generation_ids: [...]
```
可将总结范围限定在特定批次的运行（最多250个）
总结响应中的
```
statistics
```
块由原始数据计算得出，而非LLM生成——即使模式的
```
frequency
```
字段是定性描述，也可信任这些计数
对于
```
llma-evaluation-list
```
不支持的复杂筛选（如按作者或模型配置筛选），可直接针对Postgres的
```
evaluations
```
表或ClickHouse的
```
$ai_evaluation
```
事件使用
```
execute-sql
```
向用户展示失败模式时，始终包含1-2个示例追踪链接，方便用户直观验证模式

llma-evaluation-*

工具中，读取类工具使用

evaluation:read

权限，修改类工具使用

evaluation:write

权限；

llma-evaluation-summary-create

使用

llm_analytics:write

权限

Hog评估器具有可复现性——如果怀疑出现回归，使用
```
llma-evaluation-test-hog
```
针对失败结果运行可疑代码，是快速排查变化源于评估器还是生成结果来源的最佳方式
LLM评判器在多次运行中具有非确定性；即使提示词和模型固定，也会存在1-5%的误差。如果追踪的失败率回归很小，优先选择Hog，或在
```
model_configuration
```
中固定确定性的提供商/种子