Exploring LLM evaluations
PostHog evaluations score
events. Each evaluation is one of two types,
both first-class:
- — deterministic Hog code that returns / (and optionally N/A).
Best for objective rule-based checks: format validation (JSON parses, schema matches),
length limits, keyword presence/absence, regex patterns, structural assertions, latency
thresholds, cost guards. Cheap, fast, reproducible — no LLM call per run. Prefer this
when the criterion can be expressed as code.
- — an LLM scores generations against a prompt you write. Best for
subjective or fuzzy checks: tone, helpfulness, hallucination detection, off-topic
drift, instruction-following. Costs an LLM call per run and requires AI data
processing approval at the org level.
Results from both types land in ClickHouse as
events with the same
schema, so the read/query/summary workflows are identical regardless of evaluator type —
the only thing that changes is whether
was written by Hog
code or by an LLM.
This skill covers the full lifecycle: list/inspect/manage evaluation configs (Hog or
LLM judge), run them on specific generations, query individual results, and get an
AI-generated summary of pass/fail/N/A patterns across many runs.
Tools
| Tool | Purpose |
|---|
posthog:llma-evaluation-list
| List/search evaluation configs (filter by name, enabled flag) |
posthog:llma-evaluation-get
| Get a single evaluation config by UUID |
posthog:llma-evaluation-create
| Create a new or evaluation |
posthog:llma-evaluation-update
| Update an existing evaluation (name, prompt, enabled, …) |
posthog:llma-evaluation-delete
| Soft-delete an evaluation |
posthog:llma-evaluation-run
| Run an evaluation against a specific event |
posthog:llma-evaluation-test-hog
| Dry-run Hog source against recent generations (no save) |
posthog:llma-evaluation-summary-create
| AI-powered summary of pass/fail/N/A patterns across runs |
| Ad-hoc HogQL over events |
| Drill into the underlying generation that an evaluation scored |
All
tools are defined in
products/llm_analytics/mcp/tools.yaml
.
Event schema
Every run of an evaluation emits an
event. Key properties:
| Property | Meaning |
|---|
| UUID of the evaluation config |
| Human-readable name |
| UUID of the event being scored |
| Parent trace ID (for jumping to the trace UI) |
| = pass, = fail |
| Free-text explanation (set by the LLM judge or Hog code) |
$ai_evaluation_applicable
| when the evaluator decided the generation is N/A |
When
$ai_evaluation_applicable = false
, the run counts as N/A regardless of
.
For evaluations that don't support N/A, this property may be
— treat null as "applicable".
Workflow: investigate why an evaluation is failing
Works the same way for
and
evaluations — the differences only matter
when you eventually go to fix the evaluator (edit the prompt vs. edit the Hog source).
Step 1 — Find the evaluation
json
posthog:llma-evaluation-list
{ "search": "hallucination", "enabled": true }
Look at the returned
,
,
, and either:
The Hog source is the ground truth for why a hog evaluator passes or fails — read it
before assuming the failure is in the generation.
Step 2 — Get the AI-generated summary
json
posthog:llma-evaluation-summary-create
{
"evaluation_id": "<uuid>",
"filter": "fail"
}
Returns:
- — natural-language summary
- — grouped patterns with , , , and
- and — same shape, populated when includes them
- — actionable next steps
- — , , ,
The endpoint analyses the most recent ~250 runs (
EVALUATION_SUMMARY_MAX_RUNS
).
Results are cached for one hour per
(evaluation_id, filter, set_of_generation_ids)
.
Pass
to recompute.
Compare filters in two calls to spot what's distinctive about failures vs passes:
json
posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "pass" }
Then diff the
against the
from Step 2.
Step 3 — Drill into example failing runs
Each pattern surfaces
. Pull the underlying trace for the most
representative example:
json
posthog:query-llm-trace
{ "traceId": "<trace_id>", "dateRange": {"date_from": "-30d"} }
(If you only have a generation ID, query for it via
first to find the
parent trace ID — see below.)
Step 4 — Verify the pattern with raw SQL
The summary is LLM-generated and should be verified. Use
to count and
spot-check:
sql
posthog:execute-sql
SELECT
properties.$ai_target_event_id AS generation_id,
properties.$ai_trace_id AS trace_id,
properties.$ai_evaluation_reasoning AS reasoning,
timestamp
FROM events
WHERE event = '$ai_evaluation'
AND properties.$ai_evaluation_id = '<evaluation_uuid>'
AND properties.$ai_evaluation_result = false
AND (
properties.$ai_evaluation_applicable IS NULL
OR properties.$ai_evaluation_applicable != false
)
AND timestamp >= now() - INTERVAL 7 DAY
ORDER BY timestamp DESC
LIMIT 25
The N/A guard (
) is important — it matches the same logic the
backend uses to bucket runs.
Workflow: run an evaluation against a specific generation
Use this when the user pastes a trace/generation URL and asks "what would evaluation X
say about this?".
json
posthog:llma-evaluation-run
{
"evaluationId": "<eval_uuid>",
"target_event_id": "<generation_event_uuid>",
"timestamp": "2026-04-01T19:39:20Z",
"event": "$ai_generation"
}
The
is required for an efficient ClickHouse lookup of the target event.
Pass
if you have it — it speeds up the lookup further.
Workflow: build and test a new evaluator
Hog evaluator (deterministic, code-based)
Reach for this first when the criterion is rule-based — it's cheaper, faster, and
reproducible. Prototype with
(no save):
json
posthog:llma-evaluation-test-hog
{
"source": "return event.properties.$ai_output_choices[1].content contains 'sorry';",
"sample_count": 5,
"allows_na": false
}
The handler returns the boolean result for each of the most recent N
events. Iterate on the source until it behaves as expected, then promote it via
:
json
posthog:llma-evaluation-create
{
"name": "Output is valid JSON",
"description": "Fails when the assistant message can't be parsed as JSON",
"evaluation_type": "hog",
"evaluation_config": {
"source": "let raw := event.properties.$ai_output_choices[1].content; try { jsonParseStr(raw); return true; } catch { return false; }"
},
"output_type": "boolean",
"enabled": true
}
Hog evaluators have full access to the event and its properties — common patterns
include schema validation, length/token limits, regex matches, and tool-call shape
checks. Because they're deterministic, results are reproducible across reruns and
trivially diff-able.
LLM-judge evaluator (subjective, prompt-based)
Use this when the criterion is fuzzy and a code rule would be brittle (tone, factuality,
helpfulness, on-topic-ness). There's no equivalent of
for LLM
judges — the typical loop is to create the evaluator with
, run it
manually against a handful of representative generations via
, inspect
the results, refine the prompt with
, and then flip
when you're satisfied:
json
posthog:llma-evaluation-create
{
"name": "Response stays on-topic",
"description": "LLM judge — fails if the assistant changes topic from the user's question",
"evaluation_type": "llm_judge",
"evaluation_config": {
"prompt": "You are evaluating whether the assistant's reply stays on-topic relative to the user's most recent question. Return true if it does, false if the assistant changed the subject. Return N/A if the user did not actually ask a question."
},
"output_type": "boolean",
"output_config": { "allows_na": true },
"model_configuration": {
"provider": "openai",
"model": "gpt-5-mini"
},
"enabled": false
}
Then dry-run against a known-good and a known-bad generation:
json
posthog:llma-evaluation-run
{
"evaluationId": "<new_eval_uuid>",
"target_event_id": "<generation_uuid>",
"timestamp": "2026-04-01T19:39:20Z"
}
LLM judges require organisation AI data processing approval. Hog evaluators do not.
Workflow: manage the evaluation lifecycle
| Action | Tool |
|---|
| Add a Hog evaluator | with and |
| Add an LLM-judge evaluator | with evaluation_type: "llm_judge"
, , and a |
| Tweak the source or prompt | (edits for Hog, for LLM judge) |
| Toggle N/A handling | with |
| Disable temporarily | with |
| Remove | (soft-delete via PATCH ) |
evaluations require AI data processing approval at the org level
(
is_ai_data_processing_approved
). The same gate applies to
llma-evaluation-summary-create
. Hog evaluations do
not require this gate
— they run as plain code on the ingestion pipeline.
When to use Hog vs LLM judge
Reach for Hog by default. Switch to LLM judge only when the criterion can't be
expressed as code.
| Use Hog when… | Use LLM judge when… |
|---|
| The check is structural (JSON parses, schema matches) | The check is about meaning (on-topic, helpful, factual) |
| You need a deterministic, reproducible result | A small amount of judgement variability is acceptable |
| The criterion is cheap to compute | The criterion requires reading and understanding text |
| You can't get AI data processing approval | You have approval and the criterion is genuinely fuzzy |
| You need to enforce a hard limit (length, cost, etc.) | You need to rate a quality dimension |
| You want sub-millisecond evaluation | A few hundred milliseconds + LLM cost are acceptable |
A common pattern is to
layer them: a Hog evaluator gates obvious format/length
violations cheaply, and an LLM-judge evaluator only fires on the generations that pass
the Hog gate (via
).
Investigation patterns
The summarisation tool works the same way regardless of whether the evaluator is
or
— it analyses the resulting
events, not the evaluator
itself. The fix path differs (edit Hog source vs. edit prompt) but the diagnosis is
identical.
"Why is evaluation X suddenly failing more?"
-
— confirm the evaluation is still enabled and unchanged
(compare
or
to the version you
expect)
-
llma-evaluation-summary-create
with
— get the dominant
failure patterns and example IDs
-
SQL count of fails per day to confirm the regression window:
sql
SELECT toDate(timestamp) AS day, count() AS fails
FROM events
WHERE event = '$ai_evaluation'
AND properties.$ai_evaluation_id = '<uuid>'
AND properties.$ai_evaluation_result = false
AND timestamp >= now() - INTERVAL 30 DAY
GROUP BY day
ORDER BY day
-
Drill into a representative trace per pattern via
"Are passes and fails caused by the same root content?"
- Generate two summaries: one with , one with
- If and describe similar content:
- For an : the prompt or rubric is probably ambiguous — reword
and use
- For a evaluator: the rule is probably under- or over-matching — read the
source via , narrow the predicate, and retest with
before pushing the fix via
"Did a Hog evaluator regression after a code change?"
Hog evaluators are reproducible — if the source hasn't changed, identical inputs should
yield identical outputs. When fail rates jump for a Hog evaluator:
- — note the current source and
- Spot-check the latest failing runs with the SQL query from Step 4 above
- Re-run the source against those exact generations using with a
modified filter that targets them
- If the test results match the live results, the change is in the generations, not
the evaluator (a model upgrade, prompt change upstream, etc.) — investigate the
producer
- If they diverge, the evaluator was edited; check git history of the source field via
the activity log
"What kinds of generations does this evaluator skip as N/A?"
json
posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "na" }
Inspect
to see whether the N/A logic is doing the right thing. If a
pattern in
looks like something that should have been scored:
- For an : the applicability instruction in the prompt is too broad — narrow
it
- For a evaluator with
output_config.allows_na: true
: the source is returning
(or whatever the N/A signal is) too eagerly — tighten the precondition
"Score this single generation right now"
with the trace's generation ID and timestamp. Useful for spot-checking
or wiring evaluations into a larger agent loop.
Constructing UI links
- Evaluations list:
https://app.posthog.com/llm-analytics/evaluations
- Single evaluation:
https://app.posthog.com/llm-analytics/evaluations/<evaluation_id>
- Underlying generation/trace: see the skill's URL conventions
Always surface the relevant link so the user can verify in the UI.
Tips
- The summary tool is rate-limited (burst, sustained, daily) and caches results
for one hour — repeated calls with the same are cheap; use
only when you genuinely need fresh analysis
- Pass to scope a summary to a specific cohort of runs (max 250)
- The block in the summary response is computed from raw data, not the LLM
— trust those counts even if a pattern's field is qualitative
- For rich filtering not supported by (e.g. by author or model
configuration), fall back to against the Postgres table or
the ClickHouse events
- When showing failure patterns to the user, always include 1-2 example trace links so
they can validate the pattern visually
- tools use for read tools and for
mutating tools;
llma-evaluation-summary-create
uses
- Hog evaluators are reproducible — if you suspect a regression,
with the suspect source against the failing generations is the fastest way to bisect
whether the change is in the evaluator or in the producer of the generations
- LLM-judge evaluators are non-deterministic across reruns; expect 1-5% noise even with
a fixed prompt and model. If you're chasing a small regression in fail rate, prefer
Hog or pin a deterministic provider/seed in the