Building a feature usage feed via LLM evals
Some PostHog features (group session summaries, single session summaries, replay AI search, error tracking AI debug, etc.) generate hundreds or thousands of LLM traces per week. Reading them by hand is not feasible. This skill covers the end-to-end pattern for turning that trace volume into a live Slack feed of canonical use cases — what users are actually doing with the feature.
The workflow is
mixed, and leans UI. Trace inspection and filter discovery (steps 1-2) are MCP-driven. Eval creation, dry-running, and enabling (steps 4-5) are MCP-driven
when posthog:llma-evaluation-*
tools are exposed to your agent — but they often aren't, in which case fall back to the UI (Data pipeline → destinations for the alert is always UI). Each step flags its UI fallback. Expect to finish in the UI even when you start from chat.
When to use
- "How are people actually using [feature X] in production?"
- "Can we identify the canonical use cases for [feature X] so we can write better docs / prioritize improvements?"
- "I want a Slack feed of representative usage examples without manually skimming traces."
- "Set up a feed of use cases for [feature X] in #team-[area]-usage."
If the user just wants to debug a single trace or tune an existing eval, redirect to
or
exploring-llm-evaluations
instead.
Two filter patterns
This skill supports two different ways to scope an eval to "the feature you care about":
Pattern A — Feature-native trace_id prefix. For standalone features that emit their own
pattern (e.g.
,
, error-tracking-specific flows). Filter on the prefix.
Pattern B — PostHog AI agent mode. For features the user interacts with
via PostHog AI in a specific agent mode (error tracking, product analytics, session replay, SQL, flags, surveys, LLM analytics). Filter on
ai_product = 'posthog_ai' AND agent_mode = '<mode>'
. This requires PR #55160 (merged April 2026) to be deployed, which threads
and
onto every
emitted by the chat agent loop. A useful ergonomic side-effect:
is a reliable "user-facing chat turn" filter — batch jobs and tool-internal LLM calls go through different code paths and have
, so they're excluded for free.
If the user asks "what are users trying to DO in [ET / replay / SQL / flags / surveys] mode of PostHog AI", that's Pattern B. If they ask "what use cases does [standalone feature] cover", that's Pattern A. Pick the pattern first — the prompt, filter, and Slack channel naming all follow from it.
Prerequisites
| Requirement | How to verify |
|---|
| (Pattern A) Feature emits events with a stable pattern | for distinct prefixes |
| (Pattern B) property is present on recent events | group-by on recent events. Null bucket is normal (batch jobs + tool-internal calls) — you want non-null coverage across the modes you care about. |
| is attached to the events (links trace to trigger session) | for countIf($session_id IS NOT NULL) / count()
|
| is also attached to the events (lets the Slack alert link to the session) | Same query but on events after the eval has run once |
| User has organisation-level AI data processing approval | Required for evaluations and the eval summary tool |
If
is missing on either event type, file a backend fix before continuing — there is no UI workaround. The session-summary feature has a worked example of the threading pattern in PR #54952. For Pattern B, the agent-mode threading pattern is in PR #55160.
Tools
| Tool | Purpose |
|---|
posthog:query-llm-traces-list
| Find sample traces matching the feature's pattern |
| Inspect a specific trace's contents end-to-end |
| Verify trace volume, session_id coverage, eval result distributions |
posthog:llma-evaluation-create
| (often unexposed — UI fallback: LLM analytics → Evaluations → New) Create the LLM-judge eval (disabled at first) |
posthog:llma-evaluation-run
| (often unexposed — UI fallback: the eval's detail page has a "Run on event" button) Dry-run the eval against specific generations during prompt iteration |
posthog:llma-evaluation-update
| (often unexposed — UI fallback: edit the eval in LLM analytics → Evaluations) Tweak the prompt / enable when ready |
posthog:llma-evaluation-summary-create
| (often unexposed — UI fallback: the eval detail page has a "Summarize results" button) After the feed is running, get an AI summary of pass/N/A patterns to validate signal quality |
| / | (often unexposed — UI: Data pipeline → Workflows) Browse existing workflow configs — useful for cloning an existing feed's structure when setting up a new one. Read-only; no create/update tool is exposed yet, so step 6's Slack workflow setup is UI-only. |
Before starting,
check which of the posthog:llma-evaluation-*
tools are actually exposed in your agent's MCP tool set. If they aren't loaded, treat steps 4-5 as UI walkthroughs rather than tool calls.
Workflow
Step 1 — Identify the filter
Pattern A (feature-native trace_id prefix): find the prefix that maps to your feature.
sql
SELECT
splitByChar(':', coalesce(properties.$ai_trace_id, ''))[1] AS root,
splitByChar(':', coalesce(properties.$ai_trace_id, ''))[2] AS subtype,
count() AS events
FROM events
WHERE timestamp > now() - INTERVAL 3 DAY
AND event = '$ai_generation'
AND properties.$ai_trace_id IS NOT NULL
GROUP BY root, subtype
ORDER BY events DESC
LIMIT 25
Note:
is load-bearing —
on a nullable column errors out in HogQL otherwise.
Pattern B (PostHog AI agent mode): verify coverage and volume for the mode you're targeting.
sql
SELECT
properties.agent_mode AS agent_mode,
properties.supermode AS supermode,
count() AS events,
count(DISTINCT properties.$ai_trace_id) AS traces
FROM events
WHERE timestamp > now() - INTERVAL 3 DAY
AND event = '$ai_generation'
AND properties.ai_product = 'posthog_ai'
GROUP BY agent_mode, supermode
ORDER BY events DESC
LIMIT 20
Expected values for
:
,
,
,
,
,
,
,
. Null ≈ batch jobs + tool-internal calls (not user chat).
splits planning turns from execution turns — worth calling out separately if your feed is about plan-mode specifically.
Record the mode + rough volume. Low-volume modes (<100 events/day) will produce a trickle-feed that's hard to validate early; high-volume modes (>1k/day) may need sampling to avoid Slack flooding. See the "Tips" section on sampling.
Step 2 — Pull a handful of sample traces
Use these for prompt iteration in step 4.
Pattern A:
json
posthog:query-llm-traces-list
{
"properties": [
{ "type": "event", "key": "$ai_trace_id", "operator": "icontains", "value": "<your-prefix-here>" }
],
"limit": 10,
"dateRange": { "date_from": "-2d" },
"randomOrder": true
}
Pattern B:
json
posthog:query-llm-traces-list
{
"properties": [
{ "type": "event", "key": "ai_product", "operator": "exact", "value": "posthog_ai" },
{ "type": "event", "key": "agent_mode", "operator": "exact", "value": "<mode-here>" }
],
"limit": 10,
"dateRange": { "date_from": "-2d" },
"randomOrder": true
}
matters — recency bias produces a non-representative sample. Pick 5-10 traces to test against.
Output size warning: with
routinely returns 3-6MB of JSON (full input/output per generation). This will blow your context window.
Immediately delegate the summarization to a subagent the moment you see the "result exceeds maximum allowed tokens" error — ask the subagent to extract, per trace: the trace id, the first user message (truncated to ~300 chars), the sampled
, and a one-sentence description of what the conversation was about. Don't try to read the raw file in-line.
Watch for topic drift in Pattern B samples. The
tag reflects the user's mode selection at the time of the turn — but chat state retains the mode even if the user drifts off-topic within the same conversation (e.g. user selected "error tracking" mode, then asked an unrelated pricing question three turns later). Your eval prompt's classification step needs to be permissive about topic-drift: PASS should mean "user is doing something recognizably in-scope for this mode", FAIL should catch the off-topic drift. If you don't, your feed will include irrelevant PASS entries that happen to carry the mode tag.
Step 3 — Draft the LLM-judge prompt
The prompt has two responsibilities: (a) classify the trace as relevant or not, (b) produce reasoning text that is directly postable to Slack (no preamble, no meta-description). The reasoning field becomes the Slack message body.
Template:
text
You are analyzing a PostHog [FEATURE NAME] trace to extract its real use case.
Your reasoning text will be posted directly to a Slack channel as a notification.
Write it as a short, ready-to-post message — no preamble, no meta-description.
Step 1 — Classification:
- PASS = this trace is the [feature kind] you care about
- FAIL = a different LLM call or a false match
- N/A = ambiguous from the trace alone
Step 2 — Reasoning (only matters if PASS). Write 2-3 sentences in this exact format:
"[OPENER] [what they targeted/filtered for]. They were
trying to [understand X / debug Y / find Z]. The result surfaced [key pattern
or finding]."
Your output MUST start with the exact phrase "[OPENER]". No other opening is allowed.
Rules:
- No "This is a [feature]..." or "The input contains..." preamble
- No JSON, field names, system-prompt references, or meta-description
- Concrete > generic. "users hitting error tracking for the first time" beats "user behavior"
- If you cannot infer one of the three pieces from the trace, write "(unclear from trace)" in that slot — do not guess
Pick an that matches how users actually interact with the feature. The forced opener is load-bearing (it prevents the model from drifting into "this trace is a..." meta-description), but the exact verb has to fit the interaction:
| Feature / mode | OPENER |
|---|
| Session summary (group / single) | |
| Replay AI search | A user searched replays for
|
| PostHog AI in error tracking mode | A user asked PostHog AI about
|
| PostHog AI in session replay mode | A user asked PostHog AI about
|
| PostHog AI in SQL mode | A user asked PostHog AI to write SQL for
|
Note:
is a sub-filter that layers
on top of an
row — it's not its own row. If you want plan-mode-only, filter
agent_mode='<mode>' AND supermode='plan'
and pick an opener like
"A user asked PostHog AI to plan"
.
If you force
on a chat-based feature, the model will produce awkward contortions ("A user ran a question about...") that read wrong in Slack. The forced-opener pattern is the mechanism — the specific phrase is per-feature.
The negative example list ("No 'This is a...' preamble", etc.) is load-bearing regardless of opener. Don't remove it.
Step 4 — Create the eval (disabled), test, iterate
Create with
so it doesn't immediately fan out to all traces.
If posthog:llma-evaluation-create
is exposed, use this payload:
json
posthog:llma-evaluation-create
{
"name": "[feature] use case feed",
"description": "Extracts canonical use cases for [feature] for the #team-[area]-usage Slack feed",
"evaluation_type": "llm_judge",
"evaluation_config": {
"prompt": "<full prompt from step 3>"
},
"output_type": "boolean",
"output_config": { "allows_na": true },
"model_configuration": {
"provider": "<provider>",
"model": "<model>"
},
"enabled": false,
"conditions": {
"filters": [
// Pattern A — feature-native trace_id prefix:
{ "key": "$ai_trace_id", "operator": "icontains", "value": "<your-prefix>" }
// Pattern B — PostHog AI agent mode (use these INSTEAD of the trace_id filter):
// { "key": "ai_product", "operator": "exact", "value": "posthog_ai" },
// { "key": "agent_mode", "operator": "exact", "value": "<mode>" }
]
}
}
Leave model choice to the user — LLM-judge cost scales linearly with event volume, and cheap-vs-capable is a real tradeoff they should make based on their own spend tolerance and signal-quality requirements. Don't pick for them.
UI fallback (when
isn't exposed): LLM analytics → Evaluations → New evaluation. Type =
, output = boolean + allow N/A, filters as above, enabled = off. Paste the prompt from step 3.
Then dry-run against your sample traces.
If posthog:llma-evaluation-run
is exposed:
json
posthog:llma-evaluation-run
{
"evaluationId": "<uuid from create>",
"target_event_id": "<a $ai_generation event id from step 2>",
"timestamp": "<ISO timestamp of that event>"
}
UI fallback: on the eval detail page, use the "Run on event" button with the trace sample's event id.
Look at the returned
. If it preambles, drifts, or describes the input, fix the prompt (via
or by editing in the UI) and re-run. Iterate on 3-5 traces before enabling.
Common failure modes during iteration:
| Symptom | Fix |
|---|
| Reasoning starts with "This is a..." | Strengthen the forced opener instruction; add a counter-example |
| Reasoning is generic ("user behavior", "various patterns") | Add positive examples of concrete phrasing in the prompt |
| Model classifies everything as PASS | Tighten the FAIL definition; add an example of what a non-match looks like |
| Reasoning is too long for Slack | Add a hard sentence cap ("MAX 3 sentences, hard limit") |
Step 5 — Enable the eval
Once 3-5 sample runs produce clean Slack-ready output.
If posthog:llma-evaluation-update
is exposed:
json
posthog:llma-evaluation-update
{
"evaluationId": "<uuid>",
"enabled": true
}
UI fallback: LLM analytics → Evaluations → open the eval → toggle enabled.
The eval will now run on every new matching
event.
Step 6 — Build the workflow (UI only)
Workflow setup is not MCP-accessible for writes (
/
are read-only). The steps below are a UI walkthrough.
Prereq: before you start, invite the PostHog Slack bot to your target channel (
in the Slack channel). Without this, the Slack dispatch step will fail with an opaque permission error at send time, not at save time — easy to miss.
6.1 Create the workflow
Data pipeline → Workflows → New workflow. Name it
to match the eval name from step 4.
6.2 Trigger step
- Event: — i.e. . This is the event emitted when an eval runs, and it's the only event that carries properties. The original event is not enriched with eval results, so filtering on here matches nothing.
- Property filters (both required):
- equals
<your eval name from step 4>
AI Evaluation Result (LLM)
equals
⚠️ LOAD-BEARING: the stored values for
are the strings
/
/
— NOT
/
/
(despite what the prompt template calls them internally). The Workflows UI property filter normalizes
→
, so selecting
from the dropdown works. But if you were wiring this in raw SQL somewhere else (say a hog function), you'd need the string literal. Verify the stored distribution before saving:
sql
SELECT DISTINCT toString(properties.$ai_evaluation_result) AS result, count() AS n
FROM events
WHERE event = '$ai_evaluation'
AND properties.$ai_evaluation_name = '<your eval name>'
AND timestamp > now() - INTERVAL 1 HOUR
GROUP BY result
If the only values are
/
/
and
dominates, the UI
filter will match. If you see anything else, adjust accordingly.
6.3 Slack dispatch step
- Add step → Slack dispatch
- Channel:
- Sender / bot display name: something that reads well in the channel (e.g. )
- Blocks (Slack block-kit JSON) — paste this and replace with your actual numeric project ID (e.g. ):
json
[
{
"text": {
"text": "<emoji> *{event.properties.$ai_evaluation_name}* triggered by *{person.name}*",
"type": "mrkdwn"
},
"type": "section"
},
{
"text": {
"text": "{event.properties.$ai_evaluation_reasoning}",
"type": "mrkdwn"
},
"type": "section"
},
{
"type": "actions",
"elements": [
{
"url": "https://us.posthog.com/project/<project_id>/llm-analytics/traces/{event.properties.$ai_trace_id}?event={event.properties.$ai_target_event_id}",
"text": { "text": "View Trace", "type": "plain_text" },
"type": "button"
},
{
"url": "https://us.posthog.com/project/<project_id>/replay/{event.properties.$session_id}",
"text": { "text": "View Trigger Session", "type": "plain_text" },
"type": "button"
},
{
"url": "{person.url}",
"text": { "text": "View Person", "type": "plain_text" },
"type": "button"
}
]
}
]
Pick an
that matches the feature's shape: 📊 product analytics, 🐛 error tracking, 🎬 session replay, 🔎 search/AI search, 🧪 experiments, 🚩 flags, 📋 surveys, 🧠 generic AI.
The
and
placeholders are valid PostHog template syntax and resolve at send time.
6.4 Test before enabling
The Workflows Test panel has two modes — this matters because naively hitting "Test" can look like a broken integration when it isn't:
- Synthetic event (default) — the Test panel fabricates an payload and runs the flow without hitting Slack's real API. Useful as a dry-run of the block template, but placeholders may resolve to and Slack's block validator will reject the payload with . That's a test-harness artifact, not a real bug — don't chase it.
- "Make real HTTPS requests" — flip this toggle on. Workflows then pulls a recent real event matching your filters and runs the flow end-to-end, including the actual Slack post. This is the test that tells you "it works" for real. If no matching real event exists yet (common if the eval was just enabled), trigger the feature yourself, wait ~1 minute, and retry.
Recommended flow: synthetic → sanity-check the block template renders → flip real-requests on → confirm an actual post lands in the channel → save + enable the workflow.
Step 7 — End-to-end verify in production
Once the workflow is enabled, trigger the feature yourself. Within a minute or two:
- The event should appear in LLM Analytics
- The eval should auto-run and emit an event
- The workflow should fire and the Slack post should land in the configured channel
- Click "View Trigger Session" — should land on the recording of you using the feature, not the replay homepage
If "View Trigger Session" lands on the replay homepage,
is missing on the
event (which is separate from the
event — threading is independent for the two). Backend fix needed — see prerequisites.
Worked example A (Pattern A): group session summary use cases
Pattern: a
group_summary_use_case_feed
eval streaming to a
channel. Trace prefix:
. Opener:
"A user ran a group summary on"
. Slack channel showed e.g.:
📊 group_summary_use_case_feed triggered by some user
"A user ran a group summary on a company's onboarding sessions from the last 7 days. They were trying to understand why account activation rates are low. The summary surfaced that most users abandon at the company onboarding wizard after creating accounts."
[View Trace] [View Trigger Session] [View Person]
The PRs that made this work (linked here as worked examples of the session_id threading pattern, not as steps in the skill itself):
- PostHog/posthog#54952 — threads through to events on the session summary backend
- (Followup PR — threads onto events specifically)
Worked example B (Pattern B): PostHog AI in error tracking mode
Pattern: an
agent_mode = 'error_tracking'
scoped feed streaming to a
channel, answering "what are users actually trying to DO when they chat with PostHog AI in error tracking mode?" Mode sizing varies by an order of magnitude or more across agent modes — spot-check volume per §Step 1 before wiring, because a high-volume mode can flood a channel. Opener:
"A user asked PostHog AI about"
.
Enabling PR: PostHog/posthog#55160 — threads
and
onto every
emitted by the chat agent loop. Wiring lives in
ee/hogai/core/agent_modes/executables.py
(
AgentExecutable._get_model
) and passes the dict through the existing
field on
in
. Before this PR, scoping a PostHog AI eval to a specific mode wasn't possible — you'd end up evaluating every PostHog AI generation, which produced noisy feeds with low single-digit PASS rates.
Key observation from setup: the
tag reflects the mode at turn-time, but chat state retains mode selection even when users drift off-topic mid-conversation. Spot-check: a random
agent_mode=error_tracking
sample included a conversation that ended up being about session replay pricing. The eval prompt's classification must be permissive about topic drift — PASS only when the turn is recognizably in-scope for the mode, FAIL when the conversation has drifted to something else entirely.
Validating signal quality after launch
Once the feed has been running for a day or two, sanity-check the eval output at scale.
If posthog:llma-evaluation-summary-create
is exposed:
json
posthog:llma-evaluation-summary-create
{
"evaluation_id": "<uuid>",
"filter": "fail"
}
UI fallback: open the eval in LLM analytics → Evaluations → "Summarize results" button, filter = fail.
If the FAIL bucket is large, the classification step is too strict — relax it. If the PASS bucket has lots of generic reasonings, iterate on the prompt to enforce concreteness. The summary tool gives a quick read on this without you having to scroll through individual events.
Spot-check raw events when needed (note: the stored result value is
, not
— see step 6):
sql
SELECT
properties.$ai_evaluation_reasoning AS reasoning,
properties.$ai_trace_id AS trace_id,
timestamp
FROM events
WHERE event = '$ai_evaluation'
AND properties.$ai_evaluation_name = '<your eval name>'
AND properties.$ai_evaluation_result = 'True'
AND timestamp > now() - INTERVAL 1 DAY
ORDER BY timestamp DESC
LIMIT 25
Tips
- The reasoning field IS the Slack message — design the prompt for that, not for "chain of thought before classification." Models can produce structured Slack-ready text in one pass.
- LLM judges are non-deterministic across reruns. Expect 1-5% noise even with a fixed prompt and model. If you need reproducibility, pin a deterministic provider/seed in .
- Keep the eval scoped tightly via on prefix. Otherwise it fans out to every event in the project and burns LLM cost.
- For high-volume features (>10k traces/week), consider sampling — set the eval to run on a percentage of matching events rather than all of them. Slack flooding is a real failure mode.
- The "View Trigger Session" button is the highest-value link in the alert. Without it, the feed is just text — you can't watch what the user was actually doing. Verify it works in step 7 before considering the feed shipped.
- Once the feed is live, periodically re-run the eval summary tool with to surface the dominant use case clusters. That's how you turn the feed into actual product insights instead of just a notification stream.