Building a feature usage feed via LLM evals

Some PostHog features (group session summaries, single session summaries, replay AI search, error tracking AI debug, etc.) generate hundreds or thousands of LLM traces per week. Reading them by hand is not feasible. This skill covers the end-to-end pattern for turning that trace volume into a live Slack feed of canonical use cases — what users are actually doing with the feature.

The workflow is mixed, and leans UI. Trace inspection and filter discovery (steps 1-2) are MCP-driven. Eval creation, dry-running, and enabling (steps 4-5) are MCP-driven when

posthog:llma-evaluation-*

tools are exposed to your agent — but they often aren't, in which case fall back to the UI (Data pipeline → destinations for the alert is always UI). Each step flags its UI fallback. Expect to finish in the UI even when you start from chat.

When to use

"How are people actually using [feature X] in production?"
"Can we identify the canonical use cases for [feature X] so we can write better docs / prioritize improvements?"
"I want a Slack feed of representative usage examples without manually skimming traces."
"Set up a feed of use cases for [feature X] in #team-[area]-usage."

If the user just wants to debug a single trace or tune an existing eval, redirect to

exploring-llm-traces

exploring-llm-evaluations

instead.

Two filter patterns

This skill supports two different ways to scope an eval to "the feature you care about":

Pattern A — Feature-native trace_id prefix. For standalone features that emit their own

$ai_trace_id

pattern (e.g.

session-summary:group:

replay-search:

, error-tracking-specific flows). Filter on the prefix.

Pattern B — PostHog AI agent mode. For features the user interacts with via PostHog AI in a specific agent mode (error tracking, product analytics, session replay, SQL, flags, surveys, LLM analytics). Filter on

ai_product = 'posthog_ai' AND agent_mode = '<mode>'

. This requires PR #55160 (merged April 2026) to be deployed, which threads

agent_mode

and

supermode

onto every

$ai_generation

emitted by the chat agent loop. A useful ergonomic side-effect:

agent_mode IS NOT NULL

is a reliable "user-facing chat turn" filter — batch jobs and tool-internal LLM calls go through different code paths and have

agent_mode=null

, so they're excluded for free.

If the user asks "what are users trying to DO in [ET / replay / SQL / flags / surveys] mode of PostHog AI", that's Pattern B. If they ask "what use cases does [standalone feature] cover", that's Pattern A. Pick the pattern first — the prompt, filter, and Slack channel naming all follow from it.

Prerequisites

Requirement	How to verify
(Pattern A) Feature emits `$ai_generation` events with a stable `$ai_trace_id` pattern	`posthog:execute-sql` for distinct `$ai_trace_id` prefixes
(Pattern B) `agent_mode` property is present on recent `$ai_generation` events	`posthog:execute-sql` group-by `properties.agent_mode` on recent `ai_product='posthog_ai'` events. Null bucket is normal (batch jobs + tool-internal calls) — you want non-null coverage across the modes you care about.
`$session_id` is attached to the `$ai_generation` events (links trace to trigger session)	`posthog:execute-sql` for `countIf($session_id IS NOT NULL) / count()`
`$session_id` is also attached to the `$ai_evaluation` events (lets the Slack alert link to the session)	Same query but on `$ai_evaluation` events after the eval has run once
User has organisation-level AI data processing approval	Required for `llm_judge` evaluations and the eval summary tool

$session_id

is missing on either event type, file a backend fix before continuing — there is no UI workaround. The session-summary feature has a worked example of the threading pattern in PR #54952. For Pattern B, the agent-mode threading pattern is in PR #55160.

Tools

Tool	Purpose
`posthog:query-llm-traces-list`	Find sample traces matching the feature's `$ai_trace_id` pattern
`posthog:query-llm-trace`	Inspect a specific trace's contents end-to-end
`posthog:execute-sql`	Verify trace volume, session_id coverage, eval result distributions
`posthog:llma-evaluation-create`	(often unexposed — UI fallback: LLM analytics → Evaluations → New) Create the LLM-judge eval (disabled at first)
`posthog:llma-evaluation-run`	(often unexposed — UI fallback: the eval's detail page has a "Run on event" button) Dry-run the eval against specific generations during prompt iteration
`posthog:llma-evaluation-update`	(often unexposed — UI fallback: edit the eval in LLM analytics → Evaluations) Tweak the prompt / enable when ready
`posthog:llma-evaluation-summary-create`	(often unexposed — UI fallback: the eval detail page has a "Summarize results" button) After the feed is running, get an AI summary of pass/N/A patterns to validate signal quality
`posthog:workflows-list` / `posthog:workflows-get`	(often unexposed — UI: Data pipeline → Workflows) Browse existing workflow configs — useful for cloning an existing feed's structure when setting up a new one. Read-only; no create/update tool is exposed yet, so step 6's Slack workflow setup is UI-only.

Before starting, check which of the
posthog:llma-evaluation-*
tools are actually exposed in your agent's MCP tool set. If they aren't loaded, treat steps 4-5 as UI walkthroughs rather than tool calls.

Workflow

Step 1 — Identify the filter

Pattern A (feature-native trace_id prefix): find the prefix that maps to your feature.

sql

SELECT
    splitByChar(':', coalesce(properties.$ai_trace_id, ''))[1] AS root,
    splitByChar(':', coalesce(properties.$ai_trace_id, ''))[2] AS subtype,
    count() AS events
FROM events
WHERE timestamp > now() - INTERVAL 3 DAY
    AND event = '$ai_generation'
    AND properties.$ai_trace_id IS NOT NULL
GROUP BY root, subtype
ORDER BY events DESC
LIMIT 25

Note:

coalesce(..., '')

is load-bearing —

splitByChar

on a nullable column errors out in HogQL otherwise.

Pattern B (PostHog AI agent mode): verify coverage and volume for the mode you're targeting.

sql

SELECT
    properties.agent_mode AS agent_mode,
    properties.supermode AS supermode,
    count() AS events,
    count(DISTINCT properties.$ai_trace_id) AS traces
FROM events
WHERE timestamp > now() - INTERVAL 3 DAY
    AND event = '$ai_generation'
    AND properties.ai_product = 'posthog_ai'
GROUP BY agent_mode, supermode
ORDER BY events DESC
LIMIT 20

Expected values for

agent_mode

error_tracking

product_analytics

sql

session_replay

flags

survey

llm_analytics

null

. Null ≈ batch jobs + tool-internal calls (not user chat).

supermode='plan'

splits planning turns from execution turns — worth calling out separately if your feed is about plan-mode specifically.

Record the mode + rough volume. Low-volume modes (<100 events/day) will produce a trickle-feed that's hard to validate early; high-volume modes (>1k/day) may need sampling to avoid Slack flooding. See the "Tips" section on sampling.

Step 2 — Pull a handful of sample traces

Use these for prompt iteration in step 4.

Pattern A:

json

posthog:query-llm-traces-list
{
  "properties": [
    { "type": "event", "key": "$ai_trace_id", "operator": "icontains", "value": "<your-prefix-here>" }
  ],
  "limit": 10,
  "dateRange": { "date_from": "-2d" },
  "randomOrder": true
}

Pattern B:

json

posthog:query-llm-traces-list
{
  "properties": [
    { "type": "event", "key": "ai_product", "operator": "exact", "value": "posthog_ai" },
    { "type": "event", "key": "agent_mode", "operator": "exact", "value": "<mode-here>" }
  ],
  "limit": 10,
  "dateRange": { "date_from": "-2d" },
  "randomOrder": true
}

randomOrder: true

matters — recency bias produces a non-representative sample. Pick 5-10 traces to test against.

Output size warning:

query-llm-traces-list

with

limit: 10

routinely returns 3-6MB of JSON (full input/output per generation). This will blow your context window. Immediately delegate the summarization to a subagent the moment you see the "result exceeds maximum allowed tokens" error — ask the subagent to extract, per trace: the trace id, the first user message (truncated to ~300 chars), the sampled

$current_url

, and a one-sentence description of what the conversation was about. Don't try to read the raw file in-line.

Watch for topic drift in Pattern B samples. The

agent_mode

tag reflects the user's mode selection at the time of the turn — but chat state retains the mode even if the user drifts off-topic within the same conversation (e.g. user selected "error tracking" mode, then asked an unrelated pricing question three turns later). Your eval prompt's classification step needs to be permissive about topic-drift: PASS should mean "user is doing something recognizably in-scope for this mode", FAIL should catch the off-topic drift. If you don't, your feed will include irrelevant PASS entries that happen to carry the mode tag.

Step 3 — Draft the LLM-judge prompt

The prompt has two responsibilities: (a) classify the trace as relevant or not, (b) produce reasoning text that is directly postable to Slack (no preamble, no meta-description). The reasoning field becomes the Slack message body.

Template:

text

You are analyzing a PostHog [FEATURE NAME] trace to extract its real use case.
Your reasoning text will be posted directly to a Slack channel as a notification.
Write it as a short, ready-to-post message — no preamble, no meta-description.

Step 1 — Classification:
- PASS = this trace is the [feature kind] you care about
- FAIL = a different LLM call or a false match
- N/A = ambiguous from the trace alone

Step 2 — Reasoning (only matters if PASS). Write 2-3 sentences in this exact format:

"[OPENER] [what they targeted/filtered for]. They were
trying to [understand X / debug Y / find Z]. The result surfaced [key pattern
or finding]."

Your output MUST start with the exact phrase "[OPENER]". No other opening is allowed.

Rules:
- No "This is a [feature]..." or "The input contains..." preamble
- No JSON, field names, system-prompt references, or meta-description
- Concrete > generic. "users hitting error tracking for the first time" beats "user behavior"
- If you cannot infer one of the three pieces from the trace, write "(unclear from trace)" in that slot — do not guess

Pick an
[OPENER]
that matches how users actually interact with the feature. The forced opener is load-bearing (it prevents the model from drifting into "this trace is a..." meta-description), but the exact verb has to fit the interaction:

Feature / mode	OPENER
Session summary (group / single)	`A user ran a summary on`
Replay AI search	`A user searched replays for`
PostHog AI in error tracking mode	`A user asked PostHog AI about`
PostHog AI in session replay mode	`A user asked PostHog AI about`
PostHog AI in SQL mode	`A user asked PostHog AI to write SQL for`

Note:

supermode='plan'

is a sub-filter that layers on top of an

agent_mode

row — it's not its own row. If you want plan-mode-only, filter

agent_mode='<mode>' AND supermode='plan'

and pick an opener like

"A user asked PostHog AI to plan"

If you force

"A user ran"

on a chat-based feature, the model will produce awkward contortions ("A user ran a question about...") that read wrong in Slack. The forced-opener pattern is the mechanism — the specific phrase is per-feature.

The negative example list ("No 'This is a...' preamble", etc.) is load-bearing regardless of opener. Don't remove it.

Step 4 — Create the eval (disabled), test, iterate

Create with

enabled: false

so it doesn't immediately fan out to all traces.

If
posthog:llma-evaluation-create
is exposed, use this payload:

json

posthog:llma-evaluation-create
{
  "name": "[feature] use case feed",
  "description": "Extracts canonical use cases for [feature] for the #team-[area]-usage Slack feed",
  "evaluation_type": "llm_judge",
  "evaluation_config": {
    "prompt": "<full prompt from step 3>"
  },
  "output_type": "boolean",
  "output_config": { "allows_na": true },
  "model_configuration": {
    "provider": "<provider>",
    "model": "<model>"
  },
  "enabled": false,
  "conditions": {
    "filters": [
      // Pattern A — feature-native trace_id prefix:
      { "key": "$ai_trace_id", "operator": "icontains", "value": "<your-prefix>" }

      // Pattern B — PostHog AI agent mode (use these INSTEAD of the trace_id filter):
      // { "key": "ai_product", "operator": "exact", "value": "posthog_ai" },
      // { "key": "agent_mode", "operator": "exact", "value": "<mode>" }
    ]
  }
}

Leave model choice to the user — LLM-judge cost scales linearly with event volume, and cheap-vs-capable is a real tradeoff they should make based on their own spend tolerance and signal-quality requirements. Don't pick for them.

UI fallback (when

llma-evaluation-create

isn't exposed): LLM analytics → Evaluations → New evaluation. Type =

LLM judge

, output = boolean + allow N/A, filters as above, enabled = off. Paste the prompt from step 3.

Then dry-run against your sample traces.

If
posthog:llma-evaluation-run
is exposed:

json

posthog:llma-evaluation-run
{
  "evaluationId": "<uuid from create>",
  "target_event_id": "<a $ai_generation event id from step 2>",
  "timestamp": "<ISO timestamp of that event>"
}

UI fallback: on the eval detail page, use the "Run on event" button with the trace sample's event id.

Look at the returned

$ai_evaluation_reasoning

. If it preambles, drifts, or describes the input, fix the prompt (via

llma-evaluation-update

or by editing in the UI) and re-run. Iterate on 3-5 traces before enabling.

Common failure modes during iteration:

Symptom	Fix
Reasoning starts with "This is a..."	Strengthen the forced opener instruction; add a counter-example
Reasoning is generic ("user behavior", "various patterns")	Add positive examples of concrete phrasing in the prompt
Model classifies everything as PASS	Tighten the FAIL definition; add an example of what a non-match looks like
Reasoning is too long for Slack	Add a hard sentence cap ("MAX 3 sentences, hard limit")

Step 5 — Enable the eval

Once 3-5 sample runs produce clean Slack-ready output.

If
posthog:llma-evaluation-update
is exposed:

json

posthog:llma-evaluation-update
{
  "evaluationId": "<uuid>",
  "enabled": true
}

UI fallback: LLM analytics → Evaluations → open the eval → toggle enabled.

The eval will now run on every new matching

$ai_generation

event.

Step 6 — Build the workflow (UI only)

Workflow setup is not MCP-accessible for writes (

posthog:workflows-list

posthog:workflows-get

are read-only). The steps below are a UI walkthrough.

Prereq: before you start, invite the PostHog Slack bot to your target channel (

/invite @PostHog

in the Slack channel). Without this, the Slack dispatch step will fail with an opaque permission error at send time, not at save time — easy to miss.

6.1 Create the workflow

Data pipeline → Workflows → New workflow. Name it

<feature> use case feed

to match the eval name from step 4.

6.2 Trigger step

Event:
```
AI evaluation (LLM)
```
— i.e.
```
$ai_evaluation
```
. This is the event emitted when an eval runs, and it's the only event that carries
```
$ai_evaluation_*
```
properties. The original
```
$ai_generation
```
event is not enriched with eval results, so filtering on
```
$ai_generation
```
here matches nothing.

Property filters (both required):

AI Evaluation Name (LLM)

equals

<your eval name from step 4>

```
AI Evaluation Result (LLM)
```
equals
```
true
```

⚠️ LOAD-BEARING: the stored values for

$ai_evaluation_result

are the strings

'True'

'False'

'None'

— NOT

'PASS'

'FAIL'

'N/A'

(despite what the prompt template calls them internally). The Workflows UI property filter normalizes

true

→

'True'

, so selecting

equals true

from the dropdown works. But if you were wiring this in raw SQL somewhere else (say a hog function), you'd need the string literal. Verify the stored distribution before saving:

sql

SELECT DISTINCT toString(properties.$ai_evaluation_result) AS result, count() AS n
FROM events
WHERE event = '$ai_evaluation'
  AND properties.$ai_evaluation_name = '<your eval name>'
  AND timestamp > now() - INTERVAL 1 HOUR
GROUP BY result

If the only values are

True

False

None

and

True

dominates, the UI

equals true

filter will match. If you see anything else, adjust accordingly.

6.3 Slack dispatch step

Add step → Slack dispatch
Channel:
```
#<your-team>-usage-feed
```
Sender / bot display name: something that reads well in the channel (e.g.
```
PostHog Usage Feed
```
)
Blocks (Slack block-kit JSON) — paste this and replace
```
<project_id>
```
with your actual numeric project ID (e.g.
```
2
```
):

json

[
  {
    "text": {
      "text": "<emoji> *{event.properties.$ai_evaluation_name}* triggered by *{person.name}*",
      "type": "mrkdwn"
    },
    "type": "section"
  },
  {
    "text": {
      "text": "{event.properties.$ai_evaluation_reasoning}",
      "type": "mrkdwn"
    },
    "type": "section"
  },
  {
    "type": "actions",
    "elements": [
      {
        "url": "https://us.posthog.com/project/<project_id>/llm-analytics/traces/{event.properties.$ai_trace_id}?event={event.properties.$ai_target_event_id}",
        "text": { "text": "View Trace", "type": "plain_text" },
        "type": "button"
      },
      {
        "url": "https://us.posthog.com/project/<project_id>/replay/{event.properties.$session_id}",
        "text": { "text": "View Trigger Session", "type": "plain_text" },
        "type": "button"
      },
      {
        "url": "{person.url}",
        "text": { "text": "View Person", "type": "plain_text" },
        "type": "button"
      }
    ]
  }
]

Pick an

<emoji>

that matches the feature's shape: 📊 product analytics, 🐛 error tracking, 🎬 session replay, 🔎 search/AI search, 🧪 experiments, 🚩 flags, 📋 surveys, 🧠 generic AI.

The

{event.properties.X}

and

{person.X}

placeholders are valid PostHog template syntax and resolve at send time.

6.4 Test before enabling

The Workflows Test panel has two modes — this matters because naively hitting "Test" can look like a broken integration when it isn't:

Synthetic event (default) — the Test panel fabricates an
```
$ai_evaluation
```
payload and runs the flow without hitting Slack's real API. Useful as a dry-run of the block template, but
```
{event.properties.$ai_*}
```
placeholders may resolve to
```
null
```
and Slack's block validator will reject the payload with
```
invalid_blocks
```
. That's a test-harness artifact, not a real bug — don't chase it.
"Make real HTTPS requests" — flip this toggle on. Workflows then pulls a recent real
```
$ai_evaluation
```
event matching your filters and runs the flow end-to-end, including the actual Slack post. This is the test that tells you "it works" for real. If no matching real event exists yet (common if the eval was just enabled), trigger the feature yourself, wait ~1 minute, and retry.

Recommended flow: synthetic → sanity-check the block template renders → flip real-requests on → confirm an actual post lands in the channel → save + enable the workflow.

Step 7 — End-to-end verify in production

Once the workflow is enabled, trigger the feature yourself. Within a minute or two:

The
```
$ai_generation
```
event should appear in LLM Analytics
The eval should auto-run and emit an
```
$ai_evaluation
```
event
The workflow should fire and the Slack post should land in the configured channel
Click "View Trigger Session" — should land on the recording of you using the feature, not the replay homepage

If "View Trigger Session" lands on the replay homepage,

$session_id

is missing on the

$ai_evaluation

event (which is separate from the

$ai_generation

event — threading is independent for the two). Backend fix needed — see prerequisites.

Worked example A (Pattern A): group session summary use cases

Pattern: a

group_summary_use_case_feed

eval streaming to a

#<team>-usage-feed

channel. Trace prefix:

session-summary:group:

. Opener:

"A user ran a group summary on"

. Slack channel showed e.g.:

📊 group_summary_use_case_feed triggered by some user "A user ran a group summary on a company's onboarding sessions from the last 7 days. They were trying to understand why account activation rates are low. The summary surfaced that most users abandon at the company onboarding wizard after creating accounts." [View Trace] [View Trigger Session] [View Person]

The PRs that made this work (linked here as worked examples of the session_id threading pattern, not as steps in the skill itself):

PostHog/posthog#54952 — threads
```
trigger_session_id
```
through to
```
$ai_generation
```
events on the session summary backend
(Followup PR — threads
```
$session_id
```
onto
```
$ai_evaluation
```
events specifically)

Worked example B (Pattern B): PostHog AI in error tracking mode

Pattern: an

agent_mode = 'error_tracking'

scoped feed streaming to a

#<team>-usage-feed

channel, answering "what are users actually trying to DO when they chat with PostHog AI in error tracking mode?" Mode sizing varies by an order of magnitude or more across agent modes — spot-check volume per §Step 1 before wiring, because a high-volume mode can flood a channel. Opener:

"A user asked PostHog AI about"

Enabling PR: PostHog/posthog#55160 — threads

agent_mode

and

supermode

onto every

$ai_generation

emitted by the chat agent loop. Wiring lives in

ee/hogai/core/agent_modes/executables.py

(

AgentExecutable._get_model

) and passes the dict through the existing

posthog_properties

field on

MaxChatMixin

ee/hogai/llm.py

. Before this PR, scoping a PostHog AI eval to a specific mode wasn't possible — you'd end up evaluating every PostHog AI generation, which produced noisy feeds with low single-digit PASS rates.

Key observation from setup: the

agent_mode

tag reflects the mode at turn-time, but chat state retains mode selection even when users drift off-topic mid-conversation. Spot-check: a random

agent_mode=error_tracking

sample included a conversation that ended up being about session replay pricing. The eval prompt's classification must be permissive about topic drift — PASS only when the turn is recognizably in-scope for the mode, FAIL when the conversation has drifted to something else entirely.

Validating signal quality after launch

Once the feed has been running for a day or two, sanity-check the eval output at scale.

If
posthog:llma-evaluation-summary-create
is exposed:

json

posthog:llma-evaluation-summary-create
{
  "evaluation_id": "<uuid>",
  "filter": "fail"
}

UI fallback: open the eval in LLM analytics → Evaluations → "Summarize results" button, filter = fail.

If the FAIL bucket is large, the classification step is too strict — relax it. If the PASS bucket has lots of generic reasonings, iterate on the prompt to enforce concreteness. The summary tool gives a quick read on this without you having to scroll through individual events.

Spot-check raw events when needed (note: the stored result value is

'True'

, not

'PASS'

— see step 6):

sql

SELECT
    properties.$ai_evaluation_reasoning AS reasoning,
    properties.$ai_trace_id AS trace_id,
    timestamp
FROM events
WHERE event = '$ai_evaluation'
    AND properties.$ai_evaluation_name = '<your eval name>'
    AND properties.$ai_evaluation_result = 'True'
    AND timestamp > now() - INTERVAL 1 DAY
ORDER BY timestamp DESC
LIMIT 25

Tips

The reasoning field IS the Slack message — design the prompt for that, not for "chain of thought before classification." Models can produce structured Slack-ready text in one pass.
LLM judges are non-deterministic across reruns. Expect 1-5% noise even with a fixed prompt and model. If you need reproducibility, pin a deterministic provider/seed in
```
model_configuration
```
.
Keep the eval scoped tightly via
```
conditions.filters
```
on
```
$ai_trace_id
```
prefix. Otherwise it fans out to every
```
$ai_generation
```
event in the project and burns LLM cost.
For high-volume features (>10k traces/week), consider sampling — set the eval to run on a percentage of matching events rather than all of them. Slack flooding is a real failure mode.
The "View Trigger Session" button is the highest-value link in the alert. Without it, the feed is just text — you can't watch what the user was actually doing. Verify it works in step 7 before considering the feed shipped.
Once the feed is live, periodically re-run the eval summary tool with
```
filter: "pass"
```
to surface the dominant use case clusters. That's how you turn the feed into actual product insights instead of just a notification stream.

feature-usage-feed

NPX Install

Tags

SKILL.md Content

Building a feature usage feed via LLM evals

When to use

Two filter patterns

Prerequisites

Tools

Workflow

Step 1 — Identify the filter

Step 2 — Pull a handful of sample traces

Step 3 — Draft the LLM-judge prompt

Step 4 — Create the eval (disabled), test, iterate

Step 5 — Enable the eval

Step 6 — Build the workflow (UI only)

6.1 Create the workflow

6.2 Trigger step

6.3 Slack dispatch step

6.4 Test before enabling

Step 7 — End-to-end verify in production

Worked example A (Pattern A): group session summary use cases

Worked example B (Pattern B): PostHog AI in error tracking mode

Validating signal quality after launch

Tips