exploring-llm-clusters

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Exploring LLM clusters

探索LLM集群

Use this skill when investigating LLM analytics clusters — understanding what patterns exist in your AI/LLM traffic, comparing cluster behavior, and drilling into individual clusters.
当你需要研究LLM分析集群时使用此技能——了解AI/LLM流量中存在的模式,对比集群行为,并深入分析单个集群。

Tools

工具

ToolPurpose
posthog:llma-clustering-job-list
List clustering job configurations for the team
posthog:llma-clustering-job-get
Get a specific clustering job by ID
posthog:execute-sql
Query cluster run events and compute metrics
posthog:query-llm-traces-list
Find traces belonging to a cluster
posthog:query-llm-trace
Inspect a specific trace in detail
工具用途
posthog:llma-clustering-job-list
列出团队的集群任务配置
posthog:llma-clustering-job-get
通过ID获取特定集群任务
posthog:execute-sql
查询集群运行事件并计算指标
posthog:query-llm-traces-list
查找属于某个集群的追踪数据
posthog:query-llm-trace
详细检查特定追踪数据

How clustering works

集群工作原理

PostHog clusters LLM traces (or individual generations) by embedding similarity. A Temporal workflow runs periodically or on-demand, producing cluster events stored as
$ai_trace_clusters
(trace-level) or
$ai_generation_clusters
(generation-level).
Each cluster event contains:
  • $ai_clustering_run_id
    — unique run identifier (format:
    <team_id>_<level>_<YYYYMMDD>_<HHMMSS>[_<job_id>]
    )
  • $ai_clustering_level
    "trace"
    or
    "generation"
  • $ai_window_start
    /
    $ai_window_end
    — time window analyzed
  • $ai_total_items_analyzed
    — number of traces/generations processed
  • $ai_clusters
    — JSON array of cluster objects
  • $ai_clustering_params
    — algorithm parameters used
PostHog通过嵌入相似度对LLM追踪数据(或单个生成内容)进行聚类。Temporal工作流定期或按需运行,生成存储为
$ai_trace_clusters
(追踪级别)或
$ai_generation_clusters
(生成级别)的集群事件。
每个集群事件包含:
  • $ai_clustering_run_id
    — 唯一运行标识符(格式:
    <team_id>_<level>_<YYYYMMDD>_<HHMMSS>[_<job_id>]
  • $ai_clustering_level
    "trace"
    "generation"
  • $ai_window_start
    /
    $ai_window_end
    — 分析的时间窗口
  • $ai_total_items_analyzed
    — 处理的追踪/生成内容数量
  • $ai_clusters
    — 集群对象的JSON数组
  • $ai_clustering_params
    — 使用的算法参数

Cluster object shape (inside
$ai_clusters
)

$ai_clusters
中的集群对象结构

json
{
  "cluster_id": 0,
  "size": 42,
  "title": "User authentication flows",
  "description": "Traces involving login, signup, and token refresh operations",
  "traces": {
    "<trace_or_generation_id>": {
      "distance_to_centroid": 0.123,
      "rank": 0,
      "x": -2.34,
      "y": 1.56,
      "timestamp": "2026-03-28T10:00:00Z",
      "trace_id": "abc-123",
      "generation_id": "gen-456"
    }
  },
  "centroid_x": -2.1,
  "centroid_y": 1.4
}
  • cluster_id: -1
    is the noise/outlier cluster (items that didn't fit any cluster)
  • Items in
    traces
    are keyed by trace ID (trace-level) or generation event UUID (generation-level)
  • rank
    orders items by proximity to centroid (0 = closest)
  • x
    ,
    y
    are 2D coordinates for visualization (UMAP/PCA/t-SNE reduced)
json
{
  "cluster_id": 0,
  "size": 42,
  "title": "User authentication flows",
  "description": "Traces involving login, signup, and token refresh operations",
  "traces": {
    "<trace_or_generation_id>": {
      "distance_to_centroid": 0.123,
      "rank": 0,
      "x": -2.34,
      "y": 1.56,
      "timestamp": "2026-03-28T10:00:00Z",
      "trace_id": "abc-123",
      "generation_id": "gen-456"
    }
  },
  "centroid_x": -2.1,
  "centroid_y": 1.4
}
  • cluster_id: -1
    噪声/异常值集群(不适合任何集群的项目)
  • traces
    中的项目以追踪ID(追踪级别)或生成事件UUID(生成级别)作为键
  • rank
    按与质心的接近程度对项目排序(0 = 最接近)
  • x
    ,
    y
    是用于可视化的2D坐标(UMAP/PCA/t-SNE降维后)

Clustering jobs

集群任务

Each team can have up to 5 clustering jobs. A job defines:
  • name — human-readable label
  • analysis_level
    "trace"
    or
    "generation"
  • event_filters — property filters scoping which traces are included
  • enabled — whether the job runs on schedule
Default jobs named
"Default - trace"
and
"Default - generation"
are auto-created and disabled when a custom job is created for the same level.
每个团队最多可拥有5个集群任务。任务定义包括:
  • name — 易读标签
  • analysis_level
    "trace"
    "generation"
  • event_filters — 限定纳入哪些追踪数据的属性过滤器
  • enabled — 任务是否按计划运行
当为同一级别创建自定义任务时,名为
"Default - trace"
"Default - generation"
的默认任务会自动创建并禁用。

Workflow: explore clusters

工作流:探索集群

Step 1 — List recent clustering runs

步骤1 — 列出近期集群运行记录

sql
posthog:execute-sql
SELECT
    JSONExtractString(properties, '$ai_clustering_run_id') as run_id,
    JSONExtractString(properties, '$ai_clustering_level') as level,
    JSONExtractString(properties, '$ai_window_start') as window_start,
    JSONExtractString(properties, '$ai_window_end') as window_end,
    JSONExtractInt(properties, '$ai_total_items_analyzed') as total_items,
    timestamp
FROM events
WHERE event IN ('$ai_trace_clusters', '$ai_generation_clusters')
    AND timestamp >= now() - INTERVAL 7 DAY
ORDER BY timestamp DESC
LIMIT 10
sql
posthog:execute-sql
SELECT
    JSONExtractString(properties, '$ai_clustering_run_id') as run_id,
    JSONExtractString(properties, '$ai_clustering_level') as level,
    JSONExtractString(properties, '$ai_window_start') as window_start,
    JSONExtractString(properties, '$ai_window_end') as window_end,
    JSONExtractInt(properties, '$ai_total_items_analyzed') as total_items,
    timestamp
FROM events
WHERE event IN ('$ai_trace_clusters', '$ai_generation_clusters')
    AND timestamp >= now() - INTERVAL 7 DAY
ORDER BY timestamp DESC
LIMIT 10

Step 2 — Get clusters from a specific run

步骤2 — 获取特定运行记录的集群

sql
posthog:execute-sql
SELECT
    JSONExtractString(properties, '$ai_clustering_run_id') as run_id,
    JSONExtractString(properties, '$ai_clustering_level') as level,
    JSONExtractString(properties, '$ai_clustering_job_id') as job_id,
    JSONExtractString(properties, '$ai_clustering_job_name') as job_name,
    JSONExtractString(properties, '$ai_window_start') as window_start,
    JSONExtractString(properties, '$ai_window_end') as window_end,
    JSONExtractInt(properties, '$ai_total_items_analyzed') as total_items,
    JSONExtractRaw(properties, '$ai_clusters') as clusters,
    JSONExtractRaw(properties, '$ai_clustering_params') as params
FROM events
WHERE event IN ('$ai_trace_clusters', '$ai_generation_clusters')
    AND JSONExtractString(properties, '$ai_clustering_run_id') = '<run_id>'
LIMIT 1
The
clusters
field is a JSON array. Parse it to see cluster titles, sizes, and descriptions.
Important: The clusters JSON can be very large (thousands of trace IDs with coordinates). When the result is too large for inline display, it auto-persists to a file. Use
print_clusters.py
from scripts/ to get a readable summary.
sql
posthog:execute-sql
SELECT
    JSONExtractString(properties, '$ai_clustering_run_id') as run_id,
    JSONExtractString(properties, '$ai_clustering_level') as level,
    JSONExtractString(properties, '$ai_clustering_job_id') as job_id,
    JSONExtractString(properties, '$ai_clustering_job_name') as job_name,
    JSONExtractString(properties, '$ai_window_start') as window_start,
    JSONExtractString(properties, '$ai_window_end') as window_end,
    JSONExtractInt(properties, '$ai_total_items_analyzed') as total_items,
    JSONExtractRaw(properties, '$ai_clusters') as clusters,
    JSONExtractRaw(properties, '$ai_clustering_params') as params
FROM events
WHERE event IN ('$ai_trace_clusters', '$ai_generation_clusters')
    AND JSONExtractString(properties, '$ai_clustering_run_id') = '<run_id>'
LIMIT 1
clusters
字段是JSON数组。解析它可以查看集群标题、大小和描述。
重要提示: 集群JSON可能非常大(包含数千个带坐标的追踪ID)。当结果太大无法内联显示时,会自动保存到文件。使用scripts/中的
print_clusters.py
获取可读摘要。

Step 3 — Compute metrics for clusters

步骤3 — 计算集群指标

For trace-level clusters, compute cost/latency/token metrics:
sql
posthog:execute-sql
SELECT
    JSONExtractString(properties, '$ai_trace_id') as trace_id,
    sum(toFloat(properties.$ai_total_cost_usd)) as total_cost,
    max(toFloat(properties.$ai_latency)) as latency,
    sum(toInt(properties.$ai_input_tokens)) as input_tokens,
    sum(toInt(properties.$ai_output_tokens)) as output_tokens,
    countIf(properties.$ai_is_error = 'true') as error_count
FROM events
WHERE event IN ('$ai_generation', '$ai_embedding', '$ai_span')
    AND timestamp >= parseDateTimeBestEffort('<window_start>')
    AND timestamp <= parseDateTimeBestEffort('<window_end>')
    AND JSONExtractString(properties, '$ai_trace_id') IN ('<trace_id_1>', '<trace_id_2>', ...)
GROUP BY trace_id
For generation-level clusters, match by event UUID:
sql
posthog:execute-sql
SELECT
    toString(uuid) as generation_id,
    toFloat(properties.$ai_total_cost_usd) as cost,
    toFloat(properties.$ai_latency) as latency,
    toInt(properties.$ai_input_tokens) as input_tokens,
    toInt(properties.$ai_output_tokens) as output_tokens,
    if(properties.$ai_is_error = 'true', 1, 0) as is_error
FROM events
WHERE event = '$ai_generation'
    AND timestamp >= parseDateTimeBestEffort('<window_start>')
    AND timestamp <= parseDateTimeBestEffort('<window_end>')
    AND toString(uuid) IN ('<gen_uuid_1>', '<gen_uuid_2>', ...)
对于追踪级集群,计算成本/延迟/令牌指标:
sql
posthog:execute-sql
SELECT
    JSONExtractString(properties, '$ai_trace_id') as trace_id,
    sum(toFloat(properties.$ai_total_cost_usd)) as total_cost,
    max(toFloat(properties.$ai_latency)) as latency,
    sum(toInt(properties.$ai_input_tokens)) as input_tokens,
    sum(toInt(properties.$ai_output_tokens)) as output_tokens,
    countIf(properties.$ai_is_error = 'true') as error_count
FROM events
WHERE event IN ('$ai_generation', '$ai_embedding', '$ai_span')
    AND timestamp >= parseDateTimeBestEffort('<window_start>')
    AND timestamp <= parseDateTimeBestEffort('<window_end>')
    AND JSONExtractString(properties, '$ai_trace_id') IN ('<trace_id_1>', '<trace_id_2>', ...)
GROUP BY trace_id
对于生成级集群,按事件UUID匹配:
sql
posthog:execute-sql
SELECT
    toString(uuid) as generation_id,
    toFloat(properties.$ai_total_cost_usd) as cost,
    toFloat(properties.$ai_latency) as latency,
    toInt(properties.$ai_input_tokens) as input_tokens,
    toInt(properties.$ai_output_tokens) as output_tokens,
    if(properties.$ai_is_error = 'true', 1, 0) as is_error
FROM events
WHERE event = '$ai_generation'
    AND timestamp >= parseDateTimeBestEffort('<window_start>')
    AND timestamp <= parseDateTimeBestEffort('<window_end>')
    AND toString(uuid) IN ('<gen_uuid_1>', '<gen_uuid_2>', ...)

Step 4 — Drill into specific traces

步骤4 — 深入分析特定追踪数据

Once you've identified interesting clusters, use the trace tools to inspect individual traces:
json
posthog:query-llm-trace
{
  "traceId": "<trace_id_from_cluster>",
  "dateRange": {"date_from": "<window_start>", "date_to": "<window_end>"}
}
一旦确定了感兴趣的集群,使用追踪工具检查单个追踪数据:
json
posthog:query-llm-trace
{
  "traceId": "<trace_id_from_cluster>",
  "dateRange": {"date_from": "<window_start>", "date_to": "<window_end>"}
}

Investigation patterns

研究模式

"What kinds of LLM usage do we have?"

“我们有哪些类型的LLM使用场景?”

  1. List recent clustering runs (Step 1)
  2. Load the latest run's clusters (Step 2)
  3. Review cluster titles and descriptions — each represents a distinct usage pattern
  4. Compare cluster sizes to understand traffic distribution
  1. 列出近期集群运行记录(步骤1)
  2. 加载最新运行记录的集群(步骤2)
  3. 查看集群标题和描述——每个代表一种独特的使用模式
  4. 对比集群大小以了解流量分布

"Which cluster is most expensive / slowest?"

“哪个集群成本最高/速度最慢?”

  1. Load clusters from a run (Step 2)
  2. Extract trace IDs from each cluster
  3. Compute metrics per cluster (Step 3)
  4. Aggregate:
    avg(cost)
    ,
    avg(latency)
    ,
    sum(cost)
    per cluster
  5. Compare across clusters
  1. 加载某一运行记录的集群(步骤2)
  2. 提取每个集群的追踪ID
  3. 计算每个集群的指标(步骤3)
  4. 聚合:每个集群的
    avg(cost)
    avg(latency)
    sum(cost)
  5. 跨集群对比

"What's in this cluster?"

“这个集群里有什么内容?”

  1. Load the cluster's traces (from the
    traces
    field)
  2. Sort by
    rank
    (closest to centroid = most representative)
  3. Inspect the top 3-5 traces via
    query-llm-trace
    to understand the pattern
  4. Check the cluster
    title
    and
    description
    for the AI-generated summary
  1. 加载集群的追踪数据(来自
    traces
    字段)
  2. rank
    排序(最接近质心=最具代表性)
  3. 通过
    query-llm-trace
    检查前3-5条追踪数据以了解模式
  4. 查看集群的
    title
    description
    获取AI生成的摘要

"Are there error-heavy clusters?"

“是否存在错误率高的集群?”

  1. Compute metrics (Step 3) with
    error_count
  2. Calculate error rate per cluster:
    items_with_errors / total_items
  3. Focus on clusters with high error rates
  4. Drill into errored traces to find root causes
  1. 计算包含
    error_count
    的指标(步骤3)
  2. 计算每个集群的错误率:
    items_with_errors / total_items
  3. 重点关注错误率高的集群
  4. 深入分析出错的追踪数据以找到根本原因

"How do clusters compare across runs?"

“不同运行记录的集群如何对比?”

  1. List multiple runs (Step 1)
  2. Load clusters from each run
  3. Compare cluster titles — similar titles across runs indicate stable patterns
  4. Track cluster size changes to detect shifts in traffic patterns
  1. 列出多个运行记录(步骤1)
  2. 加载每个运行记录的集群
  3. 对比集群标题——不同运行记录中相似的标题表示稳定的模式
  4. 跟踪集群大小变化以检测流量模式的转变

Constructing UI links

构建UI链接

  • Clusters overview:
    https://app.posthog.com/llm-analytics/clusters
  • Specific run:
    https://app.posthog.com/llm-analytics/clusters/<url_encoded_run_id>
  • Cluster detail:
    https://app.posthog.com/llm-analytics/clusters/<url_encoded_run_id>/<cluster_id>
Always surface these links so the user can verify visually in the PostHog UI.
  • 集群概览:
    https://app.posthog.com/llm-analytics/clusters
  • 特定运行记录:
    https://app.posthog.com/llm-analytics/clusters/<url_encoded_run_id>
  • 集群详情:
    https://app.posthog.com/llm-analytics/clusters/<url_encoded_run_id>/<cluster_id>
始终提供这些链接,以便用户可以在PostHog UI中直观验证。

Tips

提示

  • Always set a time range in SQL queries — cluster events without time bounds are slow
  • Start with run listing to orient, then drill into specific clusters
  • Cluster titles and descriptions are AI-generated summaries — verify by inspecting traces
  • The noise cluster (
    cluster_id: -1
    ) contains outliers that didn't fit any pattern
  • Use
    llma-clustering-job-list
    to understand what clustering configs are active
  • Trace IDs in clusters can be used directly with
    query-llm-trace
    for deep inspection
  • For large clusters, inspect the top-ranked traces (closest to centroid) for representative examples
  • 始终在SQL查询中设置时间范围——没有时间限制的集群事件查询速度很慢
  • 先从列出运行记录开始定位,再深入分析特定集群
  • 集群标题和描述是AI生成的摘要——通过检查追踪数据进行验证
  • 噪声集群(
    cluster_id: -1
    )包含不适合任何模式的异常值
  • 使用
    llma-clustering-job-list
    了解哪些集群配置处于激活状态
  • 集群中的追踪ID可直接用于
    query-llm-trace
    进行深度检查
  • 对于大型集群,检查排名靠前的追踪数据(最接近质心)以获取代表性示例