prometheus-cardinality-troubleshooter

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prometheus Cardinality Troubleshooter

Prometheus基数问题排查指南

You are an expert in diagnosing live Prometheus cardinality problems. When a user reports a Prometheus performance, memory, or cost issue that smells like cardinality, use this guide to triage systematically.
This skill is diagnostic and operational. For schema design and prevention, route to
prometheus-label-strategy
.

你是诊断实时Prometheus基数问题的专家。当用户报告Prometheus的性能、内存或成本问题且疑似与基数相关时,请使用本指南进行系统化排查。
本技能专注于诊断与运维操作。如需架构设计与预防方案,请转至
prometheus-label-strategy

Symptom → Likely Cause

症状 → 可能原因

SymptomLikely CauseFirst Action
Prometheus OOMKilled or memory growing linearlyActive series growth (often from a new bad metric or label)Active Series triage
Single PromQL query slow or OOMs the querierOne or more metrics in the query have high cardinalityPer-query drill-down
Remote write lagging, WAL growingSample throughput spike — series count OR scrape interval changedActive Series triage + check scrape intervals
429 Too Many Samples
/
out of bounds
errors
Hitting Mimir/Cortex ingester per-tenant series limitPer-metric drill-down, find the new offender
Grafana Cloud Active Series bill spikedNew metric, new label, or rollout creating churnPer-metric drill-down + churn check
Grafana Cloud DPM bill spiked but Active Series flatScrape interval shortened, OR remote_write sending duplicatesDPM-side issue — route to
dpm-finder
series_limit_per_user
errors after a deploy
Application change introduced a new bad labelRecent change diff
Series count grows then resets every restartSeries churn from ephemeral label valuesChurn diagnosis

症状可能原因首要操作
Prometheus被OOM终止或内存线性增长活跃序列增长(通常来自新的不良指标或标签)活跃序列排查
单个PromQL查询缓慢或导致查询器OOM查询中的一个或多个指标具有高基数按查询深度排查
远程写入延迟、WAL文件增长样本吞吐量突增——序列数量或抓取间隔变更活跃序列排查 + 检查抓取间隔
429 Too Many Samples
/
out of bounds
错误
达到Mimir/Cortex摄入器的租户序列限制按指标深度排查,找出新增的问题指标
Grafana Cloud活跃序列账单突增新指标、新标签或版本发布导致序列更替按指标深度排查 + 更替检查
Grafana Cloud DPM账单突增但活跃序列数量平稳抓取间隔缩短,或remote_write发送重复数据DPM侧问题——转至
dpm-finder
部署后出现
series_limit_per_user
错误
应用变更引入了新的不良标签近期变更对比
序列数量增长后每次重启都会重置临时标签值导致的序列更替更替诊断

Step 1: Active Series Triage

步骤1:活跃序列排查

Get the headline number

获取核心数据

promql
undefined
promql
undefined

Total active series in the local Prometheus

本地Prometheus中的总活跃序列数

prometheus_tsdb_head_series
prometheus_tsdb_head_series

Or for Mimir / Grafana Cloud Metrics (per tenant)

针对Mimir / Grafana Cloud Metrics(按租户)

cortex_ingester_memory_series{user="<tenant>"}

Compare to recent history:
```promql
cortex_ingester_memory_series{user="<tenant>"}

与近期历史数据对比:
```promql

Growth over the last 7 days

过去7天的增长速率

deriv(prometheus_tsdb_head_series[7d]) * 86400

A growth rate > a few % per day on a stable application set is a red flag.
deriv(prometheus_tsdb_head_series[7d]) * 86400

对于稳定的应用集群,日增长率超过几个百分点就是危险信号。

Use the TSDB status endpoint

使用TSDB状态端点

Prometheus exposes a built-in cardinality breakdown:
bash
curl -s http://prometheus:9090/api/v1/status/tsdb | jq
Returns:
  • seriesCountByMetricName
    — top metrics by series count
  • labelValueCountByLabelName
    — top labels by unique value count
  • memoryInBytesByLabelName
    — top labels by memory footprint
  • seriesCountByLabelValuePair
    — top label-value pairs by series count
This is usually the fastest path to "which metric / which label is the problem."
For Grafana Cloud:
bash
undefined
Prometheus内置了基数细分信息的端点:
bash
curl -s http://prometheus:9090/api/v1/status/tsdb | jq
返回内容包括:
  • seriesCountByMetricName
    —— 按序列数排序的顶级指标
  • labelValueCountByLabelName
    —— 按唯一值数量排序的顶级标签
  • memoryInBytesByLabelName
    —— 按内存占用排序的顶级标签
  • seriesCountByLabelValuePair
    —— 按序列数排序的顶级标签-值对
这通常是快速定位“哪个指标/哪个标签出问题”的最佳路径。
针对Grafana Cloud:
bash
undefined

Same endpoint, authenticated against the per-tenant Mimir

相同端点,需通过租户Mimir的认证

Step 2: Read the Output

步骤2:解读输出结果

Top metrics by series count

按序列数排序的顶级指标

json
"seriesCountByMetricName": [
  { "name": "http_request_duration_seconds_bucket", "value": 184320 },
  { "name": "go_gc_duration_seconds",               "value": 80 },
  ...
]
Heuristics:
  • A histogram (
    _bucket
    ) at the top is almost always the answer — those have a 14× multiplier (bucket count + 3). The fix is usually trimming labels on the underlying histogram, not the buckets themselves.
  • A metric in the top 5 you don't recognize → grep the codebase for it; it's likely a new feature flag or a debug metric that shipped to prod
  • The same metric showing up under multiple variants (
    _total
    ,
    _count
    ,
    _sum
    ) — that's a histogram or summary, count all variants together for the true impact
json
"seriesCountByMetricName": [
  { "name": "http_request_duration_seconds_bucket", "value": 184320 },
  { "name": "go_gc_duration_seconds",               "value": 80 },
  ...
]
判断准则:
  • 排名靠前的直方图(
    _bucket
    )几乎就是问题根源——这类指标的序列数会乘以约14倍(桶数量+3)。修复方案通常是修剪底层直方图的标签,而非调整桶本身。
  • 前5名中出现你不熟悉的指标→在代码库中搜索它;这很可能是新的功能标志或被误部署到生产环境的调试指标
  • 同一指标以多个变体出现(
    _total
    _count
    _sum
    )——这是直方图或摘要类型,需将所有变体的序列数相加才能得到真实影响

Top labels by unique value count

按唯一值数量排序的顶级标签

json
"labelValueCountByLabelName": [
  { "name": "url",       "value": 84210 },
  { "name": "trace_id",  "value": 41000 },
  { "name": "pod",       "value": 1820 }
]
Red flags:
  • Any label with >10K unique values is almost certainly a bug. The only exceptions are intentional per-target labels in massive fleets.
  • trace_id
    ,
    request_id
    ,
    session_id
    ,
    query
    ,
    email
    ,
    path
    ,
    url
    — these should never be labels. They belong in exemplars, logs, or traces.
  • pod
    with thousands of values — see Churn diagnosis; recent churn often inflates this number

json
"labelValueCountByLabelName": [
  { "name": "url",       "value": 84210 },
  { "name": "trace_id",  "value": 41000 },
  { "name": "pod",       "value": 1820 }
]
危险信号:
  • 任何拥有超过10K个唯一值的标签几乎肯定是bug。唯一例外是大规模集群中有意设置的按目标划分的标签。
  • trace_id
    request_id
    session_id
    query
    email
    path
    url
    ——这些绝对不能作为标签。它们应该放在exemplars、日志或追踪数据中。
  • pod
    标签拥有数千个值→查看更替诊断;近期的序列更替通常会导致这个数值膨胀

Step 3: Per-Metric Drill-Down

步骤3:按指标深度排查

Once you've identified a suspect metric, find which label is responsible.
一旦确定了可疑指标,找出对应的问题标签。

Count distinct label values per label, for one metric

统计单个指标下每个标签的不同值数量

promql
undefined
promql
undefined

How many unique values does each label have on this metric?

该指标下每个标签有多少个唯一值?

count by (name) ( count by (name, label_name_here) ( http_request_duration_seconds_bucket ) )

Repeat per label, or use the helper:

```bash
count by (name) ( count by (name, label_name_here) ( http_request_duration_seconds_bucket ) )

针对每个标签重复执行,或使用以下脚本:

```bash

Via the Prometheus HTTP API

通过Prometheus HTTP API

curl -s "http://prometheus:9090/api/v1/labels?match[]=http_request_duration_seconds_bucket" | jq -r '.data[]' |
while read label; do count=$(curl -s "http://prometheus:9090/api/v1/label/${label}/values?match[]=http_request_duration_seconds_bucket" | jq '.data | length') echo "${count} ${label}" done | sort -rn | head -20
undefined
curl -s "http://prometheus:9090/api/v1/labels?match[]=http_request_duration_seconds_bucket" | jq -r '.data[]' |
while read label; do count=$(curl -s "http://prometheus:9090/api/v1/label/${label}/values?match[]=http_request_duration_seconds_bucket" | jq '.data | length') echo "${count} ${label}" done | sort -rn | head -20
undefined

Find the top label values for one label

找出单个标签的顶级值

promql
undefined
promql
undefined

Top 20 path values for http_requests_total

http_requests_total的前20个path值

topk(20, count by (path) (http_requests_total) )

If you see UUIDs, hashes, timestamps, or numeric IDs in the top values → that label has unbounded values from the source.
topk(20, count by (path) (http_requests_total) )

如果顶级值中出现UUID、哈希值、时间戳或数字ID→该标签的取值是无界的,源于数据源问题。

Per-metric series count, grouped

按分组统计单个指标的序列数

promql
undefined
promql
undefined

Series-per-instance breakdown — if uneven, one instance is misbehaving

按实例划分的序列数——如果分布不均,说明某个实例存在异常

sum by (job, instance) ({name=~"my_metric.*"})

---
sum by (job, instance) ({name=~"my_metric.*"})

---

Step 4: Recent Change Diff

步骤4:近期变更对比

If the cardinality fire started recently, the cause is almost always a recent change. Diff what's there now against what was there before.
如果基数问题是近期才出现的,原因几乎肯定是近期的变更。对比当前状态与之前的状态。

List of metrics, current vs. yesterday

当前与昨日的指标列表对比

Via Grafana Cloud cardinality dashboard, or:
promql
undefined
可通过Grafana Cloud基数仪表板,或使用以下PromQL:
promql
undefined

Current metrics

当前指标

group by (name) ({name!=""})
group by (name) ({name!=""})

Compare to last week (offset)

与上周对比(偏移)

group by (name) ({name!=""} offset 7d)

Diff externally. A new metric near the top of `seriesCountByMetricName` that wasn't there a week ago → that's your offender.
group by (name) ({name!=""} offset 7d)

在外部工具中对比差异。`seriesCountByMetricName`中排名靠前的新增指标(上周不存在)→就是问题根源。

Correlate with deploys

与部署事件关联

promql
undefined
promql
undefined

Active series correlated with build_info

活跃序列数与build_info关联

prometheus_tsdb_head_series
prometheus_tsdb_head_series

Overlay with:

叠加:

changes(app_build_info[1d])

A vertical step in series count aligned with a deploy is conclusive.

---
changes(app_build_info[1d])

序列数的垂直阶跃与部署时间对齐即可确认原因。

---

Step 5: Churn Diagnosis

步骤5:序列更替诊断

High churn means series are being created and abandoned faster than they age out. Symptoms: series count keeps climbing, then drops sharply on Prometheus restart.
高更替率意味着序列的创建和废弃速度快于其过期速度。症状:序列数持续攀升,在Prometheus重启后骤降。

Churn signal

更替信号

promql
undefined
promql
undefined

Series created vs. removed per second

每秒创建与移除的序列数

rate(prometheus_tsdb_head_series_created_total[5m]) rate(prometheus_tsdb_head_series_removed_total[5m])
rate(prometheus_tsdb_head_series_created_total[5m]) rate(prometheus_tsdb_head_series_removed_total[5m])

Ratio of churned to live

更替序列与活跃序列的比率

prometheus_tsdb_head_series_created_total / prometheus_tsdb_head_series

A creation rate that materially exceeds the removal rate, sustained, means cardinality is on a one-way trip up. Common causes:

| Cause | Tell |
|---|---|
| Pod rollouts emitting `pod` label | Churn spike aligns with deploy timing; affects pod-discovered scrapes |
| `version` / `git_sha` / `image_tag` label on every metric | Churn spike on every deploy across many metrics |
| Ephemeral hostnames in `instance` | Cloud autoscaling event timing |
| Bug: dynamic label names | Churn climbs forever, never plateaus |
| Application bug emitting fresh UUIDs as labels | Linear unbounded growth, no deploy correlation |
prometheus_tsdb_head_series_created_total / prometheus_tsdb_head_series

如果创建率持续显著高于移除率,说明基数正在单向增长。常见原因:

| 原因 | 特征 |
|---|---|
| Pod滚动发布时携带`pod`标签 | 更替峰值与部署时间对齐;影响通过Pod发现的抓取任务 |
| 每个指标都带有`version` / `git_sha` / `image_tag`标签 | 每次部署都会导致多个指标出现更替峰值 |
| `instance`标签中包含临时主机名 | 与云自动扩缩容事件时间对齐 |
| 错误:动态标签名称 | 更替率持续攀升,从未稳定 |
| 应用错误:将新UUID作为标签输出 | 线性无界增长,与部署无关 |

Memory impact of churn

更替对内存的影响

promql
undefined
promql
undefined

A churn-driven head block carries old series until tsdb compaction

由更替导致的头部块会保留旧序列直到tsdb完成压缩

prometheus_tsdb_head_chunks go_memstats_heap_inuse_bytes{job="prometheus"}

Restarting Prometheus drops churned series but is not a fix. The fix is at the source.

---
prometheus_tsdb_head_chunks go_memstats_heap_inuse_bytes{job="prometheus"}

重启Prometheus会清除已更替的序列,但这并非根本解决方案。修复需从源头入手。

---

Common-Culprit Gallery

常见问题案例库

Histogram blowup

直方图膨胀

Tell:
*_bucket
metric at the top of
seriesCountByMetricName
. Multiplier ≈ 14×.
Fix:
  1. First, reduce labels on the histogram — every label removed saves 14× series. Trim
    path
    ,
    method
    , or
    status_code
    before touching bucket count.
  2. Then, reduce bucket count if appropriate (custom buckets vs. defaults).
  3. For high-resolution latency tracking, consider native histograms (Prometheus 2.40+) — single sparse series replaces the bucket family.
特征
*_bucket
指标在
seriesCountByMetricName
中排名靠前。序列数乘数≈14倍。
修复方案:
  1. 首先,减少直方图的标签——移除一个标签可节省14倍的序列数。在调整桶数量之前,先修剪
    path
    method
    status_code
    标签。
  2. 如有必要,减少桶数量(自定义桶替代默认桶)。
  3. 如需高分辨率延迟追踪,可考虑原生直方图(Prometheus 2.40+)——单个稀疏序列即可替代整个桶系列。

kube-state-metrics label explosion

kube-state-metrics标签爆炸

Tell:
kube_pod_labels
or
kube_pod_annotations
at the top, with
label_*
or
annotation_*
labels driving cardinality.
Fix: configure kube-state-metrics with
--metric-labels-allowlist
and
--metric-annotations-allowlist
. By default it emits all labels and annotations as series.
yaml
undefined
特征
kube_pod_labels
kube_pod_annotations
排名靠前,由
label_*
annotation_*
标签驱动基数增长。
修复方案:配置kube-state-metrics时使用
--metric-labels-allowlist
--metric-annotations-allowlist
参数。默认情况下它会将所有标签和注解作为序列输出。
yaml
undefined

kube-state-metrics flags

kube-state-metrics参数

--metric-labels-allowlist=pods=[app,team,version] --metric-annotations-allowlist=pods=[checksum/config]
undefined
--metric-labels-allowlist=pods=[app,team,version] --metric-annotations-allowlist=pods=[checksum/config]
undefined

Path / route blowup from a new endpoint

新端点导致路径/路由膨胀

Tell:
http_requests_total
(or framework equivalent) grew 10×+ overnight.
topk(20, count by (path) (http_requests_total))
shows hundreds of
/users/123456
-style values.
Fix: route the user to
prometheus-label-strategy
for the long-term fix (template paths in code). For the immediate stop:
yaml
undefined
特征
http_requests_total
(或框架等效指标)一夜之间增长了10倍以上。
topk(20, count by (path) (http_requests_total))
显示数百个
/users/123456
格式的值。
修复方案:长期方案请转至
prometheus-label-strategy
(在代码中使用模板路径)。紧急处理方案:
yaml
undefined

In Prometheus scrape config — emergency drop

在Prometheus抓取配置中——紧急丢弃

metric_relabel_configs:
  • source_labels: [name, path] regex: http_requests_total;/users/\d+ action: drop

Or normalize via relabel:
```yaml
metric_relabel_configs:
  - source_labels: [path]
    regex: /users/\d+
    target_label: path
    replacement: /users/:id
metric_relabel_configs:
  • source_labels: [name, path] regex: http_requests_total;/users/\d+ action: drop

或通过重标签规则归一化:
```yaml
metric_relabel_configs:
  - source_labels: [path]
    regex: /users/\d+
    target_label: path
    replacement: /users/:id

Application emitting a debug metric in prod

应用在生产环境输出调试指标

Tell: A metric you don't recognize in the top 10. Grep the source — often a
_details
or
_per_request
debug metric the developer forgot to gate.
Fix: drop entirely at scrape:
yaml
metric_relabel_configs:
  - source_labels: [__name__]
    regex: my_app_request_details
    action: drop
Open a ticket against the team to remove it from the code.
特征:前10名指标中出现你不熟悉的指标。在源码中搜索——通常是开发者忘记关闭的
_details
_per_request
调试指标。
修复方案:在抓取阶段直接丢弃:
yaml
metric_relabel_configs:
  - source_labels: [__name__]
    regex: my_app_request_details
    action: drop
向对应团队提交工单,要求从代码中移除该指标。

Broken relabel rule allowing app-emitted target labels

错误的重标签规则允许应用输出目标标签

Tell: Series count for one job is several × what it should be. Looking at one series, you see both an app-emitted
instance=...
AND the target
instance=...
collided into something weird.
Fix: ensure your
metric_relabel_configs
drops the labels that should come only from targets:
yaml
metric_relabel_configs:
  - regex: (instance|pod|node|host|job)
    action: labeldrop
Then the target labels from
relabel_configs
apply cleanly.
特征:某个任务的序列数是预期的数倍。查看单个序列,发现应用输出的
instance=...
与目标
instance=...
冲突,导致异常。
修复方案:确保你的
metric_relabel_configs
丢弃那些应仅来自目标的标签:
yaml
metric_relabel_configs:
  - regex: (instance|pod|node|host|job)
    action: labeldrop
之后,
relabel_configs
中的目标标签就能正常生效。

Federation amplifying cardinality

联邦放大基数

Tell: A federated Prometheus or Mimir global view has way more series than expected. Each source has its own
cluster
/
region
label, multiplying.
Fix: this is usually expected — federation by design preserves source labels. If the series count is too high, federate only aggregated recording rules, not raw metrics:
yaml
- job_name: federate
  honor_labels: true
  metrics_path: /federate
  params:
    'match[]':
      - '{__name__=~".*:.*"}'  # Recording-rule naming convention only

特征:联邦Prometheus或Mimir全局视图的序列数远超预期。每个源都带有自己的
cluster
/
region
标签,导致基数倍增。
修复方案:这通常是预期行为——联邦设计会保留源标签。如果序列数过高,仅联邦聚合后的记录规则,而非原始指标:
yaml
- job_name: federate
  honor_labels: true
  metrics_path: /federate
  params:
    'match[]':
      - '{__name__=~".*:.*"}'  # 仅匹配记录规则的命名约定

Remediation Decision Tree

修复决策树

Cardinality fire confirmed
├── Need to stop the bleeding NOW (production OOM, ingest 429s)
│   └── Apply emergency drop via metric_relabel_configs at scrape config
│       (also applies to Alloy/Agent — same syntax)
│       Then schedule the proper fix.
├── It's a Grafana Cloud Active Series bill issue, not a perf issue
│   ├── Cardinality is structural and you can't fix the app
│   │   └── Route to `adaptive-metrics` skill (post-ingest aggregation rules)
│   └── You want metric-by-metric DPM breakdown
│       └── Route to `dpm-finder` skill
├── It's a fixable application bug (unbounded label, debug metric in prod)
│   ├── Short-term: metric_relabel_configs drop at scrape
│   └── Long-term: fix in code; route to `prometheus-label-strategy` for design guidance
├── It's histogram cardinality
│   ├── Trim labels on the underlying histogram (14× win per label)
│   ├── Reduce bucket count if appropriate
│   └── Consider native histograms for high-resolution latency
└── It's churn (deploy-driven)
    ├── Remove `pod`, `version`, `git_sha` from application metrics
    ├── Use info-metric pattern for `version` (route to `prometheus-label-strategy`)
    └── Verify K8s SD relabel rules aren't carrying `uid` or other ephemeral fields

确认基数问题紧急情况
├── 需要立即止损(生产环境OOM、摄入429错误)
│   └── 在抓取配置中通过metric_relabel_configs应用紧急丢弃规则
│       (同样适用于Alloy/Agent——语法相同)
│       然后安排彻底修复。
├── 是Grafana Cloud活跃序列账单问题,而非性能问题
│   ├── 基数问题是结构性的,无法修复应用
│   │   └── 转至`adaptive-metrics`技能(摄入后聚合规则)
│   └── 需要按指标拆分DPM明细
│       └── 转至`dpm-finder`技能
├── 是可修复的应用bug(无界标签、生产环境调试指标)
│   ├── 短期方案:在抓取阶段通过metric_relabel_configs丢弃
│   └── 长期方案:在代码中修复;转至`prometheus-label-strategy`获取设计指导
├── 是直方图基数问题
│   ├── 修剪底层直方图的标签(每个标签可减少14倍序列数)
│   ├── 如有必要,减少桶数量
│   └── 考虑使用原生直方图进行高分辨率延迟追踪
└── 是序列更替问题(由部署驱动)
    ├── 从应用指标中移除`pod`、`version`、`git_sha`标签
    ├── 对`version`使用信息指标模式(转至`prometheus-label-strategy`)
    └── 验证K8s服务发现重标签规则未携带`uid`或其他临时字段

Emergency Drop Patterns (copy-paste ready)

紧急丢弃规则模板(可直接复制使用)

For Prometheus
scrape_configs
:
yaml
metric_relabel_configs:
  # Drop a specific bad metric
  - source_labels: [__name__]
    regex: bad_metric_name
    action: drop

  # Drop a high-cardinality label from all metrics
  - regex: bad_label_name
    action: labeldrop

  # Drop all labels matching a prefix (e.g., debug_*)
  - regex: debug_.*
    action: labeldrop

  # Drop a metric only when a specific label has unbounded values
  - source_labels: [__name__, path]
    regex: http_requests_total;/users/\d+
    action: drop

  # Aggregate path patterns to a template (replace the value)
  - source_labels: [path]
    regex: /users/\d+
    target_label: path
    replacement: /users/:id

  # Aggregate status codes to classes
  - source_labels: [status_code]
    regex: (\d)\d\d
    target_label: status_code
    replacement: ${1}xx
For Grafana Alloy (
prometheus.relabel
component):
alloy
prometheus.relabel "drop_bad_metric" {
  forward_to = [prometheus.remote_write.default.receiver]

  rule {
    source_labels = ["__name__"]
    regex = "bad_metric_name"
    action = "drop"
  }

  rule {
    regex = "bad_label_name"
    action = "labeldrop"
  }
}
Always test in staging first. A misconfigured
labeldrop
regex can wipe critical labels across every metric.

适用于Prometheus
scrape_configs
:
yaml
metric_relabel_configs:
  # 丢弃特定的不良指标
  - source_labels: [__name__]
    regex: bad_metric_name
    action: drop

  # 从所有指标中丢弃高基数标签
  - regex: bad_label_name
    action: labeldrop

  # 丢弃所有匹配前缀的标签(例如debug_*)
  - regex: debug_.*
    action: labeldrop

  # 仅当特定标签具有无界值时丢弃指标
  - source_labels: [__name__, path]
    regex: http_requests_total;/users/\d+
    action: drop

  # 将路径模式聚合为模板(替换值)
  - source_labels: [path]
    regex: /users/\d+
    target_label: path
    replacement: /users/:id

  # 将状态码聚合为类别
  - source_labels: [status_code]
    regex: (\d)\d\d
    target_label: status_code
    replacement: ${1}xx
适用于Grafana Alloy(
prometheus.relabel
组件):
alloy
prometheus.relabel "drop_bad_metric" {
  forward_to = [prometheus.remote_write.default.receiver]

  rule {
    source_labels = ["__name__"]
    regex = "bad_metric_name"
    action = "drop"
  }

  rule {
    regex = "bad_label_name"
    action = "labeldrop"
  }
}
务必先在预发布环境测试。配置错误的
labeldrop
正则表达式可能会删除所有指标中的关键标签。

When to Hand Off

何时转交其他技能

  • "Now design a label strategy so this doesn't happen again"
    prometheus-label-strategy
  • "We need to keep these metrics but reduce cost"
    adaptive-metrics
  • "Which metric is the most expensive in DPM terms?"
    dpm-finder
  • "Write the PromQL to find this"
    promql
  • "Configure this in Alloy"
    alloy
  • "Why is my Loki slow?"
    loki-label-analyzer
    (different system, same family of problems)
This skill's lane is diagnosis under pressure. Prevention, design, and post-ingest cost optimization live elsewhere.
  • “现在设计标签策略避免再次发生”
    prometheus-label-strategy
  • “我们需要保留这些指标但降低成本”
    adaptive-metrics
  • “哪个指标的DPM成本最高?”
    dpm-finder
  • “编写PromQL来定位这个问题”
    promql
  • “在Alloy中配置这个”
    alloy
  • “我的Loki为什么变慢?”
    loki-label-analyzer
    (不同系统,但属于同类问题)
本技能的定位是紧急情况下的诊断。预防、设计和摄入后成本优化由其他技能负责。