prometheus-cardinality-troubleshooter
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrometheus Cardinality Troubleshooter
Prometheus基数问题排查指南
You are an expert in diagnosing live Prometheus cardinality problems. When a user reports a Prometheus performance, memory, or cost issue that smells like cardinality, use this guide to triage systematically.
This skill is diagnostic and operational. For schema design and prevention, route to .
prometheus-label-strategy你是诊断实时Prometheus基数问题的专家。当用户报告Prometheus的性能、内存或成本问题且疑似与基数相关时,请使用本指南进行系统化排查。
本技能专注于诊断与运维操作。如需架构设计与预防方案,请转至。
prometheus-label-strategySymptom → Likely Cause
症状 → 可能原因
| Symptom | Likely Cause | First Action |
|---|---|---|
| Prometheus OOMKilled or memory growing linearly | Active series growth (often from a new bad metric or label) | Active Series triage |
| Single PromQL query slow or OOMs the querier | One or more metrics in the query have high cardinality | Per-query drill-down |
| Remote write lagging, WAL growing | Sample throughput spike — series count OR scrape interval changed | Active Series triage + check scrape intervals |
| Hitting Mimir/Cortex ingester per-tenant series limit | Per-metric drill-down, find the new offender |
| Grafana Cloud Active Series bill spiked | New metric, new label, or rollout creating churn | Per-metric drill-down + churn check |
| Grafana Cloud DPM bill spiked but Active Series flat | Scrape interval shortened, OR remote_write sending duplicates | DPM-side issue — route to |
| Application change introduced a new bad label | Recent change diff |
| Series count grows then resets every restart | Series churn from ephemeral label values | Churn diagnosis |
| 症状 | 可能原因 | 首要操作 |
|---|---|---|
| Prometheus被OOM终止或内存线性增长 | 活跃序列增长(通常来自新的不良指标或标签) | 活跃序列排查 |
| 单个PromQL查询缓慢或导致查询器OOM | 查询中的一个或多个指标具有高基数 | 按查询深度排查 |
| 远程写入延迟、WAL文件增长 | 样本吞吐量突增——序列数量或抓取间隔变更 | 活跃序列排查 + 检查抓取间隔 |
| 达到Mimir/Cortex摄入器的租户序列限制 | 按指标深度排查,找出新增的问题指标 |
| Grafana Cloud活跃序列账单突增 | 新指标、新标签或版本发布导致序列更替 | 按指标深度排查 + 更替检查 |
| Grafana Cloud DPM账单突增但活跃序列数量平稳 | 抓取间隔缩短,或remote_write发送重复数据 | DPM侧问题——转至 |
部署后出现 | 应用变更引入了新的不良标签 | 近期变更对比 |
| 序列数量增长后每次重启都会重置 | 临时标签值导致的序列更替 | 更替诊断 |
Step 1: Active Series Triage
步骤1:活跃序列排查
Get the headline number
获取核心数据
promql
undefinedpromql
undefinedTotal active series in the local Prometheus
本地Prometheus中的总活跃序列数
prometheus_tsdb_head_series
prometheus_tsdb_head_series
Or for Mimir / Grafana Cloud Metrics (per tenant)
针对Mimir / Grafana Cloud Metrics(按租户)
cortex_ingester_memory_series{user="<tenant>"}
Compare to recent history:
```promqlcortex_ingester_memory_series{user="<tenant>"}
与近期历史数据对比:
```promqlGrowth over the last 7 days
过去7天的增长速率
deriv(prometheus_tsdb_head_series[7d]) * 86400
A growth rate > a few % per day on a stable application set is a red flag.deriv(prometheus_tsdb_head_series[7d]) * 86400
对于稳定的应用集群,日增长率超过几个百分点就是危险信号。Use the TSDB status endpoint
使用TSDB状态端点
Prometheus exposes a built-in cardinality breakdown:
bash
curl -s http://prometheus:9090/api/v1/status/tsdb | jqReturns:
- — top metrics by series count
seriesCountByMetricName - — top labels by unique value count
labelValueCountByLabelName - — top labels by memory footprint
memoryInBytesByLabelName - — top label-value pairs by series count
seriesCountByLabelValuePair
This is usually the fastest path to "which metric / which label is the problem."
For Grafana Cloud:
bash
undefinedPrometheus内置了基数细分信息的端点:
bash
curl -s http://prometheus:9090/api/v1/status/tsdb | jq返回内容包括:
- —— 按序列数排序的顶级指标
seriesCountByMetricName - —— 按唯一值数量排序的顶级标签
labelValueCountByLabelName - —— 按内存占用排序的顶级标签
memoryInBytesByLabelName - —— 按序列数排序的顶级标签-值对
seriesCountByLabelValuePair
这通常是快速定位“哪个指标/哪个标签出问题”的最佳路径。
针对Grafana Cloud:
bash
undefinedSame endpoint, authenticated against the per-tenant Mimir
相同端点,需通过租户Mimir的认证
curl -s -u "<user>:<token>"
"https://prometheus-prod-XX.grafana.net/api/prom/api/v1/status/tsdb" | jq
"https://prometheus-prod-XX.grafana.net/api/prom/api/v1/status/tsdb" | jq
---curl -s -u "<user>:<token>"
"https://prometheus-prod-XX.grafana.net/api/prom/api/v1/status/tsdb" | jq
"https://prometheus-prod-XX.grafana.net/api/prom/api/v1/status/tsdb" | jq
---Step 2: Read the Output
步骤2:解读输出结果
Top metrics by series count
按序列数排序的顶级指标
json
"seriesCountByMetricName": [
{ "name": "http_request_duration_seconds_bucket", "value": 184320 },
{ "name": "go_gc_duration_seconds", "value": 80 },
...
]Heuristics:
- A histogram () at the top is almost always the answer — those have a 14× multiplier (bucket count + 3). The fix is usually trimming labels on the underlying histogram, not the buckets themselves.
_bucket - A metric in the top 5 you don't recognize → grep the codebase for it; it's likely a new feature flag or a debug metric that shipped to prod
- The same metric showing up under multiple variants (,
_total,_count) — that's a histogram or summary, count all variants together for the true impact_sum
json
"seriesCountByMetricName": [
{ "name": "http_request_duration_seconds_bucket", "value": 184320 },
{ "name": "go_gc_duration_seconds", "value": 80 },
...
]判断准则:
- 排名靠前的直方图()几乎就是问题根源——这类指标的序列数会乘以约14倍(桶数量+3)。修复方案通常是修剪底层直方图的标签,而非调整桶本身。
_bucket - 前5名中出现你不熟悉的指标→在代码库中搜索它;这很可能是新的功能标志或被误部署到生产环境的调试指标
- 同一指标以多个变体出现(、
_total、_count)——这是直方图或摘要类型,需将所有变体的序列数相加才能得到真实影响_sum
Top labels by unique value count
按唯一值数量排序的顶级标签
json
"labelValueCountByLabelName": [
{ "name": "url", "value": 84210 },
{ "name": "trace_id", "value": 41000 },
{ "name": "pod", "value": 1820 }
]Red flags:
- Any label with >10K unique values is almost certainly a bug. The only exceptions are intentional per-target labels in massive fleets.
- ,
trace_id,request_id,session_id,query,email,path— these should never be labels. They belong in exemplars, logs, or traces.url - with thousands of values — see Churn diagnosis; recent churn often inflates this number
pod
json
"labelValueCountByLabelName": [
{ "name": "url", "value": 84210 },
{ "name": "trace_id", "value": 41000 },
{ "name": "pod", "value": 1820 }
]危险信号:
- 任何拥有超过10K个唯一值的标签几乎肯定是bug。唯一例外是大规模集群中有意设置的按目标划分的标签。
- 、
trace_id、request_id、session_id、query、email、path——这些绝对不能作为标签。它们应该放在exemplars、日志或追踪数据中。url - 标签拥有数千个值→查看更替诊断;近期的序列更替通常会导致这个数值膨胀
pod
Step 3: Per-Metric Drill-Down
步骤3:按指标深度排查
Once you've identified a suspect metric, find which label is responsible.
一旦确定了可疑指标,找出对应的问题标签。
Count distinct label values per label, for one metric
统计单个指标下每个标签的不同值数量
promql
undefinedpromql
undefinedHow many unique values does each label have on this metric?
该指标下每个标签有多少个唯一值?
count by (name) (
count by (name, label_name_here) (
http_request_duration_seconds_bucket
)
)
Repeat per label, or use the helper:
```bashcount by (name) (
count by (name, label_name_here) (
http_request_duration_seconds_bucket
)
)
针对每个标签重复执行,或使用以下脚本:
```bashVia the Prometheus HTTP API
通过Prometheus HTTP API
curl -s "http://prometheus:9090/api/v1/labels?match[]=http_request_duration_seconds_bucket" | jq -r '.data[]' |
while read label; do count=$(curl -s "http://prometheus:9090/api/v1/label/${label}/values?match[]=http_request_duration_seconds_bucket" | jq '.data | length') echo "${count} ${label}" done | sort -rn | head -20
while read label; do count=$(curl -s "http://prometheus:9090/api/v1/label/${label}/values?match[]=http_request_duration_seconds_bucket" | jq '.data | length') echo "${count} ${label}" done | sort -rn | head -20
undefinedcurl -s "http://prometheus:9090/api/v1/labels?match[]=http_request_duration_seconds_bucket" | jq -r '.data[]' |
while read label; do count=$(curl -s "http://prometheus:9090/api/v1/label/${label}/values?match[]=http_request_duration_seconds_bucket" | jq '.data | length') echo "${count} ${label}" done | sort -rn | head -20
while read label; do count=$(curl -s "http://prometheus:9090/api/v1/label/${label}/values?match[]=http_request_duration_seconds_bucket" | jq '.data | length') echo "${count} ${label}" done | sort -rn | head -20
undefinedFind the top label values for one label
找出单个标签的顶级值
promql
undefinedpromql
undefinedTop 20 path values for http_requests_total
http_requests_total的前20个path值
topk(20,
count by (path) (http_requests_total)
)
If you see UUIDs, hashes, timestamps, or numeric IDs in the top values → that label has unbounded values from the source.topk(20,
count by (path) (http_requests_total)
)
如果顶级值中出现UUID、哈希值、时间戳或数字ID→该标签的取值是无界的,源于数据源问题。Per-metric series count, grouped
按分组统计单个指标的序列数
promql
undefinedpromql
undefinedSeries-per-instance breakdown — if uneven, one instance is misbehaving
按实例划分的序列数——如果分布不均,说明某个实例存在异常
sum by (job, instance) ({name=~"my_metric.*"})
---sum by (job, instance) ({name=~"my_metric.*"})
---Step 4: Recent Change Diff
步骤4:近期变更对比
If the cardinality fire started recently, the cause is almost always a recent change. Diff what's there now against what was there before.
如果基数问题是近期才出现的,原因几乎肯定是近期的变更。对比当前状态与之前的状态。
List of metrics, current vs. yesterday
当前与昨日的指标列表对比
Via Grafana Cloud cardinality dashboard, or:
promql
undefined可通过Grafana Cloud基数仪表板,或使用以下PromQL:
promql
undefinedCurrent metrics
当前指标
group by (name) ({name!=""})
group by (name) ({name!=""})
Compare to last week (offset)
与上周对比(偏移)
group by (name) ({name!=""} offset 7d)
Diff externally. A new metric near the top of `seriesCountByMetricName` that wasn't there a week ago → that's your offender.group by (name) ({name!=""} offset 7d)
在外部工具中对比差异。`seriesCountByMetricName`中排名靠前的新增指标(上周不存在)→就是问题根源。Correlate with deploys
与部署事件关联
promql
undefinedpromql
undefinedActive series correlated with build_info
活跃序列数与build_info关联
prometheus_tsdb_head_series
prometheus_tsdb_head_series
Overlay with:
叠加:
changes(app_build_info[1d])
A vertical step in series count aligned with a deploy is conclusive.
---changes(app_build_info[1d])
序列数的垂直阶跃与部署时间对齐即可确认原因。
---Step 5: Churn Diagnosis
步骤5:序列更替诊断
High churn means series are being created and abandoned faster than they age out. Symptoms: series count keeps climbing, then drops sharply on Prometheus restart.
高更替率意味着序列的创建和废弃速度快于其过期速度。症状:序列数持续攀升,在Prometheus重启后骤降。
Churn signal
更替信号
promql
undefinedpromql
undefinedSeries created vs. removed per second
每秒创建与移除的序列数
rate(prometheus_tsdb_head_series_created_total[5m])
rate(prometheus_tsdb_head_series_removed_total[5m])
rate(prometheus_tsdb_head_series_created_total[5m])
rate(prometheus_tsdb_head_series_removed_total[5m])
Ratio of churned to live
更替序列与活跃序列的比率
prometheus_tsdb_head_series_created_total / prometheus_tsdb_head_series
A creation rate that materially exceeds the removal rate, sustained, means cardinality is on a one-way trip up. Common causes:
| Cause | Tell |
|---|---|
| Pod rollouts emitting `pod` label | Churn spike aligns with deploy timing; affects pod-discovered scrapes |
| `version` / `git_sha` / `image_tag` label on every metric | Churn spike on every deploy across many metrics |
| Ephemeral hostnames in `instance` | Cloud autoscaling event timing |
| Bug: dynamic label names | Churn climbs forever, never plateaus |
| Application bug emitting fresh UUIDs as labels | Linear unbounded growth, no deploy correlation |prometheus_tsdb_head_series_created_total / prometheus_tsdb_head_series
如果创建率持续显著高于移除率,说明基数正在单向增长。常见原因:
| 原因 | 特征 |
|---|---|
| Pod滚动发布时携带`pod`标签 | 更替峰值与部署时间对齐;影响通过Pod发现的抓取任务 |
| 每个指标都带有`version` / `git_sha` / `image_tag`标签 | 每次部署都会导致多个指标出现更替峰值 |
| `instance`标签中包含临时主机名 | 与云自动扩缩容事件时间对齐 |
| 错误:动态标签名称 | 更替率持续攀升,从未稳定 |
| 应用错误:将新UUID作为标签输出 | 线性无界增长,与部署无关 |Memory impact of churn
更替对内存的影响
promql
undefinedpromql
undefinedA churn-driven head block carries old series until tsdb compaction
由更替导致的头部块会保留旧序列直到tsdb完成压缩
prometheus_tsdb_head_chunks
go_memstats_heap_inuse_bytes{job="prometheus"}
Restarting Prometheus drops churned series but is not a fix. The fix is at the source.
---prometheus_tsdb_head_chunks
go_memstats_heap_inuse_bytes{job="prometheus"}
重启Prometheus会清除已更替的序列,但这并非根本解决方案。修复需从源头入手。
---Common-Culprit Gallery
常见问题案例库
Histogram blowup
直方图膨胀
Tell: metric at the top of . Multiplier ≈ 14×.
*_bucketseriesCountByMetricNameFix:
- First, reduce labels on the histogram — every label removed saves 14× series. Trim ,
path, ormethodbefore touching bucket count.status_code - Then, reduce bucket count if appropriate (custom buckets vs. defaults).
- For high-resolution latency tracking, consider native histograms (Prometheus 2.40+) — single sparse series replaces the bucket family.
特征:指标在中排名靠前。序列数乘数≈14倍。
*_bucketseriesCountByMetricName修复方案:
- 首先,减少直方图的标签——移除一个标签可节省14倍的序列数。在调整桶数量之前,先修剪、
path或method标签。status_code - 如有必要,减少桶数量(自定义桶替代默认桶)。
- 如需高分辨率延迟追踪,可考虑原生直方图(Prometheus 2.40+)——单个稀疏序列即可替代整个桶系列。
kube-state-metrics label explosion
kube-state-metrics标签爆炸
Tell: or at the top, with or labels driving cardinality.
kube_pod_labelskube_pod_annotationslabel_*annotation_*Fix: configure kube-state-metrics with and . By default it emits all labels and annotations as series.
--metric-labels-allowlist--metric-annotations-allowlistyaml
undefined特征:或排名靠前,由或标签驱动基数增长。
kube_pod_labelskube_pod_annotationslabel_*annotation_*修复方案:配置kube-state-metrics时使用和参数。默认情况下它会将所有标签和注解作为序列输出。
--metric-labels-allowlist--metric-annotations-allowlistyaml
undefinedkube-state-metrics flags
kube-state-metrics参数
--metric-labels-allowlist=pods=[app,team,version]
--metric-annotations-allowlist=pods=[checksum/config]
undefined--metric-labels-allowlist=pods=[app,team,version]
--metric-annotations-allowlist=pods=[checksum/config]
undefinedPath / route blowup from a new endpoint
新端点导致路径/路由膨胀
Tell: (or framework equivalent) grew 10×+ overnight. shows hundreds of -style values.
http_requests_totaltopk(20, count by (path) (http_requests_total))/users/123456Fix: route the user to for the long-term fix (template paths in code). For the immediate stop:
prometheus-label-strategyyaml
undefined特征:(或框架等效指标)一夜之间增长了10倍以上。显示数百个格式的值。
http_requests_totaltopk(20, count by (path) (http_requests_total))/users/123456修复方案:长期方案请转至(在代码中使用模板路径)。紧急处理方案:
prometheus-label-strategyyaml
undefinedIn Prometheus scrape config — emergency drop
在Prometheus抓取配置中——紧急丢弃
metric_relabel_configs:
- source_labels: [name, path] regex: http_requests_total;/users/\d+ action: drop
Or normalize via relabel:
```yaml
metric_relabel_configs:
- source_labels: [path]
regex: /users/\d+
target_label: path
replacement: /users/:idmetric_relabel_configs:
- source_labels: [name, path] regex: http_requests_total;/users/\d+ action: drop
或通过重标签规则归一化:
```yaml
metric_relabel_configs:
- source_labels: [path]
regex: /users/\d+
target_label: path
replacement: /users/:idApplication emitting a debug metric in prod
应用在生产环境输出调试指标
Tell: A metric you don't recognize in the top 10. Grep the source — often a or debug metric the developer forgot to gate.
_details_per_requestFix: drop entirely at scrape:
yaml
metric_relabel_configs:
- source_labels: [__name__]
regex: my_app_request_details
action: dropOpen a ticket against the team to remove it from the code.
特征:前10名指标中出现你不熟悉的指标。在源码中搜索——通常是开发者忘记关闭的或调试指标。
_details_per_request修复方案:在抓取阶段直接丢弃:
yaml
metric_relabel_configs:
- source_labels: [__name__]
regex: my_app_request_details
action: drop向对应团队提交工单,要求从代码中移除该指标。
Broken relabel rule allowing app-emitted target labels
错误的重标签规则允许应用输出目标标签
Tell: Series count for one job is several × what it should be. Looking at one series, you see both an app-emitted AND the target collided into something weird.
instance=...instance=...Fix: ensure your drops the labels that should come only from targets:
metric_relabel_configsyaml
metric_relabel_configs:
- regex: (instance|pod|node|host|job)
action: labeldropThen the target labels from apply cleanly.
relabel_configs特征:某个任务的序列数是预期的数倍。查看单个序列,发现应用输出的与目标冲突,导致异常。
instance=...instance=...修复方案:确保你的丢弃那些应仅来自目标的标签:
metric_relabel_configsyaml
metric_relabel_configs:
- regex: (instance|pod|node|host|job)
action: labeldrop之后,中的目标标签就能正常生效。
relabel_configsFederation amplifying cardinality
联邦放大基数
Tell: A federated Prometheus or Mimir global view has way more series than expected. Each source has its own / label, multiplying.
clusterregionFix: this is usually expected — federation by design preserves source labels. If the series count is too high, federate only aggregated recording rules, not raw metrics:
yaml
- job_name: federate
honor_labels: true
metrics_path: /federate
params:
'match[]':
- '{__name__=~".*:.*"}' # Recording-rule naming convention only特征:联邦Prometheus或Mimir全局视图的序列数远超预期。每个源都带有自己的 / 标签,导致基数倍增。
clusterregion修复方案:这通常是预期行为——联邦设计会保留源标签。如果序列数过高,仅联邦聚合后的记录规则,而非原始指标:
yaml
- job_name: federate
honor_labels: true
metrics_path: /federate
params:
'match[]':
- '{__name__=~".*:.*"}' # 仅匹配记录规则的命名约定Remediation Decision Tree
修复决策树
Cardinality fire confirmed
│
├── Need to stop the bleeding NOW (production OOM, ingest 429s)
│ └── Apply emergency drop via metric_relabel_configs at scrape config
│ (also applies to Alloy/Agent — same syntax)
│ Then schedule the proper fix.
│
├── It's a Grafana Cloud Active Series bill issue, not a perf issue
│ ├── Cardinality is structural and you can't fix the app
│ │ └── Route to `adaptive-metrics` skill (post-ingest aggregation rules)
│ └── You want metric-by-metric DPM breakdown
│ └── Route to `dpm-finder` skill
│
├── It's a fixable application bug (unbounded label, debug metric in prod)
│ ├── Short-term: metric_relabel_configs drop at scrape
│ └── Long-term: fix in code; route to `prometheus-label-strategy` for design guidance
│
├── It's histogram cardinality
│ ├── Trim labels on the underlying histogram (14× win per label)
│ ├── Reduce bucket count if appropriate
│ └── Consider native histograms for high-resolution latency
│
└── It's churn (deploy-driven)
├── Remove `pod`, `version`, `git_sha` from application metrics
├── Use info-metric pattern for `version` (route to `prometheus-label-strategy`)
└── Verify K8s SD relabel rules aren't carrying `uid` or other ephemeral fields确认基数问题紧急情况
│
├── 需要立即止损(生产环境OOM、摄入429错误)
│ └── 在抓取配置中通过metric_relabel_configs应用紧急丢弃规则
│ (同样适用于Alloy/Agent——语法相同)
│ 然后安排彻底修复。
│
├── 是Grafana Cloud活跃序列账单问题,而非性能问题
│ ├── 基数问题是结构性的,无法修复应用
│ │ └── 转至`adaptive-metrics`技能(摄入后聚合规则)
│ └── 需要按指标拆分DPM明细
│ └── 转至`dpm-finder`技能
│
├── 是可修复的应用bug(无界标签、生产环境调试指标)
│ ├── 短期方案:在抓取阶段通过metric_relabel_configs丢弃
│ └── 长期方案:在代码中修复;转至`prometheus-label-strategy`获取设计指导
│
├── 是直方图基数问题
│ ├── 修剪底层直方图的标签(每个标签可减少14倍序列数)
│ ├── 如有必要,减少桶数量
│ └── 考虑使用原生直方图进行高分辨率延迟追踪
│
└── 是序列更替问题(由部署驱动)
├── 从应用指标中移除`pod`、`version`、`git_sha`标签
├── 对`version`使用信息指标模式(转至`prometheus-label-strategy`)
└── 验证K8s服务发现重标签规则未携带`uid`或其他临时字段Emergency Drop Patterns (copy-paste ready)
紧急丢弃规则模板(可直接复制使用)
For Prometheus :
scrape_configsyaml
metric_relabel_configs:
# Drop a specific bad metric
- source_labels: [__name__]
regex: bad_metric_name
action: drop
# Drop a high-cardinality label from all metrics
- regex: bad_label_name
action: labeldrop
# Drop all labels matching a prefix (e.g., debug_*)
- regex: debug_.*
action: labeldrop
# Drop a metric only when a specific label has unbounded values
- source_labels: [__name__, path]
regex: http_requests_total;/users/\d+
action: drop
# Aggregate path patterns to a template (replace the value)
- source_labels: [path]
regex: /users/\d+
target_label: path
replacement: /users/:id
# Aggregate status codes to classes
- source_labels: [status_code]
regex: (\d)\d\d
target_label: status_code
replacement: ${1}xxFor Grafana Alloy ( component):
prometheus.relabelalloy
prometheus.relabel "drop_bad_metric" {
forward_to = [prometheus.remote_write.default.receiver]
rule {
source_labels = ["__name__"]
regex = "bad_metric_name"
action = "drop"
}
rule {
regex = "bad_label_name"
action = "labeldrop"
}
}Always test in staging first. A misconfigured regex can wipe critical labels across every metric.
labeldrop适用于Prometheus :
scrape_configsyaml
metric_relabel_configs:
# 丢弃特定的不良指标
- source_labels: [__name__]
regex: bad_metric_name
action: drop
# 从所有指标中丢弃高基数标签
- regex: bad_label_name
action: labeldrop
# 丢弃所有匹配前缀的标签(例如debug_*)
- regex: debug_.*
action: labeldrop
# 仅当特定标签具有无界值时丢弃指标
- source_labels: [__name__, path]
regex: http_requests_total;/users/\d+
action: drop
# 将路径模式聚合为模板(替换值)
- source_labels: [path]
regex: /users/\d+
target_label: path
replacement: /users/:id
# 将状态码聚合为类别
- source_labels: [status_code]
regex: (\d)\d\d
target_label: status_code
replacement: ${1}xx适用于Grafana Alloy(组件):
prometheus.relabelalloy
prometheus.relabel "drop_bad_metric" {
forward_to = [prometheus.remote_write.default.receiver]
rule {
source_labels = ["__name__"]
regex = "bad_metric_name"
action = "drop"
}
rule {
regex = "bad_label_name"
action = "labeldrop"
}
}务必先在预发布环境测试。配置错误的正则表达式可能会删除所有指标中的关键标签。
labeldropWhen to Hand Off
何时转交其他技能
- "Now design a label strategy so this doesn't happen again" →
prometheus-label-strategy - "We need to keep these metrics but reduce cost" →
adaptive-metrics - "Which metric is the most expensive in DPM terms?" →
dpm-finder - "Write the PromQL to find this" →
promql - "Configure this in Alloy" →
alloy - "Why is my Loki slow?" → (different system, same family of problems)
loki-label-analyzer
This skill's lane is diagnosis under pressure. Prevention, design, and post-ingest cost optimization live elsewhere.
- “现在设计标签策略避免再次发生” →
prometheus-label-strategy - “我们需要保留这些指标但降低成本” →
adaptive-metrics - “哪个指标的DPM成本最高?” →
dpm-finder - “编写PromQL来定位这个问题” →
promql - “在Alloy中配置这个” →
alloy - “我的Loki为什么变慢?” → (不同系统,但属于同类问题)
loki-label-analyzer
本技能的定位是紧急情况下的诊断。预防、设计和摄入后成本优化由其他技能负责。