Loading...
Loading...
Compare original and translation side by side
victoriametrics-queryundefinedvictoriametrics-queryundefinedundefinedundefinedmatch[]match[]cardinality-tsdbcurl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50&date=$(date -d 'yesterday' +%Y-%m-%d)" | jq '.data'curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | jq '.data'for label in pod instance container path url user_id request_id session_id trace_id le name; do
echo "=== focusLabel=$label ===" && \
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
jq --arg l "$label" '{label: $l, focus: .data.seriesCountByFocusLabelValue}'
donetotalSeriestotalLabelValuePairsseriesCountByMetricNameseriesCountByLabelNameseriesCountByLabelValuePaircardinality-tsdbcurl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50&date=$(date -d 'yesterday' +%Y-%m-%d)" | jq '.data'curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | jq '.data'for label in pod instance container path url user_id request_id session_id trace_id le name; do
echo "=== focusLabel=$label ===" && \
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
jq --arg l "$label" '{label: $l, focus: .data.seriesCountByFocusLabelValue}'
donetotalSeriestotalLabelValuePairsseriesCountByMetricNameseriesCountByLabelNameseriesCountByLabelValuePaircardinality-usagecurl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?le=0&limit=500" | jq '.'curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?le=5&limit=500" | jq '.'curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?limit=1" | \
jq '{statsCollectedSince: .statsCollectedSince, statsCollectedRecordsTotal: .statsCollectedRecordsTotal}'storage.trackMetricNamesStatscurl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/rules" | jq '[.data.groups[].rules[].query]'cardinality-usagecurl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?le=0&limit=500" | jq '.'curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?le=5&limit=500" | jq '.'curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/metric_names_stats?limit=1" | \
jq '{statsCollectedSince: .statsCollectedSince, statsCollectedRecordsTotal: .statsCollectedRecordsTotal}'storage.trackMetricNamesStatscurl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/rules" | jq '[.data.groups[].rules[].query]'cardinality-labels/api/v1/labels/api/v1/label/.../valuescurl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
jq '{labelValueCountByLabelName: .data.labelValueCountByLabelName, seriesCountByLabelName: .data.seriesCountByLabelName}'labelValueCountByLabelName/valuesseriesCountByLabelNamefor label in <top labels from Query 1>; do
echo "=== focusLabel=$label ===" && \
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
jq --arg l "$label" '{label: $l, topValues: .data.seriesCountByFocusLabelValue}'
doneseriesCountByFocusLabelValuecurl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
jq '.data.seriesCountByLabelValuePair'label=value| Pattern | Regex hint | Indicates |
|---|---|---|
| UUIDs | | Request/session/trace IDs as labels |
| IP addresses | | Per-client or per-pod IP tracking |
| Long strings (>50 chars) | length check | Error messages, SQL, stack traces |
| SQL keywords | | Query text stored as label |
| URL paths with IDs | | Unsanitized HTTP paths |
| Timestamps | epoch or ISO8601 | Time values as labels (unbounded) |
| Stack traces | | Error details as labels |
cardinality-labels/api/v1/labels/api/v1/label/.../valuescurl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
jq '{labelValueCountByLabelName: .data.labelValueCountByLabelName, seriesCountByLabelName: .data.seriesCountByLabelName}'labelValueCountByLabelName/valuesseriesCountByLabelNamefor label in <top labels from Query 1>; do
echo "=== focusLabel=$label ===" && \
curl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=20&focusLabel=$label" | \
jq --arg l "$label" '{label: $l, topValues: .data.seriesCountByFocusLabelValue}'
doneseriesCountByFocusLabelValuecurl -s ${VM_AUTH_HEADER:+-H "$VM_AUTH_HEADER"} \
"$VM_METRICS_URL/api/v1/status/tsdb?topN=50" | \
jq '.data.seriesCountByLabelValuePair'label=value| 模式 | 正则提示 | 说明 |
|---|---|---|
| UUIDs | | 将请求/会话/追踪ID作为标签 |
| IP地址 | | 按客户端或Pod追踪IP |
| 长字符串(>50字符) | 长度检查 | 错误消息、SQL、堆栈跟踪 |
| SQL关键字 | | 查询文本被存储为标签 |
| 带ID的URL路径 | | 未清理的HTTP路径 |
| 时间戳 | 时间戳或ISO8601格式 | 将时间值作为标签(无边界) |
| 堆栈跟踪 | | 将错误详情作为标签 |
queryRequestsCount=0queryRequestsCount=0queryRequestsCount≤5queryRequestsCount=0queryRequestsCount=0queryRequestsCount≤5| Label pattern | Assessment | Typical remedy |
|---|---|---|
| Should NEVER be metric labels — belongs in logs/traces | Drop label |
| Correlation IDs — never metric labels | Drop label |
| Unbounded strings | Drop label or replace with error code |
| Query text in labels — unbounded | Drop label |
| Unbounded if not sanitized | Relabel to normalize, or stream aggregate without |
| Normal for k8s but high churn | Stream aggregate without, if per-pod not needed |
| Normal for node metrics, wasteful for app metrics | Stream aggregate without for app-level metrics |
| Fine-grained buckets multiply every label combination | Reduce bucket count |
(series with this label) - (series without) ≈ series saved| 标签模式 | 评估 | 典型解决方案 |
|---|---|---|
| 绝对不能作为指标标签——应放在日志/追踪中 | 删除标签 |
| 关联ID——绝不能作为指标标签 | 删除标签 |
| 无边界字符串 | 删除标签或替换为错误码 |
| 标签中存储查询文本——无边界 | 删除标签 |
| 未清理时无边界 | 重标记以规范化,或不包含该标签进行流聚合 |
| 在K8s中正常但变动频繁 | 若不需要按Pod统计,不包含该标签进行流聚合 |
| 节点指标中正常,但应用指标中多余 | 应用级指标不包含该标签进行流聚合 |
| 细粒度桶会使每个标签组合的数量倍增 | 减少桶的数量 |
(包含该标签的序列数) - (不包含该标签的序列数) ≈ 可节省的序列数_bucketle_bucketlededup_interval-search.maxStalenessIntervaldedup_interval-search.maxStalenessIntervalundefinedundefined| Metric | Value |
|---|---|
| Total active series (today) | X |
| Total series (yesterday) | X |
| Churn ratio (yesterday / today) | X:1 |
| Unique metric names | X |
| Stats tracking since | <date> |
| 指标 | 数值 |
|---|---|
| 今日活跃序列总数 | X |
| 昨日序列总数 | X |
| 变动比率(昨日/今日) | X:1 |
| 唯一指标名称数量 | X |
| 统计数据追踪起始时间 | <日期> |
| Metric | Series | Last Queried | In Alert Rules | Action |
|---|---|---|---|---|
| ... | ... | never | no | Drop |
| ... | ... | never | yes — verify | Check rule |
| 指标 | 序列数 | 最后查询时间 | 是否在告警规则中 | 操作 |
|---|---|---|---|---|
| ... | ... | 从未 | 否 | 删除 |
| ... | ... | 从未 | 是——需验证 | 检查规则 |
| Label | Unique Values | Top Affected Metrics | Pattern | Action |
|---|---|---|---|---|
| user_id | 50,000 | http_requests_total | UUID | Drop |
| path | 10,000 | http_request_duration | URL paths | Aggregate |
| error_message | 5,000 | app_errors_total | Long strings | Drop |
| 标签 | 唯一值数量 | 受影响最大的指标 | 模式 | 操作 |
|---|---|---|---|---|
| user_id | 50,000 | http_requests_total | UUID | 删除 |
| path | 10,000 | http_request_duration | URL路径 | 聚合 |
| error_message | 5,000 | app_errors_total | 长字符串 | 删除 |
| Metric | Bucket Count | Recommendation |
|---|---|---|
| ... | 30 | Reduce to standard 11 buckets |
| 指标 | 桶数量 | 建议 |
|---|---|---|
| ... | 30 | 减少为标准的11个桶 |
| Observation | Value |
|---|---|
| Yesterday / today ratio | X:1 |
| Primary driver | Pod restarts / short-lived jobs |
| 观察结果 | 数值 |
|---|---|
| 昨日/今日比值 | X:1 |
| 主要原因 | Pod重启 / 短期任务 |
| Category | Est. Series Saved | % of Total | Effort |
|---|---|---|---|
| Drop unused metrics | X | Y% | Low — relabeling only |
| Drop bad labels | X | Y% | Low — labeldrop |
| Stream aggregation | X | Y% | Medium — new config |
| Histogram reduction | X | Y% | Low — bucket filtering |
| Total | X | Y% |
| 类别 | 预估可节省序列数 | 占总数比例 | 实施难度 |
|---|---|---|---|
| 删除未使用指标 | X | Y% | 低——仅需重标记 |
| 删除不良标签 | X | Y% | 低——仅需labeldrop |
| 流聚合 | X | Y% | 中——需新增配置 |
| 直方图桶缩减 | X | Y% | 低——仅需过滤桶 |
| 总计 | X | Y% |
Adapt the template to actual findings — omit sections with no findings, expand sections
with significant findings.
---
根据实际发现的问题调整模板——省略无发现的部分,扩展有重要发现的部分。
---metric_relabel_configs:
- source_labels: [__name__]
regex: "metric_to_drop|another_metric"
action: dropmetric_relabel_configs:
- regex: "label_to_drop|another_label"
action: labeldropmetric_relabel_configs:
- source_labels: [path]
regex: "/api/v1/users/[^/]+"
target_label: path
replacement: "/api/v1/users/:id"metric_relabel_configs:
- source_labels: [__name__]
regex: "metric_to_drop|another_metric"
action: dropmetric_relabel_configs:
- regex: "label_to_drop|another_label"
action: labeldropmetric_relabel_configs:
- source_labels: [path]
regex: "/api/v1/users/[^/]+"
target_label: path
replacement: "/api/v1/users/:id"- match: '{__name__=~"http_.*"}'
interval: 1m
without: [instance, pod]
outputs: [total]- match: 'http_requests_total'
interval: 30s
without: [path, user_id]
outputs: [total]- match: '{__name__=~".*_bucket"}'
interval: 1m
without: [pod, instance]
outputs: [quantiles(0.5, 0.9, 0.99)]
keep_metric_names: true| Function | Use for | Example |
|---|---|---|
| Counters (running sum) | request counts |
| Gauge sums | memory usage across pods |
| Sample counts | number of reporting instances |
| Latest gauge value | current temperature |
| Extremes | peak latency |
| Averages | mean CPU usage |
| Distribution estimation | latency percentiles |
| Re-bucket histograms | reduce bucket granularity |
totallastavgsum_samplestotal- match: '{__name__=~"http_.*"}'
interval: 1m
without: [instance, pod]
outputs: [total]- match: 'http_requests_total'
interval: 30s
without: [path, user_id]
outputs: [total]- match: '{__name__=~".*_bucket"}'
interval: 1m
without: [pod, instance]
outputs: [quantiles(0.5, 0.9, 0.99)]
keep_metric_names: true| 函数 | 用途 | 示例 |
|---|---|---|
| 计数器(运行总和) | 请求计数 |
| Gauge值求和 | 跨Pod内存使用量 |
| 样本计数 | 上报实例数量 |
| 最新Gauge值 | 当前温度 |
| 极值 | 峰值延迟 |
| 平均值 | 平均CPU使用率 |
| 分布估算 | 延迟百分位数 |
| 重新分桶直方图 | 降低桶的粒度 |
totallastavgsum_samplestotal| Method | CRD / Config | Scope |
|---|---|---|
| | Per scrape target |
| Global relabeling | VMAgent | All metrics |
| Stream aggregation | VMAgent | All remote-written metrics |
| Per-remote-write SA | VMAgent | Per destination |
| 方法 | CRD / 配置 | 范围 |
|---|---|---|
| | 每个采集目标 |
| 全局重标记 | VMAgent | 所有指标 |
| 流聚合 | VMAgent | 所有远程写入的指标 |
| 按远程写入目标配置SA | VMAgent | 每个目标 |
| Mistake | Fix |
|---|---|
| Dropping a metric used by alerts | Always cross-check |
| Verify aggregation output matches expectations first |
Stream aggregating gauges with | Use |
Forgetting | Without it, output gets long auto-generated suffix |
Dropping | Only drop specific |
| Not considering recording rule dependencies | Check both alerting AND recording rules |
| Applying relabeling without testing | Use |
| 错误 | 修复方案 |
|---|---|
| 删除了被告警规则使用的指标 | 删除前务必交叉验证 |
未测试就启用 | 先验证聚合输出符合预期 |
对Gauge值使用 | 对Gauge值使用 |
忘记设置 | 不设置的话,输出会带有冗长的自动生成后缀 |
完全删除直方图的 | 仅删除特定的 |
| 未考虑记录规则的依赖 | 同时检查告警规则和记录规则 |
| 未测试就应用重标记 | 使用 |