cx-metrics-query
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMetrics Query Skill
指标查询技能
Use this skill to investigate production issues, answer performance questions, and explore metrics data using the CLI commands. The workflow guides metric discovery, label exploration, and PromQL query construction.
cx metrics使用此技能可通过 CLI命令排查生产问题、解答性能相关疑问并探索指标数据。该工作流可指导指标发现、标签探索和PromQL查询构建。
cx metricsCLI Commands
CLI命令
All metrics operations use with four subcommands:
cx metrics| Command | Purpose | Key flags |
|---|---|---|
| Find metrics by name (wildcard or substring) | |
| List available label names for a metric | - |
| Instant PromQL query (single point in time) | |
| Range PromQL query (time series) | |
Output format: append or to any command for machine-readable output.
-o json-o agents所有指标操作均使用及四个子命令:
cx metrics| 命令 | 用途 | 关键参数 |
|---|---|---|
| 按名称查找指标(支持通配符或子字符串) | |
| 列出指标可用的标签名称 | - |
| 即时PromQL查询(单个时间点) | |
| 范围PromQL查询(时间序列) | |
输出格式: 在任意命令后追加或可获取机器可读格式的输出。
-o json-o agentsSearch examples
搜索示例
bash
undefinedbash
undefinedExact substring match
精确子字符串匹配
cx metrics search --name http_requests
cx metrics search --name http_requests
Wildcard: find all CPU metrics
通配符:查找所有CPU指标
cx metrics search --name 'cpu'
cx metrics search --name 'cpu'
List all metrics
列出所有指标
cx metrics search --name '*'
undefinedcx metrics search --name '*'
undefinedInstant query examples
即时查询示例
bash
undefinedbash
undefinedCurrent state
当前状态
cx metrics query 'up'
cx metrics query 'up'
At a specific time
指定时间点查询
cx metrics query 'rate(http_requests_total[5m])' --time 2024-01-01T12:00:00Z
cx metrics query 'rate(http_requests_total[5m])' --time 2024-01-01T12:00:00Z
With output for further processing
输出结果用于后续处理
cx metrics query 'sum by (service) (rate(http_errors_total[5m]))' -o agents
undefinedcx metrics query 'sum by (service) (rate(http_errors_total[5m]))' -o agents
undefinedRange query examples
范围查询示例
bash
undefinedbash
undefinedLast hour, default step (1m)
过去1小时,默认步长(1分钟)
cx metrics query-range 'rate(http_requests_total[5m])'
cx metrics query-range 'rate(http_requests_total[5m])'
Custom window and step
自定义时间窗口和步长
cx metrics query-range 'sum by (service) (rate(http_requests_total[5m]))'
--start now-6h --end now --step 5m
--start now-6h --end now --step 5m
cx metrics query-range 'sum by (service) (rate(http_requests_total[5m]))'
--start now-6h --end now --step 5m
--start now-6h --end now --step 5m
Daily aggregation over the last week
过去一周的每日聚合数据
cx metrics query-range 'max by () (max_over_time(cpu_usage[1d]))'
--start now-7d --end now --step 1d
--start now-7d --end now --step 1d
undefinedcx metrics query-range 'max by () (max_over_time(cpu_usage[1d]))'
--start now-7d --end now --step 1d
--start now-7d --end now --step 1d
undefinedLabel discovery example
标签发现示例
bash
cx metrics get-labels http_requests_totalbash
cx metrics get-labels http_requests_totalReturns: job, instance, method, route, status_code, ...
返回:job, instance, method, route, status_code, ...
undefinedundefinedTime Syntax
时间语法
All time arguments accept:
- Relative: ,
now,now-1h,now-30m,now-2dnow-1w - Absolute: RFC3339/ISO 8601 -
2024-01-01T00:00:00Z
所有时间参数支持:
- 相对时间:,
now,now-1h,now-30m,now-2dnow-1w - 绝对时间:RFC3339/ISO 8601格式 -
2024-01-01T00:00:00Z
Investigation Workflow
排查工作流
1. Initial Assessment
1. 初始评估
When given a vague problem, ask 1–2 focused clarifying questions before proceeding:
- What exactly is failing or behaving unexpectedly?
- When did it start? What is the affected time window?
Prefer to start investigating immediately if the question is specific enough.
当遇到模糊问题时,先提出1-2个针对性的澄清问题再继续:
- 具体是什么出现故障或行为异常?
- 问题何时开始?受影响的时间窗口是什么?
如果问题足够明确,可直接开始排查。
2. Metric Discovery
2. 指标发现
Always start by searching for relevant metrics before querying:
bash
undefined查询前务必先搜索相关指标:
bash
undefinedTry domain-specific patterns first
优先尝试领域特定的匹配模式
cx metrics search --name 'http'
cx metrics search --name 'error'
cx metrics search --name 'latency'
cx metrics search --name 'cpu'
cx metrics search --name 'memory'
cx metrics search --name 'http'
cx metrics search --name 'error'
cx metrics search --name 'latency'
cx metrics search --name 'cpu'
cx metrics search --name 'memory'
If nothing found, broaden the search
如果未找到结果,扩大搜索范围
cx metrics search --name 'request'
cx metrics search --name '*' # full list as last resort
When two similar metrics are found and one is suffixed with `_count`, prefer the one without the suffix - `_count` typically tracks the number of observations, not the measured value itself.cx metrics search --name 'request'
cx metrics search --name '*' # 最后再尝试列出全部指标
当发现两个相似指标且其中一个后缀为`_count`时,优先选择不带后缀的指标——`_count`通常用于跟踪观测次数,而非实际测量值。3. Label Discovery
3. 标签发现
Once a relevant metric is identified, discover its labels before filtering:
bash
cx metrics get-labels <metric_name>Use the returned label names to build precise PromQL filters. Note: label values are not directly queryable via the CLI - infer them from query results or domain knowledge.
确定相关指标后,在过滤前先发现其标签:
bash
cx metrics get-labels <metric_name>使用返回的标签名称构建精准的PromQL过滤器。注意:无法通过CLI直接查询标签值——需从查询结果或领域知识中推断。
4. Query Construction & Execution
4. 查询构建与执行
Choose the right query type:
- Instant query () - use for current state, single values, or absolute aggregations over a window. Use
cx metrics queryto query historical data at a specific moment.--time - Range query () - use when comparing across time periods (e.g., per-day DAU, hourly error rate trend). Set
cx metrics query-rangeto match any--stepused in temporal functions.[window]
Start simple, add complexity as needed:
bash
undefined选择合适的查询类型:
- 即时查询()- 用于查询当前状态、单一值或时间窗口内的绝对聚合数据。使用
cx metrics query参数查询特定时刻的历史数据。--time - 范围查询()- 用于跨时间段对比(如日活DAU、每小时错误率趋势)。设置
cx metrics query-range参数以匹配时间函数中使用的--step。[window]
从简单查询开始,逐步增加复杂度:
bash
undefinedStep 1: Check if metric exists and has data
步骤1:检查指标是否存在且有数据
cx metrics query 'http_requests_total'
cx metrics query 'http_requests_total'
Step 2: Add label filters and aggregation
步骤2:添加标签过滤和聚合
cx metrics query 'sum by (status) (rate(http_requests_total[5m]))'
cx metrics query 'sum by (status) (rate(http_requests_total[5m]))'
Step 3: Build the final diagnostic query
步骤3:构建最终诊断查询
cx metrics query 'sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))'
undefinedcx metrics query 'sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))'
undefined5. Retry Logic
5. 重试逻辑
If a query returns no results or an error:
- Check metric name - run with a broader term
cx metrics search --name '*<keyword>*' - Check label names - run to verify filter keys
cx metrics get-labels <metric> - Widen the time range or shorten the rate window
- If filtering on a label that may be empty, exclude empty values:
{label!=""} - Try an alternative metric name or structure
Maximum 5 retry attempts per query, each with a concrete improvement.
如果查询无结果或返回错误:
- 检查指标名称——使用更宽泛的关键词运行
cx metrics search --name '*<keyword>*' - 检查标签名称——运行验证过滤键
cx metrics get-labels <metric> - 扩大时间范围或缩短速率窗口
- 如果过滤的标签可能为空,排除空值:
{label!=""} - 尝试其他指标名称或结构
每个查询最多重试5次,每次重试需有明确的改进点。
6. Pattern Recognition & Root Cause Analysis
6. 模式识别与根因分析
After collecting results:
- Correlate across metrics (e.g., error spike matches CPU spike?)
- Look for temporal patterns - recurring peaks, sudden step changes
- Cross-layer analysis: app → services → infrastructure → dependencies
- Provide actionable next steps, not just data
收集结果后:
- 关联多个指标(如错误峰值是否与CPU峰值匹配?)
- 查找时间模式——重复峰值、突然阶跃变化
- 跨层分析:应用 → 服务 → 基础设施 → 依赖项
- 提供可执行的下一步建议,而非仅展示数据
7. Summarize Frequently
7. 频繁总结
PromQL results can be large. After every few queries, summarize:
- Key findings so far
- Queries already run
- Next planned queries
- Ask to continue if more investigation is needed
PromQL结果可能较大。每执行几次查询后进行总结:
- 目前的关键发现
- 已运行的查询
- 计划执行的下一个查询
- 如需进一步排查,询问是否继续
Common Investigation Patterns
常见排查模式
HTTP Errors
HTTP错误
- Check error rate:
sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) - Compare to total RPS:
sum by (service) (rate(http_requests_total[5m])) - Check pod/deployment health metrics
- Check dependency latency
- 检查错误率:
sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) - 对比总请求数:
sum by (service) (rate(http_requests_total[5m])) - 检查Pod/部署健康指标
- 检查依赖项延迟
Performance / Latency
性能/延迟
- Check p95/p99 latency via histograms:
histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) - Check resource saturation: CPU, memory, disk
- Check autoscaling metrics
- Check dependency response times
- 通过直方图检查p95/p99延迟:
histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) - 检查资源饱和度:CPU、内存、磁盘
- 检查自动扩缩容指标
- 检查依赖项响应时间
Availability
可用性
- Check metric across services:
upcx metrics query 'up' - Check pod restart counts
- Check node health
- Check service discovery metrics
- 检查各服务的指标:
upcx metrics query 'up' - 检查Pod重启次数
- 检查节点健康状态
- 检查服务发现指标
Key Principles
核心原则
- Discover before querying: always search for metric names first
- Instant over range: prefer instant queries unless the question requires a time series
- Align step with window: when using , set
max_over_time(metric[1d])--step 1d - Filter empty labels: if results have blank label values, add to the filter
{label!=""} - Aggregate early: use to reduce cardinality before further operations
sum by (...)
- 先发现再查询:始终先搜索指标名称
- 优先即时查询:除非问题需要时间序列数据,否则优先使用即时查询
- 步长与窗口对齐:使用时,设置
max_over_time(metric[1d])--step 1d - 过滤空标签:如果结果包含空标签值,在过滤器中添加
{label!=""} - 尽早聚合:使用在后续操作前降低基数
sum by (...)
Additional Resources
额外资源
Reference Files
参考文件
- - Complete PromQL reference: value types, counter vs gauge, histograms, common tasks, gotchas, and a mini cheat-sheet
references/promql-guidelines.md
- - 完整的PromQL参考文档:值类型、计数器与仪表盘对比、直方图、常见任务、注意事项及迷你速查表
references/promql-guidelines.md