observability-service-health
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAPM Service Health
APM服务健康评估
Assess APM service health using Observability APIs,
ES|QL against APM indices, Elasticsearch APIs, and (for correlation and APM-specific logic) the Kibana repo. Use
SLOs, firing alerts, ML anomalies, throughput, latency (avg/p95/p99), error rate, and dependency health.
使用可观测性API、针对APM索引的ES|QL、Elasticsearch API,以及(用于关联分析和APM专属逻辑的)Kibana代码库来评估APM服务健康状况。可借助SLO、触发的告警、机器学习异常、吞吐量、延迟(平均值/p95/p99)、错误率以及依赖项健康状况进行评估。
Where to look
查看位置
- Observability APIs (Observability APIs): Use the SLOs API (Stack | Serverless) to get SLO definitions, status, burn rate, and error budget. Use the Alerting API (Stack | Serverless) to list and manage alerting rules and their alerts for the service. Use APM annotations API to create or search annotations when needed.
- ES|QL and Elasticsearch: Query and
traces*apm*,traces*otel*with ES|QL (see Using ES|QL for APM metrics) for throughput, latency, error rate, and dependency-style aggregations. Use Elasticsearch APIs (e.g.metrics*apm*,metrics*otel*for ES|QL, or Query DSL) as documented in the Elasticsearch repo for indices and search.POST _query - APM Correlations: Run the apm-correlations script to get attributes that correlate with high-latency or failed
transactions for a given service. It tries the Kibana internal APM correlations API first, then falls back to
Elasticsearch significant_terms on . See APM Correlations script.
traces*apm*,traces*otel* - Infrastructure: Correlate via resource attributes (e.g. ,
k8s.pod.name,container.id) in traces; query infrastructure or metrics indices with ES|QL/Elasticsearch for CPU and memory. OOM and CPU throttling directly impact APM health.host.name - Logs: Use ES|QL or Elasticsearch search on log indices filtered by or
service.nameto explain behavior and root cause.trace.id - Observability Labs: Observability Labs and APM tag for patterns and troubleshooting.
- 可观测性API (可观测性API):使用SLO API(Stack版 | Serverless版)获取SLO定义、状态、消耗速率以及错误预算。使用告警API(Stack版 | Serverless版)列出并管理服务的告警规则及其触发的告警。必要时使用APM注释API创建或搜索注释。
- ES|QL与Elasticsearch:使用ES|QL查询和
traces*apm*,traces*otel*索引(详见使用ES|QL查询APM指标),以获取吞吐量、延迟、错误率以及依赖项聚合数据。使用Elasticsearch代码库中记录的Elasticsearch API(如用于ES|QL的metrics*apm*,metrics*otel*或查询DSL)进行索引和搜索操作。POST _query - APM关联分析:运行apm-correlations脚本,获取与指定服务的高延迟或失败事务相关的属性。该脚本会优先尝试调用Kibana内部的APM关联API,若不可用(如返回404),则 fallback 到对索引执行Elasticsearch significant_terms查询。详见APM关联分析脚本。
traces*apm*,traces*otel* - 基础设施:通过跟踪数据中的资源属性(如、
k8s.pod.name、container.id)进行关联分析;使用ES|QL或Elasticsearch查询基础设施或指标索引,获取CPU和内存数据。内存不足(OOM)和CPU限流会直接影响APM服务健康状况。host.name - 日志:使用ES|QL或Elasticsearch搜索按或
service.name过滤的日志索引,以解释服务行为并定位根本原因。trace.id - 可观测性实验室:参考可观测性实验室和APM标签页面获取排查模式和故障排除方法。
Health criteria
健康评估标准
Synthesize health from all of the following when available:
| Signal | What to check |
|---|---|
| SLOs | Burn rate, status (healthy/degrading/violated), error budget. |
| Firing alerts | Open or recently fired alerts for the service or dependencies. |
| ML anomalies | Anomaly jobs; score and severity for latency, throughput, or error rate. |
| Throughput | Request rate; compare to baseline or previous period. |
| Latency | Avg, p95, p99; compare to SLO targets or history. |
| Error rate | Failed/total requests; spikes or sustained elevation. |
| Dependency health | Downstream latency, error rate, availability (ES|QL, APIs, Kibana repo). |
| Infrastructure | CPU usage, memory; OOM and CPU throttling on pods/containers/hosts. |
| Logs | App logs filtered by service or trace ID for context and root cause. |
Treat a service as unhealthy if SLOs are violated, critical alerts are firing, or ML anomalies indicate severe
degradation. Correlate with infrastructure (OOM, CPU throttling), dependencies, and logs (service/trace context) to
explain why and suggest next steps.
综合以下所有可用信号来评估服务健康状况:
| 信号类型 | 检查内容 |
|---|---|
| SLO | 消耗速率、状态(健康/退化/违规)、错误预算。 |
| 触发的告警 | 服务或依赖项的活跃或近期触发的告警。 |
| 机器学习异常 | 异常检测任务;延迟、吞吐量或错误率的异常分数和严重程度。 |
| 吞吐量 | 请求速率;与基线或往期数据对比。 |
| 延迟 | 平均值、p95、p99;与SLO目标或历史数据对比。 |
| 错误率 | 失败请求/总请求数;是否出现突增或持续升高。 |
| 依赖项健康状况 | 下游服务的延迟、错误率、可用性(通过ES |
| 基础设施 | CPU使用率、内存;Pod/容器/主机上的内存不足(OOM)和CPU限流情况。 |
| 日志 | 按服务或跟踪ID过滤的应用日志,用于获取上下文信息和定位根本原因。 |
若SLO违规、触发严重告警,或机器学习异常显示服务严重退化,则判定服务为不健康。结合基础设施(OOM、CPU限流)、依赖项和日志(服务/跟踪上下文)来解释原因并给出下一步建议。
Using ES|QL for APM metrics
使用ES|QL查询APM指标
When querying APM data from Elasticsearch (, ), use ES|QL by
default where available.
traces*apm*,traces*otel*metrics*apm*,metrics*otel*- Availability: ES|QL is available in Elasticsearch 8.11+ (technical preview; GA in 8.14). It is always available in Elastic Observability Serverless Complete tier.
- Scoping to a service: Always filter by (and
service.namewhen relevant). Combine with a time range onservice.environment:@timestamp
esql
WHERE service.name == "my-service-name" AND service.environment == "production"
AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"- Example patterns: Throughput, latency, and error rate over time: see Kibana (
trace_charts_definition.ts,getThroughputChart,getLatencyChart). UsegetErrorRateChart→from(index)→where(...)/stats(...)withevaluate(...)andBUCKET(@timestamp, ...).WHERE service.name == "<service_name>" - Performance: Add to cap rows and token usage. Prefer coarser
LIMIT n(e.g. 1 hour) when only trends are needed; finer buckets increase work and result size.BUCKET(@timestamp, ...)
当从Elasticsearch查询APM数据(、)时,默认优先使用ES|QL(若可用)。
traces*apm*,traces*otel*metrics*apm*,metrics*otel*- 可用性:ES|QL在Elasticsearch 8.11+版本中提供(技术预览版;8.14版本正式GA)。在Elastic可观测性Serverless完整套餐中始终可用。
- 服务范围限定:始终按过滤(相关时同时按
service.name过滤)。结合service.environment的时间范围:@timestamp
esql
WHERE service.name == "my-service-name" AND service.environment == "production"
AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"- 示例模式:随时间变化的吞吐量、延迟和错误率:参考Kibana的文件中的
trace_charts_definition.ts、getThroughputChart、getLatencyChart函数。使用getErrorRateChart→from(index)→where(...)/stats(...)语法,结合evaluate(...)和BUCKET(@timestamp, ...)。WHERE service.name == "<service_name>" - 性能优化:添加来限制返回行数和令牌使用量。仅需查看趋势时,优先使用更粗粒度的
LIMIT n(如1小时);更细的时间桶会增加计算量和结果大小。BUCKET(@timestamp, ...)
APM Correlations script
APM关联分析脚本
When only a subpopulation of transactions has high latency or failures, run the apm-correlations script to list
attributes that correlate with those transactions (e.g. host, service version, pod, region). The script tries the Kibana
internal APM correlations API first; if unavailable (e.g. 404), it falls back to Elasticsearch significant_terms on
.
traces*apm*,traces*otel*bash
undefined当仅部分事务存在高延迟或失败情况时,运行apm-correlations脚本以列出与这些事务相关的属性(如主机、服务版本、Pod、区域)。该脚本会优先尝试调用Kibana内部的APM关联API;若不可用(如返回404),则 fallback 到对索引执行Elasticsearch significant_terms查询。
traces*apm*,traces*otel*bash
undefinedLatency correlations (attributes over-represented in slow transactions)
延迟关联分析(慢事务中占比过高的属性)
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
Failed transaction correlations
失败事务关联分析
node skills/observability/service-health/scripts/apm-correlations.js failed-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
node skills/observability/service-health/scripts/apm-correlations.js failed-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
Test Kibana connection
测试Kibana连接
node skills/observability/service-health/scripts/apm-correlations.js test [--space <id>]
**Environment:** `KIBANA_URL` and `KIBANA_API_KEY` (or `KIBANA_USERNAME`/`KIBANA_PASSWORD`) for Kibana; for fallback,
`ELASTICSEARCH_URL` and `ELASTICSEARCH_API_KEY`. Use the same time range as the investigation.node skills/observability/service-health/scripts/apm-correlations.js test [--space <id>]
**环境变量**:Kibana需要配置`KIBANA_URL`和`KIBANA_API_KEY`(或`KIBANA_USERNAME`/`KIBANA_PASSWORD`);fallback到Elasticsearch时,需要`ELASTICSEARCH_URL`和`ELASTICSEARCH_API_KEY`。使用与调查一致的时间范围。Workflow
工作流程
text
Service health progress:
- [ ] Step 1: Identify the service (and time range)
- [ ] Step 2: Check SLOs and firing alerts
- [ ] Step 3: Check ML anomalies (if configured)
- [ ] Step 4: Review throughput, latency (avg/p95/p99), error rate
- [ ] Step 5: Assess dependency health (ES|QL/APIs / Kibana repo)
- [ ] Step 6: Correlate with infrastructure and logs
- [ ] Step 7: Summarize health and recommend actionstext
服务健康评估流程:
- [ ] 步骤1:确定服务(和时间范围)
- [ ] 步骤2:检查SLO和触发的告警
- [ ] 步骤3:检查机器学习异常(若已配置)
- [ ] 步骤4:查看吞吐量、延迟(平均值/p95/p99)、错误率
- [ ] 步骤5:评估依赖项健康状况(ES|QL/API / Kibana代码库)
- [ ] 步骤6:与基础设施和日志进行关联分析
- [ ] 步骤7:总结健康状况并推荐行动方案Step 1: Identify the service
步骤1:确定服务
Confirm service name and time range. Resolve the service from the request; if multiple are in scope, target the most
relevant. Use ES|QL on or (e.g.
) or Kibana repo APM routes to obtain service-level data. If the user has not provided
the time range, assume last hour.
traces*apm*,traces*otel*metrics*apm*,metrics*otel*WHERE service.name == "<name>"确认服务名称和时间范围。从请求中解析服务;若涉及多个服务,优先处理最相关的那个。使用ES|QL查询或(如),或通过Kibana代码库中的APM路由获取服务级数据。若用户未提供时间范围,默认使用过去1小时。
traces*apm*,traces*otel*metrics*apm*,metrics*otel*WHERE service.name == "<name>"Step 2: Check SLOs and firing alerts
步骤2:检查SLO和触发的告警
SLOs: Call the SLOs API to get SLO definitions and status for the service (latency, availability),
healthy/degrading/violated, burn rate, error budget. Alerts: For active APM alerts, call
.
When checking one service, include both rules where matches the service and rules where
is absent (all-services rules). Do not query indices for active-state checks. Correlate
with SLO violations or metric changes.
/api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:activeparams.serviceNameparams.serviceName.alerts*SLO:调用SLO API获取服务的SLO定义和状态(延迟、可用性)、健康/退化/违规状态、消耗速率、错误预算。告警:对于活跃的APM告警,调用。检查单个服务时,需同时包含匹配该服务的规则,以及为空的规则(全服务规则)。不要查询索引来检查活跃状态,以上述告警API的响应为依据。将SLO违规或指标变化与告警进行关联分析。
/api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:activeparams.serviceNameparams.serviceName.alerts*Step 3: Check ML anomalies
步骤3:检查机器学习异常
If ML anomaly detection is used, query ML job results or anomaly records (via Elasticsearch ML APIs or indices) for the
service and time range. Note high-severity anomalies (latency, throughput, error rate); use anomaly time windows to
narrow Steps 4–5.
若已配置机器学习异常检测,查询服务在指定时间范围内的机器学习任务结果或异常记录(通过Elasticsearch ML API或索引)。记录高严重程度的异常(延迟、吞吐量、错误率);利用异常时间窗口缩小步骤4-5的排查范围。
Step 4: Review throughput, latency, and error rate
步骤4:查看吞吐量、延迟和错误率
Use ES|QL against or for the service and time range to get
throughput (e.g. req/min), latency (avg, p95, p99), error rate (failed/total or 5xx/total). Example:
.
Compare to prior period or SLO targets. See Using ES|QL for APM metrics.
traces*apm*,traces*otel*metrics*apm*,metrics*otel*FROM traces*apm*,traces*otel* | WHERE service.name == "<service_name>" AND @timestamp >= ... AND @timestamp <= ... | STATS ...使用ES|QL查询指定服务和时间范围的或索引,获取吞吐量(如请求数/分钟)、延迟(平均值、p95、p99)、错误率(失败请求/总请求数或5xx请求/总请求数)。示例:。与往期数据或SLO目标进行对比。详见使用ES|QL查询APM指标。
traces*apm*,traces*otel*metrics*apm*,metrics*otel*FROM traces*apm*,traces*otel* | WHERE service.name == "<service_name>" AND @timestamp >= ... AND @timestamp <= ... | STATS ...Step 5: Assess dependency health
步骤5:评估依赖项健康状况
Obtain dependency and service-map data via ES|QL on / (e.g.
downstream service/span aggregations) or via APM route handlers in the Kibana repo that expose
dependency/service-map data. For the service and time range, note downstream latency and error rate; flag slow or
failing dependencies as likely causes.
traces*apm*,traces*otel*metrics*apm*,metrics*otel*通过ES|QL查询/索引(如下游服务/跨度聚合数据),或通过Kibana代码库中的APM路由处理器获取依赖项/服务地图数据。对于指定服务和时间范围,记录下游服务的延迟和错误率;标记缓慢或失败的依赖项作为可能的原因。
traces*apm*,traces*otel*metrics*apm*,metrics*otel*Step 6: Correlate with infrastructure and logs
步骤6:与基础设施和日志进行关联分析
- APM Correlations (when only a subpopulation is affected): Run
to get correlated attributes. Filter by those attributes and fetch trace samples or errors to confirm root cause. See APM Correlations script.
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations|failed-correlations --service-name <name> [--start ...] [--end ...] - Infrastructure: Use resource attributes from traces (e.g. ,
k8s.pod.name,container.id) and query infrastructure/metrics indices with ES|QL or Elasticsearch for CPU and memory. OOM and CPU throttling directly impact APM health; correlate their time windows with APM degradation.host.name - Logs: Use ES|QL or Elasticsearch on log indices with or
service.name == "<service_name>"to explain behavior and root cause (exceptions, timeouts, restarts).trace.id == "<trace_id>"
- APM关联分析(仅部分事务受影响时):运行以获取相关属性。按这些属性过滤并获取跟踪样本或错误记录,以确认根本原因。详见APM关联分析脚本。
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations|failed-correlations --service-name <name> [--start ...] [--end ...] - 基础设施:使用跟踪数据中的资源属性(如、
k8s.pod.name、container.id),并通过ES|QL或Elasticsearch查询基础设施/指标索引,获取CPU和内存数据。内存不足(OOM)和CPU限流会直接影响APM服务健康状况;将其时间窗口与APM服务退化时间进行关联分析。host.name - 日志:使用ES|QL或Elasticsearch查询按或
service.name == "<service_name>"过滤的日志索引,以解释服务行为并定位根本原因(如异常、超时、重启)。trace.id == "<trace_id>"
Step 7: Summarize and recommend
步骤7:总结并推荐
State health (healthy / degraded / unhealthy) with reasons; list concrete next steps.
说明服务健康状态(健康 / 退化 / 不健康)及原因;列出具体的下一步行动建议。
Examples
示例
Example: ES|QL for a specific service
示例:针对特定服务的ES|QL查询
Scope with and time range. Throughput and error rate (1-hour buckets;
caps rows and tokens):
WHERE service.name == "<service_name>"LIMITesql
FROM traces*apm*,traces*otel*
| WHERE service.name == "api-gateway"
AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
| STATS request_count = COUNT(*), failures = COUNT(*) WHERE event.outcome == "failure" BY BUCKET(@timestamp, 1 hour)
| EVAL error_rate = failures / request_count
| SORT @timestamp
| LIMIT 500Latency percentiles and exact field names: see Kibana .
trace_charts_definition.ts通过和时间范围限定查询范围。以下是吞吐量和错误率的查询(按1小时分桶;用于限制返回行数和令牌使用量):
WHERE service.name == "<service_name>"LIMITesql
FROM traces*apm*,traces*otel*
| WHERE service.name == "api-gateway"
AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
| STATS request_count = COUNT(*), failures = COUNT(*) WHERE event.outcome == "failure" BY BUCKET(@timestamp, 1 hour)
| EVAL error_rate = failures / request_count
| SORT @timestamp
| LIMIT 500延迟百分位数和具体字段名称:参考Kibana的文件。
trace_charts_definition.tsExample: "Is service X healthy?"
示例:“服务X是否健康?”
- Resolve service X and time range. Call SLOs API and Alerting API; run ES|QL on
/
traces*apm*,traces*otel*for throughput, latency, error rate; query dependency/service-map data (ES|QL or Kibana repo).metrics*apm*,metrics*otel* - Evaluate SLO status (violated/degrading?), firing rules, ML anomalies, and dependency health.
- Answer: Healthy / Degraded / Unhealthy with reasons and next steps (e.g. Observability Labs).
- 确定服务X和时间范围。调用SLO API和告警API;对/
traces*apm*,traces*otel*运行ES|QL查询以获取吞吐量、延迟、错误率;查询依赖项/服务地图数据(ES|QL或Kibana代码库)。metrics*apm*,metrics*otel* - 评估SLO状态(是否违规/退化)、触发的规则、机器学习异常以及依赖项健康状况。
- 回复:健康/退化/不健康,并说明原因和下一步建议(如参考可观测性实验室)。
Example: "Why is service Y slow?"
示例:“为什么服务Y运行缓慢?”
- Service Y and slowness time range. Call SLOs API and Alerting API; run ES|QL for Y and dependencies; query ML anomaly results.
- Compare latency (avg/p95/p99) to prior period via ES|QL; from dependency data identify high-latency or failing deps.
- Summarize (e.g. p99 up; dependency Z elevated) and recommend (investigate Z; Observability Labs for latency).
- 确定服务Y和缓慢问题的时间范围。调用SLO API和告警API;对Y及其依赖项运行ES|QL查询;查询机器学习异常结果。
- 通过ES|QL将延迟(平均值/p95/p99)与往期数据对比;从依赖项数据中识别高延迟或失败的依赖项。
- 总结(如p99延迟升高;依赖项Z的延迟异常)并给出建议(调查依赖项Z;参考可观测性实验室的延迟排查方案)。
Example: Correlate service to infrastructure (OpenTelemetry)
示例:将服务与基础设施关联分析(OpenTelemetry)
Use resource attributes on spans/traces to get the runtimes (pods, containers, hosts) for the service. Then check
CPU and memory for those resources in the same time window as the APM issue:
- From the service’s traces or metrics, read resource attributes such as ,
k8s.pod.name,k8s.namespace.name, orcontainer.id.host.name - Run ES|QL or Elasticsearch search on infrastructure/metrics indices filtered by those resource values and the
incident time range. Check CPU usage and memory consumption (e.g. ); look for OOMKilled events, CPU throttling, or sustained high CPU/memory that align with APM latency or error spikes.
system.cpu.total.norm.pct
使用跟踪/跨度中的资源属性获取服务的运行环境(Pod、容器、主机)。然后在APM问题发生的同一时间窗口内检查这些资源的CPU和内存情况:
- 从服务的跟踪或指标数据中读取资源属性,如、
k8s.pod.name、k8s.namespace.name或container.id。host.name - 使用ES|QL或Elasticsearch查询按这些资源值和事件时间范围过滤的基础设施/指标索引。检查CPU使用率和内存消耗(如);查找与APM延迟或错误突增时间一致的OOMKilled事件、CPU限流或持续高CPU/内存占用情况。
system.cpu.total.norm.pct
Example: Filter logs by service or trace ID
示例:按服务或跟踪ID过滤日志
To understand behavior for a specific service or a single trace, filter logs accordingly:
- By service: Run ES|QL or Elasticsearch search on log indices with and time range to get application logs (errors, warnings, restarts) in the service context.
service.name == "<service_name>" - By trace ID: When investigating a specific request, take the from the APM trace and filter logs by
trace.id(or equivalent field in your log schema). Logs with that trace ID show the full request path and help explain failures or latency.trace.id == "<trace_id>"
为了了解特定服务或单个跟踪的行为,按以下方式过滤日志:
- 按服务过滤:使用ES|QL或Elasticsearch查询按和时间范围过滤的日志索引,获取服务上下文下的应用日志(错误、警告、重启)。
service.name == "<service_name>" - 按跟踪ID过滤:排查特定请求时,从APM跟踪中获取,并按
trace.id(或日志模式中的等效字段)过滤日志。带有该跟踪ID的日志可展示完整请求路径,帮助解释失败或延迟原因。trace.id == "<trace_id>"
Guidelines
指南
- Use Observability APIs (SLOs API,
Alerting API) and ES|QL on
/
traces*apm*,traces*otel*(8.11+ or Serverless), filtering bymetrics*apm*,metrics*otel*(andservice.namewhen relevant). For active APM alerts, callservice.environment. When checking one service, evaluate both rule types: rules where/api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:activematches the target service, and rules whereparams.serviceNameis absent (all-services rules). Treat either as applicable to the service before declaring health. Do not queryparams.serviceNameindices when determining currently active alerts; use the Alerting API response above as the source of truth. For APM correlations, run the apm-correlations script (see APM Correlations script); for dependency/service-map data, use ES|QL or Kibana repo route handlers. For Elasticsearch index and search behavior, see the Elasticsearch APIs in the Elasticsearch repo..alerts* - Always use the user's time range; avoid assuming "last 1 hour" if the issue is historical.
- When SLOs exist, anchor the health summary to SLO status and burn rate; when they do not, rely on alerts, anomalies, throughput, latency, error rate, and dependencies.
- When analyzing only application metrics ingested via OpenTelemetry, use the ES|QL TS (time series) command for efficient metrics queries. The TS command is available in Elasticsearch 9.3+ and is always available in Elastic Observability Serverless.
- Summary: one short health verdict plus bullet points for evidence and next steps.
- 使用可观测性API(SLO API、告警API)和针对/
traces*apm*,traces*otel*索引的ES|QL(Elasticsearch 8.11+或Serverless版本),按metrics*apm*,metrics*otel*(相关时同时按service.name)过滤。对于活跃的APM告警,调用service.environment。检查单个服务时,需评估两种规则类型:/api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active匹配目标服务的规则,以及params.serviceName为空的规则(全服务规则)。在判定服务健康状况前,将这两种规则均视为与该服务相关。不要查询params.serviceName索引来确定当前活跃告警,以上述告警API的响应为唯一依据。对于APM关联分析,运行apm-correlations脚本(详见APM关联分析脚本);对于依赖项/服务地图数据,使用ES|QL或Kibana代码库中的路由处理器。关于Elasticsearch索引和搜索行为,请参考Elasticsearch代码库中的Elasticsearch API文档。.alerts* - 始终使用用户指定的时间范围;若问题为历史问题,不要默认使用“过去1小时”。
- 若存在SLO,以SLO状态和消耗速率作为健康总结的核心依据;若不存在SLO,则依赖告警、异常、吞吐量、延迟、错误率和依赖项数据。
- 当仅分析通过OpenTelemetry采集的应用指标时,使用ES|QL的TS(时间序列)命令以高效查询指标数据。TS命令在Elasticsearch 9.3+版本中可用,且在Elastic可观测性Serverless版本中始终可用。
- 总结要求:简短的健康结论,加上证据和下一步建议的要点。