observability-service-health

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

APM Service Health

APM服务健康评估

Assess APM service health using Observability APIs, ES|QL against APM indices, Elasticsearch APIs, and (for correlation and APM-specific logic) the Kibana repo. Use SLOs, firing alerts, ML anomalies, throughput, latency (avg/p95/p99), error rate, and dependency health.
使用可观测性API、针对APM索引的ES|QL、Elasticsearch API,以及(用于关联分析和APM专属逻辑的)Kibana代码库来评估APM服务健康状况。可借助SLO、触发的告警、机器学习异常、吞吐量、延迟(平均值/p95/p99)、错误率以及依赖项健康状况进行评估。

Where to look

查看位置

  • Observability APIs (Observability APIs): Use the SLOs API (Stack | Serverless) to get SLO definitions, status, burn rate, and error budget. Use the Alerting API (Stack | Serverless) to list and manage alerting rules and their alerts for the service. Use APM annotations API to create or search annotations when needed.
  • ES|QL and Elasticsearch: Query
    traces*apm*,traces*otel*
    and
    metrics*apm*,metrics*otel*
    with ES|QL (see Using ES|QL for APM metrics) for throughput, latency, error rate, and dependency-style aggregations. Use Elasticsearch APIs (e.g.
    POST _query
    for ES|QL, or Query DSL) as documented in the Elasticsearch repo for indices and search.
  • APM Correlations: Run the apm-correlations script to get attributes that correlate with high-latency or failed transactions for a given service. It tries the Kibana internal APM correlations API first, then falls back to Elasticsearch significant_terms on
    traces*apm*,traces*otel*
    . See APM Correlations script.
  • Infrastructure: Correlate via resource attributes (e.g.
    k8s.pod.name
    ,
    container.id
    ,
    host.name
    ) in traces; query infrastructure or metrics indices with ES|QL/Elasticsearch for CPU and memory. OOM and CPU throttling directly impact APM health.
  • Logs: Use ES|QL or Elasticsearch search on log indices filtered by
    service.name
    or
    trace.id
    to explain behavior and root cause.
  • Observability Labs: Observability Labs and APM tag for patterns and troubleshooting.
  • 可观测性API (可观测性API):使用SLO APIStack版 | Serverless版)获取SLO定义、状态、消耗速率以及错误预算。使用告警APIStack版 | Serverless版)列出并管理服务的告警规则及其触发的告警。必要时使用APM注释API创建或搜索注释。
  • ES|QL与Elasticsearch:使用ES|QL查询
    traces*apm*,traces*otel*
    metrics*apm*,metrics*otel*
    索引(详见使用ES|QL查询APM指标),以获取吞吐量、延迟、错误率以及依赖项聚合数据。使用Elasticsearch代码库中记录的Elasticsearch API(如用于ES|QL的
    POST _query
    或查询DSL)进行索引和搜索操作。
  • APM关联分析:运行apm-correlations脚本,获取与指定服务的高延迟或失败事务相关的属性。该脚本会优先尝试调用Kibana内部的APM关联API,若不可用(如返回404),则 fallback 到对
    traces*apm*,traces*otel*
    索引执行Elasticsearch significant_terms查询。详见APM关联分析脚本
  • 基础设施:通过跟踪数据中的资源属性(如
    k8s.pod.name
    container.id
    host.name
    )进行关联分析;使用ES|QL或Elasticsearch查询基础设施或指标索引,获取CPU和内存数据。内存不足(OOM)CPU限流会直接影响APM服务健康状况。
  • 日志:使用ES|QL或Elasticsearch搜索按
    service.name
    trace.id
    过滤的日志索引,以解释服务行为并定位根本原因。
  • 可观测性实验室:参考可观测性实验室APM标签页面获取排查模式和故障排除方法。

Health criteria

健康评估标准

Synthesize health from all of the following when available:
SignalWhat to check
SLOsBurn rate, status (healthy/degrading/violated), error budget.
Firing alertsOpen or recently fired alerts for the service or dependencies.
ML anomaliesAnomaly jobs; score and severity for latency, throughput, or error rate.
ThroughputRequest rate; compare to baseline or previous period.
LatencyAvg, p95, p99; compare to SLO targets or history.
Error rateFailed/total requests; spikes or sustained elevation.
Dependency healthDownstream latency, error rate, availability (ES|QL, APIs, Kibana repo).
InfrastructureCPU usage, memory; OOM and CPU throttling on pods/containers/hosts.
LogsApp logs filtered by service or trace ID for context and root cause.
Treat a service as unhealthy if SLOs are violated, critical alerts are firing, or ML anomalies indicate severe degradation. Correlate with infrastructure (OOM, CPU throttling), dependencies, and logs (service/trace context) to explain why and suggest next steps.
综合以下所有可用信号来评估服务健康状况:
信号类型检查内容
SLO消耗速率、状态(健康/退化/违规)、错误预算。
触发的告警服务或依赖项的活跃或近期触发的告警。
机器学习异常异常检测任务;延迟、吞吐量或错误率的异常分数和严重程度。
吞吐量请求速率;与基线或往期数据对比。
延迟平均值、p95、p99;与SLO目标或历史数据对比。
错误率失败请求/总请求数;是否出现突增或持续升高。
依赖项健康状况下游服务的延迟、错误率、可用性(通过ES
基础设施CPU使用率、内存;Pod/容器/主机上的内存不足(OOM)和CPU限流情况。
日志按服务或跟踪ID过滤的应用日志,用于获取上下文信息和定位根本原因。
若SLO违规、触发严重告警,或机器学习异常显示服务严重退化,则判定服务为不健康。结合基础设施(OOM、CPU限流)、依赖项和日志(服务/跟踪上下文)来解释原因并给出下一步建议。

Using ES|QL for APM metrics

使用ES|QL查询APM指标

When querying APM data from Elasticsearch (
traces*apm*,traces*otel*
,
metrics*apm*,metrics*otel*
), use ES|QL by default where available.
  • Availability: ES|QL is available in Elasticsearch 8.11+ (technical preview; GA in 8.14). It is always available in Elastic Observability Serverless Complete tier.
  • Scoping to a service: Always filter by
    service.name
    (and
    service.environment
    when relevant). Combine with a time range on
    @timestamp
    :
esql
WHERE service.name == "my-service-name" AND service.environment == "production"
  AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
  • Example patterns: Throughput, latency, and error rate over time: see Kibana
    trace_charts_definition.ts
    (
    getThroughputChart
    ,
    getLatencyChart
    ,
    getErrorRateChart
    ). Use
    from(index)
    where(...)
    stats(...)
    /
    evaluate(...)
    with
    BUCKET(@timestamp, ...)
    and
    WHERE service.name == "<service_name>"
    .
  • Performance: Add
    LIMIT n
    to cap rows and token usage. Prefer coarser
    BUCKET(@timestamp, ...)
    (e.g. 1 hour) when only trends are needed; finer buckets increase work and result size.
当从Elasticsearch查询APM数据(
traces*apm*,traces*otel*
metrics*apm*,metrics*otel*
)时,默认优先使用ES|QL(若可用)。
  • 可用性:ES|QL在Elasticsearch 8.11+版本中提供(技术预览版;8.14版本正式GA)。在Elastic可观测性Serverless完整套餐始终可用
  • 服务范围限定:始终按
    service.name
    过滤(相关时同时按
    service.environment
    过滤)。结合
    @timestamp
    的时间范围:
esql
WHERE service.name == "my-service-name" AND service.environment == "production"
  AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
  • 示例模式:随时间变化的吞吐量、延迟和错误率:参考Kibana的
    trace_charts_definition.ts
    文件中的
    getThroughputChart
    getLatencyChart
    getErrorRateChart
    函数。使用
    from(index)
    where(...)
    stats(...)
    /
    evaluate(...)
    语法,结合
    BUCKET(@timestamp, ...)
    WHERE service.name == "<service_name>"
  • 性能优化:添加
    LIMIT n
    来限制返回行数和令牌使用量。仅需查看趋势时,优先使用更粗粒度的
    BUCKET(@timestamp, ...)
    (如1小时);更细的时间桶会增加计算量和结果大小。

APM Correlations script

APM关联分析脚本

When only a subpopulation of transactions has high latency or failures, run the apm-correlations script to list attributes that correlate with those transactions (e.g. host, service version, pod, region). The script tries the Kibana internal APM correlations API first; if unavailable (e.g. 404), it falls back to Elasticsearch significant_terms on
traces*apm*,traces*otel*
.
bash
undefined
当仅部分事务存在高延迟或失败情况时,运行apm-correlations脚本以列出与这些事务相关的属性(如主机、服务版本、Pod、区域)。该脚本会优先尝试调用Kibana内部的APM关联API;若不可用(如返回404),则 fallback 到对
traces*apm*,traces*otel*
索引执行Elasticsearch significant_terms查询。
bash
undefined

Latency correlations (attributes over-represented in slow transactions)

延迟关联分析(慢事务中占比过高的属性)

node skills/observability/service-health/scripts/apm-correlations.js latency-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]

Failed transaction correlations

失败事务关联分析

node skills/observability/service-health/scripts/apm-correlations.js failed-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]
node skills/observability/service-health/scripts/apm-correlations.js failed-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]

Test Kibana connection

测试Kibana连接

node skills/observability/service-health/scripts/apm-correlations.js test [--space <id>]

**Environment:** `KIBANA_URL` and `KIBANA_API_KEY` (or `KIBANA_USERNAME`/`KIBANA_PASSWORD`) for Kibana; for fallback,
`ELASTICSEARCH_URL` and `ELASTICSEARCH_API_KEY`. Use the same time range as the investigation.
node skills/observability/service-health/scripts/apm-correlations.js test [--space <id>]

**环境变量**:Kibana需要配置`KIBANA_URL`和`KIBANA_API_KEY`(或`KIBANA_USERNAME`/`KIBANA_PASSWORD`);fallback到Elasticsearch时,需要`ELASTICSEARCH_URL`和`ELASTICSEARCH_API_KEY`。使用与调查一致的时间范围。

Workflow

工作流程

text
Service health progress:
- [ ] Step 1: Identify the service (and time range)
- [ ] Step 2: Check SLOs and firing alerts
- [ ] Step 3: Check ML anomalies (if configured)
- [ ] Step 4: Review throughput, latency (avg/p95/p99), error rate
- [ ] Step 5: Assess dependency health (ES|QL/APIs / Kibana repo)
- [ ] Step 6: Correlate with infrastructure and logs
- [ ] Step 7: Summarize health and recommend actions
text
服务健康评估流程:
- [ ] 步骤1:确定服务(和时间范围)
- [ ] 步骤2:检查SLO和触发的告警
- [ ] 步骤3:检查机器学习异常(若已配置)
- [ ] 步骤4:查看吞吐量、延迟(平均值/p95/p99)、错误率
- [ ] 步骤5:评估依赖项健康状况(ES|QL/API / Kibana代码库)
- [ ] 步骤6:与基础设施和日志进行关联分析
- [ ] 步骤7:总结健康状况并推荐行动方案

Step 1: Identify the service

步骤1:确定服务

Confirm service name and time range. Resolve the service from the request; if multiple are in scope, target the most relevant. Use ES|QL on
traces*apm*,traces*otel*
or
metrics*apm*,metrics*otel*
(e.g.
WHERE service.name == "<name>"
) or Kibana repo APM routes to obtain service-level data. If the user has not provided the time range, assume last hour.
确认服务名称和时间范围。从请求中解析服务;若涉及多个服务,优先处理最相关的那个。使用ES|QL查询
traces*apm*,traces*otel*
metrics*apm*,metrics*otel*
(如
WHERE service.name == "<name>"
),或通过Kibana代码库中的APM路由获取服务级数据。若用户未提供时间范围,默认使用过去1小时。

Step 2: Check SLOs and firing alerts

步骤2:检查SLO和触发的告警

SLOs: Call the SLOs API to get SLO definitions and status for the service (latency, availability), healthy/degrading/violated, burn rate, error budget. Alerts: For active APM alerts, call
/api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active
. When checking one service, include both rules where
params.serviceName
matches the service and rules where
params.serviceName
is absent (all-services rules). Do not query
.alerts*
indices for active-state checks. Correlate with SLO violations or metric changes.
SLO:调用SLO API获取服务的SLO定义和状态(延迟、可用性)、健康/退化/违规状态、消耗速率、错误预算。告警:对于活跃的APM告警,调用
/api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active
。检查单个服务时,需同时包含
params.serviceName
匹配该服务的规则,以及
params.serviceName
为空的规则(全服务规则)。不要查询
.alerts*
索引来检查活跃状态,以上述告警API的响应为依据。将SLO违规或指标变化与告警进行关联分析。

Step 3: Check ML anomalies

步骤3:检查机器学习异常

If ML anomaly detection is used, query ML job results or anomaly records (via Elasticsearch ML APIs or indices) for the service and time range. Note high-severity anomalies (latency, throughput, error rate); use anomaly time windows to narrow Steps 4–5.
若已配置机器学习异常检测,查询服务在指定时间范围内的机器学习任务结果或异常记录(通过Elasticsearch ML API或索引)。记录高严重程度的异常(延迟、吞吐量、错误率);利用异常时间窗口缩小步骤4-5的排查范围。

Step 4: Review throughput, latency, and error rate

步骤4:查看吞吐量、延迟和错误率

Use ES|QL against
traces*apm*,traces*otel*
or
metrics*apm*,metrics*otel*
for the service and time range to get throughput (e.g. req/min), latency (avg, p95, p99), error rate (failed/total or 5xx/total). Example:
FROM traces*apm*,traces*otel* | WHERE service.name == "<service_name>" AND @timestamp >= ... AND @timestamp <= ... | STATS ...
. Compare to prior period or SLO targets. See Using ES|QL for APM metrics.
使用ES|QL查询指定服务和时间范围的
traces*apm*,traces*otel*
metrics*apm*,metrics*otel*
索引,获取吞吐量(如请求数/分钟)、延迟(平均值、p95、p99)、错误率(失败请求/总请求数或5xx请求/总请求数)。示例:
FROM traces*apm*,traces*otel* | WHERE service.name == "<service_name>" AND @timestamp >= ... AND @timestamp <= ... | STATS ...
。与往期数据或SLO目标进行对比。详见使用ES|QL查询APM指标

Step 5: Assess dependency health

步骤5:评估依赖项健康状况

Obtain dependency and service-map data via ES|QL on
traces*apm*,traces*otel*
/
metrics*apm*,metrics*otel*
(e.g. downstream service/span aggregations) or via APM route handlers in the Kibana repo that expose dependency/service-map data. For the service and time range, note downstream latency and error rate; flag slow or failing dependencies as likely causes.
通过ES|QL查询
traces*apm*,traces*otel*
/
metrics*apm*,metrics*otel*
索引(如下游服务/跨度聚合数据),或通过Kibana代码库中的APM路由处理器获取依赖项/服务地图数据。对于指定服务和时间范围,记录下游服务的延迟和错误率;标记缓慢或失败的依赖项作为可能的原因。

Step 6: Correlate with infrastructure and logs

步骤6:与基础设施和日志进行关联分析

  • APM Correlations (when only a subpopulation is affected): Run
    node skills/observability/service-health/scripts/apm-correlations.js latency-correlations|failed-correlations --service-name <name> [--start ...] [--end ...]
    to get correlated attributes. Filter by those attributes and fetch trace samples or errors to confirm root cause. See APM Correlations script.
  • Infrastructure: Use resource attributes from traces (e.g.
    k8s.pod.name
    ,
    container.id
    ,
    host.name
    ) and query infrastructure/metrics indices with ES|QL or Elasticsearch for CPU and memory. OOM and CPU throttling directly impact APM health; correlate their time windows with APM degradation.
  • Logs: Use ES|QL or Elasticsearch on log indices with
    service.name == "<service_name>"
    or
    trace.id == "<trace_id>"
    to explain behavior and root cause (exceptions, timeouts, restarts).
  • APM关联分析(仅部分事务受影响时):运行
    node skills/observability/service-health/scripts/apm-correlations.js latency-correlations|failed-correlations --service-name <name> [--start ...] [--end ...]
    以获取相关属性。按这些属性过滤并获取跟踪样本或错误记录,以确认根本原因。详见APM关联分析脚本
  • 基础设施:使用跟踪数据中的资源属性(如
    k8s.pod.name
    container.id
    host.name
    ),并通过ES|QL或Elasticsearch查询基础设施/指标索引,获取CPU内存数据。内存不足(OOM)CPU限流会直接影响APM服务健康状况;将其时间窗口与APM服务退化时间进行关联分析。
  • 日志:使用ES|QL或Elasticsearch查询按
    service.name == "<service_name>"
    trace.id == "<trace_id>"
    过滤的日志索引,以解释服务行为并定位根本原因(如异常、超时、重启)。

Step 7: Summarize and recommend

步骤7:总结并推荐

State health (healthy / degraded / unhealthy) with reasons; list concrete next steps.
说明服务健康状态(健康 / 退化 / 不健康)及原因;列出具体的下一步行动建议。

Examples

示例

Example: ES|QL for a specific service

示例:针对特定服务的ES|QL查询

Scope with
WHERE service.name == "<service_name>"
and time range. Throughput and error rate (1-hour buckets;
LIMIT
caps rows and tokens):
esql
FROM traces*apm*,traces*otel*
| WHERE service.name == "api-gateway"
  AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
| STATS request_count = COUNT(*), failures = COUNT(*) WHERE event.outcome == "failure" BY BUCKET(@timestamp, 1 hour)
| EVAL error_rate = failures / request_count
| SORT @timestamp
| LIMIT 500
Latency percentiles and exact field names: see Kibana
trace_charts_definition.ts
.
通过
WHERE service.name == "<service_name>"
和时间范围限定查询范围。以下是吞吐量和错误率的查询(按1小时分桶;
LIMIT
用于限制返回行数和令牌使用量):
esql
FROM traces*apm*,traces*otel*
| WHERE service.name == "api-gateway"
  AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
| STATS request_count = COUNT(*), failures = COUNT(*) WHERE event.outcome == "failure" BY BUCKET(@timestamp, 1 hour)
| EVAL error_rate = failures / request_count
| SORT @timestamp
| LIMIT 500
延迟百分位数和具体字段名称:参考Kibana的
trace_charts_definition.ts
文件。

Example: "Is service X healthy?"

示例:“服务X是否健康?”

  1. Resolve service X and time range. Call SLOs API and Alerting API; run ES|QL on
    traces*apm*,traces*otel*
    /
    metrics*apm*,metrics*otel*
    for throughput, latency, error rate; query dependency/service-map data (ES|QL or Kibana repo).
  2. Evaluate SLO status (violated/degrading?), firing rules, ML anomalies, and dependency health.
  3. Answer: Healthy / Degraded / Unhealthy with reasons and next steps (e.g. Observability Labs).
  1. 确定服务X和时间范围。调用SLO API告警API;对
    traces*apm*,traces*otel*
    /
    metrics*apm*,metrics*otel*
    运行ES|QL查询以获取吞吐量、延迟、错误率;查询依赖项/服务地图数据(ES|QL或Kibana代码库)。
  2. 评估SLO状态(是否违规/退化)、触发的规则、机器学习异常以及依赖项健康状况。
  3. 回复:健康/退化/不健康,并说明原因和下一步建议(如参考可观测性实验室)。

Example: "Why is service Y slow?"

示例:“为什么服务Y运行缓慢?”

  1. Service Y and slowness time range. Call SLOs API and Alerting API; run ES|QL for Y and dependencies; query ML anomaly results.
  2. Compare latency (avg/p95/p99) to prior period via ES|QL; from dependency data identify high-latency or failing deps.
  3. Summarize (e.g. p99 up; dependency Z elevated) and recommend (investigate Z; Observability Labs for latency).
  1. 确定服务Y和缓慢问题的时间范围。调用SLO API告警API;对Y及其依赖项运行ES|QL查询;查询机器学习异常结果。
  2. 通过ES|QL将延迟(平均值/p95/p99)与往期数据对比;从依赖项数据中识别高延迟或失败的依赖项。
  3. 总结(如p99延迟升高;依赖项Z的延迟异常)并给出建议(调查依赖项Z;参考可观测性实验室的延迟排查方案)。

Example: Correlate service to infrastructure (OpenTelemetry)

示例:将服务与基础设施关联分析(OpenTelemetry)

Use resource attributes on spans/traces to get the runtimes (pods, containers, hosts) for the service. Then check CPU and memory for those resources in the same time window as the APM issue:
  • From the service’s traces or metrics, read resource attributes such as
    k8s.pod.name
    ,
    k8s.namespace.name
    ,
    container.id
    , or
    host.name
    .
  • Run ES|QL or Elasticsearch search on infrastructure/metrics indices filtered by those resource values and the incident time range. Check CPU usage and memory consumption (e.g.
    system.cpu.total.norm.pct
    ); look for OOMKilled events, CPU throttling, or sustained high CPU/memory that align with APM latency or error spikes.
使用跟踪/跨度中的资源属性获取服务的运行环境(Pod、容器、主机)。然后在APM问题发生的同一时间窗口内检查这些资源的CPU和内存情况:
  • 从服务的跟踪或指标数据中读取资源属性,如
    k8s.pod.name
    k8s.namespace.name
    container.id
    host.name
  • 使用ES|QL或Elasticsearch查询按这些资源值和事件时间范围过滤的基础设施/指标索引。检查CPU使用率内存消耗(如
    system.cpu.total.norm.pct
    );查找与APM延迟或错误突增时间一致的OOMKilled事件CPU限流或持续高CPU/内存占用情况。

Example: Filter logs by service or trace ID

示例:按服务或跟踪ID过滤日志

To understand behavior for a specific service or a single trace, filter logs accordingly:
  • By service: Run ES|QL or Elasticsearch search on log indices with
    service.name == "<service_name>"
    and time range to get application logs (errors, warnings, restarts) in the service context.
  • By trace ID: When investigating a specific request, take the
    trace.id
    from the APM trace and filter logs by
    trace.id == "<trace_id>"
    (or equivalent field in your log schema). Logs with that trace ID show the full request path and help explain failures or latency.
为了了解特定服务或单个跟踪的行为,按以下方式过滤日志:
  • 按服务过滤:使用ES|QL或Elasticsearch查询按
    service.name == "<service_name>"
    和时间范围过滤的日志索引,获取服务上下文下的应用日志(错误、警告、重启)。
  • 按跟踪ID过滤:排查特定请求时,从APM跟踪中获取
    trace.id
    ,并按
    trace.id == "<trace_id>"
    (或日志模式中的等效字段)过滤日志。带有该跟踪ID的日志可展示完整请求路径,帮助解释失败或延迟原因。

Guidelines

指南

  • Use Observability APIs (SLOs API, Alerting API) and ES|QL on
    traces*apm*,traces*otel*
    /
    metrics*apm*,metrics*otel*
    (8.11+ or Serverless), filtering by
    service.name
    (and
    service.environment
    when relevant). For active APM alerts, call
    /api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active
    . When checking one service, evaluate both rule types: rules where
    params.serviceName
    matches the target service, and rules where
    params.serviceName
    is absent (all-services rules). Treat either as applicable to the service before declaring health. Do not query
    .alerts*
    indices when determining currently active alerts; use the Alerting API response above as the source of truth. For APM correlations, run the apm-correlations script (see APM Correlations script); for dependency/service-map data, use ES|QL or Kibana repo route handlers. For Elasticsearch index and search behavior, see the Elasticsearch APIs in the Elasticsearch repo.
  • Always use the user's time range; avoid assuming "last 1 hour" if the issue is historical.
  • When SLOs exist, anchor the health summary to SLO status and burn rate; when they do not, rely on alerts, anomalies, throughput, latency, error rate, and dependencies.
  • When analyzing only application metrics ingested via OpenTelemetry, use the ES|QL TS (time series) command for efficient metrics queries. The TS command is available in Elasticsearch 9.3+ and is always available in Elastic Observability Serverless.
  • Summary: one short health verdict plus bullet points for evidence and next steps.
  • 使用可观测性APISLO API告警API)和针对
    traces*apm*,traces*otel*
    /
    metrics*apm*,metrics*otel*
    索引的ES|QL(Elasticsearch 8.11+或Serverless版本),按
    service.name
    (相关时同时按
    service.environment
    )过滤。对于活跃的APM告警,调用
    /api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active
    。检查单个服务时,需评估两种规则类型:
    params.serviceName
    匹配目标服务的规则,以及
    params.serviceName
    为空的规则(全服务规则)。在判定服务健康状况前,将这两种规则均视为与该服务相关。不要查询
    .alerts*
    索引来确定当前活跃告警,以上述告警API的响应为唯一依据。对于APM关联分析,运行apm-correlations脚本(详见APM关联分析脚本);对于依赖项/服务地图数据,使用ES|QL或Kibana代码库中的路由处理器。关于Elasticsearch索引和搜索行为,请参考Elasticsearch代码库中的Elasticsearch API文档。
  • 始终使用用户指定的时间范围;若问题为历史问题,不要默认使用“过去1小时”。
  • 若存在SLO,以SLO状态和消耗速率作为健康总结的核心依据;若不存在SLO,则依赖告警、异常、吞吐量、延迟、错误率和依赖项数据。
  • 当仅分析通过OpenTelemetry采集的应用指标时,使用ES|QL的TS(时间序列)命令以高效查询指标数据。TS命令在Elasticsearch 9.3+版本中可用,且在Elastic可观测性Serverless版本中始终可用
  • 总结要求:简短的健康结论,加上证据和下一步建议的要点。