signoz-explaining-alerts

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Alert Explain

告警规则解释

Decode an existing SigNoz alert's configuration into a plain-language explanation. The skill is read-only and stays focused on the rule itself: what it watches, when it fires, where it notifies. A single line of fire-frequency data is included to ground the explanation, but this skill does not investigate any specific fire — that is
signoz-investigating-alerts
's job.
将现有SigNoz告警的配置解码为通俗语言解释。此技能为只读模式,仅聚焦于规则本身:它监控什么、何时触发、通知至何处。解释中会包含一行触发频率数据,但不会调查具体的告警触发事件——这是
signoz-investigating-alerts
的职责。

Prerequisites

前置条件

This skill calls SigNoz MCP server tools (
signoz:signoz_get_alert
,
signoz:signoz_list_alert_rules
,
signoz:signoz_get_alert_history
). Before running the workflow, confirm the
signoz:signoz_*
tools are available. If they are not, the SigNoz MCP server is not installed or configured — stop and direct the user to set it up: https://signoz.io/docs/ai/signoz-mcp-server/. Do not guess at alert configuration from the rule name alone.
此技能会调用SigNoz MCP服务器工具(
signoz:signoz_get_alert
signoz:signoz_list_alert_rules
signoz:signoz_get_alert_history
)。在运行工作流之前,请确认
signoz:signoz_*
工具可用。若不可用,则说明SigNoz MCP服务器未安装或配置——请停止操作并引导用户完成设置:https://signoz.io/docs/ai/signoz-mcp-server/。请勿仅根据规则名称猜测告警配置。

When to use

使用场景

Use this skill when the user wants to:
  • Understand or interpret an existing alert rule.
  • Confirm what signal an alert watches and at what threshold.
  • Audit whether an alert is reasonably configured.
  • Translate raw alert JSON into operational language.
Do NOT use when the user wants to:
  • Create a new alert →
    signoz-creating-alerts
    .
  • Diagnose why an alert fired or correlate signals around a fire window →
    signoz-investigating-alerts
    .
  • Modify an existing alert → call
    signoz:signoz_update_alert
    directly.
当用户需要以下操作时使用此技能:
  • 理解或解读现有告警规则。
  • 确认告警监控的信号及阈值。
  • 审计告警配置是否合理。
  • 将原始告警JSON转换为可落地的运维语言。
请勿在以下场景使用:
  • 创建新告警 → 使用
    signoz-creating-alerts
  • 诊断告警触发原因或关联触发时段的信号 → 使用
    signoz-investigating-alerts
  • 修改现有告警 → 直接调用
    signoz:signoz_update_alert

Required inputs

必填输入

InputRequiredSource if missing
Alert identifier (rule ID or name)yes
$ARGUMENTS
, recent context, or fuzzy match
If the input is missing or ambiguous, this skill is best-effort (not strict — read-only operations are cheap to recover from):
  1. Call
    signoz:signoz_list_alert_rules
    , paginate through every page, and find the closest name match.
  2. State the interpretation in the response: "Interpreting your request as alert 'High Error Rate — Checkout' (id 42). If you meant a different one, tell me the name or id."
  3. Proceed with the explanation. The user can correct after.
输入项是否必填缺失时获取来源
告警标识符(规则ID或名称)
$ARGUMENTS
、最近上下文或模糊匹配
若输入缺失或模糊,此技能会进行最大努力处理(非严格要求——只读操作易于恢复):
  1. 调用
    signoz:signoz_list_alert_rules
    ,遍历所有分页,找到最匹配的名称。
  2. 在响应中说明解读依据: "将您的请求解读为告警‘高错误率——结账服务’(ID 42)。若您指的是其他告警,请告知名称或ID。"
  3. 继续进行解释,用户可后续纠正。

Workflow

工作流

Step 1: Resolve the alert

步骤1:解析告警

If the user provided a numeric id, skip to Step 2. Otherwise:
  1. Call
    signoz:signoz_list_alert_rules
    and paginate every page
    pagination.hasMore
    is true until the full list is walked.
  2. Match by name (case-insensitive substring). If multiple match, present the candidates and ask which one (interactive) or pick the closest and flag the assumption (autonomous).
若用户提供了数字ID,直接跳至步骤2。否则:
  1. 调用
    signoz:signoz_list_alert_rules
    遍历所有分页——直到
    pagination.hasMore
    为false,确保获取完整列表。
  2. 按名称进行不区分大小写的子串匹配。若存在多个匹配项,可展示候选列表并询问用户选择(交互模式),或选择最匹配项并标注假设(自主模式)。

Step 2: Fetch the full configuration

步骤2:获取完整配置

Call
signoz:signoz_get_alert
with the rule id. This is mandatory — the list response does not include the full condition / thresholds / notification settings, and explanations based on the name alone are guesses.
调用
signoz:signoz_get_alert
并传入规则ID。这一步是必须的——列表响应不包含完整的条件/阈值/通知设置,仅基于名称的解释属于猜测。

Step 3: Pull a one-line fire-frequency summary

步骤3:提取一行触发频率总结

Call
signoz:signoz_get_alert_history
for the rule with a 7-day lookback. From the response, derive a single line:
Fired N times in the last 7d (last fire: <relative-time>).
If the alert never fired in the window, say so explicitly: "Has not fired in the last 7d." If the alert is disabled, mention that and skip the history line.
This single line grounds the explanation. Do not drill into specific fires here — that's
signoz-investigating-alerts
.
调用
signoz:signoz_get_alert_history
查询该规则过去7天的记录。从响应中生成一行总结:
过去7天触发N次(最后一次触发:<相对时间>)。
若该时段内告警从未触发,请明确说明:"过去7天未触发。" 若告警已禁用,请提及此状态并跳过历史记录行。
这一行总结为解释提供实际依据。请勿深入分析具体的触发事件——这属于
signoz-investigating-alerts
的范畴。

Step 4: Build the explanation

步骤4:生成解释

The single most useful thing for the user is a tight summary. Lead with a TL;DR that directly answers the question they asked, not a generic alert summary. The TL;DR is the only thing some users will read — burying their answer under a fixed template forces them to scroll for what they wanted in the first place.
Match the TL;DR shape to the user's question:
  • "What does this alert do?" / "Explain X" — describe what fires:
    TL;DR: Fires when
    <condition>
    for
    <scope>
    , notifies
    <channel>
    .
    <fire-frequency line>
    .
  • "Is it configured correctly?" / "Audit this" / "Anything I should change?" — lead with the verdict and the top 1–3 changes, not the description of what fires:
    TL;DR: Mostly well-configured, but recommend: (1) add
    alertOnAbsent
    — currently a crashed service stays silent; (2) fix annotation template
    {{$topic}}
    {{$labels.topic}}
    (won't interpolate); (3) split critical to PagerDuty (both tiers currently route to Slack).
    <fire-frequency line>
    .
  • "How does X work?" / "Explain the count guard" — answer the mechanism in 1–2 sentences before any framing:
    TL;DR: The count guard is a
    having: count() > 50
    clause on query A — any 1-minute bucket with ≤50 spans is dropped before evaluation, so low-traffic minutes can't fire the alert.
  • "What's the threshold?" / focused config question — state the exact thing they asked about:
    TL;DR: Threshold is 3 standard deviations (z-score), not a raw rate value. Daily seasonality means the model compares each hour against historical norms for that hour.
Always include the fire-frequency line and
disabled
status if non-default — those ground every kind of TL;DR. But put the answer to the user's specific question first.
After the TL;DR, write the explanation in prose, organized into the four sections below. Skip any section that has nothing meaningful to add — empty severity labels, default notification settings, vanilla annotations don't deserve a header. Short and skimmable beats perfunctorily complete; the user is not reading a checklist.
1. What it watches — one short paragraph. Combine signal type (metrics / logs / traces / exceptions), what the query measures, and scope. Translate the query to operational language; for formulas, name each sub-query (A, B, …) and state what F1 (or whichever
selectedQueryName
triggers) computes — e.g. "F1 = A × 100 / B → error percentage". Decode filter operators (
=
equals,
!=
not equals,
IN
/
NOT IN
,
LIKE
/
ILIKE
,
CONTAINS
,
REGEXP
,
EXISTS
/
NOT EXISTS
); enumerate
IN
/
NOT IN
value lists so the user can verify them. Name each
groupBy
dimension and its practical effect ("fires separately per service" for
service.name
).
For anomaly rules (
ruleType: anomaly_rule
), explicitly state that the threshold is in standard deviations from the learned pattern, not the raw value — this is the most common point of confusion. Include
algorithm
(zscore),
seasonality
(hourly / daily / weekly), and how lower/higher targets shift sensitivity (lower → more noise, higher → only extreme deviations).
2. When it fires — one paragraph covering threshold + timing. Decode the threshold spec into plain English using these mappings:
  • op
    codes:
    1
    above,
    2
    below,
    3
    equal,
    4
    not equal.
  • matchType
    codes:
    1
    at_least_once (any point in window),
    2
    all_the_times (entire window),
    3
    on_average (window average),
    4
    in_total (window sum),
    5
    last (most recent point).
State each threshold tier's
name
,
target
,
targetUnit
, and attached channels. Always state the threshold in
targetUnit
, not the native query unit
(e.g. "fires when p99 exceeds 500 ms", not "…exceeds 500 000 000 ns"). Note
recoveryTarget
if set (hysteresis); if absent, mention flap risk when the value hovers near the boundary. Describe timing as "checks every
<frequency>
over the last
<evalWindow>
", and mention that with
at_least_once
a single-point breach triggers, while
all_the_times
requires the full window.
3. Where it notifies — channels per tier (resolved by name from
signoz_list_notification_channels
if needed),
notificationSettings.groupBy
(how notifications are bundled),
renotify
(interval + which states),
usePolicy
(label-based routing). Skip this section entirely if notification settings are vanilla and the user already saw the channel in the TL;DR.
4. Notable concerns — flag only what's non-default and worth the user's attention. Don't list every absent field; focus on the high-leverage ones:
  • alertOnAbsent
    missing
    when the signal is critical: silent data loss (crashed service, broken instrumentation) won't trigger the alert. Always call this out for production-tier rules.
  • alertOnAbsent: true
    but
    nodata
    not in
    renotify.alertStates
    : the absent-data fire pages once and then goes silent — easy to miss.
  • Template variable bugs:
    {{$topic}}
    won't interpolate; the correct form is
    {{$labels.topic}}
    . Dots in label keys become underscores (
    service.name
    {{$labels.service_name}}
    ).
  • Multiple severity tiers but
    labels.severity
    missing on the rule
    — breaks label-based routing policies. Common gap.
  • All tiers route to the same channel — defeats the point of graduated thresholds.
  • High-cardinality
    groupBy
    (e.g.
    pod.name
    ×
    partition
    ) → notification-storm risk during cluster-wide events.
  • Annotation/description text contradicts
    matchType
    (e.g. description says "for over 5 minutes" but
    matchType=at_least_once
    fires on first breach within the window).
  • Alert name doesn't match the filter target (e.g. name says "checkout" but filter targets
    payments
    ) — call this out.
If none of these apply, omit the section. Better silent than padded.
If the user asked only "what does this alert do", stop here. The audit (Step 5) is for "is it configured correctly" / "audit this" / "anything I should change" requests.
对用户最有用的内容是简洁的总结。以直接回应用户问题的TL;DR(摘要)开头,而非通用的告警总结。部分用户可能仅阅读TL;DR——将答案埋在固定模板下会迫使用户滚动查找所需内容。
根据用户问题调整TL;DR的形式:
  • “这个告警是做什么的?” / “解释X” ——描述触发条件:
    TL;DR:当
    <条件>
    <范围>
    内满足时触发,通知至
    <渠道>
    <触发频率行>
  • “配置正确吗?” / “审计这条规则” / “有什么需要修改的?” ——先给出结论和最关键的1-3项修改建议,而非描述触发内容:
    TL;DR:配置基本合理,但建议:(1) 添加
    alertOnAbsent
    ——当前服务崩溃时不会触发告警;(2) 修复注解模板
    {{$topic}}
    改为
    {{$labels.topic}}
    (原格式无法插值);(3) 将严重级告警分流至PagerDuty(当前所有层级均路由至Slack)。
    <触发频率行>
  • “X是如何工作的?” / “解释计数防护机制” ——先以1-2句话回答机制,再补充其他内容:
    TL;DR:计数防护机制是查询A上的
    having: count() > 50
    子句——任何1分钟窗口内跨度≤50的记录会在评估前被过滤,因此低流量时段不会触发告警。
  • “阈值是多少?” / 聚焦配置的问题 ——直接说明用户询问的具体内容:
    TL;DR:阈值为3个标准差(z-score),而非原始速率值。每日季节性模型会将每个时段与历史同期数据进行对比。
始终包含触发频率行,若告警处于禁用状态需额外说明——这些内容可为所有类型的TL;DR提供依据。但请将用户具体问题的答案放在最前面。
在TL;DR之后,将解释整理为以下四个部分的散文式内容。跳过无实际意义的部分——空的严重性标签、默认通知设置、通用注解无需单独设标题。简短易读优于形式上的完整;用户并非在核对清单。
1. 监控内容 ——一段简短描述。结合信号类型(指标/日志/链路追踪/异常)、查询测量的对象及范围。将查询转换为运维语言;对于公式,为每个子查询命名(A、B…)并说明触发规则的
selectedQueryName
(如F1)的计算逻辑——例如“F1 = A × 100 / B → 错误百分比”。解码过滤操作符(
=
等于、
!=
不等于、
IN
/
NOT IN
LIKE
/
ILIKE
CONTAINS
REGEXP
EXISTS
/
NOT EXISTS
);枚举
IN
/
NOT IN
的取值列表,方便用户验证。说明每个
groupBy
维度及其实际作用(如
service.name
对应“按服务单独触发告警”)。
对于异常检测规则
ruleType: anomaly_rule
),需明确说明阈值为与学习模式的标准差,而非原始数值——这是最常见的混淆点。需包含
algorithm
(zscore)、
seasonality
(小时/日/周),以及目标值高低对灵敏度的影响(值越低→误报越多,值越高→仅检测极端偏差)。
2. 触发时机 ——一段描述阈值与时间逻辑的内容。使用以下映射将阈值规范解码为通俗英语:
  • op
    代码:
    1
    高于、
    2
    低于、
    3
    等于、
    4
    不等于。
  • matchType
    代码:
    1
    至少一次(窗口内任意时点)、
    2
    持续满足(整个窗口)、
    3
    平均值(窗口平均值)、
    4
    总和(窗口总和)、
    5
    最新值(最近一个时点)。
说明每个阈值层级的
name
target
targetUnit
及关联渠道。始终以
targetUnit
为单位说明阈值,而非查询原生单位
(例如“当p99延迟超过500ms时触发”,而非“…超过500000000ns”)。若设置了
recoveryTarget
(滞后阈值)需说明;若未设置,需提及当数值在阈值附近波动时可能出现告警震荡。时间逻辑描述为“每
<频率>
检查一次过去
<评估窗口>
的数据”,并说明
at_least_once
模式下单次违规即触发,而
all_the_times
模式下需整个窗口持续满足条件。
3. 通知渠道 ——按层级说明通知渠道(若需可通过
signoz_list_notification_channels
按名称解析)、
notificationSettings.groupBy
(通知的分组方式)、
renotify
(重复通知间隔及状态)、
usePolicy
(基于标签的路由)。若通知设置为默认值且用户已在TL;DR中看到渠道信息,可跳过此部分。
4. 值得关注的问题 ——仅标注非默认且值得用户注意的内容。无需列出所有缺失字段,聚焦高影响项:
  • 关键信号缺失
    alertOnAbsent
    :数据静默丢失(服务崩溃、 instrumentation 故障)不会触发告警。对于生产级规则需特别指出。
  • alertOnAbsent: true
    renotify.alertStates
    中未包含
    nodata
    :无数据告警仅触发一次后即静默,容易被忽略。
  • 模板变量错误:
    {{$topic}}
    无法插值;正确格式为
    {{$labels.topic}}
    。标签键中的点会转换为下划线(
    service.name
    {{$labels.service_name}}
    )。
  • 存在多个严重性层级但规则缺失
    labels.severity
    ——会破坏基于标签的路由策略。这是常见漏洞。
  • 所有层级路由至同一渠道——失去分级阈值的意义。
  • 高基数
    groupBy
    (如
    pod.name
    ×
    partition
    )→ 集群级事件时可能出现通知风暴。
  • 注解/描述文本与
    matchType
    矛盾(例如描述为“持续超过5分钟”但
    matchType=at_least_once
    会在窗口内首次违规时触发)。
  • 告警名称与过滤目标不符(例如名称为“结账服务”但过滤目标为
    payments
    )——需指出此问题。
若以上问题均不适用,可省略此部分。宁可不写也不要凑数。
若用户仅询问“这个告警是做什么的”,到此为止即可。审计步骤(步骤5)仅适用于用户询问“配置正确吗”/“审计这条规则”/“有什么需要修改的”的场景。

Step 5: Assess the configuration (only if asked)

步骤5:评估配置(仅在用户要求时执行)

The user may ask "is this alert reasonable" alongside the explanation. Only assess when asked or when the request implies it (audit, review, "is this configured correctly"). Keep assessment grounded in what's actually in the config:
  • Threshold calibration — appropriate for the signal? Consider service criticality and traffic.
  • matchType fit
    at_least_once
    is sensitive (catches transients);
    all_the_times
    is conservative;
    on_average
    smooths noise.
  • Window vs frequency — short window +
    at_least_once
    can be noisy. Long window can delay detection.
  • Multi-severity — alerts with both warning and critical thresholds enable graduated response. Single-severity alerts miss this.
  • Notification routing — critical → high-urgency channels (PagerDuty); warning → low-urgency (Slack).
  • Missing runbook / description — if
    annotations
    are empty or default, suggest adding context.
  • Absent-data monitoring — for critical signals, recommend
    alertOnAbsent: true
    if it isn't set.
  • GroupBy cardinality — high-cardinality groupBy fields can produce many independent alert series; flag potential notification storms.
  • Filter completeness — for
    IN
    /
    NOT IN
    filters with explicit value lists, flag values that look out of place or missing values that seem expected.
  • Fire frequency vs threshold — if Step 3 shows the alert fires many times a day (>10/day in the 7d window), the threshold is likely too tight; if it never fires and the user is asking because they expected it to, the threshold may be too loose or the query may be wrong.
用户可能在请求解释的同时询问“这条告警配置合理吗”。仅在用户明确要求或请求隐含评估需求(审计、审核、“配置正确吗”)时执行评估。评估需基于实际配置:
  • 阈值校准——是否适合该信号?需考虑服务重要性及流量情况。
  • matchType适配性——
    at_least_once
    灵敏度高(捕捉瞬态问题);
    all_the_times
    更保守;
    on_average
    可平滑噪声。
  • 窗口与频率——短窗口+
    at_least_once
    可能导致误报。长窗口可能延迟检测。
  • 多严重性层级——包含警告和严重阈值的告警可实现分级响应。单一严重性层级会遗漏此优势。
  • 通知路由——严重级告警→高优先级渠道(PagerDuty);警告→低优先级渠道(Slack)。
  • 缺失运行手册/描述——若
    annotations
    为空或默认值,建议添加上下文信息。
  • 无数据监控——对于关键信号,若未设置
    alertOnAbsent: true
    ,建议添加。
  • GroupBy基数——高基数GroupBy字段会生成大量独立告警系列;需标注潜在的通知风暴风险。
  • 过滤完整性——对于包含明确取值列表的
    IN
    /
    NOT IN
    过滤,标注异常取值或缺失的预期取值。
  • 触发频率与阈值——若步骤3显示告警每日触发多次(7天内>10次),则阈值可能过严;若从未触发且用户表示预期会触发,则阈值可能过松或查询逻辑错误。

Step 6: Offer next steps

步骤6:提供后续操作建议

End with two or three actionable follow-ups:
  • "Want me to investigate the most recent fire?" (→
    signoz-investigating-alerts
    )
  • "Want me to run the underlying query to see current values?" (→
    signoz-generating-queries
    )
  • "Want me to adjust the threshold or add a severity level?" (→
    signoz:signoz_update_alert
    directly — out of scope here)
  • "Want me to create a related alert for [gap noticed]?" (→
    signoz-creating-alerts
    )
结尾提供2-3个可执行的后续选项:
  • “需要我调查最近一次的告警触发事件吗?”(→
    signoz-investigating-alerts
  • “需要我运行底层查询查看当前数据吗?”(→
    signoz-generating-queries
  • “需要我调整阈值或添加严重性层级吗?”(→ 直接调用
    signoz:signoz_update_alert
    ——超出本技能范围)
  • “需要我针对[发现的漏洞]创建相关告警吗?”(→
    signoz-creating-alerts

Guardrails

约束规则

  • Fetch before explaining. Always call
    signoz:signoz_get_alert
    . Do not base explanations on the rule name or list response alone.
  • Always pull fire history. The one-line frequency summary is cheap (one MCP call) and grounds the explanation. Skip it only if the alert is disabled.
  • Decode, don't dump. Translate
    op
    ,
    matchType
    , filter expressions, and query JSON into operational language. Show raw JSON only if the user asks.
  • Decode every sub-query. For formula alerts, explain each builder query and the formula. Don't gloss over a formula as "error rate" — show the user what A, B, and F1 each compute.
  • Anchor to the actual config. Don't speculate based on the alert name. If the name says "checkout" but the filter targets
    payments
    , call that out.
  • No live queries by default. Do not run the alert's query unless the user asks. Explain is about understanding the rule, not inspecting current data.
  • Paginate alert listing. Before concluding an alert does not exist, walk every page of
    signoz:signoz_list_alert_rules
    .
  • SigNoz evaluation model. SigNoz uses
    evalWindow
    +
    frequency
    • matchType
      , not Prometheus-style
      for
      durations. Don't describe the evaluation model in Prometheus terms.
  • Stay in scope. Investigation of a specific fire window belongs to
    signoz-investigating-alerts
    . Modifications belong to
    signoz:signoz_update_alert
    .
  • 先获取再解释。始终调用
    signoz:signoz_get_alert
    。请勿仅基于规则名称或列表响应进行解释。
  • 始终获取触发历史。一行频率总结成本低(仅一次MCP调用)且为解释提供依据。仅在告警禁用时跳过。
  • 解码而非直接输出。将
    op
    matchType
    、过滤表达式及查询JSON转换为运维语言。仅在用户要求时展示原始JSON。
  • 解码每个子查询。对于公式型告警,解释每个构建查询及公式逻辑。不要将公式简单描述为“错误率”——需向用户说明A、B、F1各自的计算逻辑。
  • 基于实际配置。请勿根据告警名称推测。若名称为“结账服务”但过滤目标为
    payments
    ,需指出此问题。
  • 默认不执行实时查询**。除非用户要求,否则不要运行告警的查询。解释的目的是理解规则,而非检查当前数据。
  • 遍历告警列表分页。在确认告警不存在之前,需遍历
    signoz:signoz_list_alert_rules
    的所有分页。
  • SigNoz评估模型。SigNoz使用
    evalWindow
    +
    frequency
    +
    matchType
    ,而非Prometheus风格的
    for
    时长。请勿用Prometheus术语描述评估模型。
  • 聚焦范围。特定触发时段的调查属于
    signoz-investigating-alerts
    的范畴。修改操作属于
    signoz:signoz_update_alert
    的范畴。

Examples

示例

User: "Explain my checkout error rate alert"
Agent:
  1. signoz:signoz_list_alert_rules
    (paginated) → finds "High Error Rate — Checkout" id 42.
  2. signoz:signoz_get_alert id=42
    → traces formula (A errored spans / B total spans × 100), single critical threshold at 5%,
    op=1
    ,
    matchType=1
    , channel
    pagerduty-oncall
    .
  3. signoz:signoz_get_alert_history id=42 lookback=7d
    → fired 3 times, last fire 2h ago.
  4. Replies:
    TL;DR: Fires when checkout error rate (errored spans / total spans) exceeds 5% at any point in the last 5 min, pages
    pagerduty-oncall
    . Fired 3 times in the last 7d (last 2h ago).
    What it watches — traces from
    service.name = 'checkout'
    . Query A counts spans with
    hasError = true
    , query B counts all spans, F1 = A × 100 / B is the error percentage; the alert triggers on F1.
    When it fires
    op=above
    ,
    matchType=at_least_once
    , target 5%. Checks every 1 min over the last 5 min — a single elevated minute fires it.
    Notable — no
    recoveryTarget
    (could flap near 5%); no
    alertOnAbsent
    (a checkout outage that drops traffic to zero won't trigger this rule).
  5. Offers: "Want me to investigate the fire from 2h ago?"

User: "Explain alert rule 17 — is it well configured?"
Agent:
  1. signoz:signoz_get_alert id=17
    → metrics,
    system.cpu.utilization
    , two thresholds (warning 80%, critical 90%),
    matchType=3
    on_average, filter
    deployment.environment.name = 'production'
    , groupBy
    host.name
    .
  2. signoz:signoz_get_alert_history id=17
    → has not fired in last 7d.
  3. The user asked "is it well configured?" — the TL;DR leads with the verdict, not a generic description. Then explanation + full audit:
    TL;DR: Mostly well-configured (dual-threshold +
    on_average
    is the right call for CPU), but two gaps to fix: (1) no
    alertOnAbsent
    — a crashed agent silently stops reporting and this rule won't catch it; (2) annotations are default — a runbook URL would help on-call. Has not fired in the last 7d (could mean production is healthy or thresholds too loose).
    What it watches
    system.cpu.utilization
    on
    deployment.environment.name = 'production'
    , grouped by
    host.name
    (fires per host).
    When it fires
    op=above
    ,
    matchType=on_average
    , two tiers (warning 80%, critical 90%) over a 5-min window. Short spikes don't fire — smoothing is correct for CPU.
    Notable concerns — same as TL;DR fixes above; nothing else non-default.
  4. Offers next steps.

User: "Help me understand this anomaly detection alert" (alert id 88)
Agent:
  1. signoz:signoz_get_alert id=88
    ruleType: anomaly_rule
    ,
    algorithm=zscore
    ,
    seasonality=daily
    , target 3, metric
    http.server.request.duration
    , scope
    service.name = 'api-gateway'
    .
  2. History: fired 1 time in last 7d.
  3. Replies:
    TL;DR: Fires when api-gateway request latency deviates by more than 3 standard deviations (not raw latency, not a fixed value) from its learned daily pattern. Fired once in the last 7d.
    What it watches
    http.server.request.duration
    for
    service.name = 'api-gateway'
    , evaluated as a Z-score anomaly with daily seasonality — the model learns the typical pattern for each hour of day, so peak-hour latency won't false-trigger if it matches the historical norm for that hour.
    When it fires — when |Z-score| > 3, i.e. the value is more than 3 standard deviations away from the expected pattern. Lower target → more sensitive (more noise); higher → only extreme deviations. The threshold is not in seconds or milliseconds.
  4. Offers to investigate the recent fire.
用户: “解释我的结账服务错误率告警”
Agent:
  1. signoz:signoz_list_alert_rules
    (遍历分页)→ 找到“高错误率——结账服务”,ID 42。
  2. signoz:signoz_get_alert id=42
    → 链路追踪公式(A错误跨度数 / B总跨度数 × 100),单一严重阈值5%,
    op=1
    matchType=1
    ,通知渠道
    pagerduty-oncall
  3. signoz:signoz_get_alert_history id=42 lookback=7d
    → 触发3次,最后一次触发在2小时前。
  4. 响应:
    TL;DR:当结账服务错误率(错误跨度数/总跨度数)在过去5分钟内任意时点超过5%时触发,向
    pagerduty-oncall
    发送告警。过去7天触发3次(最后一次在2小时前)。
    监控内容 ——来自
    service.name = 'checkout'
    的链路追踪数据。查询A统计
    hasError = true
    的跨度数,查询B统计所有跨度数,F1 = A × 100 / B为错误百分比;告警基于F1触发。
    触发时机 ——
    op=高于
    matchType=至少一次
    ,阈值5%。每1分钟检查过去5分钟的数据——单次数值超标即触发告警。
    值得关注的问题 ——未设置
    recoveryTarget
    (数值在5%附近波动时可能出现告警震荡);未设置
    alertOnAbsent
    (结账服务中断导致流量降为0时不会触发此规则)。
  5. 提供选项:“需要我调查2小时前的告警触发事件吗?”

用户: “解释告警规则17——配置合理吗?”
Agent:
  1. signoz:signoz_get_alert id=17
    → 指标
    system.cpu.utilization
    ,两个阈值(警告80%,严重90%),
    matchType=3
    (平均值),过滤条件
    deployment.environment.name = 'production'
    ,按
    host.name
    分组。
  2. signoz:signoz_get_alert_history id=17
    → 过去7天未触发。
  3. 用户询问“配置合理吗”——TL;DR先给出结论,而非通用描述。随后给出解释及完整审计:
    TL;DR:配置基本合理(双阈值+
    平均值
    模式适合CPU监控),但存在两个漏洞需要修复:(1) 未设置
    alertOnAbsent
    ——代理崩溃后停止上报数据,此规则无法检测到;(2) 注解为默认值——添加运行手册URL可帮助运维人员处理告警。过去7天未触发(可能意味着生产环境健康或阈值过松)。
    监控内容 ——
    deployment.environment.name = 'production'
    system.cpu.utilization
    指标,按
    host.name
    分组(按主机触发告警)。
    触发时机 ——
    op=高于
    matchType=平均值
    ,两个层级(警告80%,严重90%),基于5分钟窗口数据。短时间峰值不会触发——平滑处理适合CPU监控。
    值得关注的问题 ——与TL;DR中的修复建议一致;无其他非默认问题。
  4. 提供后续操作建议。

用户: “帮我理解这条异常检测告警”(告警ID 88)
Agent:
  1. signoz:signoz_get_alert id=88
    ruleType: anomaly_rule
    algorithm=zscore
    seasonality=daily
    ,阈值3,指标
    http.server.request.duration
    ,范围
    service.name = 'api-gateway'
  2. 历史记录:过去7天触发1次。
  3. 响应:
    TL;DR:当api-gateway请求延迟偏离其学习到的每日模式超过3个标准差(并非原始延迟值或固定数值)时触发。过去7天触发1次。
    监控内容 ——
    service.name = 'api-gateway'
    http.server.request.duration
    指标,采用带每日季节性的Z-score异常检测模型——模型会学习一天中每个时段的典型模式,因此峰值时段的延迟若符合历史同期水平不会误触发。
    触发时机 ——当|Z-score| > 3时触发,即数值偏离预期模式超过3个标准差。阈值越低→灵敏度越高(误报越多);阈值越高→仅检测极端偏差。阈值以秒或毫秒为单位。
  4. 提供调查最近触发事件的选项。