monitoring-authoring

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Monitoring Resource Authoring

监控资源编写

This skill covers creating and modifying monitoring resources. For querying Prometheus or investigating alerts, see the prometheus skill and sre skill.
本技能覆盖创建和修改监控资源的相关内容。如需查询Prometheus或排查告警,请参考prometheus技能sre技能

Resource Types Overview

资源类型概览

ResourceAPI GroupPurposeCRD Provider
PrometheusRule
monitoring.coreos.com/v1
Alert rules and recording ruleskube-prometheus-stack
ServiceMonitor
monitoring.coreos.com/v1
Scrape metrics from Serviceskube-prometheus-stack
PodMonitor
monitoring.coreos.com/v1
Scrape metrics from Pods directlykube-prometheus-stack
ScrapeConfig
monitoring.coreos.com/v1alpha1
Advanced scrape configuration (relabeling, multi-target)kube-prometheus-stack
AlertmanagerConfig
monitoring.coreos.com/v1alpha1
Routing, receivers, silencingkube-prometheus-stack
Silence
observability.giantswarm.io/v1alpha2
Declarative Alertmanager silencessilence-operator
Canary
canaries.flanksource.com/v1
Synthetic health checks (HTTP, TCP, K8s)canary-checker

资源API组用途CRD提供者
PrometheusRule
monitoring.coreos.com/v1
告警规则和recording规则kube-prometheus-stack
ServiceMonitor
monitoring.coreos.com/v1
从Service抓取指标kube-prometheus-stack
PodMonitor
monitoring.coreos.com/v1
直接从Pod抓取指标kube-prometheus-stack
ScrapeConfig
monitoring.coreos.com/v1alpha1
高级抓取配置(relabeling、多目标)kube-prometheus-stack
AlertmanagerConfig
monitoring.coreos.com/v1alpha1
路由、接收端、静默配置kube-prometheus-stack
Silence
observability.giantswarm.io/v1alpha2
声明式Alertmanager静默规则silence-operator
Canary
canaries.flanksource.com/v1
合成健康检查(HTTP、TCP、K8s)canary-checker

File Placement

文件存放位置

Monitoring resources go in different locations depending on scope:
ScopePathWhen to Use
Platform-wide alerts/monitors
kubernetes/platform/config/monitoring/
Alerts for platform components (Cilium, Istio, cert-manager, etc.)
Subsystem-specific alerts
kubernetes/platform/config/<subsystem>/
Alerts bundled with the subsystem they monitor (e.g.,
dragonfly/prometheus-rules.yaml
)
Cluster-specific silences
kubernetes/clusters/<cluster>/config/silences/
Silences for known issues on specific clusters
Cluster-specific alerts
kubernetes/clusters/<cluster>/config/
Alerts that only apply to a specific cluster
Canary health checks
kubernetes/platform/config/canary-checker/
Platform-wide synthetic checks
监控资源根据适用范围存放在不同位置:
适用范围路径使用场景
平台级告警/监控
kubernetes/platform/config/monitoring/
平台组件(Cilium、Istio、cert-manager等)的告警
子系统专属告警
kubernetes/platform/config/<subsystem>/
与被监控子系统绑定的告警(例如
dragonfly/prometheus-rules.yaml
集群专属静默规则
kubernetes/clusters/<cluster>/config/silences/
特定集群已知问题的静默规则
集群专属告警
kubernetes/clusters/<cluster>/config/
仅适用于特定集群的告警
Canary健康检查
kubernetes/platform/config/canary-checker/
平台级合成检查

File Naming Conventions

文件命名规范

Observed patterns in the
config/monitoring/
directory:
PatternExampleWhen
<component>-alerts.yaml
cilium-alerts.yaml
,
grafana-alerts.yaml
PrometheusRule files
<component>-recording-rules.yaml
loki-mixin-recording-rules.yaml
Recording rules
<component>-servicemonitors.yaml
istio-servicemonitors.yaml
ServiceMonitor/PodMonitor files
<component>-canary.yaml
alertmanager-canary.yaml
Canary health checks
<component>-route.yaml
grafana-route.yaml
HTTPRoute for gateway access
<component>-secret.yaml
discord-secret.yaml
ExternalSecrets for monitoring
<component>-scrape.yaml
hardware-monitoring-scrape.yaml
ScrapeConfig resources
config/monitoring/
目录下遵循以下命名模式:
命名模式示例使用场景
<component>-alerts.yaml
cilium-alerts.yaml
grafana-alerts.yaml
PrometheusRule文件
<component>-recording-rules.yaml
loki-mixin-recording-rules.yaml
Recording规则
<component>-servicemonitors.yaml
istio-servicemonitors.yaml
ServiceMonitor/PodMonitor文件
<component>-canary.yaml
alertmanager-canary.yaml
Canary健康检查
<component>-route.yaml
grafana-route.yaml
网关访问用的HTTPRoute
<component>-secret.yaml
discord-secret.yaml
监控相关的ExternalSecrets
<component>-scrape.yaml
hardware-monitoring-scrape.yaml
ScrapeConfig资源

Registration

注册配置

After creating a file in
config/monitoring/
, add it to the kustomization:
yaml
undefined
config/monitoring/
下创建文件后,需要将其添加到kustomization中:
yaml
undefined

kubernetes/platform/config/monitoring/kustomization.yaml

kubernetes/platform/config/monitoring/kustomization.yaml

resources:
  • ...existing resources...
  • my-new-alerts.yaml # Add alphabetically by component

For subsystem-specific alerts (e.g., `config/dragonfly/prometheus-rules.yaml`), add to that
subsystem's `kustomization.yaml` instead.

---
resources:
  • ...现有资源...
  • my-new-alerts.yaml # 按组件名称字母顺序添加

如果是子系统专属告警(例如`config/dragonfly/prometheus-rules.yaml`),则添加到对应子系统的`kustomization.yaml`中。

---

PrometheusRule Authoring

PrometheusRule编写

Required Structure

required结构

Every PrometheusRule must include the
release: kube-prometheus-stack
label for Prometheus to discover it. The YAML schema comment enables editor validation.
yaml
---
每个PrometheusRule必须包含
release: kube-prometheus-stack
标签,Prometheus才能发现该资源。YAML schema注释可以启用编辑器校验功能。
yaml
---
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: <component>-alerts labels: app.kubernetes.io/name: <component> release: kube-prometheus-stack # REQUIRED - Prometheus selector spec: groups: - name: <component>.rules # or <component>-<subsystem> for sub-groups rules: - alert: AlertName expr: <PromQL expression> for: 5m labels: severity: critical # critical | warning | info annotations: summary: "Short human-readable summary with {{ $labels.instance }}" description: >- Detailed explanation of what is happening, what it means, and what to investigate. Use template variables for context.
undefined
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: <component>-alerts labels: app.kubernetes.io/name: <component> release: kube-prometheus-stack # 必填 - Prometheus选择器 spec: groups: - name: <component>.rules # 子组可以命名为<component>-<subsystem> rules: - alert: AlertName expr: <PromQL表达式> for: 5m labels: severity: critical # critical | warning | info annotations: summary: "简洁的人工可读摘要,包含{{ $labels.instance }}" description: >- 详细说明当前发生的问题、影响范围以及需要排查的内容。使用模板变量补充上下文。
undefined

Label Requirements

标签要求

LabelRequiredPurpose
release: kube-prometheus-stack
YesPrometheus discovery selector
app.kubernetes.io/name: <component>
RecommendedOrganizational grouping
Some files use additional labels like
prometheus: kube-prometheus-stack
(e.g., dragonfly), but
release: kube-prometheus-stack
is the critical one for discovery.
标签是否必填用途
release: kube-prometheus-stack
Prometheus发现选择器
app.kubernetes.io/name: <component>
推荐组织分类
部分文件会使用额外标签,例如
prometheus: kube-prometheus-stack
(例如dragonfly),但
release: kube-prometheus-stack
是发现资源的核心必填标签。

Severity Conventions

严重级别规范

Severity
for
Duration
Use CaseAlertmanager Routing
critical
2m-5mService down, data loss risk, immediate action neededRouted to Discord
warning
5m-15mDegraded performance, approaching limits, needs attentionDefault receiver (Discord)
info
10m-30mInformational, capacity planning, non-urgentSilenced by InfoInhibitor
Guidelines for
for
duration:
  • Shorter
    for
    = faster alert, more noise. Longer = quieter, slower response.
  • for: 0m
    (immediate) only for truly instant failures (e.g., SMART health check fail).
  • Most alerts: 5m is a good default.
  • Flap-prone metrics (error rates, latency): 10m-15m to avoid false positives.
  • Absence detection: 5m (metric may genuinely disappear briefly during restarts).
严重级别
for
时长
使用场景Alertmanager路由
critical
2m-5m服务宕机、数据丢失风险、需要立即处理路由到Discord
warning
5m-15m性能下降、接近阈值、需要关注默认接收端(Discord)
info
10m-30m通知类信息、容量规划、非紧急被InfoInhibitor静默
for
时长设置指南:
  • for
    越短 = 告警越快、噪音越多。越长 = 越安静、响应越慢。
  • for: 0m
    (立即告警)仅适用于真正的瞬时故障(例如SMART健康检查失败)。
  • 大多数告警默认使用5m即可。
  • 容易抖动的指标(错误率、延迟):设置10m-15m避免误报。
  • 缺失检测:设置5m(重启期间指标可能会短暂消失,属于正常现象)。

Annotation Templates

Annotation模板

Standard annotations used across this repository:
yaml
annotations:
  summary: "Short title with {{ $labels.relevant_label }}"
  description: >-
    Multi-line description explaining what happened, the impact,
    and what to investigate. Reference threshold values and current
    values using template functions.
  runbook_url: "https://github.com/ionfury/homelab/blob/main/docs/runbooks/<runbook>.md"
The
runbook_url
annotation is optional but recommended for critical alerts that have established recovery procedures.
本仓库通用的annotation规范:
yaml
annotations:
  summary: "包含{{ $labels.relevant_label }}的简短标题"
  description: >-
    多行描述,说明发生的问题、影响以及需要排查的内容。使用模板函数引用阈值和当前值。
  runbook_url: "https://github.com/ionfury/homelab/blob/main/docs/runbooks/<runbook>.md"
runbook_url
是可选字段,但对于已有成熟恢复流程的critical告警建议添加。

PromQL Template Functions

PromQL模板函数

Functions available in
summary
and
description
annotations:
FunctionInputOutputExample
humanize
NumberHuman-readable number
{{ $value | humanize }}
-> "1.234k"
humanizePercentage
Float (0-1)Percentage string
{{ $value | humanizePercentage }}
-> "45.6%"
humanizeDuration
SecondsDuration string
{{ $value | humanizeDuration }}
-> "2h 30m"
printf
Format stringFormatted value
{{ printf "%.2f" $value }}
-> "1.23"
summary
description
annotation中可用的函数:
函数输入输出示例
humanize
数字人类可读的数字
{{ $value | humanize }}
-> "1.234k"
humanizePercentage
浮点数(0-1)百分比字符串
{{ $value | humanizePercentage }}
-> "45.6%"
humanizeDuration
秒数时长字符串
{{ $value | humanizeDuration }}
-> "2h 30m"
printf
格式化字符串格式化后的值
{{ printf "%.2f" $value }}
-> "1.23"

Label Variables in Annotations

Annotation中的标签变量

Access alert labels via
{{ $labels.<label_name> }}
and the expression value via
{{ $value }}
:
yaml
summary: "Cilium agent down on {{ $labels.instance }}"
description: >-
  BPF map {{ $labels.map_name }} on {{ $labels.instance }} is at
  {{ $value | humanizePercentage }}.
通过
{{ $labels.<label_name> }}
访问告警标签,通过
{{ $value }}
访问表达式结果:
yaml
summary: "Cilium agent在{{ $labels.instance }}上宕机"
description: >-
  {{ $labels.instance }}上的BPF map {{ $labels.map_name }}使用率已达{{ $value | humanizePercentage }}

Common Alert Patterns

常用告警模式

Target down (availability):
yaml
- alert: <Component>Down
  expr: up{job="<job-name>"} == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component> is down on {{ $labels.instance }}"
Absence detection (component disappeared entirely):
yaml
- alert: <Component>Down
  expr: absent(up{job="<job-name>"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component> is unavailable"
Error rate (ratio):
yaml
- alert: <Component>HighErrorRate
  expr: |
    (
      sum(rate(http_requests_total{job="<job>",status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{job="<job>"}[5m]))
    ) > 0.05
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "<Component> error rate above 5%"
    description: "Error rate is {{ $value | humanizePercentage }}"
Latency (histogram quantile):
yaml
- alert: <Component>HighLatency
  expr: |
    histogram_quantile(0.99,
      sum(rate(http_request_duration_seconds_bucket{job="<job>"}[5m])) by (le)
    ) > 1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "<Component> p99 latency above 1s"
    description: "P99 latency is {{ $value | humanizeDuration }}"
Resource pressure (capacity):
yaml
- alert: <Component>ResourcePressure
  expr: <resource_used> / <resource_total> > 0.9
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component> at {{ $value | humanizePercentage }} capacity"
PVC space low:
yaml
- alert: <Component>PVCLow
  expr: |
    kubelet_volume_stats_available_bytes{persistentvolumeclaim=~".*<component>.*"}
    /
    kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*<component>.*"}
    < 0.15
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "PVC {{ $labels.persistentvolumeclaim }} running low"
    description: "{{ $value | humanizePercentage }} free space remaining"
目标宕机(可用性):
yaml
- alert: <Component>Down
  expr: up{job="<job-name>"} == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component>在{{ $labels.instance }}上宕机"
缺失检测(组件完全消失):
yaml
- alert: <Component>Down
  expr: absent(up{job="<job-name>"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component>不可用"
错误率(比率):
yaml
- alert: <Component>HighErrorRate
  expr: |
    (
      sum(rate(http_requests_total{job="<job>",status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{job="<job>"}[5m]))
    ) > 0.05
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "<Component>错误率超过5%"
    description: "错误率为{{ $value | humanizePercentage }}"
延迟(直方图分位数):
yaml
- alert: <Component>HighLatency
  expr: |
    histogram_quantile(0.99,
      sum(rate(http_request_duration_seconds_bucket{job="<job>"}[5m])) by (le)
    ) > 1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "<Component> p99延迟超过1s"
    description: "P99延迟为{{ $value | humanizeDuration }}"
资源压力(容量):
yaml
- alert: <Component>ResourcePressure
  expr: <resource_used> / <resource_total> > 0.9
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component>容量使用率达{{ $value | humanizePercentage }}"
PVC空间不足:
yaml
- alert: <Component>PVCLow
  expr: |
    kubelet_volume_stats_available_bytes{persistentvolumeclaim=~".*<component>.*"}
    /
    kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*<component>.*"}
    < 0.15
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "PVC {{ $labels.persistentvolumeclaim }}剩余空间不足"
    description: "剩余空间为{{ $value | humanizePercentage }}"

Alert Grouping

告警分组

Group related alerts in named rule groups. The
name
field groups alerts in the Prometheus UI and affects evaluation order:
yaml
spec:
  groups:
    - name: cilium-agent       # Agent availability and health
      rules: [...]
    - name: cilium-bpf         # BPF subsystem alerts
      rules: [...]
    - name: cilium-policy      # Network policy alerts
      rules: [...]
    - name: cilium-network     # General networking alerts
      rules: [...]

将相关告警放在命名规则组中。
name
字段会在Prometheus UI中对告警进行分组,同时会影响评估顺序:
yaml
spec:
  groups:
    - name: cilium-agent       # Agent可用性和健康状态
      rules: [...]
    - name: cilium-bpf         # BPF子系统告警
      rules: [...]
    - name: cilium-policy      # 网络策略告警
      rules: [...]
    - name: cilium-network     # 通用网络告警
      rules: [...]

Recording Rules

Recording规则

Recording rules pre-compute expensive queries for dashboard performance. Place them alongside alerts in the same PrometheusRule file or in a dedicated
*-recording-rules.yaml
file.
yaml
spec:
  groups:
    - name: <component>-recording-rules
      rules:
        - record: <namespace>:<metric>:<aggregation>
          expr: |
            <PromQL aggregation query>
Recording规则会预计算高开销查询,提升看板性能。可以和告警放在同一个PrometheusRule文件中,也可以放在单独的
*-recording-rules.yaml
文件中。
yaml
spec:
  groups:
    - name: <component>-recording-rules
      rules:
        - record: <namespace>:<metric>:<aggregation>
          expr: |
            <PromQL聚合查询>

Naming Convention

命名规范

Recording rule names follow the pattern
level:metric:operations
:
loki:request_duration_seconds:p99
loki:requests_total:rate5m
loki:requests_error_rate:ratio5m
Recording规则名称遵循
level:metric:operations
模式:
loki:request_duration_seconds:p99
loki:requests_total:rate5m
loki:requests_error_rate:ratio5m

When to Create Recording Rules

何时创建Recording规则

  • Dashboard queries that aggregate across many series (e.g., sum/rate across all pods)
  • Queries used by multiple alerts (avoids redundant computation)
  • Complex multi-step computations that are hard to read inline
  • 需要跨多个序列聚合的看板查询(例如跨所有pod的sum/rate计算)
  • 多个告警共用的查询(避免重复计算)
  • 难以直接阅读的复杂多步计算

Example: Loki Recording Rules

示例:Loki Recording规则

yaml
- record: loki:request_duration_seconds:p99
  expr: |
    histogram_quantile(0.99,
      sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, job, namespace)
    )

- record: loki:requests_error_rate:ratio5m
  expr: |
    sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[5m])) by (job, namespace)
    /
    sum(rate(loki_request_duration_seconds_count[5m])) by (job, namespace)

yaml
- record: loki:request_duration_seconds:p99
  expr: |
    histogram_quantile(0.99,
      sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, job, namespace)
    )

- record: loki:requests_error_rate:ratio5m
  expr: |
    sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[5m])) by (job, namespace)
    /
    sum(rate(loki_request_duration_seconds_count[5m])) by (job, namespace)

ServiceMonitor and PodMonitor

ServiceMonitor和PodMonitor

Via Helm Values (Preferred)

通过Helm Values配置(优先方式)

Most charts support enabling ServiceMonitor through values. Always prefer this over manual resources:
yaml
undefined
大多数Chart支持通过values开启ServiceMonitor,优先使用这种方式而非手动创建资源:
yaml
undefined

kubernetes/platform/charts/<app-name>.yaml

kubernetes/platform/charts/<app-name>.yaml

serviceMonitor: enabled: true interval: 30s scrapeTimeout: 10s
undefined
serviceMonitor: enabled: true interval: 30s scrapeTimeout: 10s
undefined

Manual ServiceMonitor

手动创建ServiceMonitor

When a chart does not support ServiceMonitor creation, create one manually. The resource lives in the
monitoring
namespace and uses
namespaceSelector
to reach across namespaces.
yaml
---
当Chart不支持创建ServiceMonitor时,手动创建即可。资源存放在
monitoring
命名空间下,使用
namespaceSelector
跨命名空间访问目标资源。
yaml
---
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: <component> namespace: monitoring labels: release: kube-prometheus-stack # REQUIRED for discovery spec: namespaceSelector: matchNames: - <target-namespace> # Namespace where the service lives selector: matchLabels: app.kubernetes.io/name: <component> # Must match service labels endpoints: - port: http-monitoring # Must match service port name path: /metrics interval: 30s
undefined
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: <component> namespace: monitoring labels: release: kube-prometheus-stack # 发现资源必填 spec: namespaceSelector: matchNames: - <target-namespace> # 服务所在的命名空间 selector: matchLabels: app.kubernetes.io/name: <component> # 必须与服务标签匹配 endpoints: - port: http-monitoring # 必须与服务端口名称匹配 path: /metrics interval: 30s
undefined

Manual PodMonitor

手动创建PodMonitor

Use PodMonitor when pods expose metrics but don't have a Service (e.g., DaemonSets, sidecars):
yaml
---
当Pod暴露指标但没有对应的Service时使用PodMonitor(例如DaemonSet、sidecar):
yaml
---
apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: <component> namespace: monitoring labels: release: kube-prometheus-stack # REQUIRED for discovery spec: namespaceSelector: matchNames: - <target-namespace> selector: matchLabels: app: <component> podMetricsEndpoints: - port: "15020" # Port name or number (quoted if numeric) path: /stats/prometheus interval: 30s
undefined
apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: <component> namespace: monitoring labels: release: kube-prometheus-stack # 发现资源必填 spec: namespaceSelector: matchNames: - <target-namespace> selector: matchLabels: app: <component> podMetricsEndpoints: - port: "15020" # 端口名称或数字(数字需要加引号) path: /stats/prometheus interval: 30s
undefined

Cross-Namespace Pattern

跨命名空间模式

All ServiceMonitors and PodMonitors in this repo live in the
monitoring
namespace and use
namespaceSelector
to reach pods in other namespaces. This centralizes monitoring configuration and avoids needing
release: kube-prometheus-stack
labels on resources in app namespaces.
本仓库中所有的ServiceMonitor和PodMonitor都存放在
monitoring
命名空间下,使用
namespaceSelector
访问其他命名空间的Pod。这种方式可以集中管理监控配置,无需在应用命名空间的资源上添加
release: kube-prometheus-stack
标签。

Advanced: matchExpressions

高级:matchExpressions

For selecting multiple pod labels (e.g., all Flux controllers):
yaml
selector:
  matchExpressions:
    - key: app
      operator: In
      values:
        - helm-controller
        - source-controller
        - kustomize-controller

用于选择匹配多个标签的Pod(例如所有Flux控制器):
yaml
selector:
  matchExpressions:
    - key: app
      operator: In
      values:
        - helm-controller
        - source-controller
        - kustomize-controller

AlertmanagerConfig

AlertmanagerConfig

The platform Alertmanager configuration lives in
config/monitoring/alertmanager-config.yaml
. It defines routing and receivers for the entire platform.
平台Alertmanager配置存放在
config/monitoring/alertmanager-config.yaml
中,定义了整个平台的告警路由和接收端。

Current Routing Architecture

当前路由架构

All alerts
  ├── InfoInhibitor → null receiver (silenced)
  ├── Watchdog → heartbeat receiver (webhook to healthchecks.io, every 2m)
  └── severity=critical → discord receiver
  └── (default) → discord receiver
所有告警
  ├── InfoInhibitor → null接收端(静默)
  ├── Watchdog → 心跳接收端(webhook发送到healthchecks.io,每2分钟一次)
  └── severity=critical → discord接收端
  └── (默认) → discord接收端

Receivers

接收端

ReceiverTypePurpose
"null"
NoneSilences matched alerts (e.g., InfoInhibitor)
heartbeat
WebhookSends Watchdog heartbeat to healthchecks.io
discord
Discord webhookSends alerts to Discord channel
接收端类型用途
"null"
静默匹配的告警(例如InfoInhibitor)
heartbeat
Webhook发送Watchdog心跳到healthchecks.io
discord
Discord webhook发送告警到Discord频道

Adding a New Route

添加新路由

To route specific alerts differently (e.g., to a different channel or receiver), add a route entry in the
alertmanager-config.yaml
:
yaml
routes:
  - receiver: "<receiver-name>"
    matchers:
      - name: alertname
        value: "<AlertName>"
        matchType: =
如果需要为特定告警配置不同的路由(例如发送到不同频道或接收端),在
alertmanager-config.yaml
中添加路由条目:
yaml
routes:
  - receiver: "<receiver-name>"
    matchers:
      - name: alertname
        value: "<AlertName>"
        matchType: =

Secrets for Alertmanager

Alertmanager的密钥

SecretSourceFile
alertmanager-discord-webhook
ExternalSecret (AWS SSM)
discord-secret.yaml
alertmanager-heartbeat-ping-url
Replicated from
kube-system
heartbeat-secret.yaml

密钥来源文件
alertmanager-discord-webhook
ExternalSecret(AWS SSM)
discord-secret.yaml
alertmanager-heartbeat-ping-url
kube-system
同步
heartbeat-secret.yaml

Silence CRs (silence-operator)

Silence CR(silence-operator)

Silences suppress known alerts declaratively. They are per-cluster resources because different clusters have different expected alert profiles.
静默规则用于声明式抑制已知告警,属于集群专属资源,因为不同集群的预期告警配置不同。

Placement

存放位置

kubernetes/clusters/<cluster>/config/silences/
  ├── kustomization.yaml
  └── <descriptive-name>.yaml
kubernetes/clusters/<cluster>/config/silences/
  ├── kustomization.yaml
  └── <descriptive-name>.yaml

Template

模板

yaml
---
yaml
---

<Comment explaining WHY this alert is silenced>

<注释说明为什么要静默该告警>

apiVersion: observability.giantswarm.io/v1alpha2 kind: Silence metadata: name: <descriptive-name> namespace: monitoring spec: matchers: - name: alertname matchType: "=" # "=" exact, "=" regex, "!=" negation, "!~" regex negation value: "Alert1|Alert2" - name: namespace matchType: "=" value: <target-namespace>
undefined
apiVersion: observability.giantswarm.io/v1alpha2 kind: Silence metadata: name: <descriptive-name> namespace: monitoring spec: matchers: - name: alertname matchType: "=" # "="精确匹配, "="正则匹配, "!="不等于, "!~"正则不匹配 value: "Alert1|Alert2" - name: namespace matchType: "=" value: <target-namespace>
undefined

Matcher Reference

匹配器参考

matchTypeMeaningExample
=
Exact match
value: "KubePodCrashLooping"
!=
Not equal
value: "Watchdog"
=~
Regex match
value: "KubePod.*|TargetDown"
!~
Regex negation
value: "Info.*"
matchType含义示例
=
精确匹配
value: "KubePodCrashLooping"
!=
不等于
value: "Watchdog"
=~
正则匹配
value: "KubePod.*|TargetDown"
!~
正则不匹配
value: "Info.*"

Requirements

要求

  • Always include a comment explaining why the silence exists (architectural limitation, expected behavior, etc.)
  • Every cluster must maintain a zero firing alerts baseline (excluding Watchdog)
  • Silences are a LAST RESORT — every effort must be made to fix the root cause before resorting to a silence. Only silence when the alert genuinely cannot be fixed: architectural limitations (e.g., single-node Spegel), expected environmental behavior, or confirmed upstream bugs
  • Never leave alerts firing without action — either fix the cause or create a Silence CR. An ignored alert degrades trust in the entire monitoring system and leads to alert fatigue where real incidents get missed
  • 必须添加注释说明静默规则存在的原因(架构限制、预期行为等)
  • 每个集群必须维护零触发告警基线(Watchdog除外)
  • 静默规则是最后手段——在使用静默规则之前必须尽一切努力修复根本原因。仅当告警确实无法修复时才使用静默:架构限制(例如单节点Spegel)、预期的环境行为、已确认的上游Bug
  • 不要放任告警触发不处理——要么修复问题,要么创建Silence CR。被忽略的告警会降低整个监控系统的可信度,导致告警疲劳,最终遗漏真实故障

Adding a Silence to a Cluster

为集群添加静默规则

  1. Create
    config/silences/
    directory if it does not exist
  2. Add the Silence YAML file
  3. Create or update
    config/silences/kustomization.yaml
    :
    yaml
    apiVersion: kustomize.config.k8s.io/v1beta1
    kind: Kustomization
    resources:
      - <silence-name>.yaml
  4. Reference
    silences
    in
    config/kustomization.yaml

  1. 如果不存在
    config/silences/
    目录则先创建
  2. 添加Silence YAML文件
  3. 创建或更新
    config/silences/kustomization.yaml
    yaml
    apiVersion: kustomize.config.k8s.io/v1beta1
    kind: Kustomization
    resources:
      - <silence-name>.yaml
  4. config/kustomization.yaml
    中引用
    silences
    目录

Canary Health Checks

Canary健康检查

Canary resources provide synthetic monitoring using Flanksource canary-checker. They live in
config/canary-checker/
for platform checks or alongside app config for app-specific checks.
Canary资源使用Flanksource canary-checker提供合成监控。平台级检查存放在
config/canary-checker/
中,应用专属检查可以和应用配置放在一起。

HTTP Health Check

HTTP健康检查

yaml
---
yaml
---
apiVersion: canaries.flanksource.com/v1 kind: Canary metadata: name: http-check-<component> spec: schedule: "@every 1m" http: - name: <component>-health url: https://<component>.${internal_domain}/health responseCodes: [200] maxSSLExpiry: 7 # Alert if TLS cert expires within 7 days thresholdMillis: 5000 # Fail if response takes >5s
undefined
apiVersion: canaries.flanksource.com/v1 kind: Canary metadata: name: http-check-<component> spec: schedule: "@every 1m" http: - name: <component>-health url: https://<component>.${internal_domain}/health responseCodes: [200] maxSSLExpiry: 7 # TLS证书7天内到期则告警 thresholdMillis: 5000 # 响应超过5s则失败
undefined

TCP Port Check

TCP端口检查

yaml
spec:
  schedule: "@every 1m"
  tcp:
    - name: <component>-port
      host: <service>.<namespace>.svc.cluster.local
      port: 8080
      timeout: 5000
yaml
spec:
  schedule: "@every 1m"
  tcp:
    - name: <component>-port
      host: <service>.<namespace>.svc.cluster.local
      port: 8080
      timeout: 5000

Kubernetes Resource Check with CEL

带CEL表达式的Kubernetes资源检查

Test that pods are actually healthy using CEL expressions (preferred over
ready: true
because the built-in flag penalizes pods with restart history):
yaml
spec:
  interval: 60
  kubernetes:
    - name: <component>-pods-healthy
      kind: Pod
      namespaceSelector:
        name: <namespace>
      resource:
        labelSelector: app.kubernetes.io/name=<component>
      test:
        expr: >
          dyn(results).all(pod,
            pod.Object.status.phase == "Running" &&
            pod.Object.status.conditions.exists(c, c.type == "Ready" && c.status == "True")
          )
使用CEL表达式测试Pod是否真正健康(优于
ready: true
,因为内置标志会 penalize 有重启历史的Pod):
yaml
spec:
  interval: 60
  kubernetes:
    - name: <component>-pods-healthy
      kind: Pod
      namespaceSelector:
        name: <namespace>
      resource:
        labelSelector: app.kubernetes.io/name=<component>
      test:
        expr: >
          dyn(results).all(pod,
            pod.Object.status.phase == "Running" &&
            pod.Object.status.conditions.exists(c, c.type == "Ready" && c.status == "True")
          )

Canary Metrics and Alerting

Canary指标和告警

canary-checker exposes metrics that are already monitored by the platform:
  • canary_check == 1
    triggers
    CanaryCheckFailure
    (critical, 2m)
  • High failure rates trigger
    CanaryCheckHighFailureRate
    (warning, 5m)
These alerts are defined in
config/canary-checker/prometheus-rules.yaml
-- you do not need to create separate alerts for each canary.

canary-checker暴露的指标已经被平台监控覆盖:
  • canary_check == 1
    触发
    CanaryCheckFailure
    (critical,2m)
  • 高失败率触发
    CanaryCheckHighFailureRate
    (warning,5m)
这些告警已经在
config/canary-checker/prometheus-rules.yaml
中定义——不需要为每个canary单独创建告警。

Workflow: Adding Monitoring for a New Component

工作流程:为新组件添加监控

Step 1: Determine What Exists

步骤1:确认现有配置

Check if the Helm chart already provides monitoring:
bash
undefined
检查Helm Chart是否已经提供监控配置:
bash
undefined

Search chart values for monitoring options

搜索Chart values中的监控选项

kubesearch <chart-name> serviceMonitor kubesearch <chart-name> prometheusRule

Enable via Helm values if available (see [deploy-app skill](../deploy-app/SKILL.md)).
kubesearch <chart-name> serviceMonitor kubesearch <chart-name> prometheusRule

如果有可用配置则通过Helm values开启(参考[deploy-app技能](../deploy-app/SKILL.md))。

Step 2: Create Missing Resources

步骤2:创建缺失的资源

If the chart does not provide monitoring, create resources manually:
  1. ServiceMonitor or PodMonitor for metrics scraping
  2. PrometheusRule for alert rules
  3. Canary for synthetic health checks (HTTP/TCP)
如果Chart不提供监控配置,手动创建以下资源:
  1. ServiceMonitorPodMonitor用于指标抓取
  2. PrometheusRule用于告警规则
  3. Canary用于合成健康检查(HTTP/TCP)

Step 3: Place Files Correctly

步骤3:正确存放文件

  • If the component has its own config subsystem (
    config/<component>/
    ), add monitoring resources there alongside other config
  • If it is a standalone monitoring addition, add to
    config/monitoring/
  • 如果组件有专属的配置子系统(
    config/<component>/
    ),将监控资源和其他配置放在一起
  • 如果是独立的监控补充配置,添加到
    config/monitoring/

Step 4: Register in Kustomization

步骤4:在Kustomization中注册

Add new files to the appropriate
kustomization.yaml
.
将新文件添加到对应的
kustomization.yaml
中。

Step 5: Validate

步骤5:校验配置

bash
task k8s:validate
bash
task k8s:validate

Step 6: Verify After Deployment

步骤6:部署后验证

Prometheus is behind OAuth2 Proxy — use
kubectl exec
or port-forward for API queries:
bash
undefined
Prometheus在OAuth2 Proxy后面——使用
kubectl exec
或端口转发进行API查询:
bash
undefined

Check ServiceMonitor is discovered

检查ServiceMonitor是否被发现

KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus --
wget -qO- 'http://localhost:9090/api/v1/targets' |
jq '.data.activeTargets[] | select(.labels.job | contains("<component>"))'
KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus --
wget -qO- 'http://localhost:9090/api/v1/targets' |
jq '.data.activeTargets[] | select(.labels.job | contains("<component>"))'

Check alert rules are loaded

检查告警规则是否被加载

KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus --
wget -qO- 'http://localhost:9090/api/v1/rules' |
jq '.data.groups[] | select(.name | contains("<component>"))'
KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus --
wget -qO- 'http://localhost:9090/api/v1/rules' |
jq '.data.groups[] | select(.name | contains("<component>"))'

Check canary status

检查canary状态

KUBECONFIG=~/.kube/<cluster>.yaml kubectl get canaries -A | grep <component>

---
KUBECONFIG=~/.kube/<cluster>.yaml kubectl get canaries -A | grep <component>

---

Common Mistakes

常见错误

MistakeImpactFix
Missing
release: kube-prometheus-stack
label
Prometheus ignores the resourceAdd the label to metadata.labels
PrometheusRule in wrong namespace without namespaceSelectorPrometheus does not discover itPlace in
monitoring
namespace or ensure Prometheus watches the target namespace
ServiceMonitor selector does not match any serviceNo metrics scraped, no error raisedVerify labels match with
kubectl get svc -n <ns> --show-labels
Using
ready: true
in canary-checker Kubernetes checks
False negatives after pod restartsUse CEL
test.expr
instead
Hardcoding domains in canary URLsBreaks across clustersUse
${internal_domain}
substitution variable
Very short
for
duration on flappy metrics
Alert noiseUse 10m+ for error rates and latencies
Creating alerts for metrics that do not exist yetAlert permanently in "pending" stateVerify metrics exist in Prometheus before writing rules

错误影响修复方案
缺少
release: kube-prometheus-stack
标签
Prometheus忽略该资源在metadata.labels中添加该标签
PrometheusRule放在错误的命名空间且没有配置namespaceSelectorPrometheus无法发现该资源放在
monitoring
命名空间或确保Prometheus监控目标命名空间
ServiceMonitor选择器没有匹配到任何服务没有抓取到指标,也不会抛出错误使用
kubectl get svc -n <ns> --show-labels
验证标签是否匹配
在canary-checker Kubernetes检查中使用
ready: true
Pod重启后会出现误报使用CEL
test.expr
替代
在canary URL中硬编码域名跨集群时会失效使用
${internal_domain}
替换变量
抖动指标的
for
时长过短
告警噪音多错误率和延迟类指标设置10m以上
为还不存在的指标创建告警告警永久处于"pending"状态编写规则前先在Prometheus中验证指标存在

Reference: Existing Alert Files

参考:现有告警文件

FileComponentAlert CountSubsystem
monitoring/cilium-alerts.yaml
Cilium14Agent, BPF, Policy, Network
monitoring/istio-alerts.yaml
Istio~10Control plane, mTLS, Gateway
monitoring/cert-manager-alerts.yaml
cert-manager5Expiry, Renewal, Issuance
monitoring/network-policy-alerts.yaml
Network Policy2Enforcement escape hatch
monitoring/external-secrets-alerts.yaml
External Secrets3Sync, Ready, Store health
monitoring/grafana-alerts.yaml
Grafana4Datasource, Errors, Plugins, Down
monitoring/loki-mixin-alerts.yaml
Loki~5Requests, Latency, Ingester
monitoring/alloy-alerts.yaml
Alloy3Dropped entries, Errors, Lag
monitoring/hardware-monitoring-alerts.yaml
Hardware7Temperature, Fans, Disks, Power
dragonfly/prometheus-rules.yaml
Dragonfly2+Down, Memory
canary-checker/prometheus-rules.yaml
canary-checker2Check failure, High failure rate

文件组件告警数量子系统
monitoring/cilium-alerts.yaml
Cilium14Agent、BPF、Policy、Network
monitoring/istio-alerts.yaml
Istio~10Control plane、mTLS、Gateway
monitoring/cert-manager-alerts.yaml
cert-manager5Expiry、Renewal、Issuance
monitoring/network-policy-alerts.yaml
Network Policy2Enforcement escape hatch
monitoring/external-secrets-alerts.yaml
External Secrets3Sync、Ready、Store health
monitoring/grafana-alerts.yaml
Grafana4Datasource、Errors、Plugins、Down
monitoring/loki-mixin-alerts.yaml
Loki~5Requests、Latency、Ingester
monitoring/alloy-alerts.yaml
Alloy3Dropped entries、Errors、Lag
monitoring/hardware-monitoring-alerts.yaml
Hardware7Temperature、Fans、Disks、Power
dragonfly/prometheus-rules.yaml
Dragonfly2+Down、Memory
canary-checker/prometheus-rules.yaml
canary-checker2Check failure、High failure rate

Keywords

关键词

PrometheusRule, ServiceMonitor, PodMonitor, ScrapeConfig, AlertmanagerConfig, Silence, silence-operator, canary-checker, Canary, recording rules, alert rules, monitoring, observability, scrape targets, prometheus, alertmanager, discord, heartbeat
PrometheusRule、ServiceMonitor、PodMonitor、ScrapeConfig、AlertmanagerConfig、Silence、silence-operator、canary-checker、Canary、recording rules、alert rules、monitoring、observability、scrape targets、prometheus、alertmanager、discord、heartbeat