monitoring-authoring

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Monitoring Resource Authoring

监控资源编写

This skill covers creating and modifying monitoring resources. For querying Prometheus or investigating alerts, see the prometheus skill and sre skill.

本技能覆盖创建和修改监控资源的相关内容。如需查询Prometheus或排查告警，请参考prometheus技能和sre技能。

Resource Types Overview

资源类型概览

Resource	API Group	Purpose	CRD Provider
`PrometheusRule`	`monitoring.coreos.com/v1`	Alert rules and recording rules	kube-prometheus-stack
`ServiceMonitor`	`monitoring.coreos.com/v1`	Scrape metrics from Services	kube-prometheus-stack
`PodMonitor`	`monitoring.coreos.com/v1`	Scrape metrics from Pods directly	kube-prometheus-stack
`ScrapeConfig`	`monitoring.coreos.com/v1alpha1`	Advanced scrape configuration (relabeling, multi-target)	kube-prometheus-stack
`AlertmanagerConfig`	`monitoring.coreos.com/v1alpha1`	Routing, receivers, silencing	kube-prometheus-stack
`Silence`	`observability.giantswarm.io/v1alpha2`	Declarative Alertmanager silences	silence-operator
`Canary`	`canaries.flanksource.com/v1`	Synthetic health checks (HTTP, TCP, K8s)	canary-checker

资源	API组	用途	CRD提供者
`PrometheusRule`	`monitoring.coreos.com/v1`	告警规则和recording规则	kube-prometheus-stack
`ServiceMonitor`	`monitoring.coreos.com/v1`	从Service抓取指标	kube-prometheus-stack
`PodMonitor`	`monitoring.coreos.com/v1`	直接从Pod抓取指标	kube-prometheus-stack
`ScrapeConfig`	`monitoring.coreos.com/v1alpha1`	高级抓取配置（relabeling、多目标）	kube-prometheus-stack
`AlertmanagerConfig`	`monitoring.coreos.com/v1alpha1`	路由、接收端、静默配置	kube-prometheus-stack
`Silence`	`observability.giantswarm.io/v1alpha2`	声明式Alertmanager静默规则	silence-operator
`Canary`	`canaries.flanksource.com/v1`	合成健康检查（HTTP、TCP、K8s）	canary-checker

File Placement

文件存放位置

Monitoring resources go in different locations depending on scope:

Scope	Path	When to Use
Platform-wide alerts/monitors	`kubernetes/platform/config/monitoring/`	Alerts for platform components (Cilium, Istio, cert-manager, etc.)
Subsystem-specific alerts	`kubernetes/platform/config/<subsystem>/`	Alerts bundled with the subsystem they monitor (e.g., `dragonfly/prometheus-rules.yaml` )
Cluster-specific silences	`kubernetes/clusters/<cluster>/config/silences/`	Silences for known issues on specific clusters
Cluster-specific alerts	`kubernetes/clusters/<cluster>/config/`	Alerts that only apply to a specific cluster
Canary health checks	`kubernetes/platform/config/canary-checker/`	Platform-wide synthetic checks

监控资源根据适用范围存放在不同位置：

适用范围	路径	使用场景
平台级告警/监控	`kubernetes/platform/config/monitoring/`	平台组件（Cilium、Istio、cert-manager等）的告警
子系统专属告警	`kubernetes/platform/config/<subsystem>/`	与被监控子系统绑定的告警（例如 `dragonfly/prometheus-rules.yaml` ）
集群专属静默规则	`kubernetes/clusters/<cluster>/config/silences/`	特定集群已知问题的静默规则
集群专属告警	`kubernetes/clusters/<cluster>/config/`	仅适用于特定集群的告警
Canary健康检查	`kubernetes/platform/config/canary-checker/`	平台级合成检查

File Naming Conventions

文件命名规范

Observed patterns in the

config/monitoring/

directory:

Pattern	Example	When
`<component>-alerts.yaml`	`cilium-alerts.yaml` , `grafana-alerts.yaml`	PrometheusRule files
`<component>-recording-rules.yaml`	`loki-mixin-recording-rules.yaml`	Recording rules
`<component>-servicemonitors.yaml`	`istio-servicemonitors.yaml`	ServiceMonitor/PodMonitor files
`<component>-canary.yaml`	`alertmanager-canary.yaml`	Canary health checks
`<component>-route.yaml`	`grafana-route.yaml`	HTTPRoute for gateway access
`<component>-secret.yaml`	`discord-secret.yaml`	ExternalSecrets for monitoring
`<component>-scrape.yaml`	`hardware-monitoring-scrape.yaml`	ScrapeConfig resources

config/monitoring/

目录下遵循以下命名模式：

命名模式	示例	使用场景
`<component>-alerts.yaml`	`cilium-alerts.yaml` 、 `grafana-alerts.yaml`	PrometheusRule文件
`<component>-recording-rules.yaml`	`loki-mixin-recording-rules.yaml`	Recording规则
`<component>-servicemonitors.yaml`	`istio-servicemonitors.yaml`	ServiceMonitor/PodMonitor文件
`<component>-canary.yaml`	`alertmanager-canary.yaml`	Canary健康检查
`<component>-route.yaml`	`grafana-route.yaml`	网关访问用的HTTPRoute
`<component>-secret.yaml`	`discord-secret.yaml`	监控相关的ExternalSecrets
`<component>-scrape.yaml`	`hardware-monitoring-scrape.yaml`	ScrapeConfig资源

Registration

注册配置

After creating a file in

config/monitoring/

, add it to the kustomization:

yaml

undefined

在

config/monitoring/

下创建文件后，需要将其添加到kustomization中：

yaml

undefined

kubernetes/platform/config/monitoring/kustomization.yaml

resources:

...existing resources...
my-new-alerts.yaml # Add alphabetically by component


For subsystem-specific alerts (e.g., `config/dragonfly/prometheus-rules.yaml`), add to that
subsystem's `kustomization.yaml` instead.

---

resources:

...现有资源...
my-new-alerts.yaml # 按组件名称字母顺序添加


如果是子系统专属告警（例如`config/dragonfly/prometheus-rules.yaml`），则添加到对应子系统的`kustomization.yaml`中。

---

PrometheusRule Authoring

PrometheusRule编写

Required Structure

required结构

Every PrometheusRule must include the

release: kube-prometheus-stack

label for Prometheus to discover it. The YAML schema comment enables editor validation.

yaml

---

每个PrometheusRule必须包含

release: kube-prometheus-stack

标签，Prometheus才能发现该资源。YAML schema注释可以启用编辑器校验功能。

yaml

---

yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/prometheusrule_v1.json

apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: <component>-alerts labels: app.kubernetes.io/name: <component> release: kube-prometheus-stack # REQUIRED - Prometheus selector spec: groups: - name: <component>.rules # or <component>-<subsystem> for sub-groups rules: - alert: AlertName expr: <PromQL expression> for: 5m labels: severity: critical # critical | warning | info annotations: summary: "Short human-readable summary with {{ $labels.instance }}" description: >- Detailed explanation of what is happening, what it means, and what to investigate. Use template variables for context.

undefined

apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: <component>-alerts labels: app.kubernetes.io/name: <component> release: kube-prometheus-stack # 必填 - Prometheus选择器 spec: groups: - name: <component>.rules # 子组可以命名为<component>-<subsystem> rules: - alert: AlertName expr: <PromQL表达式> for: 5m labels: severity: critical # critical | warning | info annotations: summary: "简洁的人工可读摘要，包含{{ $labels.instance }}" description: >- 详细说明当前发生的问题、影响范围以及需要排查的内容。使用模板变量补充上下文。

undefined

Label Requirements

标签要求

Label	Required	Purpose
`release: kube-prometheus-stack`	Yes	Prometheus discovery selector
`app.kubernetes.io/name: <component>`	Recommended	Organizational grouping

Some files use additional labels like

prometheus: kube-prometheus-stack

(e.g., dragonfly), but

release: kube-prometheus-stack

is the critical one for discovery.

标签	是否必填	用途
`release: kube-prometheus-stack`	是	Prometheus发现选择器
`app.kubernetes.io/name: <component>`	推荐	组织分类

部分文件会使用额外标签，例如

prometheus: kube-prometheus-stack

（例如dragonfly），但

release: kube-prometheus-stack

是发现资源的核心必填标签。

Severity Conventions

严重级别规范

Severity	`for` Duration	Use Case	Alertmanager Routing
`critical`	2m-5m	Service down, data loss risk, immediate action needed	Routed to Discord
`warning`	5m-15m	Degraded performance, approaching limits, needs attention	Default receiver (Discord)
`info`	10m-30m	Informational, capacity planning, non-urgent	Silenced by InfoInhibitor

Guidelines for
for
duration:

Shorter
```
for
```
= faster alert, more noise. Longer = quieter, slower response.
```
for: 0m
```
(immediate) only for truly instant failures (e.g., SMART health check fail).
Most alerts: 5m is a good default.
Flap-prone metrics (error rates, latency): 10m-15m to avoid false positives.
Absence detection: 5m (metric may genuinely disappear briefly during restarts).

严重级别	`for` 时长	使用场景	Alertmanager路由
`critical`	2m-5m	服务宕机、数据丢失风险、需要立即处理	路由到Discord
`warning`	5m-15m	性能下降、接近阈值、需要关注	默认接收端（Discord）
`info`	10m-30m	通知类信息、容量规划、非紧急	被InfoInhibitor静默

for
时长设置指南：

```
for
```
越短 = 告警越快、噪音越多。越长 = 越安静、响应越慢。
```
for: 0m
```
（立即告警）仅适用于真正的瞬时故障（例如SMART健康检查失败）。
大多数告警默认使用5m即可。
容易抖动的指标（错误率、延迟）：设置10m-15m避免误报。
缺失检测：设置5m（重启期间指标可能会短暂消失，属于正常现象）。

Annotation Templates

Annotation模板

Standard annotations used across this repository:

yaml

annotations:
  summary: "Short title with {{ $labels.relevant_label }}"
  description: >-
    Multi-line description explaining what happened, the impact,
    and what to investigate. Reference threshold values and current
    values using template functions.
  runbook_url: "https://github.com/ionfury/homelab/blob/main/docs/runbooks/<runbook>.md"

The

runbook_url

annotation is optional but recommended for critical alerts that have established recovery procedures.

本仓库通用的annotation规范：

yaml

annotations:
  summary: "包含{{ $labels.relevant_label }}的简短标题"
  description: >-
    多行描述，说明发生的问题、影响以及需要排查的内容。使用模板函数引用阈值和当前值。
  runbook_url: "https://github.com/ionfury/homelab/blob/main/docs/runbooks/<runbook>.md"

runbook_url

是可选字段，但对于已有成熟恢复流程的critical告警建议添加。

PromQL Template Functions

PromQL模板函数

Functions available in

summary

and

description

annotations:

Function	Input	Output	Example
`humanize`	Number	Human-readable number	`{{ $value \| humanize }}` -> "1.234k"
`humanizePercentage`	Float (0-1)	Percentage string	`{{ $value \| humanizePercentage }}` -> "45.6%"
`humanizeDuration`	Seconds	Duration string	`{{ $value \| humanizeDuration }}` -> "2h 30m"
`printf`	Format string	Formatted value	`{{ printf "%.2f" $value }}` -> "1.23"

summary

和

description

annotation中可用的函数：

函数	输入	输出	示例
`humanize`	数字	人类可读的数字	`{{ $value \| humanize }}` -> "1.234k"
`humanizePercentage`	浮点数(0-1)	百分比字符串	`{{ $value \| humanizePercentage }}` -> "45.6%"
`humanizeDuration`	秒数	时长字符串	`{{ $value \| humanizeDuration }}` -> "2h 30m"
`printf`	格式化字符串	格式化后的值	`{{ printf "%.2f" $value }}` -> "1.23"

Label Variables in Annotations

Annotation中的标签变量

Access alert labels via

{{ $labels.<label_name> }}

and the expression value via

{{ $value }}

yaml

summary: "Cilium agent down on {{ $labels.instance }}"
description: >-
  BPF map {{ $labels.map_name }} on {{ $labels.instance }} is at
  {{ $value | humanizePercentage }}.

通过

{{ $labels.<label_name> }}

访问告警标签，通过

{{ $value }}

访问表达式结果：

yaml

summary: "Cilium agent在{{ $labels.instance }}上宕机"
description: >-
  {{ $labels.instance }}上的BPF map {{ $labels.map_name }}使用率已达{{ $value | humanizePercentage }}。

Common Alert Patterns

常用告警模式

Target down (availability):

yaml

- alert: <Component>Down
  expr: up{job="<job-name>"} == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component> is down on {{ $labels.instance }}"

Absence detection (component disappeared entirely):

yaml

- alert: <Component>Down
  expr: absent(up{job="<job-name>"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component> is unavailable"

Error rate (ratio):

yaml

- alert: <Component>HighErrorRate
  expr: |
    (
      sum(rate(http_requests_total{job="<job>",status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{job="<job>"}[5m]))
    ) > 0.05
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "<Component> error rate above 5%"
    description: "Error rate is {{ $value | humanizePercentage }}"

Latency (histogram quantile):

yaml

- alert: <Component>HighLatency
  expr: |
    histogram_quantile(0.99,
      sum(rate(http_request_duration_seconds_bucket{job="<job>"}[5m])) by (le)
    ) > 1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "<Component> p99 latency above 1s"
    description: "P99 latency is {{ $value | humanizeDuration }}"

Resource pressure (capacity):

yaml

- alert: <Component>ResourcePressure
  expr: <resource_used> / <resource_total> > 0.9
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component> at {{ $value | humanizePercentage }} capacity"

PVC space low:

yaml

- alert: <Component>PVCLow
  expr: |
    kubelet_volume_stats_available_bytes{persistentvolumeclaim=~".*<component>.*"}
    /
    kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*<component>.*"}
    < 0.15
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "PVC {{ $labels.persistentvolumeclaim }} running low"
    description: "{{ $value | humanizePercentage }} free space remaining"

目标宕机（可用性）：

yaml

- alert: <Component>Down
  expr: up{job="<job-name>"} == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component>在{{ $labels.instance }}上宕机"

缺失检测（组件完全消失）：

yaml

- alert: <Component>Down
  expr: absent(up{job="<job-name>"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component>不可用"

错误率（比率）：

yaml

- alert: <Component>HighErrorRate
  expr: |
    (
      sum(rate(http_requests_total{job="<job>",status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{job="<job>"}[5m]))
    ) > 0.05
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "<Component>错误率超过5%"
    description: "错误率为{{ $value | humanizePercentage }}"

延迟（直方图分位数）：

yaml

- alert: <Component>HighLatency
  expr: |
    histogram_quantile(0.99,
      sum(rate(http_request_duration_seconds_bucket{job="<job>"}[5m])) by (le)
    ) > 1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "<Component> p99延迟超过1s"
    description: "P99延迟为{{ $value | humanizeDuration }}"

资源压力（容量）：

yaml

- alert: <Component>ResourcePressure
  expr: <resource_used> / <resource_total> > 0.9
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "<Component>容量使用率达{{ $value | humanizePercentage }}"

PVC空间不足：

yaml

- alert: <Component>PVCLow
  expr: |
    kubelet_volume_stats_available_bytes{persistentvolumeclaim=~".*<component>.*"}
    /
    kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*<component>.*"}
    < 0.15
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "PVC {{ $labels.persistentvolumeclaim }}剩余空间不足"
    description: "剩余空间为{{ $value | humanizePercentage }}"

Alert Grouping

告警分组

Group related alerts in named rule groups. The

name

field groups alerts in the Prometheus UI and affects evaluation order:

yaml

spec:
  groups:
    - name: cilium-agent       # Agent availability and health
      rules: [...]
    - name: cilium-bpf         # BPF subsystem alerts
      rules: [...]
    - name: cilium-policy      # Network policy alerts
      rules: [...]
    - name: cilium-network     # General networking alerts
      rules: [...]

将相关告警放在命名规则组中。

name

字段会在Prometheus UI中对告警进行分组，同时会影响评估顺序：

yaml

spec:
  groups:
    - name: cilium-agent       # Agent可用性和健康状态
      rules: [...]
    - name: cilium-bpf         # BPF子系统告警
      rules: [...]
    - name: cilium-policy      # 网络策略告警
      rules: [...]
    - name: cilium-network     # 通用网络告警
      rules: [...]

Recording Rules

Recording规则

Recording rules pre-compute expensive queries for dashboard performance. Place them alongside alerts in the same PrometheusRule file or in a dedicated

*-recording-rules.yaml

file.

yaml

spec:
  groups:
    - name: <component>-recording-rules
      rules:
        - record: <namespace>:<metric>:<aggregation>
          expr: |
            <PromQL aggregation query>

Recording规则会预计算高开销查询，提升看板性能。可以和告警放在同一个PrometheusRule文件中，也可以放在单独的

*-recording-rules.yaml

文件中。

yaml

spec:
  groups:
    - name: <component>-recording-rules
      rules:
        - record: <namespace>:<metric>:<aggregation>
          expr: |
            <PromQL聚合查询>

Naming Convention

命名规范

Recording rule names follow the pattern

level:metric:operations

loki:request_duration_seconds:p99
loki:requests_total:rate5m
loki:requests_error_rate:ratio5m

Recording规则名称遵循

level:metric:operations

模式：

loki:request_duration_seconds:p99
loki:requests_total:rate5m
loki:requests_error_rate:ratio5m

When to Create Recording Rules

何时创建Recording规则

Dashboard queries that aggregate across many series (e.g., sum/rate across all pods)
Queries used by multiple alerts (avoids redundant computation)
Complex multi-step computations that are hard to read inline

需要跨多个序列聚合的看板查询（例如跨所有pod的sum/rate计算）
多个告警共用的查询（避免重复计算）
难以直接阅读的复杂多步计算

Example: Loki Recording Rules

示例：Loki Recording规则

yaml

- record: loki:request_duration_seconds:p99
  expr: |
    histogram_quantile(0.99,
      sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, job, namespace)
    )

- record: loki:requests_error_rate:ratio5m
  expr: |
    sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[5m])) by (job, namespace)
    /
    sum(rate(loki_request_duration_seconds_count[5m])) by (job, namespace)

yaml

- record: loki:request_duration_seconds:p99
  expr: |
    histogram_quantile(0.99,
      sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, job, namespace)
    )

- record: loki:requests_error_rate:ratio5m
  expr: |
    sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[5m])) by (job, namespace)
    /
    sum(rate(loki_request_duration_seconds_count[5m])) by (job, namespace)

ServiceMonitor and PodMonitor

ServiceMonitor和PodMonitor

Via Helm Values (Preferred)

通过Helm Values配置（优先方式）

Most charts support enabling ServiceMonitor through values. Always prefer this over manual resources:

yaml

undefined

大多数Chart支持通过values开启ServiceMonitor，优先使用这种方式而非手动创建资源：

yaml

undefined

kubernetes/platform/charts/<app-name>.yaml

serviceMonitor: enabled: true interval: 30s scrapeTimeout: 10s

undefined

serviceMonitor: enabled: true interval: 30s scrapeTimeout: 10s

undefined

Manual ServiceMonitor

手动创建ServiceMonitor

When a chart does not support ServiceMonitor creation, create one manually. The resource lives in the

monitoring

namespace and uses

namespaceSelector

to reach across namespaces.

yaml

---

当Chart不支持创建ServiceMonitor时，手动创建即可。资源存放在

monitoring

命名空间下，使用

namespaceSelector

跨命名空间访问目标资源。

yaml

---

yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/servicemonitor_v1.json

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: <component> namespace: monitoring labels: release: kube-prometheus-stack # REQUIRED for discovery spec: namespaceSelector: matchNames: - <target-namespace> # Namespace where the service lives selector: matchLabels: app.kubernetes.io/name: <component> # Must match service labels endpoints: - port: http-monitoring # Must match service port name path: /metrics interval: 30s

undefined

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: <component> namespace: monitoring labels: release: kube-prometheus-stack # 发现资源必填 spec: namespaceSelector: matchNames: - <target-namespace> # 服务所在的命名空间 selector: matchLabels: app.kubernetes.io/name: <component> # 必须与服务标签匹配 endpoints: - port: http-monitoring # 必须与服务端口名称匹配 path: /metrics interval: 30s

undefined

Manual PodMonitor

手动创建PodMonitor

Use PodMonitor when pods expose metrics but don't have a Service (e.g., DaemonSets, sidecars):

yaml

---

当Pod暴露指标但没有对应的Service时使用PodMonitor（例如DaemonSet、sidecar）：

yaml

---

yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/podmonitor_v1.json

apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: <component> namespace: monitoring labels: release: kube-prometheus-stack # REQUIRED for discovery spec: namespaceSelector: matchNames: - <target-namespace> selector: matchLabels: app: <component> podMetricsEndpoints: - port: "15020" # Port name or number (quoted if numeric) path: /stats/prometheus interval: 30s

undefined

apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: <component> namespace: monitoring labels: release: kube-prometheus-stack # 发现资源必填 spec: namespaceSelector: matchNames: - <target-namespace> selector: matchLabels: app: <component> podMetricsEndpoints: - port: "15020" # 端口名称或数字（数字需要加引号） path: /stats/prometheus interval: 30s

undefined

Cross-Namespace Pattern

跨命名空间模式

All ServiceMonitors and PodMonitors in this repo live in the

monitoring

namespace and use

namespaceSelector

to reach pods in other namespaces. This centralizes monitoring configuration and avoids needing

release: kube-prometheus-stack

labels on resources in app namespaces.

本仓库中所有的ServiceMonitor和PodMonitor都存放在

monitoring

命名空间下，使用

namespaceSelector

访问其他命名空间的Pod。这种方式可以集中管理监控配置，无需在应用命名空间的资源上添加

release: kube-prometheus-stack

标签。

Advanced: matchExpressions

高级：matchExpressions

For selecting multiple pod labels (e.g., all Flux controllers):

yaml

selector:
  matchExpressions:
    - key: app
      operator: In
      values:
        - helm-controller
        - source-controller
        - kustomize-controller

用于选择匹配多个标签的Pod（例如所有Flux控制器）：

yaml

selector:
  matchExpressions:
    - key: app
      operator: In
      values:
        - helm-controller
        - source-controller
        - kustomize-controller

AlertmanagerConfig

The platform Alertmanager configuration lives in

config/monitoring/alertmanager-config.yaml

. It defines routing and receivers for the entire platform.

平台Alertmanager配置存放在

config/monitoring/alertmanager-config.yaml

中，定义了整个平台的告警路由和接收端。

Current Routing Architecture

当前路由架构

All alerts
  ├── InfoInhibitor → null receiver (silenced)
  ├── Watchdog → heartbeat receiver (webhook to healthchecks.io, every 2m)
  └── severity=critical → discord receiver
  └── (default) → discord receiver

所有告警
  ├── InfoInhibitor → null接收端（静默）
  ├── Watchdog → 心跳接收端（webhook发送到healthchecks.io，每2分钟一次）
  └── severity=critical → discord接收端
  └── (默认) → discord接收端

Receivers

接收端

Receiver	Type	Purpose
`"null"`	None	Silences matched alerts (e.g., InfoInhibitor)
`heartbeat`	Webhook	Sends Watchdog heartbeat to healthchecks.io
`discord`	Discord webhook	Sends alerts to Discord channel

接收端	类型	用途
`"null"`	无	静默匹配的告警（例如InfoInhibitor）
`heartbeat`	Webhook	发送Watchdog心跳到healthchecks.io
`discord`	Discord webhook	发送告警到Discord频道

Adding a New Route

添加新路由

To route specific alerts differently (e.g., to a different channel or receiver), add a route entry in the

alertmanager-config.yaml

yaml

routes:
  - receiver: "<receiver-name>"
    matchers:
      - name: alertname
        value: "<AlertName>"
        matchType: =

如果需要为特定告警配置不同的路由（例如发送到不同频道或接收端），在

alertmanager-config.yaml

中添加路由条目：

yaml

routes:
  - receiver: "<receiver-name>"
    matchers:
      - name: alertname
        value: "<AlertName>"
        matchType: =

Secrets for Alertmanager

Alertmanager的密钥

Secret	Source	File
`alertmanager-discord-webhook`	ExternalSecret (AWS SSM)	`discord-secret.yaml`
`alertmanager-heartbeat-ping-url`	Replicated from `kube-system`	`heartbeat-secret.yaml`

密钥	来源	文件
`alertmanager-discord-webhook`	ExternalSecret（AWS SSM）	`discord-secret.yaml`
`alertmanager-heartbeat-ping-url`	从 `kube-system` 同步	`heartbeat-secret.yaml`

Silence CRs (silence-operator)

Silence CR（silence-operator）

Silences suppress known alerts declaratively. They are per-cluster resources because different clusters have different expected alert profiles.

静默规则用于声明式抑制已知告警，属于集群专属资源，因为不同集群的预期告警配置不同。

Placement

存放位置

kubernetes/clusters/<cluster>/config/silences/
  ├── kustomization.yaml
  └── <descriptive-name>.yaml

kubernetes/clusters/<cluster>/config/silences/
  ├── kustomization.yaml
  └── <descriptive-name>.yaml

Template

模板

yaml

---

yaml

---

<Comment explaining WHY this alert is silenced>

<注释说明为什么要静默该告警>

apiVersion: observability.giantswarm.io/v1alpha2 kind: Silence metadata: name: <descriptive-name> namespace: monitoring spec: matchers: - name: alertname matchType: "=~~" # "=" exact, "=~~" regex, "!=" negation, "!~" regex negation value: "Alert1|Alert2" - name: namespace matchType: "=" value: <target-namespace>

undefined

apiVersion: observability.giantswarm.io/v1alpha2 kind: Silence metadata: name: <descriptive-name> namespace: monitoring spec: matchers: - name: alertname matchType: "=~~" # "="精确匹配, "=~~"正则匹配, "!="不等于, "!~"正则不匹配 value: "Alert1|Alert2" - name: namespace matchType: "=" value: <target-namespace>

undefined

Matcher Reference

匹配器参考

matchType	Meaning	Example
`=`	Exact match	`value: "KubePodCrashLooping"`
`!=`	Not equal	`value: "Watchdog"`
`=~`	Regex match	`value: "KubePod.*\|TargetDown"`
`!~`	Regex negation	`value: "Info.*"`

matchType	含义	示例
`=`	精确匹配	`value: "KubePodCrashLooping"`
`!=`	不等于	`value: "Watchdog"`
`=~`	正则匹配	`value: "KubePod.*\|TargetDown"`
`!~`	正则不匹配	`value: "Info.*"`

Requirements

要求

Always include a comment explaining why the silence exists (architectural limitation, expected behavior, etc.)
Every cluster must maintain a zero firing alerts baseline (excluding Watchdog)
Silences are a LAST RESORT — every effort must be made to fix the root cause before resorting to a silence. Only silence when the alert genuinely cannot be fixed: architectural limitations (e.g., single-node Spegel), expected environmental behavior, or confirmed upstream bugs
Never leave alerts firing without action — either fix the cause or create a Silence CR. An ignored alert degrades trust in the entire monitoring system and leads to alert fatigue where real incidents get missed

必须添加注释说明静默规则存在的原因（架构限制、预期行为等）
每个集群必须维护零触发告警基线（Watchdog除外）
静默规则是最后手段——在使用静默规则之前必须尽一切努力修复根本原因。仅当告警确实无法修复时才使用静默：架构限制（例如单节点Spegel）、预期的环境行为、已确认的上游Bug
不要放任告警触发不处理——要么修复问题，要么创建Silence CR。被忽略的告警会降低整个监控系统的可信度，导致告警疲劳，最终遗漏真实故障

Adding a Silence to a Cluster

为集群添加静默规则

Create
```
config/silences/
```
directory if it does not exist
Add the Silence YAML file

Create or update

config/silences/kustomization.yaml

yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - <silence-name>.yaml

Reference
```
silences
```
in
```
config/kustomization.yaml
```

如果不存在
```
config/silences/
```
目录则先创建
添加Silence YAML文件

创建或更新

config/silences/kustomization.yaml

：

yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - <silence-name>.yaml

在
```
config/kustomization.yaml
```
中引用
```
silences
```
目录

Canary Health Checks

Canary健康检查

Canary resources provide synthetic monitoring using Flanksource canary-checker. They live in

config/canary-checker/

for platform checks or alongside app config for app-specific checks.

Canary资源使用Flanksource canary-checker提供合成监控。平台级检查存放在

config/canary-checker/

中，应用专属检查可以和应用配置放在一起。

HTTP Health Check

HTTP健康检查

yaml

---

yaml

---

yaml-language-server: $schema=https://kubernetes-schemas.pages.dev/canaries.flanksource.com/canary_v1.json

apiVersion: canaries.flanksource.com/v1 kind: Canary metadata: name: http-check-<component> spec: schedule: "@every 1m" http: - name: <component>-health url: https://<component>.${internal_domain}/health responseCodes: [200] maxSSLExpiry: 7 # Alert if TLS cert expires within 7 days thresholdMillis: 5000 # Fail if response takes >5s

undefined

apiVersion: canaries.flanksource.com/v1 kind: Canary metadata: name: http-check-<component> spec: schedule: "@every 1m" http: - name: <component>-health url: https://<component>.${internal_domain}/health responseCodes: [200] maxSSLExpiry: 7 # TLS证书7天内到期则告警 thresholdMillis: 5000 # 响应超过5s则失败

undefined

TCP Port Check

TCP端口检查

yaml

spec:
  schedule: "@every 1m"
  tcp:
    - name: <component>-port
      host: <service>.<namespace>.svc.cluster.local
      port: 8080
      timeout: 5000

yaml

spec:
  schedule: "@every 1m"
  tcp:
    - name: <component>-port
      host: <service>.<namespace>.svc.cluster.local
      port: 8080
      timeout: 5000

Kubernetes Resource Check with CEL

带CEL表达式的Kubernetes资源检查

Test that pods are actually healthy using CEL expressions (preferred over

ready: true

because the built-in flag penalizes pods with restart history):

yaml

spec:
  interval: 60
  kubernetes:
    - name: <component>-pods-healthy
      kind: Pod
      namespaceSelector:
        name: <namespace>
      resource:
        labelSelector: app.kubernetes.io/name=<component>
      test:
        expr: >
          dyn(results).all(pod,
            pod.Object.status.phase == "Running" &&
            pod.Object.status.conditions.exists(c, c.type == "Ready" && c.status == "True")
          )

使用CEL表达式测试Pod是否真正健康（优于

ready: true

，因为内置标志会 penalize 有重启历史的Pod）：

yaml

spec:
  interval: 60
  kubernetes:
    - name: <component>-pods-healthy
      kind: Pod
      namespaceSelector:
        name: <namespace>
      resource:
        labelSelector: app.kubernetes.io/name=<component>
      test:
        expr: >
          dyn(results).all(pod,
            pod.Object.status.phase == "Running" &&
            pod.Object.status.conditions.exists(c, c.type == "Ready" && c.status == "True")
          )

Canary Metrics and Alerting

Canary指标和告警

canary-checker exposes metrics that are already monitored by the platform:

```
canary_check == 1
```
triggers
```
CanaryCheckFailure
```
(critical, 2m)
High failure rates trigger
```
CanaryCheckHighFailureRate
```
(warning, 5m)

These alerts are defined in

config/canary-checker/prometheus-rules.yaml

-- you do not need to create separate alerts for each canary.

canary-checker暴露的指标已经被平台监控覆盖：

```
canary_check == 1
```
触发
```
CanaryCheckFailure
```
（critical，2m）
高失败率触发
```
CanaryCheckHighFailureRate
```
（warning，5m）

这些告警已经在

config/canary-checker/prometheus-rules.yaml

中定义——不需要为每个canary单独创建告警。

Workflow: Adding Monitoring for a New Component

工作流程：为新组件添加监控

Step 1: Determine What Exists

步骤1：确认现有配置

Check if the Helm chart already provides monitoring:

bash

undefined

检查Helm Chart是否已经提供监控配置：

bash

undefined

Search chart values for monitoring options

搜索Chart values中的监控选项

kubesearch <chart-name> serviceMonitor kubesearch <chart-name> prometheusRule


Enable via Helm values if available (see [deploy-app skill](../deploy-app/SKILL.md)).

kubesearch <chart-name> serviceMonitor kubesearch <chart-name> prometheusRule


如果有可用配置则通过Helm values开启（参考[deploy-app技能](../deploy-app/SKILL.md)）。

Step 2: Create Missing Resources

步骤2：创建缺失的资源

If the chart does not provide monitoring, create resources manually:

ServiceMonitor or PodMonitor for metrics scraping
PrometheusRule for alert rules
Canary for synthetic health checks (HTTP/TCP)

如果Chart不提供监控配置，手动创建以下资源：

ServiceMonitor或PodMonitor用于指标抓取
PrometheusRule用于告警规则
Canary用于合成健康检查（HTTP/TCP）

Step 3: Place Files Correctly

步骤3：正确存放文件

If the component has its own config subsystem (
```
config/<component>/
```
), add monitoring resources there alongside other config
If it is a standalone monitoring addition, add to
```
config/monitoring/
```

如果组件有专属的配置子系统（
```
config/<component>/
```
），将监控资源和其他配置放在一起
如果是独立的监控补充配置，添加到
```
config/monitoring/
```
中

Step 4: Register in Kustomization

步骤4：在Kustomization中注册

Add new files to the appropriate

kustomization.yaml

将新文件添加到对应的

kustomization.yaml

中。

Step 5: Validate

步骤5：校验配置

bash

task k8s:validate

bash

task k8s:validate

Step 6: Verify After Deployment

步骤6：部署后验证

Prometheus is behind OAuth2 Proxy — use

kubectl exec

or port-forward for API queries:

bash

undefined

Prometheus在OAuth2 Proxy后面——使用

kubectl exec

或端口转发进行API查询：

bash

undefined

Check ServiceMonitor is discovered

检查ServiceMonitor是否被发现

KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus --
wget -qO- 'http://localhost:9090/api/v1/targets' |
jq '.data.activeTargets[] | select(.labels.job | contains("<component>"))'

Check alert rules are loaded

检查告警规则是否被加载

KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus --
wget -qO- 'http://localhost:9090/api/v1/rules' |
jq '.data.groups[] | select(.name | contains("<component>"))'

Check canary status

检查canary状态

KUBECONFIG=~/.kube/<cluster>.yaml kubectl get canaries -A | grep <component>

---

KUBECONFIG=~/.kube/<cluster>.yaml kubectl get canaries -A | grep <component>

---

Common Mistakes

常见错误

Mistake	Impact	Fix
Missing `release: kube-prometheus-stack` label	Prometheus ignores the resource	Add the label to metadata.labels
PrometheusRule in wrong namespace without namespaceSelector	Prometheus does not discover it	Place in `monitoring` namespace or ensure Prometheus watches the target namespace
ServiceMonitor selector does not match any service	No metrics scraped, no error raised	Verify labels match with `kubectl get svc -n <ns> --show-labels`
Using `ready: true` in canary-checker Kubernetes checks	False negatives after pod restarts	Use CEL `test.expr` instead
Hardcoding domains in canary URLs	Breaks across clusters	Use `${internal_domain}` substitution variable
Very short `for` duration on flappy metrics	Alert noise	Use 10m+ for error rates and latencies
Creating alerts for metrics that do not exist yet	Alert permanently in "pending" state	Verify metrics exist in Prometheus before writing rules

错误	影响	修复方案
缺少 `release: kube-prometheus-stack` 标签	Prometheus忽略该资源	在metadata.labels中添加该标签
PrometheusRule放在错误的命名空间且没有配置namespaceSelector	Prometheus无法发现该资源	放在 `monitoring` 命名空间或确保Prometheus监控目标命名空间
ServiceMonitor选择器没有匹配到任何服务	没有抓取到指标，也不会抛出错误	使用 `kubectl get svc -n <ns> --show-labels` 验证标签是否匹配
在canary-checker Kubernetes检查中使用 `ready: true`	Pod重启后会出现误报	使用CEL `test.expr` 替代
在canary URL中硬编码域名	跨集群时会失效	使用 `${internal_domain}` 替换变量
抖动指标的 `for` 时长过短	告警噪音多	错误率和延迟类指标设置10m以上
为还不存在的指标创建告警	告警永久处于"pending"状态	编写规则前先在Prometheus中验证指标存在

Reference: Existing Alert Files

参考：现有告警文件

File	Component	Alert Count	Subsystem
`monitoring/cilium-alerts.yaml`	Cilium	14	Agent, BPF, Policy, Network
`monitoring/istio-alerts.yaml`	Istio	~10	Control plane, mTLS, Gateway
`monitoring/cert-manager-alerts.yaml`	cert-manager	5	Expiry, Renewal, Issuance
`monitoring/network-policy-alerts.yaml`	Network Policy	2	Enforcement escape hatch
`monitoring/external-secrets-alerts.yaml`	External Secrets	3	Sync, Ready, Store health
`monitoring/grafana-alerts.yaml`	Grafana	4	Datasource, Errors, Plugins, Down
`monitoring/loki-mixin-alerts.yaml`	Loki	~5	Requests, Latency, Ingester
`monitoring/alloy-alerts.yaml`	Alloy	3	Dropped entries, Errors, Lag
`monitoring/hardware-monitoring-alerts.yaml`	Hardware	7	Temperature, Fans, Disks, Power
`dragonfly/prometheus-rules.yaml`	Dragonfly	2+	Down, Memory
`canary-checker/prometheus-rules.yaml`	canary-checker	2	Check failure, High failure rate

文件	组件	告警数量	子系统
`monitoring/cilium-alerts.yaml`	Cilium	14	Agent、BPF、Policy、Network
`monitoring/istio-alerts.yaml`	Istio	~10	Control plane、mTLS、Gateway
`monitoring/cert-manager-alerts.yaml`	cert-manager	5	Expiry、Renewal、Issuance
`monitoring/network-policy-alerts.yaml`	Network Policy	2	Enforcement escape hatch
`monitoring/external-secrets-alerts.yaml`	External Secrets	3	Sync、Ready、Store health
`monitoring/grafana-alerts.yaml`	Grafana	4	Datasource、Errors、Plugins、Down
`monitoring/loki-mixin-alerts.yaml`	Loki	~5	Requests、Latency、Ingester
`monitoring/alloy-alerts.yaml`	Alloy	3	Dropped entries、Errors、Lag
`monitoring/hardware-monitoring-alerts.yaml`	Hardware	7	Temperature、Fans、Disks、Power
`dragonfly/prometheus-rules.yaml`	Dragonfly	2+	Down、Memory
`canary-checker/prometheus-rules.yaml`	canary-checker	2	Check failure、High failure rate

Keywords

关键词

PrometheusRule, ServiceMonitor, PodMonitor, ScrapeConfig, AlertmanagerConfig, Silence, silence-operator, canary-checker, Canary, recording rules, alert rules, monitoring, observability, scrape targets, prometheus, alertmanager, discord, heartbeat

PrometheusRule、ServiceMonitor、PodMonitor、ScrapeConfig、AlertmanagerConfig、Silence、silence-operator、canary-checker、Canary、recording rules、alert rules、monitoring、observability、scrape targets、prometheus、alertmanager、discord、heartbeat