monitoring-authoring
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMonitoring Resource Authoring
监控资源编写
This skill covers creating and modifying monitoring resources. For querying Prometheus
or investigating alerts, see the prometheus skill and
sre skill.
本技能覆盖创建和修改监控资源的相关内容。如需查询Prometheus或排查告警,请参考prometheus技能和sre技能。
Resource Types Overview
资源类型概览
| Resource | API Group | Purpose | CRD Provider |
|---|---|---|---|
| | Alert rules and recording rules | kube-prometheus-stack |
| | Scrape metrics from Services | kube-prometheus-stack |
| | Scrape metrics from Pods directly | kube-prometheus-stack |
| | Advanced scrape configuration (relabeling, multi-target) | kube-prometheus-stack |
| | Routing, receivers, silencing | kube-prometheus-stack |
| | Declarative Alertmanager silences | silence-operator |
| | Synthetic health checks (HTTP, TCP, K8s) | canary-checker |
| 资源 | API组 | 用途 | CRD提供者 |
|---|---|---|---|
| | 告警规则和recording规则 | kube-prometheus-stack |
| | 从Service抓取指标 | kube-prometheus-stack |
| | 直接从Pod抓取指标 | kube-prometheus-stack |
| | 高级抓取配置(relabeling、多目标) | kube-prometheus-stack |
| | 路由、接收端、静默配置 | kube-prometheus-stack |
| | 声明式Alertmanager静默规则 | silence-operator |
| | 合成健康检查(HTTP、TCP、K8s) | canary-checker |
File Placement
文件存放位置
Monitoring resources go in different locations depending on scope:
| Scope | Path | When to Use |
|---|---|---|
| Platform-wide alerts/monitors | | Alerts for platform components (Cilium, Istio, cert-manager, etc.) |
| Subsystem-specific alerts | | Alerts bundled with the subsystem they monitor (e.g., |
| Cluster-specific silences | | Silences for known issues on specific clusters |
| Cluster-specific alerts | | Alerts that only apply to a specific cluster |
| Canary health checks | | Platform-wide synthetic checks |
监控资源根据适用范围存放在不同位置:
| 适用范围 | 路径 | 使用场景 |
|---|---|---|
| 平台级告警/监控 | | 平台组件(Cilium、Istio、cert-manager等)的告警 |
| 子系统专属告警 | | 与被监控子系统绑定的告警(例如 |
| 集群专属静默规则 | | 特定集群已知问题的静默规则 |
| 集群专属告警 | | 仅适用于特定集群的告警 |
| Canary健康检查 | | 平台级合成检查 |
File Naming Conventions
文件命名规范
Observed patterns in the directory:
config/monitoring/| Pattern | Example | When |
|---|---|---|
| | PrometheusRule files |
| | Recording rules |
| | ServiceMonitor/PodMonitor files |
| | Canary health checks |
| | HTTPRoute for gateway access |
| | ExternalSecrets for monitoring |
| | ScrapeConfig resources |
config/monitoring/| 命名模式 | 示例 | 使用场景 |
|---|---|---|
| | PrometheusRule文件 |
| | Recording规则 |
| | ServiceMonitor/PodMonitor文件 |
| | Canary健康检查 |
| | 网关访问用的HTTPRoute |
| | 监控相关的ExternalSecrets |
| | ScrapeConfig资源 |
Registration
注册配置
After creating a file in , add it to the kustomization:
config/monitoring/yaml
undefined在下创建文件后,需要将其添加到kustomization中:
config/monitoring/yaml
undefinedkubernetes/platform/config/monitoring/kustomization.yaml
kubernetes/platform/config/monitoring/kustomization.yaml
resources:
- ...existing resources...
- my-new-alerts.yaml # Add alphabetically by component
For subsystem-specific alerts (e.g., `config/dragonfly/prometheus-rules.yaml`), add to that
subsystem's `kustomization.yaml` instead.
---resources:
- ...现有资源...
- my-new-alerts.yaml # 按组件名称字母顺序添加
如果是子系统专属告警(例如`config/dragonfly/prometheus-rules.yaml`),则添加到对应子系统的`kustomization.yaml`中。
---PrometheusRule Authoring
PrometheusRule编写
Required Structure
required结构
Every PrometheusRule must include the label for Prometheus
to discover it. The YAML schema comment enables editor validation.
release: kube-prometheus-stackyaml
---每个PrometheusRule必须包含标签,Prometheus才能发现该资源。YAML schema注释可以启用编辑器校验功能。
release: kube-prometheus-stackyaml
---yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/prometheusrule_v1.json
yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/prometheusrule_v1.json
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: <component>-alerts
labels:
app.kubernetes.io/name: <component>
release: kube-prometheus-stack # REQUIRED - Prometheus selector
spec:
groups:
- name: <component>.rules # or <component>-<subsystem> for sub-groups
rules:
- alert: AlertName
expr: <PromQL expression>
for: 5m
labels:
severity: critical # critical | warning | info
annotations:
summary: "Short human-readable summary with {{ $labels.instance }}"
description: >-
Detailed explanation of what is happening, what it means,
and what to investigate. Use template variables for context.
undefinedapiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: <component>-alerts
labels:
app.kubernetes.io/name: <component>
release: kube-prometheus-stack # 必填 - Prometheus选择器
spec:
groups:
- name: <component>.rules # 子组可以命名为<component>-<subsystem>
rules:
- alert: AlertName
expr: <PromQL表达式>
for: 5m
labels:
severity: critical # critical | warning | info
annotations:
summary: "简洁的人工可读摘要,包含{{ $labels.instance }}"
description: >-
详细说明当前发生的问题、影响范围以及需要排查的内容。使用模板变量补充上下文。
undefinedLabel Requirements
标签要求
| Label | Required | Purpose |
|---|---|---|
| Yes | Prometheus discovery selector |
| Recommended | Organizational grouping |
Some files use additional labels like (e.g., dragonfly),
but is the critical one for discovery.
prometheus: kube-prometheus-stackrelease: kube-prometheus-stack| 标签 | 是否必填 | 用途 |
|---|---|---|
| 是 | Prometheus发现选择器 |
| 推荐 | 组织分类 |
部分文件会使用额外标签,例如(例如dragonfly),但是发现资源的核心必填标签。
prometheus: kube-prometheus-stackrelease: kube-prometheus-stackSeverity Conventions
严重级别规范
| Severity | | Use Case | Alertmanager Routing |
|---|---|---|---|
| 2m-5m | Service down, data loss risk, immediate action needed | Routed to Discord |
| 5m-15m | Degraded performance, approaching limits, needs attention | Default receiver (Discord) |
| 10m-30m | Informational, capacity planning, non-urgent | Silenced by InfoInhibitor |
Guidelines for duration:
for- Shorter = faster alert, more noise. Longer = quieter, slower response.
for - (immediate) only for truly instant failures (e.g., SMART health check fail).
for: 0m - Most alerts: 5m is a good default.
- Flap-prone metrics (error rates, latency): 10m-15m to avoid false positives.
- Absence detection: 5m (metric may genuinely disappear briefly during restarts).
| 严重级别 | | 使用场景 | Alertmanager路由 |
|---|---|---|---|
| 2m-5m | 服务宕机、数据丢失风险、需要立即处理 | 路由到Discord |
| 5m-15m | 性能下降、接近阈值、需要关注 | 默认接收端(Discord) |
| 10m-30m | 通知类信息、容量规划、非紧急 | 被InfoInhibitor静默 |
for- 越短 = 告警越快、噪音越多。越长 = 越安静、响应越慢。
for - (立即告警)仅适用于真正的瞬时故障(例如SMART健康检查失败)。
for: 0m - 大多数告警默认使用5m即可。
- 容易抖动的指标(错误率、延迟):设置10m-15m避免误报。
- 缺失检测:设置5m(重启期间指标可能会短暂消失,属于正常现象)。
Annotation Templates
Annotation模板
Standard annotations used across this repository:
yaml
annotations:
summary: "Short title with {{ $labels.relevant_label }}"
description: >-
Multi-line description explaining what happened, the impact,
and what to investigate. Reference threshold values and current
values using template functions.
runbook_url: "https://github.com/ionfury/homelab/blob/main/docs/runbooks/<runbook>.md"The annotation is optional but recommended for critical alerts that have
established recovery procedures.
runbook_url本仓库通用的annotation规范:
yaml
annotations:
summary: "包含{{ $labels.relevant_label }}的简短标题"
description: >-
多行描述,说明发生的问题、影响以及需要排查的内容。使用模板函数引用阈值和当前值。
runbook_url: "https://github.com/ionfury/homelab/blob/main/docs/runbooks/<runbook>.md"runbook_urlPromQL Template Functions
PromQL模板函数
Functions available in and annotations:
summarydescription| Function | Input | Output | Example |
|---|---|---|---|
| Number | Human-readable number | |
| Float (0-1) | Percentage string | |
| Seconds | Duration string | |
| Format string | Formatted value | |
summarydescription| 函数 | 输入 | 输出 | 示例 |
|---|---|---|---|
| 数字 | 人类可读的数字 | |
| 浮点数(0-1) | 百分比字符串 | |
| 秒数 | 时长字符串 | |
| 格式化字符串 | 格式化后的值 | |
Label Variables in Annotations
Annotation中的标签变量
Access alert labels via and the expression value via :
{{ $labels.<label_name> }}{{ $value }}yaml
summary: "Cilium agent down on {{ $labels.instance }}"
description: >-
BPF map {{ $labels.map_name }} on {{ $labels.instance }} is at
{{ $value | humanizePercentage }}.通过访问告警标签,通过访问表达式结果:
{{ $labels.<label_name> }}{{ $value }}yaml
summary: "Cilium agent在{{ $labels.instance }}上宕机"
description: >-
{{ $labels.instance }}上的BPF map {{ $labels.map_name }}使用率已达{{ $value | humanizePercentage }}。Common Alert Patterns
常用告警模式
Target down (availability):
yaml
- alert: <Component>Down
expr: up{job="<job-name>"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "<Component> is down on {{ $labels.instance }}"Absence detection (component disappeared entirely):
yaml
- alert: <Component>Down
expr: absent(up{job="<job-name>"} == 1)
for: 5m
labels:
severity: critical
annotations:
summary: "<Component> is unavailable"Error rate (ratio):
yaml
- alert: <Component>HighErrorRate
expr: |
(
sum(rate(http_requests_total{job="<job>",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="<job>"}[5m]))
) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "<Component> error rate above 5%"
description: "Error rate is {{ $value | humanizePercentage }}"Latency (histogram quantile):
yaml
- alert: <Component>HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="<job>"}[5m])) by (le)
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "<Component> p99 latency above 1s"
description: "P99 latency is {{ $value | humanizeDuration }}"Resource pressure (capacity):
yaml
- alert: <Component>ResourcePressure
expr: <resource_used> / <resource_total> > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "<Component> at {{ $value | humanizePercentage }} capacity"PVC space low:
yaml
- alert: <Component>PVCLow
expr: |
kubelet_volume_stats_available_bytes{persistentvolumeclaim=~".*<component>.*"}
/
kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*<component>.*"}
< 0.15
for: 15m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} running low"
description: "{{ $value | humanizePercentage }} free space remaining"目标宕机(可用性):
yaml
- alert: <Component>Down
expr: up{job="<job-name>"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "<Component>在{{ $labels.instance }}上宕机"缺失检测(组件完全消失):
yaml
- alert: <Component>Down
expr: absent(up{job="<job-name>"} == 1)
for: 5m
labels:
severity: critical
annotations:
summary: "<Component>不可用"错误率(比率):
yaml
- alert: <Component>HighErrorRate
expr: |
(
sum(rate(http_requests_total{job="<job>",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="<job>"}[5m]))
) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "<Component>错误率超过5%"
description: "错误率为{{ $value | humanizePercentage }}"延迟(直方图分位数):
yaml
- alert: <Component>HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="<job>"}[5m])) by (le)
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "<Component> p99延迟超过1s"
description: "P99延迟为{{ $value | humanizeDuration }}"资源压力(容量):
yaml
- alert: <Component>ResourcePressure
expr: <resource_used> / <resource_total> > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "<Component>容量使用率达{{ $value | humanizePercentage }}"PVC空间不足:
yaml
- alert: <Component>PVCLow
expr: |
kubelet_volume_stats_available_bytes{persistentvolumeclaim=~".*<component>.*"}
/
kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*<component>.*"}
< 0.15
for: 15m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }}剩余空间不足"
description: "剩余空间为{{ $value | humanizePercentage }}"Alert Grouping
告警分组
Group related alerts in named rule groups. The field groups alerts in the Prometheus
UI and affects evaluation order:
nameyaml
spec:
groups:
- name: cilium-agent # Agent availability and health
rules: [...]
- name: cilium-bpf # BPF subsystem alerts
rules: [...]
- name: cilium-policy # Network policy alerts
rules: [...]
- name: cilium-network # General networking alerts
rules: [...]将相关告警放在命名规则组中。字段会在Prometheus UI中对告警进行分组,同时会影响评估顺序:
nameyaml
spec:
groups:
- name: cilium-agent # Agent可用性和健康状态
rules: [...]
- name: cilium-bpf # BPF子系统告警
rules: [...]
- name: cilium-policy # 网络策略告警
rules: [...]
- name: cilium-network # 通用网络告警
rules: [...]Recording Rules
Recording规则
Recording rules pre-compute expensive queries for dashboard performance. Place them alongside
alerts in the same PrometheusRule file or in a dedicated file.
*-recording-rules.yamlyaml
spec:
groups:
- name: <component>-recording-rules
rules:
- record: <namespace>:<metric>:<aggregation>
expr: |
<PromQL aggregation query>Recording规则会预计算高开销查询,提升看板性能。可以和告警放在同一个PrometheusRule文件中,也可以放在单独的文件中。
*-recording-rules.yamlyaml
spec:
groups:
- name: <component>-recording-rules
rules:
- record: <namespace>:<metric>:<aggregation>
expr: |
<PromQL聚合查询>Naming Convention
命名规范
Recording rule names follow the pattern :
level:metric:operationsloki:request_duration_seconds:p99
loki:requests_total:rate5m
loki:requests_error_rate:ratio5mRecording规则名称遵循模式:
level:metric:operationsloki:request_duration_seconds:p99
loki:requests_total:rate5m
loki:requests_error_rate:ratio5mWhen to Create Recording Rules
何时创建Recording规则
- Dashboard queries that aggregate across many series (e.g., sum/rate across all pods)
- Queries used by multiple alerts (avoids redundant computation)
- Complex multi-step computations that are hard to read inline
- 需要跨多个序列聚合的看板查询(例如跨所有pod的sum/rate计算)
- 多个告警共用的查询(避免重复计算)
- 难以直接阅读的复杂多步计算
Example: Loki Recording Rules
示例:Loki Recording规则
yaml
- record: loki:request_duration_seconds:p99
expr: |
histogram_quantile(0.99,
sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, job, namespace)
)
- record: loki:requests_error_rate:ratio5m
expr: |
sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[5m])) by (job, namespace)
/
sum(rate(loki_request_duration_seconds_count[5m])) by (job, namespace)yaml
- record: loki:request_duration_seconds:p99
expr: |
histogram_quantile(0.99,
sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, job, namespace)
)
- record: loki:requests_error_rate:ratio5m
expr: |
sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[5m])) by (job, namespace)
/
sum(rate(loki_request_duration_seconds_count[5m])) by (job, namespace)ServiceMonitor and PodMonitor
ServiceMonitor和PodMonitor
Via Helm Values (Preferred)
通过Helm Values配置(优先方式)
Most charts support enabling ServiceMonitor through values. Always prefer this over manual resources:
yaml
undefined大多数Chart支持通过values开启ServiceMonitor,优先使用这种方式而非手动创建资源:
yaml
undefinedkubernetes/platform/charts/<app-name>.yaml
kubernetes/platform/charts/<app-name>.yaml
serviceMonitor:
enabled: true
interval: 30s
scrapeTimeout: 10s
undefinedserviceMonitor:
enabled: true
interval: 30s
scrapeTimeout: 10s
undefinedManual ServiceMonitor
手动创建ServiceMonitor
When a chart does not support ServiceMonitor creation, create one manually. The resource
lives in the namespace and uses to reach across namespaces.
monitoringnamespaceSelectoryaml
---当Chart不支持创建ServiceMonitor时,手动创建即可。资源存放在命名空间下,使用跨命名空间访问目标资源。
monitoringnamespaceSelectoryaml
---yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/servicemonitor_v1.json
yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/servicemonitor_v1.json
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: <component>
namespace: monitoring
labels:
release: kube-prometheus-stack # REQUIRED for discovery
spec:
namespaceSelector:
matchNames:
- <target-namespace> # Namespace where the service lives
selector:
matchLabels:
app.kubernetes.io/name: <component> # Must match service labels
endpoints:
- port: http-monitoring # Must match service port name
path: /metrics
interval: 30s
undefinedapiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: <component>
namespace: monitoring
labels:
release: kube-prometheus-stack # 发现资源必填
spec:
namespaceSelector:
matchNames:
- <target-namespace> # 服务所在的命名空间
selector:
matchLabels:
app.kubernetes.io/name: <component> # 必须与服务标签匹配
endpoints:
- port: http-monitoring # 必须与服务端口名称匹配
path: /metrics
interval: 30s
undefinedManual PodMonitor
手动创建PodMonitor
Use PodMonitor when pods expose metrics but don't have a Service (e.g., DaemonSets, sidecars):
yaml
---当Pod暴露指标但没有对应的Service时使用PodMonitor(例如DaemonSet、sidecar):
yaml
---yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/podmonitor_v1.json
yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/podmonitor_v1.json
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: <component>
namespace: monitoring
labels:
release: kube-prometheus-stack # REQUIRED for discovery
spec:
namespaceSelector:
matchNames:
- <target-namespace>
selector:
matchLabels:
app: <component>
podMetricsEndpoints:
- port: "15020" # Port name or number (quoted if numeric)
path: /stats/prometheus
interval: 30s
undefinedapiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: <component>
namespace: monitoring
labels:
release: kube-prometheus-stack # 发现资源必填
spec:
namespaceSelector:
matchNames:
- <target-namespace>
selector:
matchLabels:
app: <component>
podMetricsEndpoints:
- port: "15020" # 端口名称或数字(数字需要加引号)
path: /stats/prometheus
interval: 30s
undefinedCross-Namespace Pattern
跨命名空间模式
All ServiceMonitors and PodMonitors in this repo live in the namespace and use
to reach pods in other namespaces. This centralizes monitoring configuration
and avoids needing labels on resources in app namespaces.
monitoringnamespaceSelectorrelease: kube-prometheus-stack本仓库中所有的ServiceMonitor和PodMonitor都存放在命名空间下,使用访问其他命名空间的Pod。这种方式可以集中管理监控配置,无需在应用命名空间的资源上添加标签。
monitoringnamespaceSelectorrelease: kube-prometheus-stackAdvanced: matchExpressions
高级:matchExpressions
For selecting multiple pod labels (e.g., all Flux controllers):
yaml
selector:
matchExpressions:
- key: app
operator: In
values:
- helm-controller
- source-controller
- kustomize-controller用于选择匹配多个标签的Pod(例如所有Flux控制器):
yaml
selector:
matchExpressions:
- key: app
operator: In
values:
- helm-controller
- source-controller
- kustomize-controllerAlertmanagerConfig
AlertmanagerConfig
The platform Alertmanager configuration lives in .
It defines routing and receivers for the entire platform.
config/monitoring/alertmanager-config.yaml平台Alertmanager配置存放在中,定义了整个平台的告警路由和接收端。
config/monitoring/alertmanager-config.yamlCurrent Routing Architecture
当前路由架构
All alerts
├── InfoInhibitor → null receiver (silenced)
├── Watchdog → heartbeat receiver (webhook to healthchecks.io, every 2m)
└── severity=critical → discord receiver
└── (default) → discord receiver所有告警
├── InfoInhibitor → null接收端(静默)
├── Watchdog → 心跳接收端(webhook发送到healthchecks.io,每2分钟一次)
└── severity=critical → discord接收端
└── (默认) → discord接收端Receivers
接收端
| Receiver | Type | Purpose |
|---|---|---|
| None | Silences matched alerts (e.g., InfoInhibitor) |
| Webhook | Sends Watchdog heartbeat to healthchecks.io |
| Discord webhook | Sends alerts to Discord channel |
| 接收端 | 类型 | 用途 |
|---|---|---|
| 无 | 静默匹配的告警(例如InfoInhibitor) |
| Webhook | 发送Watchdog心跳到healthchecks.io |
| Discord webhook | 发送告警到Discord频道 |
Adding a New Route
添加新路由
To route specific alerts differently (e.g., to a different channel or receiver), add a route
entry in the :
alertmanager-config.yamlyaml
routes:
- receiver: "<receiver-name>"
matchers:
- name: alertname
value: "<AlertName>"
matchType: =如果需要为特定告警配置不同的路由(例如发送到不同频道或接收端),在中添加路由条目:
alertmanager-config.yamlyaml
routes:
- receiver: "<receiver-name>"
matchers:
- name: alertname
value: "<AlertName>"
matchType: =Secrets for Alertmanager
Alertmanager的密钥
| Secret | Source | File |
|---|---|---|
| ExternalSecret (AWS SSM) | |
| Replicated from | |
| 密钥 | 来源 | 文件 |
|---|---|---|
| ExternalSecret(AWS SSM) | |
| 从 | |
Silence CRs (silence-operator)
Silence CR(silence-operator)
Silences suppress known alerts declaratively. They are per-cluster resources because
different clusters have different expected alert profiles.
静默规则用于声明式抑制已知告警,属于集群专属资源,因为不同集群的预期告警配置不同。
Placement
存放位置
kubernetes/clusters/<cluster>/config/silences/
├── kustomization.yaml
└── <descriptive-name>.yamlkubernetes/clusters/<cluster>/config/silences/
├── kustomization.yaml
└── <descriptive-name>.yamlTemplate
模板
yaml
---yaml
---<Comment explaining WHY this alert is silenced>
<注释说明为什么要静默该告警>
apiVersion: observability.giantswarm.io/v1alpha2
kind: Silence
metadata:
name: <descriptive-name>
namespace: monitoring
spec:
matchers:
- name: alertname
matchType: "=" # "=" exact, "=" regex, "!=" negation, "!~" regex negation
value: "Alert1|Alert2"
- name: namespace
matchType: "="
value: <target-namespace>
undefinedapiVersion: observability.giantswarm.io/v1alpha2
kind: Silence
metadata:
name: <descriptive-name>
namespace: monitoring
spec:
matchers:
- name: alertname
matchType: "=" # "="精确匹配, "="正则匹配, "!="不等于, "!~"正则不匹配
value: "Alert1|Alert2"
- name: namespace
matchType: "="
value: <target-namespace>
undefinedMatcher Reference
匹配器参考
| matchType | Meaning | Example |
|---|---|---|
| Exact match | |
| Not equal | |
| Regex match | |
| Regex negation | |
| matchType | 含义 | 示例 |
|---|---|---|
| 精确匹配 | |
| 不等于 | |
| 正则匹配 | |
| 正则不匹配 | |
Requirements
要求
- Always include a comment explaining why the silence exists (architectural limitation, expected behavior, etc.)
- Every cluster must maintain a zero firing alerts baseline (excluding Watchdog)
- Silences are a LAST RESORT — every effort must be made to fix the root cause before resorting to a silence. Only silence when the alert genuinely cannot be fixed: architectural limitations (e.g., single-node Spegel), expected environmental behavior, or confirmed upstream bugs
- Never leave alerts firing without action — either fix the cause or create a Silence CR. An ignored alert degrades trust in the entire monitoring system and leads to alert fatigue where real incidents get missed
- 必须添加注释说明静默规则存在的原因(架构限制、预期行为等)
- 每个集群必须维护零触发告警基线(Watchdog除外)
- 静默规则是最后手段——在使用静默规则之前必须尽一切努力修复根本原因。仅当告警确实无法修复时才使用静默:架构限制(例如单节点Spegel)、预期的环境行为、已确认的上游Bug
- 不要放任告警触发不处理——要么修复问题,要么创建Silence CR。被忽略的告警会降低整个监控系统的可信度,导致告警疲劳,最终遗漏真实故障
Adding a Silence to a Cluster
为集群添加静默规则
- Create directory if it does not exist
config/silences/ - Add the Silence YAML file
- Create or update :
config/silences/kustomization.yamlyamlapiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - <silence-name>.yaml - Reference in
silencesconfig/kustomization.yaml
- 如果不存在目录则先创建
config/silences/ - 添加Silence YAML文件
- 创建或更新:
config/silences/kustomization.yamlyamlapiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - <silence-name>.yaml - 在中引用
config/kustomization.yaml目录silences
Canary Health Checks
Canary健康检查
Canary resources provide synthetic monitoring using Flanksource canary-checker.
They live in for platform checks or alongside app config for app-specific checks.
config/canary-checker/Canary资源使用Flanksource canary-checker提供合成监控。平台级检查存放在中,应用专属检查可以和应用配置放在一起。
config/canary-checker/HTTP Health Check
HTTP健康检查
yaml
---yaml
---yaml-language-server: $schema=https://kubernetes-schemas.pages.dev/canaries.flanksource.com/canary_v1.json
yaml-language-server: $schema=https://kubernetes-schemas.pages.dev/canaries.flanksource.com/canary_v1.json
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
name: http-check-<component>
spec:
schedule: "@every 1m"
http:
- name: <component>-health
url: https://<component>.${internal_domain}/health
responseCodes: [200]
maxSSLExpiry: 7 # Alert if TLS cert expires within 7 days
thresholdMillis: 5000 # Fail if response takes >5s
undefinedapiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
name: http-check-<component>
spec:
schedule: "@every 1m"
http:
- name: <component>-health
url: https://<component>.${internal_domain}/health
responseCodes: [200]
maxSSLExpiry: 7 # TLS证书7天内到期则告警
thresholdMillis: 5000 # 响应超过5s则失败
undefinedTCP Port Check
TCP端口检查
yaml
spec:
schedule: "@every 1m"
tcp:
- name: <component>-port
host: <service>.<namespace>.svc.cluster.local
port: 8080
timeout: 5000yaml
spec:
schedule: "@every 1m"
tcp:
- name: <component>-port
host: <service>.<namespace>.svc.cluster.local
port: 8080
timeout: 5000Kubernetes Resource Check with CEL
带CEL表达式的Kubernetes资源检查
Test that pods are actually healthy using CEL expressions (preferred over
because the built-in flag penalizes pods with restart history):
ready: trueyaml
spec:
interval: 60
kubernetes:
- name: <component>-pods-healthy
kind: Pod
namespaceSelector:
name: <namespace>
resource:
labelSelector: app.kubernetes.io/name=<component>
test:
expr: >
dyn(results).all(pod,
pod.Object.status.phase == "Running" &&
pod.Object.status.conditions.exists(c, c.type == "Ready" && c.status == "True")
)使用CEL表达式测试Pod是否真正健康(优于,因为内置标志会 penalize 有重启历史的Pod):
ready: trueyaml
spec:
interval: 60
kubernetes:
- name: <component>-pods-healthy
kind: Pod
namespaceSelector:
name: <namespace>
resource:
labelSelector: app.kubernetes.io/name=<component>
test:
expr: >
dyn(results).all(pod,
pod.Object.status.phase == "Running" &&
pod.Object.status.conditions.exists(c, c.type == "Ready" && c.status == "True")
)Canary Metrics and Alerting
Canary指标和告警
canary-checker exposes metrics that are already monitored by the platform:
- triggers
canary_check == 1(critical, 2m)CanaryCheckFailure - High failure rates trigger (warning, 5m)
CanaryCheckHighFailureRate
These alerts are defined in -- you do not
need to create separate alerts for each canary.
config/canary-checker/prometheus-rules.yamlcanary-checker暴露的指标已经被平台监控覆盖:
- 触发
canary_check == 1(critical,2m)CanaryCheckFailure - 高失败率触发(warning,5m)
CanaryCheckHighFailureRate
这些告警已经在中定义——不需要为每个canary单独创建告警。
config/canary-checker/prometheus-rules.yamlWorkflow: Adding Monitoring for a New Component
工作流程:为新组件添加监控
Step 1: Determine What Exists
步骤1:确认现有配置
Check if the Helm chart already provides monitoring:
bash
undefined检查Helm Chart是否已经提供监控配置:
bash
undefinedSearch chart values for monitoring options
搜索Chart values中的监控选项
kubesearch <chart-name> serviceMonitor
kubesearch <chart-name> prometheusRule
Enable via Helm values if available (see [deploy-app skill](../deploy-app/SKILL.md)).kubesearch <chart-name> serviceMonitor
kubesearch <chart-name> prometheusRule
如果有可用配置则通过Helm values开启(参考[deploy-app技能](../deploy-app/SKILL.md))。Step 2: Create Missing Resources
步骤2:创建缺失的资源
If the chart does not provide monitoring, create resources manually:
- ServiceMonitor or PodMonitor for metrics scraping
- PrometheusRule for alert rules
- Canary for synthetic health checks (HTTP/TCP)
如果Chart不提供监控配置,手动创建以下资源:
- ServiceMonitor或PodMonitor用于指标抓取
- PrometheusRule用于告警规则
- Canary用于合成健康检查(HTTP/TCP)
Step 3: Place Files Correctly
步骤3:正确存放文件
- If the component has its own config subsystem (), add monitoring resources there alongside other config
config/<component>/ - If it is a standalone monitoring addition, add to
config/monitoring/
- 如果组件有专属的配置子系统(),将监控资源和其他配置放在一起
config/<component>/ - 如果是独立的监控补充配置,添加到中
config/monitoring/
Step 4: Register in Kustomization
步骤4:在Kustomization中注册
Add new files to the appropriate .
kustomization.yaml将新文件添加到对应的中。
kustomization.yamlStep 5: Validate
步骤5:校验配置
bash
task k8s:validatebash
task k8s:validateStep 6: Verify After Deployment
步骤6:部署后验证
Prometheus is behind OAuth2 Proxy — use or port-forward for API queries:
kubectl execbash
undefinedPrometheus在OAuth2 Proxy后面——使用或端口转发进行API查询:
kubectl execbash
undefinedCheck ServiceMonitor is discovered
检查ServiceMonitor是否被发现
KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus --
wget -qO- 'http://localhost:9090/api/v1/targets' |
jq '.data.activeTargets[] | select(.labels.job | contains("<component>"))'
wget -qO- 'http://localhost:9090/api/v1/targets' |
jq '.data.activeTargets[] | select(.labels.job | contains("<component>"))'
KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus --
wget -qO- 'http://localhost:9090/api/v1/targets' |
jq '.data.activeTargets[] | select(.labels.job | contains("<component>"))'
wget -qO- 'http://localhost:9090/api/v1/targets' |
jq '.data.activeTargets[] | select(.labels.job | contains("<component>"))'
Check alert rules are loaded
检查告警规则是否被加载
KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus --
wget -qO- 'http://localhost:9090/api/v1/rules' |
jq '.data.groups[] | select(.name | contains("<component>"))'
wget -qO- 'http://localhost:9090/api/v1/rules' |
jq '.data.groups[] | select(.name | contains("<component>"))'
KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus --
wget -qO- 'http://localhost:9090/api/v1/rules' |
jq '.data.groups[] | select(.name | contains("<component>"))'
wget -qO- 'http://localhost:9090/api/v1/rules' |
jq '.data.groups[] | select(.name | contains("<component>"))'
Check canary status
检查canary状态
KUBECONFIG=~/.kube/<cluster>.yaml kubectl get canaries -A | grep <component>
---KUBECONFIG=~/.kube/<cluster>.yaml kubectl get canaries -A | grep <component>
---Common Mistakes
常见错误
| Mistake | Impact | Fix |
|---|---|---|
Missing | Prometheus ignores the resource | Add the label to metadata.labels |
| PrometheusRule in wrong namespace without namespaceSelector | Prometheus does not discover it | Place in |
| ServiceMonitor selector does not match any service | No metrics scraped, no error raised | Verify labels match with |
Using | False negatives after pod restarts | Use CEL |
| Hardcoding domains in canary URLs | Breaks across clusters | Use |
Very short | Alert noise | Use 10m+ for error rates and latencies |
| Creating alerts for metrics that do not exist yet | Alert permanently in "pending" state | Verify metrics exist in Prometheus before writing rules |
| 错误 | 影响 | 修复方案 |
|---|---|---|
缺少 | Prometheus忽略该资源 | 在metadata.labels中添加该标签 |
| PrometheusRule放在错误的命名空间且没有配置namespaceSelector | Prometheus无法发现该资源 | 放在 |
| ServiceMonitor选择器没有匹配到任何服务 | 没有抓取到指标,也不会抛出错误 | 使用 |
在canary-checker Kubernetes检查中使用 | Pod重启后会出现误报 | 使用CEL |
| 在canary URL中硬编码域名 | 跨集群时会失效 | 使用 |
抖动指标的 | 告警噪音多 | 错误率和延迟类指标设置10m以上 |
| 为还不存在的指标创建告警 | 告警永久处于"pending"状态 | 编写规则前先在Prometheus中验证指标存在 |
Reference: Existing Alert Files
参考:现有告警文件
| File | Component | Alert Count | Subsystem |
|---|---|---|---|
| Cilium | 14 | Agent, BPF, Policy, Network |
| Istio | ~10 | Control plane, mTLS, Gateway |
| cert-manager | 5 | Expiry, Renewal, Issuance |
| Network Policy | 2 | Enforcement escape hatch |
| External Secrets | 3 | Sync, Ready, Store health |
| Grafana | 4 | Datasource, Errors, Plugins, Down |
| Loki | ~5 | Requests, Latency, Ingester |
| Alloy | 3 | Dropped entries, Errors, Lag |
| Hardware | 7 | Temperature, Fans, Disks, Power |
| Dragonfly | 2+ | Down, Memory |
| canary-checker | 2 | Check failure, High failure rate |
| 文件 | 组件 | 告警数量 | 子系统 |
|---|---|---|---|
| Cilium | 14 | Agent、BPF、Policy、Network |
| Istio | ~10 | Control plane、mTLS、Gateway |
| cert-manager | 5 | Expiry、Renewal、Issuance |
| Network Policy | 2 | Enforcement escape hatch |
| External Secrets | 3 | Sync、Ready、Store health |
| Grafana | 4 | Datasource、Errors、Plugins、Down |
| Loki | ~5 | Requests、Latency、Ingester |
| Alloy | 3 | Dropped entries、Errors、Lag |
| Hardware | 7 | Temperature、Fans、Disks、Power |
| Dragonfly | 2+ | Down、Memory |
| canary-checker | 2 | Check failure、High failure rate |
Keywords
关键词
PrometheusRule, ServiceMonitor, PodMonitor, ScrapeConfig, AlertmanagerConfig, Silence,
silence-operator, canary-checker, Canary, recording rules, alert rules, monitoring,
observability, scrape targets, prometheus, alertmanager, discord, heartbeat
PrometheusRule、ServiceMonitor、PodMonitor、ScrapeConfig、AlertmanagerConfig、Silence、silence-operator、canary-checker、Canary、recording rules、alert rules、monitoring、observability、scrape targets、prometheus、alertmanager、discord、heartbeat