dd-monitors

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Datadog Monitors

Datadog 监控

Create, manage, and maintain monitors for alerting.
创建、管理和维护用于告警的监控项。

Prerequisites

前提条件

This requires Go or the pup binary in your path.
pup
-
go install github.com/datadog-labs/pup@latest
Ensure
~/go/bin
is in
$PATH
.
这需要你的路径中存在Go环境或pup二进制文件。
pup
-
go install github.com/datadog-labs/pup@latest
确保
~/go/bin
已添加到
$PATH
中。

Quick Start

快速开始

bash
pup auth login
bash
pup auth login

Common Operations

常见操作

List Monitors

列出监控项

bash
pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"
bash
pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"

Get Monitor

获取监控项详情

bash
pup monitors get <id> --json
bash
pup monitors get <id> --json

Create Monitor

创建监控项

bash
pup monitors create \
  --name "High CPU on web servers" \
  --type "metric alert" \
  --query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
  --message "CPU above 80% @slack-ops"
bash
pup monitors create \
  --name "High CPU on web servers" \
  --type "metric alert" \
  --query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
  --message "CPU above 80% @slack-ops"

Mute/Unmute

静音/取消静音

bash
undefined
bash
undefined

Mute with duration

按时长静音

pup monitors mute --id 12345 --duration 1h
pup monitors mute --id 12345 --duration 1h

Or mute with specific end time

或按指定结束时间静音

pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"
pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"

Unmute

取消静音

pup monitors unmute --id 12345
undefined
pup monitors unmute --id 12345
undefined

⚠️ Monitor Creation Best Practices

⚠️ 监控项创建最佳实践

1. Avoid Alert Fatigue

1. 避免告警疲劳

RuleWhy
No flapping alertsUse
last_Xm
not
last_1m
Meaningful thresholdsBased on SLOs, not guesses
Actionable alertsIf no action needed, don't alert
Include runbook
@runbook-url
in message
python
undefined
规则原因
避免告警抖动使用
last_Xm
而非
last_1m
设置有意义的阈值基于SLO,而非主观猜测
告警需具备可操作性若无需采取行动,则不要触发告警
包含运行手册在消息中加入
@runbook-url
python
undefined

WRONG - will flap constantly

错误示例 - 会频繁触发抖动告警

query = "avg(last_1m):avg:system.cpu.user{*} > 50" # ❌ Too sensitive
query = "avg(last_1m):avg:system.cpu.user{*} > 50" # ❌ 过于敏感

CORRECT - stable alerting

正确示例 - 稳定告警

query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80" # ✅ Reasonable window
undefined
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80" # ✅ 合理的时间窗口
undefined

2. Use Proper Scoping

2. 合理设置范围

python
undefined
python
undefined

WRONG - alerts on everything

错误示例 - 对所有资源告警

query = "avg(last_5m):avg:system.cpu.user{*} > 80" # ❌ No scope
query = "avg(last_5m):avg:system.cpu.user{*} > 80" # ❌ 未限定范围

CORRECT - scoped to what matters

正确示例 - 仅针对关键资源告警

query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80" # ✅
undefined
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80" # ✅
undefined

3. Set Recovery Thresholds

3. 设置恢复阈值

python
monitor = {
    "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
    "options": {
        "thresholds": {
            "critical": 80,
            "critical_recovery": 70,  # ✅ Prevents flapping
            "warning": 60,
            "warning_recovery": 50
        }
    }
}
python
monitor = {
    "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
    "options": {
        "thresholds": {
            "critical": 80,
            "critical_recovery": 70,  # ✅ 防止告警抖动
            "warning": 60,
            "warning_recovery": 50
        }
    }
}

4. Include Context in Messages

4. 在消息中包含上下文信息

python
message = """
python
message = """

High CPU Alert

高CPU告警

Host: {{host.name}} Current Value: {{value}} Threshold: {{threshold}}
主机: {{host.name}} 当前值: {{value}} 阈值: {{threshold}}

Runbook

运行手册

  1. Check top processes:
    ssh {{host.name}} 'top -bn1 | head -20'
  2. Check recent deploys
  3. Scale if needed
@slack-ops @pagerduty-oncall """
undefined
  1. 检查顶级进程:
    ssh {{host.name}} 'top -bn1 | head -20'
  2. 检查最近的部署记录
  3. 必要时进行扩容
@slack-ops @pagerduty-oncall """
undefined

⚠️ NEVER Delete Monitors Directly

⚠️ 切勿直接删除监控项

Use safe deletion workflow (same as dashboards):
python
def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
    """Mark monitor instead of deleting."""
    monitor = client.get_monitor(monitor_id)
    name = monitor.get("name", "")
    
    if "[MARKED FOR DELETION]" in name:
        print(f"Already marked: {name}")
        return False
    
    new_name = f"[MARKED FOR DELETION] {name}"
    client.update_monitor(monitor_id, {"name": new_name})
    print(f"✓ Marked: {new_name}")
    return True
使用安全删除流程(与仪表盘相同):
python
def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
    """标记监控项而非直接删除。"""
    monitor = client.get_monitor(monitor_id)
    name = monitor.get("name", "")
    
    if "[MARKED FOR DELETION]" in name:
        print(f"已标记为删除: {name}")
        return False
    
    new_name = f"[MARKED FOR DELETION] {name}"
    client.update_monitor(monitor_id, {"name": new_name})
    print(f"✓ 已标记: {new_name}")
    return True

Monitor Types

监控项类型

TypeUse Case
metric alert
CPU, memory, custom metrics
query alert
Complex metric queries
service check
Agent check status
event alert
Event stream patterns
log alert
Log pattern matching
composite
Combine multiple monitors
apm
APM metrics
类型适用场景
metric alert
CPU、内存、自定义指标
query alert
复杂指标查询
service check
Agent检查状态
event alert
事件流模式匹配
log alert
日志模式匹配
composite
组合多个监控项
apm
APM指标

Audit Monitors

审计监控项

bash
undefined
bash
undefined

Find monitors without owners

查找无归属团队的监控项

pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'
pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'

Find noisy monitors (high alert count)

查找频繁告警的监控项(告警次数多)

pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'
undefined
pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'
undefined

Downtime vs Muting

停机维护与静音对比

UseWhen
Mute monitorQuick one-off, < 1 hour
DowntimeScheduled maintenance, recurring
bash
undefined
使用场景适用时机
静音监控项临时快速操作,时长<1小时
停机维护计划内维护、周期性任务
bash
undefined

Downtime (preferred)

停机维护(推荐方式)

pup downtime create
--scope "env:prod"
--monitor-tags "team:platform"
--start "2024-01-15T02:00:00Z"
--end "2024-01-15T06:00:00Z"
undefined
pup downtime create
--scope "env:prod"
--monitor-tags "team:platform"
--start "2024-01-15T02:00:00Z"
--end "2024-01-15T06:00:00Z"
undefined

Failure Handling

故障处理

ProblemFix
Alert not firingCheck query returns data, thresholds
Too many alertsIncrease window, add recovery threshold
No data alertsCheck agent connectivity, metric exists
Auth error
pup auth refresh
问题解决方法
告警未触发检查查询是否返回数据、阈值设置是否正确
告警过多增大时间窗口、添加恢复阈值
无数据告警检查Agent连通性、指标是否存在
认证错误执行
pup auth refresh

References

参考资料