dd-monitors
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDatadog Monitors
Datadog 监控
Create, manage, and maintain monitors for alerting.
创建、管理和维护用于告警的监控项。
Prerequisites
前提条件
This requires Go or the pup binary in your path.
pupgo install github.com/datadog-labs/pup@latest~/go/bin$PATH这需要你的路径中存在Go环境或pup二进制文件。
pupgo install github.com/datadog-labs/pup@latest~/go/bin$PATHQuick Start
快速开始
bash
pup auth loginbash
pup auth loginCommon Operations
常见操作
List Monitors
列出监控项
bash
pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"bash
pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"Get Monitor
获取监控项详情
bash
pup monitors get <id> --jsonbash
pup monitors get <id> --jsonCreate Monitor
创建监控项
bash
pup monitors create \
--name "High CPU on web servers" \
--type "metric alert" \
--query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
--message "CPU above 80% @slack-ops"bash
pup monitors create \
--name "High CPU on web servers" \
--type "metric alert" \
--query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
--message "CPU above 80% @slack-ops"Mute/Unmute
静音/取消静音
bash
undefinedbash
undefinedMute with duration
按时长静音
pup monitors mute --id 12345 --duration 1h
pup monitors mute --id 12345 --duration 1h
Or mute with specific end time
或按指定结束时间静音
pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"
pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"
Unmute
取消静音
pup monitors unmute --id 12345
undefinedpup monitors unmute --id 12345
undefined⚠️ Monitor Creation Best Practices
⚠️ 监控项创建最佳实践
1. Avoid Alert Fatigue
1. 避免告警疲劳
| Rule | Why |
|---|---|
| No flapping alerts | Use |
| Meaningful thresholds | Based on SLOs, not guesses |
| Actionable alerts | If no action needed, don't alert |
| Include runbook | |
python
undefined| 规则 | 原因 |
|---|---|
| 避免告警抖动 | 使用 |
| 设置有意义的阈值 | 基于SLO,而非主观猜测 |
| 告警需具备可操作性 | 若无需采取行动,则不要触发告警 |
| 包含运行手册 | 在消息中加入 |
python
undefinedWRONG - will flap constantly
错误示例 - 会频繁触发抖动告警
query = "avg(last_1m):avg:system.cpu.user{*} > 50" # ❌ Too sensitive
query = "avg(last_1m):avg:system.cpu.user{*} > 50" # ❌ 过于敏感
CORRECT - stable alerting
正确示例 - 稳定告警
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80" # ✅ Reasonable window
undefinedquery = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80" # ✅ 合理的时间窗口
undefined2. Use Proper Scoping
2. 合理设置范围
python
undefinedpython
undefinedWRONG - alerts on everything
错误示例 - 对所有资源告警
query = "avg(last_5m):avg:system.cpu.user{*} > 80" # ❌ No scope
query = "avg(last_5m):avg:system.cpu.user{*} > 80" # ❌ 未限定范围
CORRECT - scoped to what matters
正确示例 - 仅针对关键资源告警
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80" # ✅
undefinedquery = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80" # ✅
undefined3. Set Recovery Thresholds
3. 设置恢复阈值
python
monitor = {
"query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
"options": {
"thresholds": {
"critical": 80,
"critical_recovery": 70, # ✅ Prevents flapping
"warning": 60,
"warning_recovery": 50
}
}
}python
monitor = {
"query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
"options": {
"thresholds": {
"critical": 80,
"critical_recovery": 70, # ✅ 防止告警抖动
"warning": 60,
"warning_recovery": 50
}
}
}4. Include Context in Messages
4. 在消息中包含上下文信息
python
message = """python
message = """High CPU Alert
高CPU告警
Host: {{host.name}}
Current Value: {{value}}
Threshold: {{threshold}}
主机: {{host.name}}
当前值: {{value}}
阈值: {{threshold}}
Runbook
运行手册
- Check top processes:
ssh {{host.name}} 'top -bn1 | head -20' - Check recent deploys
- Scale if needed
@slack-ops @pagerduty-oncall
"""
undefined- 检查顶级进程:
ssh {{host.name}} 'top -bn1 | head -20' - 检查最近的部署记录
- 必要时进行扩容
@slack-ops @pagerduty-oncall
"""
undefined⚠️ NEVER Delete Monitors Directly
⚠️ 切勿直接删除监控项
Use safe deletion workflow (same as dashboards):
python
def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
"""Mark monitor instead of deleting."""
monitor = client.get_monitor(monitor_id)
name = monitor.get("name", "")
if "[MARKED FOR DELETION]" in name:
print(f"Already marked: {name}")
return False
new_name = f"[MARKED FOR DELETION] {name}"
client.update_monitor(monitor_id, {"name": new_name})
print(f"✓ Marked: {new_name}")
return True使用安全删除流程(与仪表盘相同):
python
def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
"""标记监控项而非直接删除。"""
monitor = client.get_monitor(monitor_id)
name = monitor.get("name", "")
if "[MARKED FOR DELETION]" in name:
print(f"已标记为删除: {name}")
return False
new_name = f"[MARKED FOR DELETION] {name}"
client.update_monitor(monitor_id, {"name": new_name})
print(f"✓ 已标记: {new_name}")
return TrueMonitor Types
监控项类型
| Type | Use Case |
|---|---|
| CPU, memory, custom metrics |
| Complex metric queries |
| Agent check status |
| Event stream patterns |
| Log pattern matching |
| Combine multiple monitors |
| APM metrics |
| 类型 | 适用场景 |
|---|---|
| CPU、内存、自定义指标 |
| 复杂指标查询 |
| Agent检查状态 |
| 事件流模式匹配 |
| 日志模式匹配 |
| 组合多个监控项 |
| APM指标 |
Audit Monitors
审计监控项
bash
undefinedbash
undefinedFind monitors without owners
查找无归属团队的监控项
pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'
pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'
Find noisy monitors (high alert count)
查找频繁告警的监控项(告警次数多)
pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'
undefinedpup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'
undefinedDowntime vs Muting
停机维护与静音对比
| Use | When |
|---|---|
| Mute monitor | Quick one-off, < 1 hour |
| Downtime | Scheduled maintenance, recurring |
bash
undefined| 使用场景 | 适用时机 |
|---|---|
| 静音监控项 | 临时快速操作,时长<1小时 |
| 停机维护 | 计划内维护、周期性任务 |
bash
undefinedDowntime (preferred)
停机维护(推荐方式)
pup downtime create
--scope "env:prod"
--monitor-tags "team:platform"
--start "2024-01-15T02:00:00Z"
--end "2024-01-15T06:00:00Z"
--scope "env:prod"
--monitor-tags "team:platform"
--start "2024-01-15T02:00:00Z"
--end "2024-01-15T06:00:00Z"
undefinedpup downtime create
--scope "env:prod"
--monitor-tags "team:platform"
--start "2024-01-15T02:00:00Z"
--end "2024-01-15T06:00:00Z"
--scope "env:prod"
--monitor-tags "team:platform"
--start "2024-01-15T02:00:00Z"
--end "2024-01-15T06:00:00Z"
undefinedFailure Handling
故障处理
| Problem | Fix |
|---|---|
| Alert not firing | Check query returns data, thresholds |
| Too many alerts | Increase window, add recovery threshold |
| No data alerts | Check agent connectivity, metric exists |
| Auth error | |
| 问题 | 解决方法 |
|---|---|
| 告警未触发 | 检查查询是否返回数据、阈值设置是否正确 |
| 告警过多 | 增大时间窗口、添加恢复阈值 |
| 无数据告警 | 检查Agent连通性、指标是否存在 |
| 认证错误 | 执行 |