dd-monitors

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Datadog Monitors

Datadog 监控

Create, manage, and maintain monitors for alerting.

创建、管理和维护用于告警的监控项。

Prerequisites

前提条件

This requires Go or the pup binary in your path.

pup

go install github.com/datadog-labs/pup@latest

Ensure

~/go/bin

is in

$PATH

这需要你的路径中存在Go环境或pup二进制文件。

pup

go install github.com/datadog-labs/pup@latest

确保

~/go/bin

已添加到

$PATH

中。

Quick Start

快速开始

bash

pup auth login

bash

pup auth login

Common Operations

常见操作

List Monitors

列出监控项

bash

pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"

bash

pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"

Get Monitor

获取监控项详情

bash

pup monitors get <id> --json

bash

pup monitors get <id> --json

Create Monitor

创建监控项

bash

pup monitors create \
  --name "High CPU on web servers" \
  --type "metric alert" \
  --query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
  --message "CPU above 80% @slack-ops"

bash

pup monitors create \
  --name "High CPU on web servers" \
  --type "metric alert" \
  --query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
  --message "CPU above 80% @slack-ops"

Mute/Unmute

静音/取消静音

bash

undefined

bash

undefined

Mute with duration

按时长静音

pup monitors mute --id 12345 --duration 1h

Or mute with specific end time

或按指定结束时间静音

pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"

Unmute

取消静音

pup monitors unmute --id 12345

undefined

pup monitors unmute --id 12345

undefined

⚠️ Monitor Creation Best Practices

⚠️ 监控项创建最佳实践

1. Avoid Alert Fatigue

1. 避免告警疲劳

Rule	Why
No flapping alerts	Use `last_Xm` not `last_1m`
Meaningful thresholds	Based on SLOs, not guesses
Actionable alerts	If no action needed, don't alert
Include runbook	`@runbook-url` in message

python

undefined

规则	原因
避免告警抖动	使用 `last_Xm` 而非 `last_1m`
设置有意义的阈值	基于SLO，而非主观猜测
告警需具备可操作性	若无需采取行动，则不要触发告警
包含运行手册	在消息中加入 `@runbook-url`

python

undefined

WRONG - will flap constantly

错误示例 - 会频繁触发抖动告警

query = "avg(last_1m):avg:system.cpu.user{*} > 50" # ❌ Too sensitive

query = "avg(last_1m):avg:system.cpu.user{*} > 50" # ❌ 过于敏感

CORRECT - stable alerting

正确示例 - 稳定告警

query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80" # ✅ Reasonable window

undefined

query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80" # ✅ 合理的时间窗口

undefined

2. Use Proper Scoping

2. 合理设置范围

python

undefined

python

undefined

WRONG - alerts on everything

错误示例 - 对所有资源告警

query = "avg(last_5m):avg:system.cpu.user{*} > 80" # ❌ No scope

query = "avg(last_5m):avg:system.cpu.user{*} > 80" # ❌ 未限定范围

CORRECT - scoped to what matters

正确示例 - 仅针对关键资源告警

query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80" # ✅

undefined

query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80" # ✅

undefined

3. Set Recovery Thresholds

3. 设置恢复阈值

python

monitor = {
    "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
    "options": {
        "thresholds": {
            "critical": 80,
            "critical_recovery": 70,  # ✅ Prevents flapping
            "warning": 60,
            "warning_recovery": 50
        }
    }
}

python

monitor = {
    "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
    "options": {
        "thresholds": {
            "critical": 80,
            "critical_recovery": 70,  # ✅ 防止告警抖动
            "warning": 60,
            "warning_recovery": 50
        }
    }
}

4. Include Context in Messages

4. 在消息中包含上下文信息

python

message = """

python

message = """

High CPU Alert

高CPU告警

Host: {{host.name}} Current Value: {{value}} Threshold: {{threshold}}

主机: {{host.name}} 当前值: {{value}} 阈值: {{threshold}}

Runbook

运行手册

Check top processes:
```
ssh {{host.name}} 'top -bn1 | head -20'
```
Check recent deploys
Scale if needed

@slack-ops @pagerduty-oncall """

undefined

检查顶级进程:
```
ssh {{host.name}} 'top -bn1 | head -20'
```
检查最近的部署记录
必要时进行扩容

@slack-ops @pagerduty-oncall """

undefined

⚠️ NEVER Delete Monitors Directly

⚠️ 切勿直接删除监控项

Use safe deletion workflow (same as dashboards):

python

def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
    """Mark monitor instead of deleting."""
    monitor = client.get_monitor(monitor_id)
    name = monitor.get("name", "")
    
    if "[MARKED FOR DELETION]" in name:
        print(f"Already marked: {name}")
        return False
    
    new_name = f"[MARKED FOR DELETION] {name}"
    client.update_monitor(monitor_id, {"name": new_name})
    print(f"✓ Marked: {new_name}")
    return True

使用安全删除流程（与仪表盘相同）:

python

def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
    """标记监控项而非直接删除。"""
    monitor = client.get_monitor(monitor_id)
    name = monitor.get("name", "")
    
    if "[MARKED FOR DELETION]" in name:
        print(f"已标记为删除: {name}")
        return False
    
    new_name = f"[MARKED FOR DELETION] {name}"
    client.update_monitor(monitor_id, {"name": new_name})
    print(f"✓ 已标记: {new_name}")
    return True

Monitor Types

监控项类型

Type	Use Case
`metric alert`	CPU, memory, custom metrics
`query alert`	Complex metric queries
`service check`	Agent check status
`event alert`	Event stream patterns
`log alert`	Log pattern matching
`composite`	Combine multiple monitors
`apm`	APM metrics

类型	适用场景
`metric alert`	CPU、内存、自定义指标
`query alert`	复杂指标查询
`service check`	Agent检查状态
`event alert`	事件流模式匹配
`log alert`	日志模式匹配
`composite`	组合多个监控项
`apm`	APM指标

Audit Monitors

审计监控项

bash

undefined

bash

undefined

Find monitors without owners

查找无归属团队的监控项

pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'

Find noisy monitors (high alert count)

查找频繁告警的监控项（告警次数多）

pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'

undefined

pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'

undefined

Downtime vs Muting

停机维护与静音对比

Use	When
Mute monitor	Quick one-off, < 1 hour
Downtime	Scheduled maintenance, recurring

bash

undefined

使用场景	适用时机
静音监控项	临时快速操作，时长<1小时
停机维护	计划内维护、周期性任务

bash

undefined

Downtime (preferred)

停机维护（推荐方式）

pup downtime create
--scope "env:prod"
--monitor-tags "team:platform"
--start "2024-01-15T02:00:00Z"
--end "2024-01-15T06:00:00Z"

undefined

pup downtime create
--scope "env:prod"
--monitor-tags "team:platform"
--start "2024-01-15T02:00:00Z"
--end "2024-01-15T06:00:00Z"

undefined

Failure Handling

故障处理

Problem	Fix
Alert not firing	Check query returns data, thresholds
Too many alerts	Increase window, add recovery threshold
No data alerts	Check agent connectivity, metric exists
Auth error	`pup auth refresh`

问题	解决方法
告警未触发	检查查询是否返回数据、阈值设置是否正确
告警过多	增大时间窗口、添加恢复阈值
无数据告警	检查Agent连通性、指标是否存在
认证错误	执行 `pup auth refresh`

dd-monitors

Original

Translation

Datadog Monitors

Datadog 监控

Prerequisites

前提条件

Quick Start

快速开始

Common Operations

常见操作

List Monitors

列出监控项

Get Monitor

获取监控项详情

Create Monitor

创建监控项

Mute/Unmute

静音/取消静音

Mute with duration

按时长静音

Or mute with specific end time

或按指定结束时间静音

Unmute

取消静音

⚠️ Monitor Creation Best Practices

⚠️ 监控项创建最佳实践

1. Avoid Alert Fatigue

1. 避免告警疲劳

WRONG - will flap constantly

错误示例 - 会频繁触发抖动告警

CORRECT - stable alerting

正确示例 - 稳定告警

2. Use Proper Scoping

2. 合理设置范围

WRONG - alerts on everything

错误示例 - 对所有资源告警

CORRECT - scoped to what matters

正确示例 - 仅针对关键资源告警

3. Set Recovery Thresholds

3. 设置恢复阈值

4. Include Context in Messages

4. 在消息中包含上下文信息

High CPU Alert

高CPU告警

Runbook

运行手册

⚠️ NEVER Delete Monitors Directly

⚠️ 切勿直接删除监控项

Monitor Types

监控项类型

Audit Monitors

审计监控项

Find monitors without owners

查找无归属团队的监控项

Find noisy monitors (high alert count)

查找频繁告警的监控项（告警次数多）

Downtime vs Muting

停机维护与静音对比

Downtime (preferred)

停机维护（推荐方式）

Failure Handling

故障处理

References

参考资料