Loading...
Loading...
Monitor management - create, update, mute, and alerting best practices.
npx skill4agent add datadog-labs/agent-skills dd-monitorspupgo install github.com/datadog-labs/pup@latest~/go/bin$PATHpup auth loginpup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"pup monitors get <id> --jsonpup monitors create \
--name "High CPU on web servers" \
--type "metric alert" \
--query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
--message "CPU above 80% @slack-ops"# Mute with duration
pup monitors mute --id 12345 --duration 1h
# Or mute with specific end time
pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"
# Unmute
pup monitors unmute --id 12345| Rule | Why |
|---|---|
| No flapping alerts | Use |
| Meaningful thresholds | Based on SLOs, not guesses |
| Actionable alerts | If no action needed, don't alert |
| Include runbook | |
# WRONG - will flap constantly
query = "avg(last_1m):avg:system.cpu.user{*} > 50" # ❌ Too sensitive
# CORRECT - stable alerting
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80" # ✅ Reasonable window# WRONG - alerts on everything
query = "avg(last_5m):avg:system.cpu.user{*} > 80" # ❌ No scope
# CORRECT - scoped to what matters
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80" # ✅monitor = {
"query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
"options": {
"thresholds": {
"critical": 80,
"critical_recovery": 70, # ✅ Prevents flapping
"warning": 60,
"warning_recovery": 50
}
}
}message = """
## High CPU Alert
Host: {{host.name}}
Current Value: {{value}}
Threshold: {{threshold}}
### Runbook
1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'`
2. Check recent deploys
3. Scale if needed
@slack-ops @pagerduty-oncall
"""def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
"""Mark monitor instead of deleting."""
monitor = client.get_monitor(monitor_id)
name = monitor.get("name", "")
if "[MARKED FOR DELETION]" in name:
print(f"Already marked: {name}")
return False
new_name = f"[MARKED FOR DELETION] {name}"
client.update_monitor(monitor_id, {"name": new_name})
print(f"✓ Marked: {new_name}")
return True| Type | Use Case |
|---|---|
| CPU, memory, custom metrics |
| Complex metric queries |
| Agent check status |
| Event stream patterns |
| Log pattern matching |
| Combine multiple monitors |
| APM metrics |
# Find monitors without owners
pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'
# Find noisy monitors (high alert count)
pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'| Use | When |
|---|---|
| Mute monitor | Quick one-off, < 1 hour |
| Downtime | Scheduled maintenance, recurring |
# Downtime (preferred)
pup downtime create \
--scope "env:prod" \
--monitor-tags "team:platform" \
--start "2024-01-15T02:00:00Z" \
--end "2024-01-15T06:00:00Z"| Problem | Fix |
|---|---|
| Alert not firing | Check query returns data, thresholds |
| Too many alerts | Increase window, add recovery threshold |
| No data alerts | Check agent connectivity, metric exists |
| Auth error | |