Loading...
Loading...
Create, update, and manage Oodle monitors — alerting thresholds, query scoping, and best practices to avoid alert fatigue.
npx skill4agent add oodle-ai/agent-skills oodle-monitors# Install + configure (see oodle-cli skill)
brew install oodle-ai/oodle/oodle
oodle configure
# or
export OODLE_API_KEY=<key>
export OODLE_INSTANCE=<instance>
export OODLE_DEPLOYMENT=<url>oodle monitors list -o json | jq 'length'oodle monitors list -o jsondeleteget| Task | Command |
|---|---|
| List all monitors | |
| Filter by status | |
| Filter by labels | |
| Get one monitor | |
| Create from file | |
| Update from file | |
| Delete (CI) | |
# ✅ CORRECT — JSON output for scripting
oodle monitors list -o json
# ✅ CORRECT — narrow with --status to find only firing alerts
oodle monitors list --status alert -o json
# ✅ CORRECT — narrow with --labels for a specific team
oodle monitors list --labels env=prod,team=platform -o json
# ❌ WRONG — pulling everything then grepping
oodle monitors list | grep CPU# ✅ CORRECT — fetch full JSON, edit, then update
oodle monitors get mon_123 -o yaml > monitor.yaml
$EDITOR monitor.yaml
oodle monitors update mon_123 -f monitor.yaml
# ❌ WRONG — building update payload from memory; overwrites unrelated fields
oodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":90}}}'){
"name": "High CPU on web servers",
"type": "metric alert",
"query": "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80",
"message": "CPU above 80% on {{host.name}}. Runbook: https://runbooks.example.com/cpu\n@slack-ops",
"labels": {"team": "platform", "env": "prod"},
"options": {
"thresholds": {
"critical": 80,
"critical_recovery": 70,
"warning": 60,
"warning_recovery": 50
}
}
}# ✅ CORRECT
oodle monitors create -f monitor.json
# ❌ WRONG — no `type`, no `options.thresholds`, monitor will be rejected
oodle monitors create -f <(echo '{"name":"x","query":"y"}')# ✅ CORRECT — get → edit → update
oodle monitors get mon_123 -o json > monitor.json
jq '.options.thresholds.critical = 85' monitor.json > monitor.new.json
oodle monitors update mon_123 -f monitor.new.json
# ❌ WRONG — sending only the changed field; missing fields become null
oodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":85}}}')# ✅ CORRECT — verify first, then delete
oodle monitors get mon_123 -o json > /dev/null
oodle monitors delete mon_123 --force
# ❌ WRONG — speculative delete by name match
oodle monitors delete "$(oodle monitors list | grep CPU | head -1 | awk '{print $1}')" --forcelast_5mlast_1mlast_5mlast_15m# ✅ CORRECT
"query": "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"
# ❌ WRONG — flaps on every brief spike
"query": "avg(last_1m):avg:system.cpu.user{env:prod} by {host} > 80"{*}{*}# ✅ CORRECT — scoped to a specific env + service
"query": "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"
# ❌ WRONG — alerts on every host in the org
"query": "avg(last_5m):avg:system.cpu.user{*} > 80"*_recoveryalert# ✅ CORRECT — clear recovery band (10pt below trigger)
"thresholds": {"critical": 80, "critical_recovery": 70, "warning": 60, "warning_recovery": 50}
# ❌ WRONG — no recovery values; monitor never cleanly recovers
"thresholds": {"critical": 80, "warning": 60}@notifiermessage# ✅ CORRECT
"message": "CPU above 80% on {{host.name}} (env=prod, service=api).\nRunbook: https://runbooks.example.com/cpu\n@slack-ops @pagerduty-platform"
# ❌ WRONG — no actionable content, no routing
"message": "CPU is high"teamenvoodle monitors list --labels ...# ✅ CORRECT
"labels": {"team": "platform", "env": "prod", "service": "api"}
# ❌ WRONG
"labels": {}| Error | Cause | Fix |
|---|---|---|
| 401 Unauthorized | Invalid or missing API key | Run |
| 404 Not Found | Monitor ID does not exist | Verify with |
| connection refused | Wrong | Check |
| PromQL/Datadog-style query has a syntax error | Test the query in the UI metrics explorer; ensure |
| Alert never fires | Query returns no data, or threshold is unreachable | Run |
| Too many alerts (flapping) | Evaluation window too short, missing recovery thresholds | Increase window to |
| Agent not reporting, or label filter excludes all hosts | Verify the agent is alive ( |
| 429 Too Many Requests | Bulk monitor creation hit rate limit | Add |