oodle-monitors
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOodle Monitors — CRUD and Alerting
Oodle监控——增删改查(CRUD)与告警管理
This skill teaches the agent to build, validate, and update Oodle monitors so that alerts are actionable, scoped, and free from flapping.
本技能指导Agent构建、验证和更新Oodle监控,确保告警具备可操作性、范围明确且不会频繁波动。
Prerequisites
前置条件
bash
undefinedbash
undefinedInstall + configure (see oodle-cli skill)
安装并配置(参考oodle-cli技能)
brew install oodle-ai/oodle/oodle
oodle configure
brew install oodle-ai/oodle/oodle
oodle configure
or
或
export OODLE_API_KEY=<key>
export OODLE_INSTANCE=<instance>
export OODLE_DEPLOYMENT=<url>
Confirm authentication and that at least one monitor list call succeeds before creating new monitors:
```bash
oodle monitors list -o json | jq 'length'export OODLE_API_KEY=<key>
export OODLE_INSTANCE=<instance>
export OODLE_DEPLOYMENT=<url>
在创建新监控前,确认身份验证通过且至少一次监控列表查询成功:
```bash
oodle monitors list -o json | jq 'length'Command Execution Order
命令执行顺序
Before running any oodle command:
- Check whether the required resource ID or name is already in context.
- If not, run the discovery command (e.g., ).
oodle monitors list -o json - If the result is ambiguous, ask the user to confirm before proceeding.
- Run the target command with the resolved ID.
- Do not run speculative commands (e.g., do not without first
delete-ing the resource).get
运行任何oodle命令前:
- 检查上下文是否已包含所需的资源ID或名称。
- 如果没有,运行发现命令(例如:)。
oodle monitors list -o json - 如果结果不明确,先请求用户确认再继续。
- 使用解析后的ID运行目标命令。
- 不要执行推测性命令(例如:未先执行获取资源就执行
get)。delete
Quick Reference
快速参考
| Task | Command |
|---|---|
| List all monitors | |
| Filter by status | |
| Filter by labels | |
| Get one monitor | |
| Create from file | |
| Update from file | |
| Delete (CI) | |
| 任务 | 命令 |
|---|---|
| 列出所有监控 | |
| 按状态筛选 | |
| 按标签筛选 | |
| 获取单个监控 | |
| 从文件创建 | |
| 从文件更新 | |
| 删除(CI环境) | |
Common Operations
常见操作
Listing monitors
列出监控
bash
undefinedbash
undefined✅ CORRECT — JSON output for scripting
✅ 正确方式 — 输出JSON用于脚本
oodle monitors list -o json
oodle monitors list -o json
✅ CORRECT — narrow with --status to find only firing alerts
✅ 正确方式 — 通过--status筛选仅显示触发中的告警
oodle monitors list --status alert -o json
oodle monitors list --status alert -o json
✅ CORRECT — narrow with --labels for a specific team
✅ 正确方式 — 通过--labels筛选特定团队的监控
oodle monitors list --labels env=prod,team=platform -o json
oodle monitors list --labels env=prod,team=platform -o json
❌ WRONG — pulling everything then grepping
❌ 错误方式 — 获取全部内容后再grep
oodle monitors list | grep CPU
undefinedoodle monitors list | grep CPU
undefinedReading a monitor before changing it
修改前先查看监控详情
bash
undefinedbash
undefined✅ CORRECT — fetch full JSON, edit, then update
✅ 正确方式 — 获取完整JSON,编辑后再更新
oodle monitors get mon_123 -o yaml > monitor.yaml
$EDITOR monitor.yaml
oodle monitors update mon_123 -f monitor.yaml
oodle monitors get mon_123 -o yaml > monitor.yaml
$EDITOR monitor.yaml
oodle monitors update mon_123 -f monitor.yaml
❌ WRONG — building update payload from memory; overwrites unrelated fields
❌ 错误方式 — 凭记忆构建更新负载;会覆盖无关字段
oodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":90}}}')
undefinedoodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":90}}}')
undefinedCreating a monitor
创建监控
A complete, valid monitor JSON:
json
{
"name": "High CPU on web servers",
"type": "metric alert",
"query": "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80",
"message": "CPU above 80% on {{host.name}}. Runbook: https://runbooks.example.com/cpu\n@slack-ops",
"labels": {"team": "platform", "env": "prod"},
"options": {
"thresholds": {
"critical": 80,
"critical_recovery": 70,
"warning": 60,
"warning_recovery": 50
}
}
}bash
undefined一个完整有效的监控JSON示例:
json
{
"name": "Web服务器CPU使用率过高",
"type": "metric alert",
"query": "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80",
"message": "{{host.name}}的CPU使用率超过80%。运行手册:https://runbooks.example.com/cpu\n@slack-ops",
"labels": {"team": "platform", "env": "prod"},
"options": {
"thresholds": {
"critical": 80,
"critical_recovery": 70,
"warning": 60,
"warning_recovery": 50
}
}
}bash
undefined✅ CORRECT
✅ 正确方式
oodle monitors create -f monitor.json
oodle monitors create -f monitor.json
❌ WRONG — no type
, no options.thresholds
, monitor will be rejected
typeoptions.thresholds❌ 错误方式 — 缺少type
和options.thresholds
,监控会被拒绝
typeoptions.thresholdsoodle monitors create -f <(echo '{"name":"x","query":"y"}')
undefinedoodle monitors create -f <(echo '{"name":"x","query":"y"}')
undefinedUpdating a monitor
更新监控
bash
undefinedbash
undefined✅ CORRECT — get → edit → update
✅ 正确方式 — 获取→编辑→更新
oodle monitors get mon_123 -o json > monitor.json
jq '.options.thresholds.critical = 85' monitor.json > monitor.new.json
oodle monitors update mon_123 -f monitor.new.json
oodle monitors get mon_123 -o json > monitor.json
jq '.options.thresholds.critical = 85' monitor.json > monitor.new.json
oodle monitors update mon_123 -f monitor.new.json
❌ WRONG — sending only the changed field; missing fields become null
❌ 错误方式 — 仅发送修改的字段;缺失字段会被设为null
oodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":85}}}')
undefinedoodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":85}}}')
undefinedDeleting a monitor
删除监控
bash
undefinedbash
undefined✅ CORRECT — verify first, then delete
✅ 正确方式 — 先验证,再删除
oodle monitors get mon_123 -o json > /dev/null
oodle monitors delete mon_123 --force
oodle monitors get mon_123 -o json > /dev/null
oodle monitors delete mon_123 --force
❌ WRONG — speculative delete by name match
❌ 错误方式 — 通过名称匹配推测性删除
oodle monitors delete "$(oodle monitors list | grep CPU | head -1 | awk '{print $1}')" --force
undefinedoodle monitors delete "$(oodle monitors list | grep CPU | head -1 | awk '{print $1}')" --force
undefinedBest Practices
最佳实践
Use a last_5m
(or longer) evaluation window, not last_1m
last_5mlast_1m使用last_5m
(或更长)的评估窗口,而非last_1m
last_5mlast_1mShort windows cause alert flapping on normal traffic spikes. Use minimum for production alerts, for noisy metrics.
last_5mlast_15mbash
undefined短窗口会在正常流量峰值时导致告警频繁波动。生产环境告警至少使用,对于噪声大的指标使用。
last_5mlast_15mbash
undefined✅ CORRECT
✅ 正确方式
"query": "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"
"query": "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"
❌ WRONG — flaps on every brief spike
❌ 错误方式 — 每次短暂峰值都会触发告警波动
"query": "avg(last_1m):avg:system.cpu.user{env:prod} by {host} > 80"
undefined"query": "avg(last_1m):avg:system.cpu.user{env:prod} by {host} > 80"
undefinedScope queries with explicit labels — never use {*}
{*}使用明确标签限定查询范围——切勿使用{*}
{*}{*}bash
undefined{*}bash
undefined✅ CORRECT — scoped to a specific env + service
✅ 正确方式 — 限定到特定环境和服务
"query": "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"
"query": "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"
❌ WRONG — alerts on every host in the org
❌ 错误方式 — 为组织内所有主机触发告警
"query": "avg(last_5m):avg:system.cpu.user{*} > 80"
undefined"query": "avg(last_5m):avg:system.cpu.user{*} > 80"
undefinedAlways set *_recovery
thresholds below the trigger thresholds
*_recovery始终将*_recovery
阈值设置为低于触发阈值
*_recoveryWithout recovery thresholds the monitor stays in state until the metric drops below the critical threshold exactly — small oscillations keep the alert active forever.
alertbash
undefined如果没有恢复阈值,监控会一直处于状态,直到指标恰好低于临界阈值——小幅波动会让告警永久保持激活状态。
alertbash
undefined✅ CORRECT — clear recovery band (10pt below trigger)
✅ 正确方式 — 设置明确的恢复区间(比触发阈值低10个点)
"thresholds": {"critical": 80, "critical_recovery": 70, "warning": 60, "warning_recovery": 50}
"thresholds": {"critical": 80, "critical_recovery": 70, "warning": 60, "warning_recovery": 50}
❌ WRONG — no recovery values; monitor never cleanly recovers
❌ 错误方式 — 无恢复值;监控无法彻底恢复
"thresholds": {"critical": 80, "warning": 60}
undefined"thresholds": {"critical": 80, "warning": 60}
undefinedPut a runbook URL and an @notifier
handle in message
@notifiermessage在message
中添加运行手册URL和@通知人
标识
message@通知人Alerts without an action are noise. Every monitor message must answer "what do I do?" and "who is paged?".
bash
undefined没有操作指引的告警就是噪音。每个监控消息必须回答“我该做什么?”和“谁会被通知?”。
bash
undefined✅ CORRECT
✅ 正确方式
"message": "CPU above 80% on {{host.name}} (env=prod, service=api).\nRunbook: https://runbooks.example.com/cpu\n@slack-ops @pagerduty-platform"
"message": "{{host.name}}的CPU使用率超过80%(环境=生产,服务=API)。\n运行手册:https://runbooks.example.com/cpu\n@slack-ops @pagerduty-platform"
❌ WRONG — no actionable content, no routing
❌ 错误方式 — 无操作内容,无路由信息
"message": "CPU is high"
undefined"message": "CPU使用率过高"
undefinedTag every monitor with at least team
and env
labels
teamenv为每个监控至少添加team
和env
标签
teamenvLabels are how notification policies route alerts and how filters work.
oodle monitors list --labels ...bash
undefined标签用于通知策略路由告警,也是筛选功能的基础。
oodle monitors list --labels ...bash
undefined✅ CORRECT
✅ 正确方式
"labels": {"team": "platform", "env": "prod", "service": "api"}
"labels": {"team": "platform", "env": "prod", "service": "api"}
❌ WRONG
❌ 错误方式
"labels": {}
undefined"labels": {}
undefinedFailure Handling
故障处理
| Error | Cause | Fix |
|---|---|---|
| 401 Unauthorized | Invalid or missing API key | Run |
| 404 Not Found | Monitor ID does not exist | Verify with |
| connection refused | Wrong | Check |
| PromQL/Datadog-style query has a syntax error | Test the query in the UI metrics explorer; ensure |
| Alert never fires | Query returns no data, or threshold is unreachable | Run |
| Too many alerts (flapping) | Evaluation window too short, missing recovery thresholds | Increase window to |
| Agent not reporting, or label filter excludes all hosts | Verify the agent is alive ( |
| 429 Too Many Requests | Bulk monitor creation hit rate limit | Add |
| 错误 | 原因 | 修复方案 |
|---|---|---|
| 401 Unauthorized | API密钥无效或缺失 | 运行 |
| 404 Not Found | 监控ID不存在 | 使用 |
| connection refused | | 检查 |
| PromQL/Datadog风格查询存在语法错误 | 在UI指标探索器中测试查询;确保包含 |
| 告警从未触发 | 查询无数据返回,或阈值过高无法达到 | 运行 |
| 告警过多(频繁波动) | 评估窗口过短,缺少恢复阈值 | 将窗口增加到 |
| Agent未上报数据,或标签筛选排除了所有主机 | 验证Agent是否存活( |
| 429 Too Many Requests | 批量创建监控触发速率限制 | 添加 |