oodle-monitors

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Oodle Monitors — CRUD and Alerting

Oodle监控——增删改查(CRUD)与告警管理

This skill teaches the agent to build, validate, and update Oodle monitors so that alerts are actionable, scoped, and free from flapping.
本技能指导Agent构建、验证和更新Oodle监控,确保告警具备可操作性、范围明确且不会频繁波动。

Prerequisites

前置条件

bash
undefined
bash
undefined

Install + configure (see oodle-cli skill)

安装并配置(参考oodle-cli技能)

brew install oodle-ai/oodle/oodle oodle configure
brew install oodle-ai/oodle/oodle oodle configure

or

export OODLE_API_KEY=<key> export OODLE_INSTANCE=<instance> export OODLE_DEPLOYMENT=<url>

Confirm authentication and that at least one monitor list call succeeds before creating new monitors:

```bash
oodle monitors list -o json | jq 'length'
export OODLE_API_KEY=<key> export OODLE_INSTANCE=<instance> export OODLE_DEPLOYMENT=<url>

在创建新监控前,确认身份验证通过且至少一次监控列表查询成功:

```bash
oodle monitors list -o json | jq 'length'

Command Execution Order

命令执行顺序

Before running any oodle command:
  1. Check whether the required resource ID or name is already in context.
  2. If not, run the discovery command (e.g.,
    oodle monitors list -o json
    ).
  3. If the result is ambiguous, ask the user to confirm before proceeding.
  4. Run the target command with the resolved ID.
  5. Do not run speculative commands (e.g., do not
    delete
    without first
    get
    -ing the resource).
运行任何oodle命令前:
  1. 检查上下文是否已包含所需的资源ID或名称。
  2. 如果没有,运行发现命令(例如:
    oodle monitors list -o json
    )。
  3. 如果结果不明确,先请求用户确认再继续。
  4. 使用解析后的ID运行目标命令。
  5. 不要执行推测性命令(例如:未先执行
    get
    获取资源就执行
    delete
    )。

Quick Reference

快速参考

TaskCommand
List all monitors
oodle monitors list -o json
Filter by status
oodle monitors list --status alert -o json
Filter by labels
oodle monitors list --labels env=prod,team=platform -o json
Get one monitor
oodle monitors get <id> -o json
Create from file
oodle monitors create -f monitor.json
Update from file
oodle monitors update <id> -f monitor.json
Delete (CI)
oodle monitors delete <id> --force
任务命令
列出所有监控
oodle monitors list -o json
按状态筛选
oodle monitors list --status alert -o json
按标签筛选
oodle monitors list --labels env=prod,team=platform -o json
获取单个监控
oodle monitors get <id> -o json
从文件创建
oodle monitors create -f monitor.json
从文件更新
oodle monitors update <id> -f monitor.json
删除(CI环境)
oodle monitors delete <id> --force

Common Operations

常见操作

Listing monitors

列出监控

bash
undefined
bash
undefined

✅ CORRECT — JSON output for scripting

✅ 正确方式 — 输出JSON用于脚本

oodle monitors list -o json
oodle monitors list -o json

✅ CORRECT — narrow with --status to find only firing alerts

✅ 正确方式 — 通过--status筛选仅显示触发中的告警

oodle monitors list --status alert -o json
oodle monitors list --status alert -o json

✅ CORRECT — narrow with --labels for a specific team

✅ 正确方式 — 通过--labels筛选特定团队的监控

oodle monitors list --labels env=prod,team=platform -o json
oodle monitors list --labels env=prod,team=platform -o json

❌ WRONG — pulling everything then grepping

❌ 错误方式 — 获取全部内容后再grep

oodle monitors list | grep CPU
undefined
oodle monitors list | grep CPU
undefined

Reading a monitor before changing it

修改前先查看监控详情

bash
undefined
bash
undefined

✅ CORRECT — fetch full JSON, edit, then update

✅ 正确方式 — 获取完整JSON,编辑后再更新

oodle monitors get mon_123 -o yaml > monitor.yaml $EDITOR monitor.yaml oodle monitors update mon_123 -f monitor.yaml
oodle monitors get mon_123 -o yaml > monitor.yaml $EDITOR monitor.yaml oodle monitors update mon_123 -f monitor.yaml

❌ WRONG — building update payload from memory; overwrites unrelated fields

❌ 错误方式 — 凭记忆构建更新负载;会覆盖无关字段

oodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":90}}}')
undefined
oodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":90}}}')
undefined

Creating a monitor

创建监控

A complete, valid monitor JSON:
json
{
  "name": "High CPU on web servers",
  "type": "metric alert",
  "query": "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80",
  "message": "CPU above 80% on {{host.name}}. Runbook: https://runbooks.example.com/cpu\n@slack-ops",
  "labels": {"team": "platform", "env": "prod"},
  "options": {
    "thresholds": {
      "critical": 80,
      "critical_recovery": 70,
      "warning": 60,
      "warning_recovery": 50
    }
  }
}
bash
undefined
一个完整有效的监控JSON示例:
json
{
  "name": "Web服务器CPU使用率过高",
  "type": "metric alert",
  "query": "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80",
  "message": "{{host.name}}的CPU使用率超过80%。运行手册:https://runbooks.example.com/cpu\n@slack-ops",
  "labels": {"team": "platform", "env": "prod"},
  "options": {
    "thresholds": {
      "critical": 80,
      "critical_recovery": 70,
      "warning": 60,
      "warning_recovery": 50
    }
  }
}
bash
undefined

✅ CORRECT

✅ 正确方式

oodle monitors create -f monitor.json
oodle monitors create -f monitor.json

❌ WRONG — no
type
, no
options.thresholds
, monitor will be rejected

❌ 错误方式 — 缺少
type
options.thresholds
,监控会被拒绝

oodle monitors create -f <(echo '{"name":"x","query":"y"}')
undefined
oodle monitors create -f <(echo '{"name":"x","query":"y"}')
undefined

Updating a monitor

更新监控

bash
undefined
bash
undefined

✅ CORRECT — get → edit → update

✅ 正确方式 — 获取→编辑→更新

oodle monitors get mon_123 -o json > monitor.json jq '.options.thresholds.critical = 85' monitor.json > monitor.new.json oodle monitors update mon_123 -f monitor.new.json
oodle monitors get mon_123 -o json > monitor.json jq '.options.thresholds.critical = 85' monitor.json > monitor.new.json oodle monitors update mon_123 -f monitor.new.json

❌ WRONG — sending only the changed field; missing fields become null

❌ 错误方式 — 仅发送修改的字段;缺失字段会被设为null

oodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":85}}}')
undefined
oodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":85}}}')
undefined

Deleting a monitor

删除监控

bash
undefined
bash
undefined

✅ CORRECT — verify first, then delete

✅ 正确方式 — 先验证,再删除

oodle monitors get mon_123 -o json > /dev/null oodle monitors delete mon_123 --force
oodle monitors get mon_123 -o json > /dev/null oodle monitors delete mon_123 --force

❌ WRONG — speculative delete by name match

❌ 错误方式 — 通过名称匹配推测性删除

oodle monitors delete "$(oodle monitors list | grep CPU | head -1 | awk '{print $1}')" --force
undefined
oodle monitors delete "$(oodle monitors list | grep CPU | head -1 | awk '{print $1}')" --force
undefined

Best Practices

最佳实践

Use a
last_5m
(or longer) evaluation window, not
last_1m

使用
last_5m
(或更长)的评估窗口,而非
last_1m

Short windows cause alert flapping on normal traffic spikes. Use
last_5m
minimum for production alerts,
last_15m
for noisy metrics.
bash
undefined
短窗口会在正常流量峰值时导致告警频繁波动。生产环境告警至少使用
last_5m
,对于噪声大的指标使用
last_15m
bash
undefined

✅ CORRECT

✅ 正确方式

"query": "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"
"query": "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"

❌ WRONG — flaps on every brief spike

❌ 错误方式 — 每次短暂峰值都会触发告警波动

"query": "avg(last_1m):avg:system.cpu.user{env:prod} by {host} > 80"
undefined
"query": "avg(last_1m):avg:system.cpu.user{env:prod} by {host} > 80"
undefined

Scope queries with explicit labels — never use
{*}

使用明确标签限定查询范围——切勿使用
{*}

{*}
matches every series in the system and produces alerts for resources you don't own.
bash
undefined
{*}
会匹配系统中的所有序列,为不属于你的资源触发告警。
bash
undefined

✅ CORRECT — scoped to a specific env + service

✅ 正确方式 — 限定到特定环境和服务

"query": "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"
"query": "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"

❌ WRONG — alerts on every host in the org

❌ 错误方式 — 为组织内所有主机触发告警

"query": "avg(last_5m):avg:system.cpu.user{*} > 80"
undefined
"query": "avg(last_5m):avg:system.cpu.user{*} > 80"
undefined

Always set
*_recovery
thresholds below the trigger thresholds

始终将
*_recovery
阈值设置为低于触发阈值

Without recovery thresholds the monitor stays in
alert
state until the metric drops below the critical threshold exactly — small oscillations keep the alert active forever.
bash
undefined
如果没有恢复阈值,监控会一直处于
alert
状态,直到指标恰好低于临界阈值——小幅波动会让告警永久保持激活状态。
bash
undefined

✅ CORRECT — clear recovery band (10pt below trigger)

✅ 正确方式 — 设置明确的恢复区间(比触发阈值低10个点)

"thresholds": {"critical": 80, "critical_recovery": 70, "warning": 60, "warning_recovery": 50}
"thresholds": {"critical": 80, "critical_recovery": 70, "warning": 60, "warning_recovery": 50}

❌ WRONG — no recovery values; monitor never cleanly recovers

❌ 错误方式 — 无恢复值;监控无法彻底恢复

"thresholds": {"critical": 80, "warning": 60}
undefined
"thresholds": {"critical": 80, "warning": 60}
undefined

Put a runbook URL and an
@notifier
handle in
message

message
中添加运行手册URL和
@通知人
标识

Alerts without an action are noise. Every monitor message must answer "what do I do?" and "who is paged?".
bash
undefined
没有操作指引的告警就是噪音。每个监控消息必须回答“我该做什么?”和“谁会被通知?”。
bash
undefined

✅ CORRECT

✅ 正确方式

"message": "CPU above 80% on {{host.name}} (env=prod, service=api).\nRunbook: https://runbooks.example.com/cpu\n@slack-ops @pagerduty-platform"
"message": "{{host.name}}的CPU使用率超过80%(环境=生产,服务=API)。\n运行手册:https://runbooks.example.com/cpu\n@slack-ops @pagerduty-platform"

❌ WRONG — no actionable content, no routing

❌ 错误方式 — 无操作内容,无路由信息

"message": "CPU is high"
undefined
"message": "CPU使用率过高"
undefined

Tag every monitor with at least
team
and
env
labels

为每个监控至少添加
team
env
标签

Labels are how notification policies route alerts and how
oodle monitors list --labels ...
filters work.
bash
undefined
标签用于通知策略路由告警,也是
oodle monitors list --labels ...
筛选功能的基础。
bash
undefined

✅ CORRECT

✅ 正确方式

"labels": {"team": "platform", "env": "prod", "service": "api"}
"labels": {"team": "platform", "env": "prod", "service": "api"}

❌ WRONG

❌ 错误方式

"labels": {}
undefined
"labels": {}
undefined

Failure Handling

故障处理

ErrorCauseFix
401 UnauthorizedInvalid or missing API keyRun
oodle configure
or set
OODLE_API_KEY
404 Not FoundMonitor ID does not existVerify with
oodle monitors list -o json
connection refusedWrong
OODLE_DEPLOYMENT
URL
Check
OODLE_DEPLOYMENT
env var
invalid query
PromQL/Datadog-style query has a syntax errorTest the query in the UI metrics explorer; ensure
avg(last_5m):
prefix and
by {label}
suffix
Alert never firesQuery returns no data, or threshold is unreachableRun
oodle metrics list --match <metric>
to confirm the metric exists; lower the threshold temporarily and re-check
Too many alerts (flapping)Evaluation window too short, missing recovery thresholdsIncrease window to
last_5m
or
last_15m
; set
critical_recovery
and
warning_recovery
no data
alerts
Agent not reporting, or label filter excludes all hostsVerify the agent is alive (
oodle metrics list --match up
); widen the label filter to confirm any series match
429 Too Many RequestsBulk monitor creation hit rate limitAdd
--retries 3
, throttle to <10 creates per second
错误原因修复方案
401 UnauthorizedAPI密钥无效或缺失运行
oodle configure
或设置
OODLE_API_KEY
404 Not Found监控ID不存在使用
oodle monitors list -o json
验证
connection refused
OODLE_DEPLOYMENT
URL错误
检查
OODLE_DEPLOYMENT
环境变量
invalid query
PromQL/Datadog风格查询存在语法错误在UI指标探索器中测试查询;确保包含
avg(last_5m):
前缀和
by {label}
后缀
告警从未触发查询无数据返回,或阈值过高无法达到运行
oodle metrics list --match <metric>
确认指标存在;临时降低阈值后重新检查
告警过多(频繁波动)评估窗口过短,缺少恢复阈值将窗口增加到
last_5m
last_15m
;设置
critical_recovery
warning_recovery
no data
告警
Agent未上报数据,或标签筛选排除了所有主机验证Agent是否存活(
oodle metrics list --match up
);放宽标签筛选确认是否有匹配序列
429 Too Many Requests批量创建监控触发速率限制添加
--retries 3
,将速率限制在每秒少于10次创建

References

参考资料