oodle-monitors

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Oodle Monitors — CRUD and Alerting

Oodle监控——增删改查（CRUD）与告警管理

This skill teaches the agent to build, validate, and update Oodle monitors so that alerts are actionable, scoped, and free from flapping.

本技能指导Agent构建、验证和更新Oodle监控，确保告警具备可操作性、范围明确且不会频繁波动。

Prerequisites

前置条件

bash

undefined

bash

undefined

Install + configure (see oodle-cli skill)

安装并配置（参考oodle-cli技能）

brew install oodle-ai/oodle/oodle oodle configure

or

或

export OODLE_API_KEY=<key> export OODLE_INSTANCE=<instance> export OODLE_DEPLOYMENT=<url>


Confirm authentication and that at least one monitor list call succeeds before creating new monitors:

```bash
oodle monitors list -o json | jq 'length'

export OODLE_API_KEY=<key> export OODLE_INSTANCE=<instance> export OODLE_DEPLOYMENT=<url>


在创建新监控前，确认身份验证通过且至少一次监控列表查询成功：

```bash
oodle monitors list -o json | jq 'length'

Command Execution Order

命令执行顺序

Before running any oodle command:

Check whether the required resource ID or name is already in context.
If not, run the discovery command (e.g.,
```
oodle monitors list -o json
```
).
If the result is ambiguous, ask the user to confirm before proceeding.
Run the target command with the resolved ID.
Do not run speculative commands (e.g., do not
```
delete
```
without first
```
get
```
-ing the resource).

运行任何oodle命令前：

检查上下文是否已包含所需的资源ID或名称。
如果没有，运行发现命令（例如：
```
oodle monitors list -o json
```
）。
如果结果不明确，先请求用户确认再继续。
使用解析后的ID运行目标命令。
不要执行推测性命令（例如：未先执行
```
get
```
获取资源就执行
```
delete
```
）。

Quick Reference

快速参考

Task	Command
List all monitors	`oodle monitors list -o json`
Filter by status	`oodle monitors list --status alert -o json`
Filter by labels	`oodle monitors list --labels env=prod,team=platform -o json`
Get one monitor	`oodle monitors get <id> -o json`
Create from file	`oodle monitors create -f monitor.json`
Update from file	`oodle monitors update <id> -f monitor.json`
Delete (CI)	`oodle monitors delete <id> --force`

任务	命令
列出所有监控	`oodle monitors list -o json`
按状态筛选	`oodle monitors list --status alert -o json`
按标签筛选	`oodle monitors list --labels env=prod,team=platform -o json`
获取单个监控	`oodle monitors get <id> -o json`
从文件创建	`oodle monitors create -f monitor.json`
从文件更新	`oodle monitors update <id> -f monitor.json`
删除（CI环境）	`oodle monitors delete <id> --force`

Common Operations

常见操作

Listing monitors

列出监控

bash

undefined

bash

undefined

✅ CORRECT — JSON output for scripting

✅ 正确方式 — 输出JSON用于脚本

oodle monitors list -o json

✅ CORRECT — narrow with --status to find only firing alerts

✅ 正确方式 — 通过--status筛选仅显示触发中的告警

oodle monitors list --status alert -o json

✅ CORRECT — narrow with --labels for a specific team

✅ 正确方式 — 通过--labels筛选特定团队的监控

oodle monitors list --labels env=prod,team=platform -o json

❌ WRONG — pulling everything then grepping

❌ 错误方式 — 获取全部内容后再grep

oodle monitors list | grep CPU

undefined

oodle monitors list | grep CPU

undefined

Reading a monitor before changing it

修改前先查看监控详情

bash

undefined

bash

undefined

✅ CORRECT — fetch full JSON, edit, then update

✅ 正确方式 — 获取完整JSON，编辑后再更新

oodle monitors get mon_123 -o yaml > monitor.yaml $EDITOR monitor.yaml oodle monitors update mon_123 -f monitor.yaml

❌ WRONG — building update payload from memory; overwrites unrelated fields

❌ 错误方式 — 凭记忆构建更新负载；会覆盖无关字段

oodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":90}}}')

undefined

oodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":90}}}')

undefined

Creating a monitor

创建监控

A complete, valid monitor JSON:

json

{
  "name": "High CPU on web servers",
  "type": "metric alert",
  "query": "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80",
  "message": "CPU above 80% on {{host.name}}. Runbook: https://runbooks.example.com/cpu\n@slack-ops",
  "labels": {"team": "platform", "env": "prod"},
  "options": {
    "thresholds": {
      "critical": 80,
      "critical_recovery": 70,
      "warning": 60,
      "warning_recovery": 50
    }
  }
}

bash

undefined

一个完整有效的监控JSON示例：

json

{
  "name": "Web服务器CPU使用率过高",
  "type": "metric alert",
  "query": "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80",
  "message": "{{host.name}}的CPU使用率超过80%。运行手册：https://runbooks.example.com/cpu\n@slack-ops",
  "labels": {"team": "platform", "env": "prod"},
  "options": {
    "thresholds": {
      "critical": 80,
      "critical_recovery": 70,
      "warning": 60,
      "warning_recovery": 50
    }
  }
}

bash

undefined

✅ CORRECT

✅ 正确方式

oodle monitors create -f monitor.json

❌ WRONG — no

type

, no

options.thresholds

, monitor will be rejected

❌ 错误方式 — 缺少

type

和

options.thresholds

，监控会被拒绝

oodle monitors create -f <(echo '{"name":"x","query":"y"}')

undefined

oodle monitors create -f <(echo '{"name":"x","query":"y"}')

undefined

Updating a monitor

更新监控

bash

undefined

bash

undefined

✅ CORRECT — get → edit → update

✅ 正确方式 — 获取→编辑→更新

oodle monitors get mon_123 -o json > monitor.json jq '.options.thresholds.critical = 85' monitor.json > monitor.new.json oodle monitors update mon_123 -f monitor.new.json

❌ WRONG — sending only the changed field; missing fields become null

❌ 错误方式 — 仅发送修改的字段；缺失字段会被设为null

oodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":85}}}')

undefined

oodle monitors update mon_123 -f <(echo '{"options":{"thresholds":{"critical":85}}}')

undefined

Deleting a monitor

删除监控

bash

undefined

bash

undefined

✅ CORRECT — verify first, then delete

✅ 正确方式 — 先验证，再删除

oodle monitors get mon_123 -o json > /dev/null oodle monitors delete mon_123 --force

❌ WRONG — speculative delete by name match

❌ 错误方式 — 通过名称匹配推测性删除

oodle monitors delete "$(oodle monitors list | grep CPU | head -1 | awk '{print $1}')" --force

undefined

oodle monitors delete "$(oodle monitors list | grep CPU | head -1 | awk '{print $1}')" --force

undefined

Best Practices

最佳实践

Use a

last_5m

(or longer) evaluation window, not

last_1m

使用

last_5m

（或更长）的评估窗口，而非

last_1m

Short windows cause alert flapping on normal traffic spikes. Use

last_5m

minimum for production alerts,

last_15m

for noisy metrics.

bash

undefined

短窗口会在正常流量峰值时导致告警频繁波动。生产环境告警至少使用

last_5m

，对于噪声大的指标使用

last_15m

。

bash

undefined

✅ CORRECT

✅ 正确方式

"query": "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"

❌ WRONG — flaps on every brief spike

❌ 错误方式 — 每次短暂峰值都会触发告警波动

"query": "avg(last_1m):avg:system.cpu.user{env:prod} by {host} > 80"

undefined

"query": "avg(last_1m):avg:system.cpu.user{env:prod} by {host} > 80"

undefined

Scope queries with explicit labels — never use

{*}

使用明确标签限定查询范围——切勿使用

{*}

{*}

matches every series in the system and produces alerts for resources you don't own.

bash

undefined

{*}

会匹配系统中的所有序列，为不属于你的资源触发告警。

bash

undefined

✅ CORRECT — scoped to a specific env + service

✅ 正确方式 — 限定到特定环境和服务

"query": "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"

❌ WRONG — alerts on every host in the org

❌ 错误方式 — 为组织内所有主机触发告警

"query": "avg(last_5m):avg:system.cpu.user{*} > 80"

undefined

"query": "avg(last_5m):avg:system.cpu.user{*} > 80"

undefined

Always set

*_recovery

thresholds below the trigger thresholds

始终将

*_recovery

阈值设置为低于触发阈值

Without recovery thresholds the monitor stays in

alert

state until the metric drops below the critical threshold exactly — small oscillations keep the alert active forever.

bash

undefined

如果没有恢复阈值，监控会一直处于

alert

状态，直到指标恰好低于临界阈值——小幅波动会让告警永久保持激活状态。

bash

undefined

✅ CORRECT — clear recovery band (10pt below trigger)

✅ 正确方式 — 设置明确的恢复区间（比触发阈值低10个点）

"thresholds": {"critical": 80, "critical_recovery": 70, "warning": 60, "warning_recovery": 50}

❌ WRONG — no recovery values; monitor never cleanly recovers

❌ 错误方式 — 无恢复值；监控无法彻底恢复

"thresholds": {"critical": 80, "warning": 60}

undefined

"thresholds": {"critical": 80, "warning": 60}

undefined

Put a runbook URL and an

@notifier

handle in

message

在

message

中添加运行手册URL和

@通知人

标识

Alerts without an action are noise. Every monitor message must answer "what do I do?" and "who is paged?".

bash

undefined

没有操作指引的告警就是噪音。每个监控消息必须回答“我该做什么？”和“谁会被通知？”。

bash

undefined

✅ CORRECT

✅ 正确方式

"message": "CPU above 80% on {{host.name}} (env=prod, service=api).\nRunbook: https://runbooks.example.com/cpu\n@slack-ops @pagerduty-platform"

"message": "{{host.name}}的CPU使用率超过80%（环境=生产，服务=API）。\n运行手册：https://runbooks.example.com/cpu\n@slack-ops @pagerduty-platform"

❌ WRONG — no actionable content, no routing

❌ 错误方式 — 无操作内容，无路由信息

"message": "CPU is high"

undefined

"message": "CPU使用率过高"

undefined

Tag every monitor with at least

team

and

env

labels

为每个监控至少添加

team

和

env

标签

Labels are how notification policies route alerts and how

oodle monitors list --labels ...

filters work.

bash

undefined

标签用于通知策略路由告警，也是

oodle monitors list --labels ...

筛选功能的基础。

bash

undefined

✅ CORRECT

✅ 正确方式

"labels": {"team": "platform", "env": "prod", "service": "api"}

❌ WRONG

❌ 错误方式

"labels": {}

undefined

"labels": {}

undefined

Failure Handling

故障处理

Error	Cause	Fix
401 Unauthorized	Invalid or missing API key	Run `oodle configure` or set `OODLE_API_KEY`
404 Not Found	Monitor ID does not exist	Verify with `oodle monitors list -o json`
connection refused	Wrong `OODLE_DEPLOYMENT` URL	Check `OODLE_DEPLOYMENT` env var
`invalid query`	PromQL/Datadog-style query has a syntax error	Test the query in the UI metrics explorer; ensure `avg(last_5m):` prefix and `by {label}` suffix
Alert never fires	Query returns no data, or threshold is unreachable	Run `oodle metrics list --match <metric>` to confirm the metric exists; lower the threshold temporarily and re-check
Too many alerts (flapping)	Evaluation window too short, missing recovery thresholds	Increase window to `last_5m` or `last_15m` ; set `critical_recovery` and `warning_recovery`
`no data` alerts	Agent not reporting, or label filter excludes all hosts	Verify the agent is alive ( `oodle metrics list --match up` ); widen the label filter to confirm any series match
429 Too Many Requests	Bulk monitor creation hit rate limit	Add `--retries 3` , throttle to <10 creates per second

错误	原因	修复方案
401 Unauthorized	API密钥无效或缺失	运行 `oodle configure` 或设置 `OODLE_API_KEY`
404 Not Found	监控ID不存在	使用 `oodle monitors list -o json` 验证
connection refused	`OODLE_DEPLOYMENT` URL错误	检查 `OODLE_DEPLOYMENT` 环境变量
`invalid query`	PromQL/Datadog风格查询存在语法错误	在UI指标探索器中测试查询；确保包含 `avg(last_5m):` 前缀和 `by {label}` 后缀
告警从未触发	查询无数据返回，或阈值过高无法达到	运行 `oodle metrics list --match <metric>` 确认指标存在；临时降低阈值后重新检查
告警过多（频繁波动）	评估窗口过短，缺少恢复阈值	将窗口增加到 `last_5m` 或 `last_15m` ；设置 `critical_recovery` 和 `warning_recovery`
`no data` 告警	Agent未上报数据，或标签筛选排除了所有主机	验证Agent是否存活（ `oodle metrics list --match up` ）；放宽标签筛选确认是否有匹配序列
429 Too Many Requests	批量创建监控触发速率限制	添加 `--retries 3` ，将速率限制在每秒少于10次创建

oodle-monitors

Original

Translation

Oodle Monitors — CRUD and Alerting

Oodle监控——增删改查（CRUD）与告警管理

Prerequisites

前置条件

Install + configure (see oodle-cli skill)

安装并配置（参考oodle-cli技能）

or

或

Command Execution Order

命令执行顺序

Quick Reference

快速参考

Common Operations

常见操作

Listing monitors

列出监控

✅ CORRECT — JSON output for scripting

✅ 正确方式 — 输出JSON用于脚本

✅ CORRECT — narrow with --status to find only firing alerts

✅ 正确方式 — 通过--status筛选仅显示触发中的告警

✅ CORRECT — narrow with --labels for a specific team

✅ 正确方式 — 通过--labels筛选特定团队的监控

❌ WRONG — pulling everything then grepping

❌ 错误方式 — 获取全部内容后再grep

Reading a monitor before changing it

修改前先查看监控详情

✅ CORRECT — fetch full JSON, edit, then update

✅ 正确方式 — 获取完整JSON，编辑后再更新

❌ WRONG — building update payload from memory; overwrites unrelated fields

❌ 错误方式 — 凭记忆构建更新负载；会覆盖无关字段

Creating a monitor

创建监控

✅ CORRECT

✅ 正确方式

❌ WRONG — no type, no options.thresholds, monitor will be rejected

❌ 错误方式 — 缺少type和options.thresholds，监控会被拒绝

Updating a monitor

更新监控

✅ CORRECT — get → edit → update

✅ 正确方式 — 获取→编辑→更新

❌ WRONG — sending only the changed field; missing fields become null

❌ 错误方式 — 仅发送修改的字段；缺失字段会被设为null

Deleting a monitor

删除监控

✅ CORRECT — verify first, then delete

✅ 正确方式 — 先验证，再删除

❌ WRONG — speculative delete by name match

❌ 错误方式 — 通过名称匹配推测性删除

Best Practices

最佳实践

Use a last_5m (or longer) evaluation window, not last_1m

使用last_5m（或更长）的评估窗口，而非last_1m

✅ CORRECT

✅ 正确方式

❌ WRONG — flaps on every brief spike

❌ 错误方式 — 每次短暂峰值都会触发告警波动

Scope queries with explicit labels — never use {*}

使用明确标签限定查询范围——切勿使用{*}

✅ CORRECT — scoped to a specific env + service

✅ 正确方式 — 限定到特定环境和服务

❌ WRONG — alerts on every host in the org

❌ 错误方式 — 为组织内所有主机触发告警

Always set *_recovery thresholds below the trigger thresholds

始终将*_recovery阈值设置为低于触发阈值

✅ CORRECT — clear recovery band (10pt below trigger)

✅ 正确方式 — 设置明确的恢复区间（比触发阈值低10个点）

❌ WRONG — no recovery values; monitor never cleanly recovers

❌ 错误方式 — 无恢复值；监控无法彻底恢复

Put a runbook URL and an @notifier handle in message

在message中添加运行手册URL和@通知人标识

✅ CORRECT

✅ 正确方式

❌ WRONG — no actionable content, no routing

❌ 错误方式 — 无操作内容，无路由信息

Tag every monitor with at least team and env labels

为每个监控至少添加team和env标签

✅ CORRECT

❌ WRONG — no
`type`
, no
`options.thresholds`
, monitor will be rejected

❌ 错误方式 — 缺少
`type`
和
`options.thresholds`
，监控会被拒绝

Use a
`last_5m`
(or longer) evaluation window, not
`last_1m`

使用
`last_5m`
（或更长）的评估窗口，而非
`last_1m`

Scope queries with explicit labels — never use
`{*}`

使用明确标签限定查询范围——切勿使用
`{*}`

Always set
`*_recovery`
thresholds below the trigger thresholds

始终将
`*_recovery`
阈值设置为低于触发阈值

Put a runbook URL and an
`@notifier`
handle in
`message`

在
`message`
中添加运行手册URL和
`@通知人`
标识

Tag every monitor with at least
`team`
and
`env`
labels

为每个监控至少添加
`team`
和
`env`
标签