job-babysitter

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Job Babysitter

后台任务监控助手（Job Babysitter）

Purpose

用途

Stop manually polling long-running background jobs. Instead of dozens-to-hundreds of

ls -lh

ps

checks while guessing at completion, start one background watcher that detects the terminal state via plateau heuristics, then routes a verdict — done, needs-attention, or blocked — with the exact next command.

A night-shift nurse for background jobs: it checks vitals on a schedule and escalates only when something is actually wrong.

停止手动轮询长时间运行的后台任务。无需在猜测任务完成情况时执行数十甚至数百次

ls -lh

ps

检查，只需启动一个后台监控程序，通过平台启发式算法检测终端状态，随后给出判定结果——完成、需关注或阻塞——并附上明确的下一步操作命令。

它就像后台任务的夜班护士：按时检查任务状态，仅在确实出现问题时才发出告警。

When to use

使用场景

Use when a job will run long enough that babysitting it by hand wastes attention:

Media encodes / transcodes (ffmpeg, video-transcribe, audio extraction)
Embedding or vector-DB builds (qmd embed, index builds)
Batch agent / LLM pipelines run in the background
Browser / scrape daemons (real-browser, agent-browser) prone to hanging

Do NOT use for jobs that finish in seconds, or where a single

Bash

call already returns the result.

当任务运行时间足够长，手动监控会浪费精力时使用：

媒体编码/转码（ffmpeg、视频转录、音频提取）
嵌入或向量数据库构建（qmd embed、索引构建）
在后台运行的批处理Agent/LLM流水线
容易挂起的浏览器/爬虫守护进程（real-browser、agent-browser）

请勿用于几秒内即可完成的任务，或单个

Bash

调用即可返回结果的任务。

Core principle: stay thin, lean on the harness

核心原则：轻量封装，依托框架

This skill orchestrates Claude Code's own primitives — do not reimplement them:

Start the watcher with run_in_background: true
. When it exits, the harness re-invokes the agent automatically — no manual polling loop needed.
The watcher (
```
scripts/watch_job.py
```
) owns the deterministic part: poll with backoff, detect plateau, distinguish done from stuck, emit a verdict JSON.
The skill's value is the per-job-type heuristics, the safe-recovery playbook, and notification routing — all in
```
references/playbook.md
```
.

该Skill编排Claude Code自身的原语——无需重新实现它们：

使用**
```
run_in_background: true
```
**启动监控程序。当它退出时，工具框架会自动重新调用Agent——无需手动轮询循环。
监控程序（
```
scripts/watch_job.py
```
）负责确定性部分：带退避策略的轮询、检测平台状态、区分完成与卡住、输出判定结果JSON。
该Skill的价值在于针对不同任务类型的启发式算法、安全恢复手册以及通知路由——这些内容均在
```
references/playbook.md
```
中。

Workflow

工作流程

1. Identify the job's signals

1. 确定任务的监控信号

Determine what can be watched, in order of reliability:

PID — the process ID (most reliable completion signal). Get it from the job's launch,
```
pgrep
```
, or
```
ps
```
.
Output file — a file that grows as the job progresses (e.g. ffmpeg target).
Log file — a log that gets appended (e.g. an embed progress log).

Read

references/playbook.md

§ "Completion heuristics by job type" to pick flags for the specific job type (ffmpeg, embed, batch, browser).

确定可监控的信号，按可靠性排序：

PID——进程ID（最可靠的完成信号）。可从任务启动信息、
```
pgrep
```
或
```
ps
```
命令获取。
输出文件——随任务推进而增长的文件（如ffmpeg的目标文件）。
日志文件——持续追加内容的日志（如嵌入进度日志）。

阅读

references/playbook.md

中的“按任务类型划分的完成启发式算法”章节，为特定任务类型（ffmpeg、嵌入、批处理、浏览器）选择相应参数。

2. Launch the watcher in the background

2. 在后台启动监控程序

Run with

run_in_background: true

. Always pass

--pid

when known; add file/log signals as corroboration. Write the verdict to a known path.

bash

scripts/watch_job.py \
  --label "lab05 stream encode" \
  --pid <PID> \
  --output-file /path/to/output.mp4 \
  --plateau-bytes 65536 --plateau-polls 5 --stuck-after 120 \
  --max-wait 7200 \
  --verdict-out /tmp/job-babysitter-<label>.json

The watcher prints a one-line JSON heartbeat per poll (tail it for live progress) and writes the final verdict JSON to

--verdict-out

on exit.

Tuning lives in the playbook; sensible defaults:

--interval 10

(backs off to 60),

--plateau-polls 4

--stuck-after 300

--max-wait 7200

使用

run_in_background: true

运行。已知PID时务必传递

--pid

参数；可添加文件/日志信号作为佐证。将判定结果写入指定路径。

bash

scripts/watch_job.py \
  --label "lab05 stream encode" \
  --pid <PID> \
  --output-file /path/to/output.mp4 \
  --plateau-bytes 65536 --plateau-polls 5 --stuck-after 120 \
  --max-wait 7200 \
  --verdict-out /tmp/job-babysitter-<label>.json

监控程序每次轮询会打印一行JSON心跳信息（可通过tail命令查看实时进度），退出时会将最终判定结果JSON写入

--verdict-out

指定的路径。

参数调优可参考手册；合理默认值：

--interval 10

（退避至60）、

--plateau-polls 4

、

--stuck-after 300

、

--max-wait 7200

。

3. On watcher exit, read the verdict and route it

3. 监控程序退出后，读取判定结果并处理

The harness re-invokes the agent when the background watcher finishes. Read the verdict JSON. It has

status

∈ {done, needs-attention, blocked}, a

reason

suggested_next

, elapsed time, and final size.

done → verify the output is real (see the job-type "Done check" in the playbook, e.g.
```
ffprobe
```
for media, count match for embeds), then proceed with the original task.
needs-attention → the job plateaued while still alive (possibly wedged). Follow the recovery playbook: diagnose read-only FIRST. Never kill or run destructive recovery (pkill, WAL checkpoint, VACUUM) without asking the user.
blocked → the watcher gave up after
```
--max-wait
```
. Report honestly: "gave up waiting" ≠ "failed". Offer to re-check or extend the ceiling.

当后台监控程序完成时，工具框架会重新调用Agent。读取判定结果JSON，其中包含

status

字段（取值为done、needs-attention、blocked）、

reason

、

suggested_next

、耗时以及最终文件大小。

done（完成） → 验证输出是否有效（参考手册中对应任务类型的“完成检查”，如对媒体文件使用
```
ffprobe
```
，对嵌入任务检查数量匹配），然后继续执行原任务。
needs-attention（需关注） → 任务仍在运行但已进入平台期（可能卡住）。遵循恢复手册：首先诊断只读状态。未经用户许可，切勿执行破坏性恢复操作（如pkill、WAL检查点、VACUUM）。
blocked（阻塞） → 监控程序在达到
```
--max-wait
```
后放弃等待。如实报告：“放弃等待”≠“失败”。提供重新检查或延长等待时长的选项。

4. Notify per the chosen channel

4. 按选定渠道发送通知

Default to in-session resume. If the user picked a channel (Telegram, voice/TTS, desktop notification), route per

references/playbook.md

§ "Notification routing". Always include the status emoji, label, elapsed time, and the exact next command.

默认在会话内恢复。如果用户选择了通知渠道（Telegram、语音/TTS、桌面通知），请按照

references/playbook.md

中的“通知路由”章节进行处理。务必包含状态表情、任务标签、耗时以及明确的下一步命令。

Guardrails (non-negotiable)

约束规则（不可违反）

Never act on a single slow poll. "Stuck" requires plateau AND elapsed past
```
--stuck-after
```
— the watcher already enforces this before returning needs-attention.
Ask before any destructive recovery —
```
pkill
```
,
```
kill
```
, WAL checkpoint,
```
VACUUM
```
, daemon restart. Diagnose read-only first.
Report honestly. Distinguish done from "gave up waiting" from "wedged". Never imply a success the watcher did not observe.
Poll with backoff, not tight loops — the watcher handles this; never wrap it in a manual fast-polling loop.

切勿仅凭一次慢轮询就采取行动。“卡住”需要同时满足平台期状态和超过
```
--stuck-after
```
时长——监控程序在返回needs-attention前已强制执行此规则。
执行任何破坏性恢复操作前需征得用户同意——如
```
pkill
```
、
```
kill
```
、WAL检查点、
```
VACUUM
```
、守护进程重启。首先诊断只读状态。
**如实报告。**区分完成、“放弃等待”和“卡住”三种状态。切勿暗示监控程序未观测到的成功。
采用带退避策略的轮询，而非紧密循环——监控程序已处理此逻辑；切勿将其包裹在手动快速轮询循环中。

Resources

资源

```
scripts/watch_job.py
```
— background watcher: plateau detection, stuck-vs-done logic, verdict JSON. Stdlib only, Python 3.11+.
```
references/playbook.md
```
— per-job-type completion heuristics, the safe-recovery table, and notification routing. Load when picking watcher flags or handling a needs-attention/blocked verdict.

```
scripts/watch_job.py
```
—— 后台监控程序：平台状态检测、卡住与完成逻辑区分、判定结果JSON输出。仅依赖标准库，需Python 3.11+。
```
references/playbook.md
```
—— 按任务类型划分的完成启发式算法、安全恢复表以及通知路由。在选择监控程序参数或处理needs-attention/blocked判定结果时加载此文件。