service-health-check

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Service Health Check Skill

服务健康检查Skill

Operator Context

操作上下文

This skill operates as an operator for service health monitoring workflows, configuring Claude's behavior for structured, read-only health assessment. It implements the Discover-Check-Report pattern — find services, gather health signals, produce actionable output — with deterministic process and health file evaluation.
该Skill作为服务健康监控工作流的操作员,配置Claude的行为以实现结构化、只读的健康评估。它采用发现-检查-报告模式——查找服务、收集健康信号、生成可执行输出——并通过确定性流程和健康文件评估实现。

Hardcoded Behaviors (Always Apply)

硬编码行为(始终适用)

  • Read-Only: NEVER restart, stop, or modify services — report only
  • CLAUDE.md Compliance: Read and follow repository CLAUDE.md before checking
  • No Side Effects: Only read process tables, health files, and ports — no writes
  • Structured Output: Always produce machine-parseable health report
  • Evidence-Based Status: Every status determination requires at least one concrete signal (process check, health file, or port probe)
  • 只读:绝不重启、停止或修改服务——仅生成报告
  • 遵循CLAUDE.md规范:检查前先读取并遵循仓库中的CLAUDE.md
  • 无副作用:仅读取进程表、健康文件和端口——不执行写入操作
  • 结构化输出:始终生成可机器解析的健康报告
  • 基于证据的状态判定:每个状态判定都需要至少一个具体信号(进程检查、健康文件或端口探测)

Default Behaviors (ON unless disabled)

默认行为(启用状态,除非手动禁用)

  • Process Verification: Check process existence via pgrep/ps before anything else
  • Staleness Detection: Flag health files older than configured threshold (default 300s)
  • Port Listening Check: Verify expected ports are bound when port is configured
  • Actionable Recommendations: Provide specific commands to resolve issues
  • Staleness Threshold Enforcement: Default 300s, configurable per service
  • 进程验证:优先通过pgrep/ps检查进程是否存在
  • 过期数据检测:标记超过配置阈值(默认300秒)的健康文件
  • 端口监听检查:当配置了端口时,验证预期端口是否已绑定
  • 可执行建议:提供解决问题的具体命令
  • 过期阈值强制执行:默认300秒,可按服务单独配置

Optional Behaviors (OFF unless enabled)

可选行为(禁用状态,除非手动启用)

  • Auto-Restart Execution: Run restart commands (requires explicit user flag)
  • Metrics Collection: Gather detailed performance metrics from health files
  • Alert Integration: Format output for monitoring system ingestion
  • Historical Comparison: Compare against previous health snapshots
  • 自动重启执行:运行重启命令(需要用户明确标记)
  • 指标收集:从健康文件中收集详细性能指标
  • 告警集成:格式化输出以适配监控系统接入
  • 历史对比:与之前的健康快照进行对比

What This Skill CAN Do

该Skill可执行的操作

  • Check if processes are running via pgrep/ps
  • Parse JSON health files for status, connection state, and metrics
  • Detect stale health data based on configurable thresholds
  • Verify ports are listening with ss/netstat
  • Produce structured health reports with actionable restart recommendations
  • Evaluate service degradation (disconnected, reconnecting states)
  • 通过pgrep/ps检查进程是否运行
  • 解析JSON健康文件获取状态、连接状态和指标
  • 基于可配置阈值检测过期健康数据
  • 通过ss/netstat验证端口是否监听
  • 生成带有可执行重启建议的结构化健康报告
  • 评估服务降级(断开连接、重连状态)

What This Skill CANNOT Do

该Skill不可执行的操作

  • Restart, stop, or modify services (report-only by design)
  • Perform deep log analysis (use systematic-debugging instead)
  • Probe remote health endpoints over HTTP (use endpoint-validator instead)
  • Inspect container internals (basic host-level process checks only)
  • Authenticate against secured health endpoints
  • Skip the Discover phase — services must be identified before checking

  • 重启、停止或修改服务(设计为仅报告)
  • 执行深度日志分析(请使用systematic-debugging替代)
  • 通过HTTP探测远程健康端点(请使用endpoint-validator替代)
  • 检查容器内部(仅支持基础主机级进程检查)
  • 对受保护的健康端点进行身份验证
  • 跳过发现阶段——必须先识别服务再进行检查

Instructions

操作说明

Phase 1: DISCOVER

阶段1:发现

Goal: Identify all services to check before running any health probes.
Step 1: Locate service definitions
Search for service configuration in this order:
  1. services.json
    in project root
  2. Docker/docker-compose files for service definitions
  3. systemd unit files or process manager configs
  4. User-provided service specification
Step 2: Build service manifest
For each service, establish:
markdown
undefined
目标:在运行任何健康探测之前,识别所有需要检查的服务。
步骤1:定位服务配置
按以下顺序搜索服务配置:
  1. 项目根目录下的
    services.json
  2. Docker/docker-compose文件中的服务定义
  3. systemd单元文件或进程管理器配置
  4. 用户提供的服务规范
步骤2:构建服务清单
为每个服务确定以下信息:
markdown
undefined

Service Manifest

服务清单

ServiceProcess PatternHealth FilePortStale Threshold
api-servergunicorn.*app:app/tmp/api_health.json8000300s
workercelery.*worker/tmp/worker_health.json-300s
cacheredis-server-6379-

**Step 3: Validate manifest**
- Confirm each process pattern is specific enough to avoid false matches
- Verify health file paths are absolute
- Ensure port numbers are within valid range (1-65535)

**Gate**: Service manifest complete with at least one service. Proceed only when gate passes.
服务进程匹配模式健康文件端口过期阈值
api-servergunicorn.*app:app/tmp/api_health.json8000300s
workercelery.*worker/tmp/worker_health.json-300s
cacheredis-server-6379-

**步骤3:验证清单**
- 确认每个进程匹配模式足够具体,避免误匹配
- 验证健康文件路径为绝对路径
- 确保端口号在有效范围(1-65535)内

**准入条件**:服务清单已完成且至少包含一个服务。仅当满足条件时才可继续。

Phase 2: CHECK

阶段2:检查

Goal: Gather health signals for every service in the manifest.
Step 1: Check process status
For each service, run process check:
bash
pgrep -f "<process_pattern>"
Record: running (true/false), PIDs, process count.
Step 2: Parse health files (if configured)
Read and parse JSON health files. Evaluate:
  • Does the file exist?
  • Does it parse as valid JSON?
  • How old is the timestamp (staleness)?
  • What status does the service self-report?
  • What is the connection state?
Step 3: Probe ports (if configured)
Check if expected ports are listening:
bash
ss -tlnp "sport = :<port>"
Flag processes that are running but not listening on expected ports.
Step 4: Evaluate health per service
Apply this decision tree:
  1. Process not running → DOWN
  2. Process running + health file missing → WARNING
  3. Process running + health file stale → WARNING (restart recommended)
  4. Process running + status=error → ERROR (restart recommended)
  5. Process running + disconnected > 30min → WARNING (restart recommended)
  6. Process running + disconnected < 30min → DEGRADED (allow reconnection)
  7. Process running + healthy → HEALTHY
  8. Process running + no health file configured → RUNNING (limited visibility)
Gate: All services evaluated with evidence-based status. Proceed only when gate passes.
目标:为清单中的每个服务收集健康信号。
步骤1:检查进程状态
为每个服务执行进程检查:
bash
pgrep -f "<process_pattern>"
记录:运行状态(是/否)、PID、进程数量。
步骤2:解析健康文件(若已配置)
读取并解析JSON健康文件,评估:
  • 文件是否存在?
  • 是否可解析为有效JSON?
  • 时间戳是否过期?
  • 服务自我报告的状态是什么?
  • 连接状态是什么?
步骤3:探测端口(若已配置)
检查预期端口是否在监听:
bash
ss -tlnp "sport = :<port>"
标记已运行但未监听预期端口的进程。
步骤4:按服务评估健康状况
应用以下决策树:
  1. 进程未运行 → DOWN(已停止)
  2. 进程运行 + 健康文件缺失 → WARNING(警告)
  3. 进程运行 + 健康文件过期 → WARNING(警告)(建议重启)
  4. 进程运行 + status=error → ERROR(错误)(建议重启)
  5. 进程运行 + 断开连接超过30分钟 → WARNING(警告)(建议重启)
  6. 进程运行 + 断开连接不足30分钟 → DEGRADED(降级)(允许重连)
  7. 进程运行 + 状态健康 → HEALTHY(健康)
  8. 进程运行 + 未配置健康文件 → RUNNING(运行中)(可见性有限)
准入条件:所有服务均已通过基于证据的状态评估。仅当满足条件时才可继续。

Phase 3: REPORT

阶段3:报告

Goal: Produce structured, actionable health report.
Step 1: Generate summary
SERVICE HEALTH REPORT
=====================
Checked: N services
Healthy: X/N

RESULTS:
  service-name         [OK  ] HEALTHY     PID 12345, uptime 2d 4h
  background-worker    [WARN] WARNING     Health file stale (15 min)
  cache-service        [DOWN] DOWN        Process not found

RECOMMENDATIONS:
  background-worker: Restart recommended - health file not updated in 900s
  cache-service: Start service - process not running

SUGGESTED ACTIONS:
  systemctl restart background-worker
  systemctl start cache-service
Step 2: Set exit status
  • All HEALTHY/RUNNING → exit 0
  • Any WARNING/DEGRADED/ERROR/DOWN → exit 1
Step 3: Present to user
  • Lead with the summary line (X/N healthy)
  • Highlight any services needing action
  • Provide copy-pasteable commands for remediation
  • If user has auto-restart enabled, confirm before executing
Gate: Report delivered with actionable recommendations for all non-healthy services.

目标:生成结构化、可执行的健康报告。
步骤1:生成摘要
服务健康报告
=====================
已检查:N个服务
健康:X/N

结果:
  service-name         [正常] 健康     PID 12345,运行时间2天4小时
  background-worker    [警告] 警告     健康文件已过期(15分钟)
  cache-service        [停止] 已停止    未找到进程

建议:
  background-worker: 建议重启 - 健康文件已900秒未更新
  cache-service: 启动服务 - 进程未运行

建议操作:
  systemctl restart background-worker
  systemctl start cache-service
步骤2:设置退出状态
  • 所有服务为HEALTHY/RUNNING → 退出码0
  • 存在任何WARNING/DEGRADED/ERROR/DOWN → 退出码1
步骤3:呈现给用户
  • 以摘要行开头(X/N健康)
  • 突出显示需要操作的服务
  • 提供可复制粘贴的修复命令
  • 若用户启用了自动重启,执行前需确认
准入条件:已交付报告,且所有非健康服务均附带可执行建议。

Examples

示例

Example 1: Routine Health Check

示例1:常规健康检查

User says: "Are all services up?" Actions:
  1. Locate services.json, build manifest (DISCOVER)
  2. Check each process, parse health files, probe ports (CHECK)
  3. Output structured report showing 3/3 healthy (REPORT) Result: Clean report, no action needed
用户提问:“所有服务都在线吗?” 操作:
  1. 找到services.json,构建清单(发现)
  2. 检查每个进程、解析健康文件、探测端口(检查)
  3. 输出结构化报告,显示3/3服务健康(报告) 结果:无操作需求的清晰报告

Example 2: Stale Worker Detection

示例2:过期Worker检测

User says: "The background worker seems stuck" Actions:
  1. Identify worker service from config (DISCOVER)
  2. Find process running but health file 20 minutes stale (CHECK)
  3. Report WARNING with restart recommendation (REPORT) Result: Specific diagnosis with actionable command

用户提问:“后台Worker好像卡住了” 操作:
  1. 从配置中识别Worker服务(发现)
  2. 发现进程运行但健康文件已过期20分钟(检查)
  3. 报告警告并给出重启建议(报告) 结果:具体诊断及可执行命令

Error Handling

错误处理

Error: "No Service Configuration Found"

错误:“未找到服务配置”

Cause: No services.json, docker-compose, or systemd units discovered Solution:
  1. Ask user for service name and process pattern
  2. Build minimal manifest from user input
  3. Proceed with manual configuration
原因:未发现services.json、docker-compose或systemd单元 解决方案:
  1. 向用户询问服务名称和进程匹配模式
  2. 根据用户输入构建最小清单
  3. 使用手动配置继续

Error: "Process Pattern Matches Too Many PIDs"

错误:“进程匹配模式匹配过多PID”

Cause: Pattern too broad (e.g., "python" matches all Python processes) Solution:
  1. Narrow pattern with full command path or arguments
  2. Use
    ps aux | grep
    to identify distinguishing arguments
  3. Update manifest with more specific pattern
原因:模式过于宽泛(例如,“python”匹配所有Python进程) 解决方案:
  1. 使用完整命令路径或参数缩小模式范围
  2. 使用
    ps aux | grep
    识别区分性参数
  3. 更新清单为更具体的模式

Error: "Health File Exists But Cannot Parse"

错误:“健康文件存在但无法解析”

Cause: Malformed JSON, permissions issue, or file being written during read Solution:
  1. Check file permissions with
    ls -la
  2. Attempt raw read to inspect content
  3. If mid-write, retry after 2-second delay
  4. Report as WARNING with parse error details

原因:JSON格式错误、权限问题或读取时文件正在被写入 解决方案:
  1. 使用
    ls -la
    检查文件权限
  2. 尝试原始读取以检查内容
  3. 若文件正在写入,延迟2秒后重试
  4. 报告为警告并附带解析错误详情

Anti-Patterns

反模式

Anti-Pattern 1: Restarting Without Diagnosing

反模式1:未诊断就重启

What it looks like: Service shows WARNING, immediately run
systemctl restart
Why wrong: Masks root cause. Service may crash again immediately. Do instead: Report finding, let user decide. Never auto-restart without explicit flag.
表现:服务显示警告,立即执行
systemctl restart
错误原因:掩盖根本原因,服务可能立即再次崩溃 正确做法:报告发现的问题,让用户决定。无明确标记时绝不自动重启。

Anti-Pattern 2: Trusting Health File Alone

反模式2:仅信任健康文件

What it looks like: Health file says "healthy" so skip process check Why wrong: Process could be zombie, health file could be stale from before crash. Do instead: Always check process status independently of health file content.
表现:健康文件显示“健康”,因此跳过进程检查 错误原因:进程可能已僵死,健康文件可能是崩溃前的过期文件 正确做法:始终独立于健康文件内容检查进程状态。

Anti-Pattern 3: Ignoring Port Mismatch

反模式3:忽略端口不匹配

What it looks like: Process running, skip port check, report HEALTHY Why wrong: Process may have started but failed to bind port — effectively down. Do instead: When port is configured, always verify it is listening.
表现:进程运行,跳过端口检查,报告健康 错误原因:进程可能已启动但未绑定端口——实际上已不可用 正确做法:配置端口后,始终验证端口是否在监听。

Anti-Pattern 4: Broad Process Patterns

反模式4:宽泛的进程匹配模式

What it looks like: Using "python" as process pattern for a Flask app Why wrong: Matches every Python process on the system, giving false positives. Do instead: Use specific patterns like
gunicorn.*myapp:app
or full command paths.

表现:使用“python”作为Flask应用的进程匹配模式 错误原因:匹配系统上所有Python进程,导致误报 正确做法:使用具体模式,如
gunicorn.*myapp:app
或完整命令路径。

References

参考资料

This skill uses these shared patterns:
  • Anti-Rationalization - Prevents shortcut rationalizations
  • Verification Checklist - Pre-completion checks
该Skill使用以下共享模式:
  • 反合理化 - 防止捷径式合理化
  • 验证清单 - 完成前检查

Domain-Specific Anti-Rationalization

领域特定反合理化

RationalizationWhy It's WrongRequired Action
"Process is running, must be healthy"Running ≠ functionalCheck health file and port
"Health file looks fine"File could be stale from before crashVerify timestamp freshness
"Just restart it"Restart masks root causeReport first, restart only if flagged
"No config, skip the check"User still needs an answerAsk user for service details
合理化借口错误原因要求操作
“进程在运行,肯定健康”运行≠可用检查健康文件和端口
“健康文件看起来没问题”文件可能是崩溃前的过期文件验证时间戳新鲜度
“直接重启就行”重启掩盖根本原因先报告,仅在标记时重启
“没有配置,跳过检查”用户仍需要答案向用户询问服务详情

Health File Format Reference

健康文件格式参考

Services should write health files as:
json
{
    "timestamp": "ISO8601, updated every 30-60s",
    "status": "healthy|degraded|error",
    "connection": "connected|disconnected|reconnecting",
    "last_activity": "ISO8601 of last meaningful action",
    "running": true,
    "uptime_seconds": 12345,
    "metrics": {}
}
服务应按以下格式写入健康文件:
json
{
    "timestamp": "ISO8601格式,每30-60秒更新一次",
    "status": "healthy|degraded|error",
    "connection": "connected|disconnected|reconnecting",
    "last_activity": "最后一次有效操作的ISO8601时间",
    "running": true,
    "uptime_seconds": 12345,
    "metrics": {}
}