service-health-check
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseService Health Check Skill
服务健康检查Skill
Operator Context
操作上下文
This skill operates as an operator for service health monitoring workflows, configuring Claude's behavior for structured, read-only health assessment. It implements the Discover-Check-Report pattern — find services, gather health signals, produce actionable output — with deterministic process and health file evaluation.
该Skill作为服务健康监控工作流的操作员,配置Claude的行为以实现结构化、只读的健康评估。它采用发现-检查-报告模式——查找服务、收集健康信号、生成可执行输出——并通过确定性流程和健康文件评估实现。
Hardcoded Behaviors (Always Apply)
硬编码行为(始终适用)
- Read-Only: NEVER restart, stop, or modify services — report only
- CLAUDE.md Compliance: Read and follow repository CLAUDE.md before checking
- No Side Effects: Only read process tables, health files, and ports — no writes
- Structured Output: Always produce machine-parseable health report
- Evidence-Based Status: Every status determination requires at least one concrete signal (process check, health file, or port probe)
- 只读:绝不重启、停止或修改服务——仅生成报告
- 遵循CLAUDE.md规范:检查前先读取并遵循仓库中的CLAUDE.md
- 无副作用:仅读取进程表、健康文件和端口——不执行写入操作
- 结构化输出:始终生成可机器解析的健康报告
- 基于证据的状态判定:每个状态判定都需要至少一个具体信号(进程检查、健康文件或端口探测)
Default Behaviors (ON unless disabled)
默认行为(启用状态,除非手动禁用)
- Process Verification: Check process existence via pgrep/ps before anything else
- Staleness Detection: Flag health files older than configured threshold (default 300s)
- Port Listening Check: Verify expected ports are bound when port is configured
- Actionable Recommendations: Provide specific commands to resolve issues
- Staleness Threshold Enforcement: Default 300s, configurable per service
- 进程验证:优先通过pgrep/ps检查进程是否存在
- 过期数据检测:标记超过配置阈值(默认300秒)的健康文件
- 端口监听检查:当配置了端口时,验证预期端口是否已绑定
- 可执行建议:提供解决问题的具体命令
- 过期阈值强制执行:默认300秒,可按服务单独配置
Optional Behaviors (OFF unless enabled)
可选行为(禁用状态,除非手动启用)
- Auto-Restart Execution: Run restart commands (requires explicit user flag)
- Metrics Collection: Gather detailed performance metrics from health files
- Alert Integration: Format output for monitoring system ingestion
- Historical Comparison: Compare against previous health snapshots
- 自动重启执行:运行重启命令(需要用户明确标记)
- 指标收集:从健康文件中收集详细性能指标
- 告警集成:格式化输出以适配监控系统接入
- 历史对比:与之前的健康快照进行对比
What This Skill CAN Do
该Skill可执行的操作
- Check if processes are running via pgrep/ps
- Parse JSON health files for status, connection state, and metrics
- Detect stale health data based on configurable thresholds
- Verify ports are listening with ss/netstat
- Produce structured health reports with actionable restart recommendations
- Evaluate service degradation (disconnected, reconnecting states)
- 通过pgrep/ps检查进程是否运行
- 解析JSON健康文件获取状态、连接状态和指标
- 基于可配置阈值检测过期健康数据
- 通过ss/netstat验证端口是否监听
- 生成带有可执行重启建议的结构化健康报告
- 评估服务降级(断开连接、重连状态)
What This Skill CANNOT Do
该Skill不可执行的操作
- Restart, stop, or modify services (report-only by design)
- Perform deep log analysis (use systematic-debugging instead)
- Probe remote health endpoints over HTTP (use endpoint-validator instead)
- Inspect container internals (basic host-level process checks only)
- Authenticate against secured health endpoints
- Skip the Discover phase — services must be identified before checking
- 重启、停止或修改服务(设计为仅报告)
- 执行深度日志分析(请使用systematic-debugging替代)
- 通过HTTP探测远程健康端点(请使用endpoint-validator替代)
- 检查容器内部(仅支持基础主机级进程检查)
- 对受保护的健康端点进行身份验证
- 跳过发现阶段——必须先识别服务再进行检查
Instructions
操作说明
Phase 1: DISCOVER
阶段1:发现
Goal: Identify all services to check before running any health probes.
Step 1: Locate service definitions
Search for service configuration in this order:
- in project root
services.json - Docker/docker-compose files for service definitions
- systemd unit files or process manager configs
- User-provided service specification
Step 2: Build service manifest
For each service, establish:
markdown
undefined目标:在运行任何健康探测之前,识别所有需要检查的服务。
步骤1:定位服务配置
按以下顺序搜索服务配置:
- 项目根目录下的
services.json - Docker/docker-compose文件中的服务定义
- systemd单元文件或进程管理器配置
- 用户提供的服务规范
步骤2:构建服务清单
为每个服务确定以下信息:
markdown
undefinedService Manifest
服务清单
| Service | Process Pattern | Health File | Port | Stale Threshold |
|---|---|---|---|---|
| api-server | gunicorn.*app:app | /tmp/api_health.json | 8000 | 300s |
| worker | celery.*worker | /tmp/worker_health.json | - | 300s |
| cache | redis-server | - | 6379 | - |
**Step 3: Validate manifest**
- Confirm each process pattern is specific enough to avoid false matches
- Verify health file paths are absolute
- Ensure port numbers are within valid range (1-65535)
**Gate**: Service manifest complete with at least one service. Proceed only when gate passes.| 服务 | 进程匹配模式 | 健康文件 | 端口 | 过期阈值 |
|---|---|---|---|---|
| api-server | gunicorn.*app:app | /tmp/api_health.json | 8000 | 300s |
| worker | celery.*worker | /tmp/worker_health.json | - | 300s |
| cache | redis-server | - | 6379 | - |
**步骤3:验证清单**
- 确认每个进程匹配模式足够具体,避免误匹配
- 验证健康文件路径为绝对路径
- 确保端口号在有效范围(1-65535)内
**准入条件**:服务清单已完成且至少包含一个服务。仅当满足条件时才可继续。Phase 2: CHECK
阶段2:检查
Goal: Gather health signals for every service in the manifest.
Step 1: Check process status
For each service, run process check:
bash
pgrep -f "<process_pattern>"Record: running (true/false), PIDs, process count.
Step 2: Parse health files (if configured)
Read and parse JSON health files. Evaluate:
- Does the file exist?
- Does it parse as valid JSON?
- How old is the timestamp (staleness)?
- What status does the service self-report?
- What is the connection state?
Step 3: Probe ports (if configured)
Check if expected ports are listening:
bash
ss -tlnp "sport = :<port>"Flag processes that are running but not listening on expected ports.
Step 4: Evaluate health per service
Apply this decision tree:
- Process not running → DOWN
- Process running + health file missing → WARNING
- Process running + health file stale → WARNING (restart recommended)
- Process running + status=error → ERROR (restart recommended)
- Process running + disconnected > 30min → WARNING (restart recommended)
- Process running + disconnected < 30min → DEGRADED (allow reconnection)
- Process running + healthy → HEALTHY
- Process running + no health file configured → RUNNING (limited visibility)
Gate: All services evaluated with evidence-based status. Proceed only when gate passes.
目标:为清单中的每个服务收集健康信号。
步骤1:检查进程状态
为每个服务执行进程检查:
bash
pgrep -f "<process_pattern>"记录:运行状态(是/否)、PID、进程数量。
步骤2:解析健康文件(若已配置)
读取并解析JSON健康文件,评估:
- 文件是否存在?
- 是否可解析为有效JSON?
- 时间戳是否过期?
- 服务自我报告的状态是什么?
- 连接状态是什么?
步骤3:探测端口(若已配置)
检查预期端口是否在监听:
bash
ss -tlnp "sport = :<port>"标记已运行但未监听预期端口的进程。
步骤4:按服务评估健康状况
应用以下决策树:
- 进程未运行 → DOWN(已停止)
- 进程运行 + 健康文件缺失 → WARNING(警告)
- 进程运行 + 健康文件过期 → WARNING(警告)(建议重启)
- 进程运行 + status=error → ERROR(错误)(建议重启)
- 进程运行 + 断开连接超过30分钟 → WARNING(警告)(建议重启)
- 进程运行 + 断开连接不足30分钟 → DEGRADED(降级)(允许重连)
- 进程运行 + 状态健康 → HEALTHY(健康)
- 进程运行 + 未配置健康文件 → RUNNING(运行中)(可见性有限)
准入条件:所有服务均已通过基于证据的状态评估。仅当满足条件时才可继续。
Phase 3: REPORT
阶段3:报告
Goal: Produce structured, actionable health report.
Step 1: Generate summary
SERVICE HEALTH REPORT
=====================
Checked: N services
Healthy: X/N
RESULTS:
service-name [OK ] HEALTHY PID 12345, uptime 2d 4h
background-worker [WARN] WARNING Health file stale (15 min)
cache-service [DOWN] DOWN Process not found
RECOMMENDATIONS:
background-worker: Restart recommended - health file not updated in 900s
cache-service: Start service - process not running
SUGGESTED ACTIONS:
systemctl restart background-worker
systemctl start cache-serviceStep 2: Set exit status
- All HEALTHY/RUNNING → exit 0
- Any WARNING/DEGRADED/ERROR/DOWN → exit 1
Step 3: Present to user
- Lead with the summary line (X/N healthy)
- Highlight any services needing action
- Provide copy-pasteable commands for remediation
- If user has auto-restart enabled, confirm before executing
Gate: Report delivered with actionable recommendations for all non-healthy services.
目标:生成结构化、可执行的健康报告。
步骤1:生成摘要
服务健康报告
=====================
已检查:N个服务
健康:X/N
结果:
service-name [正常] 健康 PID 12345,运行时间2天4小时
background-worker [警告] 警告 健康文件已过期(15分钟)
cache-service [停止] 已停止 未找到进程
建议:
background-worker: 建议重启 - 健康文件已900秒未更新
cache-service: 启动服务 - 进程未运行
建议操作:
systemctl restart background-worker
systemctl start cache-service步骤2:设置退出状态
- 所有服务为HEALTHY/RUNNING → 退出码0
- 存在任何WARNING/DEGRADED/ERROR/DOWN → 退出码1
步骤3:呈现给用户
- 以摘要行开头(X/N健康)
- 突出显示需要操作的服务
- 提供可复制粘贴的修复命令
- 若用户启用了自动重启,执行前需确认
准入条件:已交付报告,且所有非健康服务均附带可执行建议。
Examples
示例
Example 1: Routine Health Check
示例1:常规健康检查
User says: "Are all services up?"
Actions:
- Locate services.json, build manifest (DISCOVER)
- Check each process, parse health files, probe ports (CHECK)
- Output structured report showing 3/3 healthy (REPORT) Result: Clean report, no action needed
用户提问:“所有服务都在线吗?”
操作:
- 找到services.json,构建清单(发现)
- 检查每个进程、解析健康文件、探测端口(检查)
- 输出结构化报告,显示3/3服务健康(报告) 结果:无操作需求的清晰报告
Example 2: Stale Worker Detection
示例2:过期Worker检测
User says: "The background worker seems stuck"
Actions:
- Identify worker service from config (DISCOVER)
- Find process running but health file 20 minutes stale (CHECK)
- Report WARNING with restart recommendation (REPORT) Result: Specific diagnosis with actionable command
用户提问:“后台Worker好像卡住了”
操作:
- 从配置中识别Worker服务(发现)
- 发现进程运行但健康文件已过期20分钟(检查)
- 报告警告并给出重启建议(报告) 结果:具体诊断及可执行命令
Error Handling
错误处理
Error: "No Service Configuration Found"
错误:“未找到服务配置”
Cause: No services.json, docker-compose, or systemd units discovered
Solution:
- Ask user for service name and process pattern
- Build minimal manifest from user input
- Proceed with manual configuration
原因:未发现services.json、docker-compose或systemd单元
解决方案:
- 向用户询问服务名称和进程匹配模式
- 根据用户输入构建最小清单
- 使用手动配置继续
Error: "Process Pattern Matches Too Many PIDs"
错误:“进程匹配模式匹配过多PID”
Cause: Pattern too broad (e.g., "python" matches all Python processes)
Solution:
- Narrow pattern with full command path or arguments
- Use to identify distinguishing arguments
ps aux | grep - Update manifest with more specific pattern
原因:模式过于宽泛(例如,“python”匹配所有Python进程)
解决方案:
- 使用完整命令路径或参数缩小模式范围
- 使用识别区分性参数
ps aux | grep - 更新清单为更具体的模式
Error: "Health File Exists But Cannot Parse"
错误:“健康文件存在但无法解析”
Cause: Malformed JSON, permissions issue, or file being written during read
Solution:
- Check file permissions with
ls -la - Attempt raw read to inspect content
- If mid-write, retry after 2-second delay
- Report as WARNING with parse error details
原因:JSON格式错误、权限问题或读取时文件正在被写入
解决方案:
- 使用检查文件权限
ls -la - 尝试原始读取以检查内容
- 若文件正在写入,延迟2秒后重试
- 报告为警告并附带解析错误详情
Anti-Patterns
反模式
Anti-Pattern 1: Restarting Without Diagnosing
反模式1:未诊断就重启
What it looks like: Service shows WARNING, immediately run
Why wrong: Masks root cause. Service may crash again immediately.
Do instead: Report finding, let user decide. Never auto-restart without explicit flag.
systemctl restart表现:服务显示警告,立即执行
错误原因:掩盖根本原因,服务可能立即再次崩溃
正确做法:报告发现的问题,让用户决定。无明确标记时绝不自动重启。
systemctl restartAnti-Pattern 2: Trusting Health File Alone
反模式2:仅信任健康文件
What it looks like: Health file says "healthy" so skip process check
Why wrong: Process could be zombie, health file could be stale from before crash.
Do instead: Always check process status independently of health file content.
表现:健康文件显示“健康”,因此跳过进程检查
错误原因:进程可能已僵死,健康文件可能是崩溃前的过期文件
正确做法:始终独立于健康文件内容检查进程状态。
Anti-Pattern 3: Ignoring Port Mismatch
反模式3:忽略端口不匹配
What it looks like: Process running, skip port check, report HEALTHY
Why wrong: Process may have started but failed to bind port — effectively down.
Do instead: When port is configured, always verify it is listening.
表现:进程运行,跳过端口检查,报告健康
错误原因:进程可能已启动但未绑定端口——实际上已不可用
正确做法:配置端口后,始终验证端口是否在监听。
Anti-Pattern 4: Broad Process Patterns
反模式4:宽泛的进程匹配模式
What it looks like: Using "python" as process pattern for a Flask app
Why wrong: Matches every Python process on the system, giving false positives.
Do instead: Use specific patterns like or full command paths.
gunicorn.*myapp:app表现:使用“python”作为Flask应用的进程匹配模式
错误原因:匹配系统上所有Python进程,导致误报
正确做法:使用具体模式,如或完整命令路径。
gunicorn.*myapp:appReferences
参考资料
This skill uses these shared patterns:
- Anti-Rationalization - Prevents shortcut rationalizations
- Verification Checklist - Pre-completion checks
该Skill使用以下共享模式:
- 反合理化 - 防止捷径式合理化
- 验证清单 - 完成前检查
Domain-Specific Anti-Rationalization
领域特定反合理化
| Rationalization | Why It's Wrong | Required Action |
|---|---|---|
| "Process is running, must be healthy" | Running ≠ functional | Check health file and port |
| "Health file looks fine" | File could be stale from before crash | Verify timestamp freshness |
| "Just restart it" | Restart masks root cause | Report first, restart only if flagged |
| "No config, skip the check" | User still needs an answer | Ask user for service details |
| 合理化借口 | 错误原因 | 要求操作 |
|---|---|---|
| “进程在运行,肯定健康” | 运行≠可用 | 检查健康文件和端口 |
| “健康文件看起来没问题” | 文件可能是崩溃前的过期文件 | 验证时间戳新鲜度 |
| “直接重启就行” | 重启掩盖根本原因 | 先报告,仅在标记时重启 |
| “没有配置,跳过检查” | 用户仍需要答案 | 向用户询问服务详情 |
Health File Format Reference
健康文件格式参考
Services should write health files as:
json
{
"timestamp": "ISO8601, updated every 30-60s",
"status": "healthy|degraded|error",
"connection": "connected|disconnected|reconnecting",
"last_activity": "ISO8601 of last meaningful action",
"running": true,
"uptime_seconds": 12345,
"metrics": {}
}服务应按以下格式写入健康文件:
json
{
"timestamp": "ISO8601格式,每30-60秒更新一次",
"status": "healthy|degraded|error",
"connection": "connected|disconnected|reconnecting",
"last_activity": "最后一次有效操作的ISO8601时间",
"running": true,
"uptime_seconds": 12345,
"metrics": {}
}