runbookhermes-aiops-agent
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRunbookHermes AIOps Agent Skill
RunbookHermes AIOps Agent 技能
Skill by ara.so — Hermes Skills collection.
RunbookHermes is a Hermes-native AIOps agent that specializes in incident response workflows. It extends Hermes Agent's runtime with evidence collection from observability tools (Prometheus, Loki, Jaeger), approval-gated remediation, checkpoint/rollback capabilities, and automatic runbook skill generation from resolved incidents.
由ara.so提供的技能——Hermes技能集合。
RunbookHermes是一款Hermes原生的AIOps代理,专注于事件响应工作流。它扩展了Hermes Agent的运行时能力,支持从可观测性工具(Prometheus、Loki、Jaeger)收集证据、审批管控的故障修复、检查点/回滚功能,以及从已解决事件中自动生成可复用的运行手册技能。
What RunbookHermes Does
RunbookHermes 功能
- Evidence-driven incident analysis: Collects metrics, logs, traces, and deployment history
- Approval-gated remediation: Requires human approval before risky actions
- Runbook learning: Converts successful incident resolutions into reusable skills
- Multi-channel intake: Accepts incidents from Web UI, Alertmanager, Feishu, WeCom, API
- EvidenceStack context engine: Compresses observability data for model reasoning
- IncidentMemory: Remembers service profiles, incident patterns, team preferences
- 基于证据的事件分析:收集指标、日志、链路追踪和部署历史
- 审批管控的故障修复:执行高风险操作前需人工审批
- 运行手册学习:将成功的事件解决方案转化为可复用技能
- 多渠道事件接入:支持从Web UI、Alertmanager、飞书、企业微信、API接入事件
- EvidenceStack上下文引擎:压缩可观测性数据以供模型推理
- IncidentMemory:存储服务配置文件、事件模式、团队偏好
Installation
安装
Prerequisites
前置条件
- Python 3.10+
- Hermes Agent (included as subdirectory)
agent/ - Docker and Docker Compose (for local payment demo environment)
- Python 3.10+
- Hermes Agent(包含在子目录中)
agent/ - Docker和Docker Compose(用于本地支付演示环境)
Clone and Install
克隆并安装
bash
git clone https://github.com/Tommy-yw/RunbookHermes.git
cd RunbookHermesbash
git clone https://github.com/Tommy-yw/RunbookHermes.git
cd RunbookHermesInstall dependencies
安装依赖
pip install -r requirements.txt
pip install -r requirements.txt
Or use Poetry
或使用Poetry
poetry install
undefinedpoetry install
undefinedEnvironment Configuration
环境配置
Create file in project root:
.envbash
undefined在项目根目录创建文件:
.envbash
undefinedModel provider (optional, for AI-assisted summaries)
模型提供商(可选,用于AI辅助摘要)
OPENAI_API_KEY=${OPENAI_API_KEY}
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4o
OPENAI_API_KEY=${OPENAI_API_KEY}
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4o
Observability backends
可观测性后端
PROMETHEUS_URL=http://localhost:9090
LOKI_URL=http://localhost:3100
JAEGER_URL=http://localhost:16686
PROMETHEUS_URL=http://localhost:9090
LOKI_URL=http://localhost:3100
JAEGER_URL=http://localhost:16686
Deploy history backend
部署历史后端
DEPLOY_BACKEND_TYPE=local_json
DEPLOY_HISTORY_PATH=./data/payment_demo/deploy_history.json
DEPLOY_BACKEND_TYPE=local_json
DEPLOY_HISTORY_PATH=./data/payment_demo/deploy_history.json
Execution backend (for rollback/remediation)
执行后端(用于回滚/故障修复)
EXECUTION_BACKEND_TYPE=local_reference
EXECUTION_CONFIG_PATH=./data/payment_demo/execution_config.json
EXECUTION_BACKEND_TYPE=local_reference
EXECUTION_CONFIG_PATH=./data/payment_demo/execution_config.json
Feishu integration (optional)
飞书集成(可选)
FEISHU_APP_ID=${FEISHU_APP_ID}
FEISHU_APP_SECRET=${FEISHU_APP_SECRET}
FEISHU_APP_ID=${FEISHU_APP_ID}
FEISHU_APP_SECRET=${FEISHU_APP_SECRET}
WeCom integration (optional)
企业微信集成(可选)
WECOM_CORP_ID=${WECOM_CORP_ID}
WECOM_AGENT_SECRET=${WECOM_AGENT_SECRET}
WECOM_CORP_ID=${WECOM_CORP_ID}
WECOM_AGENT_SECRET=${WECOM_AGENT_SECRET}
Web API
Web API
RUNBOOK_API_HOST=0.0.0.0
RUNBOOK_API_PORT=8000
undefinedRUNBOOK_API_HOST=0.0.0.0
RUNBOOK_API_PORT=8000
undefinedStart Local Payment Demo Environment
启动本地支付演示环境
bash
cd demo/payment_system
docker-compose up -d
cd ../..bash
cd demo/payment_system
docker-compose up -d
cd ../..Verify services are running
验证服务是否运行
curl http://localhost:8001/health # payment-service
curl http://localhost:8002/health # coupon-service
curl http://localhost:8003/health # order-service
undefinedcurl http://localhost:8001/health # payment-service
curl http://localhost:8002/health # coupon-service
curl http://localhost:8003/health # order-service
undefinedStart RunbookHermes API Server
启动RunbookHermes API服务
bash
undefinedbash
undefinedFrom project root
从项目根目录执行
python -m apps.runbook_api.main
python -m apps.runbook_api.main
Or with uvicorn directly
或直接使用uvicorn
uvicorn apps.runbook_api.main:app --host 0.0.0.0 --port 8000 --reload
Access Web Console at `http://localhost:8000`uvicorn apps.runbook_api.main:app --host 0.0.0.0 --port 8000 --reload
访问Web控制台:`http://localhost:8000`Core Concepts
核心概念
1. Hermes Profile Integration
1. Hermes Profile集成
RunbookHermes runs as a Hermes Agent profile located at :
profiles/runbook-hermes/yaml
undefinedRunbookHermes作为Hermes Agent profile运行,路径为:
profiles/runbook-hermes/yaml
undefinedprofiles/runbook-hermes/profile.yaml
profiles/runbook-hermes/profile.yaml
name: runbook-hermes
version: 1.0.0
description: AIOps agent for incident response
persona: incident_responder
tools:
- runbook-hermes context_engine: evidence_stack memory_provider: incident_memory
undefinedname: runbook-hermes
version: 1.0.0
description: AIOps agent for incident response
persona: incident_responder
tools:
- runbook-hermes context_engine: evidence_stack memory_provider: incident_memory
undefined2. Evidence Collection Tools
2. 证据收集工具
The tool plugin provides incident-response capabilities:
runbook-hermespython
undefinedrunbook-hermespython
undefinedExample: Query metrics evidence
示例:查询指标证据
from plugins.runbook_hermes.tools import query_metrics
evidence = query_metrics(
service="payment-service",
metric_type="http_5xx_rate",
time_window="5m"
)
Available tools in the plugin:
- `query_metrics` - Prometheus metrics collection
- `query_logs` - Loki log search
- `query_traces` - Jaeger trace analysis
- `get_deploy_history` - Recent deployment records
- `create_checkpoint` - Save system state before remediation
- `request_approval` - Gate risky actions
- `execute_rollback` - Controlled rollback execution
- `verify_recovery` - Post-remediation health checkfrom plugins.runbook_hermes.tools import query_metrics
evidence = query_metrics(
service="payment-service",
metric_type="http_5xx_rate",
time_window="5m"
)
插件中可用的工具:
- `query_metrics` - Prometheus指标收集
- `query_logs` - Loki日志搜索
- `query_traces` - Jaeger链路追踪分析
- `get_deploy_history` - 近期部署记录
- `create_checkpoint` - 故障修复前保存系统状态
- `request_approval` - 管控高风险操作
- `execute_rollback` - 可控回滚执行
- `verify_recovery` - 故障修复后健康检查3. EvidenceStack Context Engine
3. EvidenceStack上下文引擎
Compresses observability data for model consumption:
python
from plugins.context_engine.evidence_stack.engine import EvidenceStackEngine
engine = EvidenceStackEngine()压缩可观测性数据以供模型使用:
python
from plugins.context_engine.evidence_stack.engine import EvidenceStackEngine
engine = EvidenceStackEngine()Add evidence
添加证据
engine.add_evidence({
"type": "metric",
"service": "payment-service",
"signal": "http_503_rate_spike",
"value": "45 req/s",
"severity": "critical"
})
engine.add_evidence({
"type": "metric",
"service": "payment-service",
"signal": "http_503_rate_spike",
"value": "45 req/s",
"severity": "critical"
})
Get compressed context
获取压缩后的上下文
context = engine.get_context()
context = engine.get_context()
Returns: alert summary, key evidence, hypotheses, action plan
返回内容:告警摘要、关键证据、假设、行动计划
undefinedundefined4. IncidentMemory Provider
4. IncidentMemory存储组件
Stores operational knowledge:
python
from plugins.memory.incident_memory.provider import IncidentMemoryProvider
memory = IncidentMemoryProvider()存储运维知识:
python
from plugins.memory.incident_memory.provider import IncidentMemoryProvider
memory = IncidentMemoryProvider()Remember service profile
保存服务配置文件
memory.save_service_profile("payment-service", {
"critical_metrics": ["http_5xx_rate", "p95_latency"],
"dependencies": ["coupon-service", "order-service"],
"rollback_safe": True
})
memory.save_service_profile("payment-service", {
"critical_metrics": ["http_5xx_rate", "p95_latency"],
"dependencies": ["coupon-service", "order-service"],
"rollback_safe": True
})
Recall incident patterns
查询相似事件模式
similar = memory.recall_similar_incidents(
service="payment-service",
symptom="http_503_spike"
)
undefinedsimilar = memory.recall_similar_incidents(
service="payment-service",
symptom="http_503_spike"
)
undefinedCreating and Managing Incidents
创建与管理事件
Via Web Console
通过Web控制台
Navigate to and fill the form:
http://localhost:8000/incidents/create- Service name
- Severity (critical, high, medium, low)
- Description
- Alert data (optional)
访问并填写表单:
http://localhost:8000/incidents/create- 服务名称
- 严重程度(critical、high、medium、low)
- 描述
- 告警数据(可选)
Via API
通过API
python
import requests
response = requests.post("http://localhost:8000/api/incidents", json={
"service": "payment-service",
"severity": "critical",
"description": "HTTP 503 rate spike detected",
"alert": {
"metric": "http_5xx_rate",
"value": 45.2,
"threshold": 5.0
},
"metadata": {
"source": "alertmanager",
"runbook_url": "https://wiki.example.com/payment-503"
}
})
incident_id = response.json()["incident_id"]python
import requests
response = requests.post("http://localhost:8000/api/incidents", json={
"service": "payment-service",
"severity": "critical",
"description": "检测到HTTP 503请求率飙升",
"alert": {
"metric": "http_5xx_rate",
"value": 45.2,
"threshold": 5.0
},
"metadata": {
"source": "alertmanager",
"runbook_url": "https://wiki.example.com/payment-503"
}
})
incident_id = response.json()["incident_id"]Via Hermes CLI
通过Hermes CLI
bash
undefinedbash
undefinedRun incident response through Hermes profile
通过Hermes profile运行事件响应
hermes run
--profile runbook-hermes
--input "Payment service showing HTTP 503 errors at 45 req/s"
--context '{"service": "payment-service", "severity": "critical"}'
--profile runbook-hermes
--input "Payment service showing HTTP 503 errors at 45 req/s"
--context '{"service": "payment-service", "severity": "critical"}'
undefinedhermes run
--profile runbook-hermes
--input "Payment service showing HTTP 503 errors at 45 req/s"
--context '{"service": "payment-service", "severity": "critical"}'
--profile runbook-hermes
--input "Payment service showing HTTP 503 errors at 45 req/s"
--context '{"service": "payment-service", "severity": "critical"}'
undefinedVia Alertmanager Webhook
通过Alertmanager Webhook
Configure Alertmanager to send webhooks:
yaml
undefined配置Alertmanager发送Webhook:
yaml
undefinedalertmanager.yml
alertmanager.yml
receivers:
- name: runbook-hermes
webhook_configs:
- url: http://localhost:8000/gateway/alertmanager send_resolved: true
undefinedreceivers:
- name: runbook-hermes
webhook_configs:
- url: http://localhost:8000/gateway/alertmanager send_resolved: true
undefinedApproval Workflow
审批工作流
RunbookHermes gates risky actions behind approval:
python
undefinedRunbookHermes对高风险操作设置审批管控:
python
undefinedIn your incident response logic
在事件响应逻辑中
from runbook_hermes.approval import ApprovalManager
approval_mgr = ApprovalManager()
from runbook_hermes.approval import ApprovalManager
approval_mgr = ApprovalManager()
Request approval for rollback
请求回滚审批
approval_id = approval_mgr.request_approval(
incident_id="inc_001",
action_type="rollback",
target_service="payment-service",
target_version="v1.2.3",
risk_level="high",
reason="Rollback to last known good version due to 503 spike",
checkpoint_id="chk_001"
)
approval_id = approval_mgr.request_approval(
incident_id="inc_001",
action_type="rollback",
target_service="payment-service",
target_version="v1.2.3",
risk_level="high",
reason="因503请求率飙升回滚至已知稳定版本",
checkpoint_id="chk_001"
)
Check approval status
检查审批状态
status = approval_mgr.get_status(approval_id)
if status == "approved":
# Execute rollback
execute_rollback(service="payment-service", version="v1.2.3")
undefinedstatus = approval_mgr.get_status(approval_id)
if status == "approved":
# 执行回滚
execute_rollback(service="payment-service", version="v1.2.3")
undefinedApprove via Web Console
通过Web控制台审批
Navigate to to review and approve/reject pending actions.
http://localhost:8000/approvals访问查看并审批/驳回待处理操作。
http://localhost:8000/approvalsApprove via API
通过API审批
python
requests.post(f"http://localhost:8000/api/approvals/{approval_id}/approve", json={
"operator": "alice",
"comment": "Approved after verifying checkpoint"
})python
requests.post(f"http://localhost:8000/api/approvals/{approval_id}/approve", json={
"operator": "alice",
"comment": "验证检查点后批准"
})Checkpoint and Rollback
检查点与回滚
Create Checkpoint Before Remediation
故障修复前创建检查点
python
from runbook_hermes.checkpoint import CheckpointManager
checkpoint_mgr = CheckpointManager()
checkpoint = checkpoint_mgr.create(
incident_id="inc_001",
service="payment-service",
snapshot_type="deployment",
metadata={
"current_version": "v1.2.4",
"replica_count": 3,
"config_hash": "abc123"
}
)python
from runbook_hermes.checkpoint import CheckpointManager
checkpoint_mgr = CheckpointManager()
checkpoint = checkpoint_mgr.create(
incident_id="inc_001",
service="payment-service",
snapshot_type="deployment",
metadata={
"current_version": "v1.2.4",
"replica_count": 3,
"config_hash": "abc123"
}
)Execute Rollback
执行回滚
python
from runbook_hermes.remediation import RemediationExecutor
executor = RemediationExecutor()
result = executor.rollback(
service="payment-service",
target_version="v1.2.3",
checkpoint_id=checkpoint.id,
dry_run=False
)python
from runbook_hermes.remediation import RemediationExecutor
executor = RemediationExecutor()
result = executor.rollback(
service="payment-service",
target_version="v1.2.3",
checkpoint_id=checkpoint.id,
dry_run=False
)Verify recovery
验证恢复状态
recovery_status = executor.verify_recovery(
service="payment-service",
expected_metrics={"http_5xx_rate": "<5"}
)
undefinedrecovery_status = executor.verify_recovery(
service="payment-service",
expected_metrics={"http_5xx_rate": "<5"}
)
undefinedRunbook Skill Generation
运行手册技能生成
After resolving an incident, generate a reusable skill:
python
from runbook_hermes.skills import SkillGenerator
generator = SkillGenerator()
skill = generator.generate_from_incident(
incident_id="inc_001",
skill_name="payment-http-503-rollback",
trigger_conditions=["payment service 503 spike", "payment 5xx rate > 40"],
steps=[
"collect_evidence",
"verify_deploy_change",
"create_checkpoint",
"request_approval",
"rollback_deployment",
"verify_recovery"
]
)事件解决后,生成可复用技能:
python
from runbook_hermes.skills import SkillGenerator
generator = SkillGenerator()
skill = generator.generate_from_incident(
incident_id="inc_001",
skill_name="payment-http-503-rollback",
trigger_conditions=["payment service 503 spike", "payment 5xx rate > 40"],
steps=[
"collect_evidence",
"verify_deploy_change",
"create_checkpoint",
"request_approval",
"rollback_deployment",
"verify_recovery"
]
)Save to skills directory
保存到技能目录
skill.save("skills/runbooks/payment-http-503-rollback.yaml")
Generated skill format:
```yamlskill.save("skills/runbooks/payment-http-503-rollback.yaml")
生成的技能格式:
```yamlskills/runbooks/payment-http-503-rollback.yaml
skills/runbooks/payment-http-503-rollback.yaml
name: payment-http-503-rollback
version: 1.0.0
triggers:
- payment service 503 spike
- payment 5xx rate > 40 steps:
- name: collect_evidence tool: query_metrics params: service: payment-service metric: http_5xx_rate
- name: verify_deploy_change tool: get_deploy_history params: service: payment-service limit: 5
- name: create_checkpoint tool: create_checkpoint
- name: request_approval tool: request_approval risk_level: high
- name: rollback_deployment tool: execute_rollback
- name: verify_recovery tool: verify_recovery
undefinedname: payment-http-503-rollback
version: 1.0.0
triggers:
- payment service 503 spike
- payment 5xx rate > 40 steps:
- name: collect_evidence tool: query_metrics params: service: payment-service metric: http_5xx_rate
- name: verify_deploy_change tool: get_deploy_history params: service: payment-service limit: 5
- name: create_checkpoint tool: create_checkpoint
- name: request_approval tool: request_approval risk_level: high
- name: rollback_deployment tool: execute_rollback
- name: verify_recovery tool: verify_recovery
undefinedObservability Integration
可观测性集成
Prometheus Metrics
Prometheus指标
python
from integrations.observability.prometheus_adapter import PrometheusAdapter
prom = PrometheusAdapter(base_url="http://localhost:9090")python
from integrations.observability.prometheus_adapter import PrometheusAdapter
prom = PrometheusAdapter(base_url="http://localhost:9090")Query current 5xx rate
查询当前5xx请求率
result = prom.query_range(
query='rate(http_requests_total{status=~"5..", service="payment-service"}[5m])',
start="-15m",
end="now",
step="30s"
)
result = prom.query_range(
query='rate(http_requests_total{status=~"5..", service="payment-service"}[5m])',
start="-15m",
end="now",
step="30s"
)
Extract evidence
提取证据
if result.has_spike(threshold=5.0):
evidence = {
"type": "metric",
"signal": "http_5xx_spike",
"max_value": result.max_value(),
"timestamp": result.max_timestamp()
}
undefinedif result.has_spike(threshold=5.0):
evidence = {
"type": "metric",
"signal": "http_5xx_spike",
"max_value": result.max_value(),
"timestamp": result.max_timestamp()
}
undefinedLoki Logs
Loki日志
python
from integrations.observability.loki_adapter import LokiAdapter
loki = LokiAdapter(base_url="http://localhost:3100")python
from integrations.observability.loki_adapter import LokiAdapter
loki = LokiAdapter(base_url="http://localhost:3100")Search error logs
搜索错误日志
logs = loki.query_range(
query='{service="payment-service"} |= "error" | json',
start="-15m",
limit=100
)
logs = loki.query_range(
query='{service="payment-service"} |= "error" | json',
start="-15m",
limit=100
)
Extract patterns
提取错误模式
error_patterns = logs.extract_patterns(min_frequency=5)
undefinederror_patterns = logs.extract_patterns(min_frequency=5)
undefinedJaeger Traces
Jaeger链路追踪
python
from integrations.observability.jaeger_adapter import JaegerAdapter
jaeger = JaegerAdapter(base_url="http://localhost:16686")python
from integrations.observability.jaeger_adapter import JaegerAdapter
jaeger = JaegerAdapter(base_url="http://localhost:16686")Find slow traces
查找慢链路追踪
traces = jaeger.search_traces(
service="payment-service",
start="-15m",
min_duration="500ms",
limit=20
)
traces = jaeger.search_traces(
service="payment-service",
start="-15m",
min_duration="500ms",
limit=20
)
Analyze error traces
分析错误链路追踪
for trace in traces.with_errors():
root_cause_span = trace.find_slowest_span()
undefinedfor trace in traces.with_errors():
root_cause_span = trace.find_slowest_span()
undefinedRunning Hermes Agent with RunbookHermes Profile
使用RunbookHermes Profile运行Hermes Agent
Direct CLI Invocation
直接CLI调用
bash
undefinedbash
undefinedRun incident triage
运行事件分类
hermes run
--profile runbook-hermes
--input "Payment service p95 latency is 2.5s, normal is 200ms"
--verbose
--profile runbook-hermes
--input "Payment service p95 latency is 2.5s, normal is 200ms"
--verbose
hermes run
--profile runbook-hermes
--input "Payment service p95 latency is 2.5s, normal is 200ms"
--verbose
--profile runbook-hermes
--input "Payment service p95 latency is 2.5s, normal is 200ms"
--verbose
Run with specific tool selection
指定工具运行
hermes run
--profile runbook-hermes
--input "Check payment service deployment history"
--tools query_metrics,get_deploy_history
--profile runbook-hermes
--input "Check payment service deployment history"
--tools query_metrics,get_deploy_history
undefinedhermes run
--profile runbook-hermes
--input "Check payment service deployment history"
--tools query_metrics,get_deploy_history
--profile runbook-hermes
--input "Check payment service deployment history"
--tools query_metrics,get_deploy_history
undefinedProgrammatic Invocation
程序化调用
python
from agent.runtime import HermesRuntime
from agent.config import AgentConfig
config = AgentConfig(
profile="runbook-hermes",
tools=["runbook-hermes"],
context_engine="evidence_stack",
memory_provider="incident_memory"
)
runtime = HermesRuntime(config)
response = runtime.run(
input_text="Investigate payment-service HTTP 503 spike",
context={
"service": "payment-service",
"incident_id": "inc_001",
"severity": "critical"
}
)
print(response.final_answer)
print(response.evidence_chain)
print(response.recommended_actions)python
from agent.runtime import HermesRuntime
from agent.config import AgentConfig
config = AgentConfig(
profile="runbook-hermes",
tools=["runbook-hermes"],
context_engine="evidence_stack",
memory_provider="incident_memory"
)
runtime = HermesRuntime(config)
response = runtime.run(
input_text="Investigate payment-service HTTP 503 spike",
context={
"service": "payment-service",
"incident_id": "inc_001",
"severity": "critical"
}
)
print(response.final_answer)
print(response.evidence_chain)
print(response.recommended_actions)Common Patterns
常见模式
Pattern 1: Full Incident Response Workflow
模式1:完整事件响应工作流
python
from runbook_hermes.workflow import IncidentResponseWorkflow
workflow = IncidentResponseWorkflow()python
from runbook_hermes.workflow import IncidentResponseWorkflow
workflow = IncidentResponseWorkflow()Execute end-to-end
执行端到端流程
result = workflow.execute(
service="payment-service",
symptom="http_503_spike",
severity="critical",
auto_approve=False # Require human approval
)
print(f"Root cause: {result.root_cause}")
print(f"Remediation: {result.remediation_action}")
print(f"Status: {result.status}")
undefinedresult = workflow.execute(
service="payment-service",
symptom="http_503_spike",
severity="critical",
auto_approve=False # 需要人工审批
)
print(f"根因: {result.root_cause}")
print(f"故障修复方案: {result.remediation_action}")
print(f"状态: {result.status}")
undefinedPattern 2: Evidence-Driven Diagnosis
模式2:基于证据的诊断
python
from runbook_hermes.diagnosis import EvidenceDiagnosis
diagnosis = EvidenceDiagnosis(service="payment-service")python
from runbook_hermes.diagnosis import EvidenceDiagnosis
diagnosis = EvidenceDiagnosis(service="payment-service")Collect all evidence types
收集所有类型的证据
diagnosis.collect_metrics(time_window="15m")
diagnosis.collect_logs(time_window="15m", error_only=True)
diagnosis.collect_traces(time_window="15m", min_duration="500ms")
diagnosis.collect_deploy_history(limit=10)
diagnosis.collect_metrics(time_window="15m")
diagnosis.collect_logs(time_window="15m", error_only=True)
diagnosis.collect_traces(time_window="15m", min_duration="500ms")
diagnosis.collect_deploy_history(limit=10)
Analyze
分析诊断
root_cause = diagnosis.analyze()
print(f"Most likely cause: {root_cause.hypothesis}")
print(f"Confidence: {root_cause.confidence}")
print(f"Supporting evidence: {root_cause.evidence_ids}")
undefinedroot_cause = diagnosis.analyze()
print(f"最可能原因: {root_cause.hypothesis}")
print(f"置信度: {root_cause.confidence}")
print(f"支持证据: {root_cause.evidence_ids}")
undefinedPattern 3: Safe Remediation with Approval
模式3:带审批的安全故障修复
python
from runbook_hermes.remediation import SafeRemediation
remediation = SafeRemediation(incident_id="inc_001")python
from runbook_hermes.remediation import SafeRemediation
remediation = SafeRemediation(incident_id="inc_001")Plan action
制定回滚计划
plan = remediation.plan_rollback(
service="payment-service",
target_version="v1.2.3"
)
plan = remediation.plan_rollback(
service="payment-service",
target_version="v1.2.3"
)
Create checkpoint
创建检查点
checkpoint = remediation.create_checkpoint()
checkpoint = remediation.create_checkpoint()
Request approval (blocks until human decision)
请求审批(等待人工决策)
approval = remediation.request_approval(
action=plan,
checkpoint=checkpoint,
timeout_minutes=30
)
if approval.is_approved():
# Execute with dry-run first
dry_run_result = remediation.execute(dry_run=True)
if dry_run_result.success:
# Real execution
result = remediation.execute(dry_run=False)
# Verify recovery
if remediation.verify_recovery():
print("Remediation successful")
else:
# Auto-rollback to checkpoint
remediation.restore_checkpoint(checkpoint.id)undefinedapproval = remediation.request_approval(
action=plan,
checkpoint=checkpoint,
timeout_minutes=30
)
if approval.is_approved():
# 先执行预演
dry_run_result = remediation.execute(dry_run=True)
if dry_run_result.success:
# 执行实际操作
result = remediation.execute(dry_run=False)
# 验证恢复状态
if remediation.verify_recovery():
print("故障修复成功")
else:
# 自动回滚到检查点
remediation.restore_checkpoint(checkpoint.id)undefinedPattern 4: Multi-Service Impact Analysis
模式4:多服务影响分析
python
from runbook_hermes.topology import ServiceTopology
topology = ServiceTopology()python
from runbook_hermes.topology import ServiceTopology
topology = ServiceTopology()Build dependency graph
构建依赖关系图
graph = topology.build_graph(
root_service="payment-service",
depth=2
)
graph = topology.build_graph(
root_service="payment-service",
depth=2
)
Analyze impact
分析影响范围
impact = topology.analyze_impact(
failing_service="payment-service",
failure_type="http_503"
)
print(f"Directly impacted: {impact.direct}")
print(f"Indirectly impacted: {impact.indirect}")
print(f"Suggested investigation order: {impact.priority_list}")
undefinedimpact = topology.analyze_impact(
failing_service="payment-service",
failure_type="http_503"
)
print(f"直接影响服务: {impact.direct}")
print(f"间接影响服务: {impact.indirect}")
print(f"建议排查顺序: {impact.priority_list}")
undefinedConfiguration Reference
配置参考
RunbookHermes Config File
RunbookHermes配置文件
Create :
config/runbook_hermes.yamlyaml
undefined创建:
config/runbook_hermes.yamlyaml
undefinedIncident response settings
事件响应设置
incident:
auto_create_from_alert: true
default_severity: high
evidence_collection_timeout: 300 # seconds
incident:
auto_create_from_alert: true
default_severity: high
evidence_collection_timeout: 300 # 秒
Evidence collection
证据收集配置
evidence:
metrics:
enabled: true
time_window: 15m
retention_days: 30
logs:
enabled: true
max_lines: 1000
error_patterns_only: false
traces:
enabled: true
sample_limit: 100
min_duration: 200ms
evidence:
metrics:
enabled: true
time_window: 15m
retention_days: 30
logs:
enabled: true
max_lines: 1000
error_patterns_only: false
traces:
enabled: true
sample_limit: 100
min_duration: 200ms
Approval settings
审批设置
approval:
required_for:
- rollback
- restart
- config_change
- scale_down
auto_approve_on_critical: false
approval_timeout_minutes: 30
require_checkpoint: true
approval:
required_for:
- rollback
- restart
- config_change
- scale_down
auto_approve_on_critical: false
approval_timeout_minutes: 30
require_checkpoint: true
Remediation
故障修复配置
remediation:
dry_run_first: true
verify_recovery: true
recovery_check_interval: 30 # seconds
max_recovery_wait: 300 # seconds
auto_rollback_on_failure: true
remediation:
dry_run_first: true
verify_recovery: true
recovery_check_interval: 30 # 秒
max_recovery_wait: 300 # 秒
auto_rollback_on_failure: true
Runbook skill generation
运行手册技能生成配置
skills:
auto_generate: true
min_success_count: 1
output_dir: skills/runbooks
skills:
auto_generate: true
min_success_count: 1
output_dir: skills/runbooks
Model-assisted analysis (optional)
AI辅助分析配置(可选)
model:
enabled: true
provider: openai
temperature: 0.3
max_tokens: 2000
undefinedmodel:
enabled: true
provider: openai
temperature: 0.3
max_tokens: 2000
undefinedTool Configuration
工具配置
yaml
undefinedyaml
undefinedplugins/runbook_hermes/config.yaml
plugins/runbook_hermes/config.yaml
tools:
query_metrics:
timeout: 30
max_results: 1000
query_logs:
timeout: 60
max_lines: 5000
query_traces:
timeout: 45
max_traces: 200
execute_rollback:
require_approval: true
require_checkpoint: true
dry_run_first: true
undefinedtools:
query_metrics:
timeout: 30
max_results: 1000
query_logs:
timeout: 60
max_lines: 5000
query_traces:
timeout: 45
max_traces: 200
execute_rollback:
require_approval: true
require_checkpoint: true
dry_run_first: true
undefinedTroubleshooting
故障排查
Issue: Evidence collection returns empty results
问题:证据收集返回空结果
Cause: Observability backends not reachable or no data in time window
Solution:
python
undefined原因:可观测性后端无法访问或时间窗口内无数据
解决方案:
python
undefinedTest backend connectivity
测试后端连通性
from integrations.observability.health import check_backends
health = check_backends()
print(f"Prometheus: {health['prometheus']}")
print(f"Loki: {health['loki']}")
print(f"Jaeger: {health['jaeger']}")
from integrations.observability.health import check_backends
health = check_backends()
print(f"Prometheus: {health['prometheus']}")
print(f"Loki: {health['loki']}")
print(f"Jaeger: {health['jaeger']}")
Verify time window
验证时间窗口
Ensure time_window matches your metric retention
确保时间窗口与指标保留周期匹配
evidence = query_metrics(
service="payment-service",
time_window="1h" # Increase window
)
undefinedevidence = query_metrics(
service="payment-service",
time_window="1h" # 增大时间窗口
)
undefinedIssue: Approval requests timeout
问题:审批请求超时
Cause: No operator reviewing approvals in time
Solution:
yaml
undefined原因:无运维人员及时处理审批请求
解决方案:
yaml
undefinedconfig/runbook_hermes.yaml
config/runbook_hermes.yaml
approval:
approval_timeout_minutes: 60 # Increase timeout
fallback_to_auto_reject: false # Prevent auto-reject
approval:
approval_timeout_minutes: 60 # 增加超时时间
fallback_to_auto_reject: false # 禁止自动驳回
Or configure notification
或配置通知
notification:
on_approval_request:
- type: feishu
webhook_url: ${FEISHU_APPROVAL_WEBHOOK}
undefinednotification:
on_approval_request:
- type: feishu
webhook_url: ${FEISHU_APPROVAL_WEBHOOK}
undefinedIssue: Runbook skills not generating
问题:运行手册技能未生成
Cause: Incident not marked as resolved or missing evidence
Solution:
python
undefined原因:事件未标记为已解决或缺少证据
解决方案:
python
undefinedExplicitly mark incident resolved
显式标记事件为已解决
from runbook_hermes.incident import IncidentManager
mgr = IncidentManager()
mgr.mark_resolved(
incident_id="inc_001",
resolution="Rolled back to v1.2.3",
root_cause="Bad deployment v1.2.4"
)
from runbook_hermes.incident import IncidentManager
mgr = IncidentManager()
mgr.mark_resolved(
incident_id="inc_001",
resolution="Rolled back to v1.2.3",
root_cause="Bad deployment v1.2.4"
)
Manually trigger skill generation
手动触发技能生成
from runbook_hermes.skills import SkillGenerator
generator = SkillGenerator()
skill = generator.generate_from_incident("inc_001")
skill.save()
undefinedfrom runbook_hermes.skills import SkillGenerator
generator = SkillGenerator()
skill = generator.generate_from_incident("inc_001")
skill.save()
undefinedIssue: Model-assisted summaries failing
问题:AI辅助摘要失败
Cause: Model API key not configured or endpoint unreachable
Solution:
bash
undefined原因:未配置模型API密钥或端点无法访问
解决方案:
bash
undefinedVerify environment variables
验证环境变量
echo $OPENAI_API_KEY
echo $OPENAI_BASE_URL
echo $OPENAI_API_KEY
echo $OPENAI_BASE_URL
Test model connectivity
测试模型连通性
curl $OPENAI_BASE_URL/models
-H "Authorization: Bearer $OPENAI_API_KEY"
-H "Authorization: Bearer $OPENAI_API_KEY"
curl $OPENAI_BASE_URL/models
-H "Authorization: Bearer $OPENAI_API_KEY"
-H "Authorization: Bearer $OPENAI_API_KEY"
Disable model if not needed
若不需要可禁用模型
config/runbook_hermes.yaml
config/runbook_hermes.yaml
model:
enabled: false # Fall back to evidence-only mode
undefinedmodel:
enabled: false # 回退到仅基于证据的模式
undefinedIssue: Hermes profile not found
问题:Hermes profile未找到
Cause: Profile directory not in Hermes search path
Solution:
bash
undefined原因:profile目录不在Hermes搜索路径中
解决方案:
bash
undefinedAdd RunbookHermes profiles to Hermes config
将RunbookHermes profiles添加到Hermes配置
export HERMES_PROFILE_PATH="./profiles/runbook-hermes:$HERMES_PROFILE_PATH"
export HERMES_PROFILE_PATH="./profiles/runbook-hermes:$HERMES_PROFILE_PATH"
Or copy profile to Hermes profiles directory
或复制profile到Hermes profiles目录
cp -r profiles/runbook-hermes ~/.hermes/profiles/
undefinedcp -r profiles/runbook-hermes ~/.hermes/profiles/
undefinedDebug Mode
调试模式
Enable verbose logging:
python
import logging
logging.basicConfig(level=logging.DEBUG)启用详细日志:
python
import logging
logging.basicConfig(level=logging.DEBUG)Or set environment variable
或设置环境变量
export RUNBOOK_HERMES_LOG_LEVEL=DEBUG
```bashexport RUNBOOK_HERMES_LOG_LEVEL=DEBUG
```bashRun with debug output
带调试输出运行
hermes run
--profile runbook-hermes
--input "Debug payment service issue"
--debug
--trace-tools
--profile runbook-hermes
--input "Debug payment service issue"
--debug
--trace-tools
undefinedhermes run
--profile runbook-hermes
--input "Debug payment service issue"
--debug
--trace-tools
--profile runbook-hermes
--input "Debug payment service issue"
--debug
--trace-tools
undefinedAdvanced Usage
高级用法
Custom Tool Integration
自定义工具集成
Add domain-specific tools:
python
undefined添加领域特定工具:
python
undefinedplugins/runbook_hermes/custom_tools.py
plugins/runbook_hermes/custom_tools.py
from agent.tools import Tool, ToolParameter
class CheckDatabaseConnectionTool(Tool):
name = "check_database_connection"
description = "Verify database connectivity and connection pool status"
parameters = [
ToolParameter(name="service", type="string", required=True),
ToolParameter(name="db_name", type="string", required=True)
]
def execute(self, service: str, db_name: str) -> dict:
# Your custom logic
return {
"status": "healthy",
"active_connections": 25,
"max_connections": 100
}from agent.tools import Tool, ToolParameter
class CheckDatabaseConnectionTool(Tool):
name = "check_database_connection"
description = "Verify database connectivity and connection pool status"
parameters = [
ToolParameter(name="service", type="string", required=True),
ToolParameter(name="db_name", type="string", required=True)
]
def execute(self, service: str, db_name: str) -> dict:
# 自定义逻辑
return {
"status": "healthy",
"active_connections": 25,
"max_connections": 100
}Register tool
注册工具
from plugins.runbook_hermes.registry import register_tool
register_tool(CheckDatabaseConnectionTool())
undefinedfrom plugins.runbook_hermes.registry import register_tool
register_tool(CheckDatabaseConnectionTool())
undefinedCustom Evidence Type
自定义证据类型
python
undefinedpython
undefinedrunbook_hermes/evidence/custom_evidence.py
runbook_hermes/evidence/custom_evidence.py
from runbook_hermes.evidence import EvidenceCollector
class CostEvidenceCollector(EvidenceCollector):
def collect(self, service: str, time_window: str) -> dict:
# Collect cost metrics from billing API
return {
"type": "cost_spike",
"service": service,
"cost_increase_pct": 150,
"period": time_window
}
from runbook_hermes.evidence import EvidenceCollector
class CostEvidenceCollector(EvidenceCollector):
def collect(self, service: str, time_window: str) -> dict:
# 从计费API收集成本指标
return {
"type": "cost_spike",
"service": service,
"cost_increase_pct": 150,
"period": time_window
}
Register collector
注册收集器
from runbook_hermes.evidence import register_collector
register_collector("cost", CostEvidenceCollector())
This skill enables AI coding agents to help developers deploy, configure, and operate RunbookHermes for production incident response with Hermes Agent integration.from runbook_hermes.evidence import register_collector
register_collector("cost", CostEvidenceCollector())
该技能使AI编码代理能够帮助开发者部署、配置和运行RunbookHermes,实现与Hermes Agent集成的生产环境事件响应。