runbookhermes-aiops-agent

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RunbookHermes AIOps Agent Skill

RunbookHermes AIOps Agent 技能

Skill by ara.so — Hermes Skills collection.
RunbookHermes is a Hermes-native AIOps agent that specializes in incident response workflows. It extends Hermes Agent's runtime with evidence collection from observability tools (Prometheus, Loki, Jaeger), approval-gated remediation, checkpoint/rollback capabilities, and automatic runbook skill generation from resolved incidents.
ara.so提供的技能——Hermes技能集合。
RunbookHermes是一款Hermes原生的AIOps代理,专注于事件响应工作流。它扩展了Hermes Agent的运行时能力,支持从可观测性工具(Prometheus、Loki、Jaeger)收集证据、审批管控的故障修复、检查点/回滚功能,以及从已解决事件中自动生成可复用的运行手册技能。

What RunbookHermes Does

RunbookHermes 功能

  • Evidence-driven incident analysis: Collects metrics, logs, traces, and deployment history
  • Approval-gated remediation: Requires human approval before risky actions
  • Runbook learning: Converts successful incident resolutions into reusable skills
  • Multi-channel intake: Accepts incidents from Web UI, Alertmanager, Feishu, WeCom, API
  • EvidenceStack context engine: Compresses observability data for model reasoning
  • IncidentMemory: Remembers service profiles, incident patterns, team preferences
  • 基于证据的事件分析:收集指标、日志、链路追踪和部署历史
  • 审批管控的故障修复:执行高风险操作前需人工审批
  • 运行手册学习:将成功的事件解决方案转化为可复用技能
  • 多渠道事件接入:支持从Web UI、Alertmanager、飞书、企业微信、API接入事件
  • EvidenceStack上下文引擎:压缩可观测性数据以供模型推理
  • IncidentMemory:存储服务配置文件、事件模式、团队偏好

Installation

安装

Prerequisites

前置条件

  • Python 3.10+
  • Hermes Agent (included as
    agent/
    subdirectory)
  • Docker and Docker Compose (for local payment demo environment)
  • Python 3.10+
  • Hermes Agent(包含在
    agent/
    子目录中)
  • Docker和Docker Compose(用于本地支付演示环境)

Clone and Install

克隆并安装

bash
git clone https://github.com/Tommy-yw/RunbookHermes.git
cd RunbookHermes
bash
git clone https://github.com/Tommy-yw/RunbookHermes.git
cd RunbookHermes

Install dependencies

安装依赖

pip install -r requirements.txt
pip install -r requirements.txt

Or use Poetry

或使用Poetry

poetry install
undefined
poetry install
undefined

Environment Configuration

环境配置

Create
.env
file in project root:
bash
undefined
在项目根目录创建
.env
文件:
bash
undefined

Model provider (optional, for AI-assisted summaries)

模型提供商(可选,用于AI辅助摘要)

OPENAI_API_KEY=${OPENAI_API_KEY} OPENAI_BASE_URL=https://api.openai.com/v1 OPENAI_MODEL=gpt-4o
OPENAI_API_KEY=${OPENAI_API_KEY} OPENAI_BASE_URL=https://api.openai.com/v1 OPENAI_MODEL=gpt-4o

Observability backends

可观测性后端

Deploy history backend

部署历史后端

DEPLOY_BACKEND_TYPE=local_json DEPLOY_HISTORY_PATH=./data/payment_demo/deploy_history.json
DEPLOY_BACKEND_TYPE=local_json DEPLOY_HISTORY_PATH=./data/payment_demo/deploy_history.json

Execution backend (for rollback/remediation)

执行后端(用于回滚/故障修复)

EXECUTION_BACKEND_TYPE=local_reference EXECUTION_CONFIG_PATH=./data/payment_demo/execution_config.json
EXECUTION_BACKEND_TYPE=local_reference EXECUTION_CONFIG_PATH=./data/payment_demo/execution_config.json

Feishu integration (optional)

飞书集成(可选)

FEISHU_APP_ID=${FEISHU_APP_ID} FEISHU_APP_SECRET=${FEISHU_APP_SECRET}
FEISHU_APP_ID=${FEISHU_APP_ID} FEISHU_APP_SECRET=${FEISHU_APP_SECRET}

WeCom integration (optional)

企业微信集成(可选)

WECOM_CORP_ID=${WECOM_CORP_ID} WECOM_AGENT_SECRET=${WECOM_AGENT_SECRET}
WECOM_CORP_ID=${WECOM_CORP_ID} WECOM_AGENT_SECRET=${WECOM_AGENT_SECRET}

Web API

Web API

RUNBOOK_API_HOST=0.0.0.0 RUNBOOK_API_PORT=8000
undefined
RUNBOOK_API_HOST=0.0.0.0 RUNBOOK_API_PORT=8000
undefined

Start Local Payment Demo Environment

启动本地支付演示环境

bash
cd demo/payment_system
docker-compose up -d
cd ../..
bash
cd demo/payment_system
docker-compose up -d
cd ../..

Verify services are running

验证服务是否运行

curl http://localhost:8001/health # payment-service curl http://localhost:8002/health # coupon-service curl http://localhost:8003/health # order-service
undefined
curl http://localhost:8001/health # payment-service curl http://localhost:8002/health # coupon-service curl http://localhost:8003/health # order-service
undefined

Start RunbookHermes API Server

启动RunbookHermes API服务

bash
undefined
bash
undefined

From project root

从项目根目录执行

python -m apps.runbook_api.main
python -m apps.runbook_api.main

Or with uvicorn directly

或直接使用uvicorn

uvicorn apps.runbook_api.main:app --host 0.0.0.0 --port 8000 --reload

Access Web Console at `http://localhost:8000`
uvicorn apps.runbook_api.main:app --host 0.0.0.0 --port 8000 --reload

访问Web控制台:`http://localhost:8000`

Core Concepts

核心概念

1. Hermes Profile Integration

1. Hermes Profile集成

RunbookHermes runs as a Hermes Agent profile located at
profiles/runbook-hermes/
:
yaml
undefined
RunbookHermes作为Hermes Agent profile运行,路径为
profiles/runbook-hermes/
yaml
undefined

profiles/runbook-hermes/profile.yaml

profiles/runbook-hermes/profile.yaml

name: runbook-hermes version: 1.0.0 description: AIOps agent for incident response persona: incident_responder tools:
  • runbook-hermes context_engine: evidence_stack memory_provider: incident_memory
undefined
name: runbook-hermes version: 1.0.0 description: AIOps agent for incident response persona: incident_responder tools:
  • runbook-hermes context_engine: evidence_stack memory_provider: incident_memory
undefined

2. Evidence Collection Tools

2. 证据收集工具

The
runbook-hermes
tool plugin provides incident-response capabilities:
python
undefined
runbook-hermes
工具插件提供事件响应能力:
python
undefined

Example: Query metrics evidence

示例:查询指标证据

from plugins.runbook_hermes.tools import query_metrics
evidence = query_metrics( service="payment-service", metric_type="http_5xx_rate", time_window="5m" )

Available tools in the plugin:
- `query_metrics` - Prometheus metrics collection
- `query_logs` - Loki log search
- `query_traces` - Jaeger trace analysis
- `get_deploy_history` - Recent deployment records
- `create_checkpoint` - Save system state before remediation
- `request_approval` - Gate risky actions
- `execute_rollback` - Controlled rollback execution
- `verify_recovery` - Post-remediation health check
from plugins.runbook_hermes.tools import query_metrics
evidence = query_metrics( service="payment-service", metric_type="http_5xx_rate", time_window="5m" )

插件中可用的工具:
- `query_metrics` - Prometheus指标收集
- `query_logs` - Loki日志搜索
- `query_traces` - Jaeger链路追踪分析
- `get_deploy_history` - 近期部署记录
- `create_checkpoint` - 故障修复前保存系统状态
- `request_approval` - 管控高风险操作
- `execute_rollback` - 可控回滚执行
- `verify_recovery` - 故障修复后健康检查

3. EvidenceStack Context Engine

3. EvidenceStack上下文引擎

Compresses observability data for model consumption:
python
from plugins.context_engine.evidence_stack.engine import EvidenceStackEngine

engine = EvidenceStackEngine()
压缩可观测性数据以供模型使用:
python
from plugins.context_engine.evidence_stack.engine import EvidenceStackEngine

engine = EvidenceStackEngine()

Add evidence

添加证据

engine.add_evidence({ "type": "metric", "service": "payment-service", "signal": "http_503_rate_spike", "value": "45 req/s", "severity": "critical" })
engine.add_evidence({ "type": "metric", "service": "payment-service", "signal": "http_503_rate_spike", "value": "45 req/s", "severity": "critical" })

Get compressed context

获取压缩后的上下文

context = engine.get_context()
context = engine.get_context()

Returns: alert summary, key evidence, hypotheses, action plan

返回内容:告警摘要、关键证据、假设、行动计划

undefined
undefined

4. IncidentMemory Provider

4. IncidentMemory存储组件

Stores operational knowledge:
python
from plugins.memory.incident_memory.provider import IncidentMemoryProvider

memory = IncidentMemoryProvider()
存储运维知识:
python
from plugins.memory.incident_memory.provider import IncidentMemoryProvider

memory = IncidentMemoryProvider()

Remember service profile

保存服务配置文件

memory.save_service_profile("payment-service", { "critical_metrics": ["http_5xx_rate", "p95_latency"], "dependencies": ["coupon-service", "order-service"], "rollback_safe": True })
memory.save_service_profile("payment-service", { "critical_metrics": ["http_5xx_rate", "p95_latency"], "dependencies": ["coupon-service", "order-service"], "rollback_safe": True })

Recall incident patterns

查询相似事件模式

similar = memory.recall_similar_incidents( service="payment-service", symptom="http_503_spike" )
undefined
similar = memory.recall_similar_incidents( service="payment-service", symptom="http_503_spike" )
undefined

Creating and Managing Incidents

创建与管理事件

Via Web Console

通过Web控制台

Navigate to
http://localhost:8000/incidents/create
and fill the form:
  • Service name
  • Severity (critical, high, medium, low)
  • Description
  • Alert data (optional)
访问
http://localhost:8000/incidents/create
并填写表单:
  • 服务名称
  • 严重程度(critical、high、medium、low)
  • 描述
  • 告警数据(可选)

Via API

通过API

python
import requests

response = requests.post("http://localhost:8000/api/incidents", json={
    "service": "payment-service",
    "severity": "critical",
    "description": "HTTP 503 rate spike detected",
    "alert": {
        "metric": "http_5xx_rate",
        "value": 45.2,
        "threshold": 5.0
    },
    "metadata": {
        "source": "alertmanager",
        "runbook_url": "https://wiki.example.com/payment-503"
    }
})

incident_id = response.json()["incident_id"]
python
import requests

response = requests.post("http://localhost:8000/api/incidents", json={
    "service": "payment-service",
    "severity": "critical",
    "description": "检测到HTTP 503请求率飙升",
    "alert": {
        "metric": "http_5xx_rate",
        "value": 45.2,
        "threshold": 5.0
    },
    "metadata": {
        "source": "alertmanager",
        "runbook_url": "https://wiki.example.com/payment-503"
    }
})

incident_id = response.json()["incident_id"]

Via Hermes CLI

通过Hermes CLI

bash
undefined
bash
undefined

Run incident response through Hermes profile

通过Hermes profile运行事件响应

hermes run
--profile runbook-hermes
--input "Payment service showing HTTP 503 errors at 45 req/s"
--context '{"service": "payment-service", "severity": "critical"}'
undefined
hermes run
--profile runbook-hermes
--input "Payment service showing HTTP 503 errors at 45 req/s"
--context '{"service": "payment-service", "severity": "critical"}'
undefined

Via Alertmanager Webhook

通过Alertmanager Webhook

Configure Alertmanager to send webhooks:
yaml
undefined
配置Alertmanager发送Webhook:
yaml
undefined

alertmanager.yml

alertmanager.yml

receivers:
undefined
receivers:
undefined

Approval Workflow

审批工作流

RunbookHermes gates risky actions behind approval:
python
undefined
RunbookHermes对高风险操作设置审批管控:
python
undefined

In your incident response logic

在事件响应逻辑中

from runbook_hermes.approval import ApprovalManager
approval_mgr = ApprovalManager()
from runbook_hermes.approval import ApprovalManager
approval_mgr = ApprovalManager()

Request approval for rollback

请求回滚审批

approval_id = approval_mgr.request_approval( incident_id="inc_001", action_type="rollback", target_service="payment-service", target_version="v1.2.3", risk_level="high", reason="Rollback to last known good version due to 503 spike", checkpoint_id="chk_001" )
approval_id = approval_mgr.request_approval( incident_id="inc_001", action_type="rollback", target_service="payment-service", target_version="v1.2.3", risk_level="high", reason="因503请求率飙升回滚至已知稳定版本", checkpoint_id="chk_001" )

Check approval status

检查审批状态

status = approval_mgr.get_status(approval_id) if status == "approved": # Execute rollback execute_rollback(service="payment-service", version="v1.2.3")
undefined
status = approval_mgr.get_status(approval_id) if status == "approved": # 执行回滚 execute_rollback(service="payment-service", version="v1.2.3")
undefined

Approve via Web Console

通过Web控制台审批

Navigate to
http://localhost:8000/approvals
to review and approve/reject pending actions.
访问
http://localhost:8000/approvals
查看并审批/驳回待处理操作。

Approve via API

通过API审批

python
requests.post(f"http://localhost:8000/api/approvals/{approval_id}/approve", json={
    "operator": "alice",
    "comment": "Approved after verifying checkpoint"
})
python
requests.post(f"http://localhost:8000/api/approvals/{approval_id}/approve", json={
    "operator": "alice",
    "comment": "验证检查点后批准"
})

Checkpoint and Rollback

检查点与回滚

Create Checkpoint Before Remediation

故障修复前创建检查点

python
from runbook_hermes.checkpoint import CheckpointManager

checkpoint_mgr = CheckpointManager()

checkpoint = checkpoint_mgr.create(
    incident_id="inc_001",
    service="payment-service",
    snapshot_type="deployment",
    metadata={
        "current_version": "v1.2.4",
        "replica_count": 3,
        "config_hash": "abc123"
    }
)
python
from runbook_hermes.checkpoint import CheckpointManager

checkpoint_mgr = CheckpointManager()

checkpoint = checkpoint_mgr.create(
    incident_id="inc_001",
    service="payment-service",
    snapshot_type="deployment",
    metadata={
        "current_version": "v1.2.4",
        "replica_count": 3,
        "config_hash": "abc123"
    }
)

Execute Rollback

执行回滚

python
from runbook_hermes.remediation import RemediationExecutor

executor = RemediationExecutor()

result = executor.rollback(
    service="payment-service",
    target_version="v1.2.3",
    checkpoint_id=checkpoint.id,
    dry_run=False
)
python
from runbook_hermes.remediation import RemediationExecutor

executor = RemediationExecutor()

result = executor.rollback(
    service="payment-service",
    target_version="v1.2.3",
    checkpoint_id=checkpoint.id,
    dry_run=False
)

Verify recovery

验证恢复状态

recovery_status = executor.verify_recovery( service="payment-service", expected_metrics={"http_5xx_rate": "<5"} )
undefined
recovery_status = executor.verify_recovery( service="payment-service", expected_metrics={"http_5xx_rate": "<5"} )
undefined

Runbook Skill Generation

运行手册技能生成

After resolving an incident, generate a reusable skill:
python
from runbook_hermes.skills import SkillGenerator

generator = SkillGenerator()

skill = generator.generate_from_incident(
    incident_id="inc_001",
    skill_name="payment-http-503-rollback",
    trigger_conditions=["payment service 503 spike", "payment 5xx rate > 40"],
    steps=[
        "collect_evidence",
        "verify_deploy_change",
        "create_checkpoint",
        "request_approval",
        "rollback_deployment",
        "verify_recovery"
    ]
)
事件解决后,生成可复用技能:
python
from runbook_hermes.skills import SkillGenerator

generator = SkillGenerator()

skill = generator.generate_from_incident(
    incident_id="inc_001",
    skill_name="payment-http-503-rollback",
    trigger_conditions=["payment service 503 spike", "payment 5xx rate > 40"],
    steps=[
        "collect_evidence",
        "verify_deploy_change",
        "create_checkpoint",
        "request_approval",
        "rollback_deployment",
        "verify_recovery"
    ]
)

Save to skills directory

保存到技能目录

skill.save("skills/runbooks/payment-http-503-rollback.yaml")

Generated skill format:

```yaml
skill.save("skills/runbooks/payment-http-503-rollback.yaml")

生成的技能格式:

```yaml

skills/runbooks/payment-http-503-rollback.yaml

skills/runbooks/payment-http-503-rollback.yaml

name: payment-http-503-rollback version: 1.0.0 triggers:
  • payment service 503 spike
  • payment 5xx rate > 40 steps:
  • name: collect_evidence tool: query_metrics params: service: payment-service metric: http_5xx_rate
  • name: verify_deploy_change tool: get_deploy_history params: service: payment-service limit: 5
  • name: create_checkpoint tool: create_checkpoint
  • name: request_approval tool: request_approval risk_level: high
  • name: rollback_deployment tool: execute_rollback
  • name: verify_recovery tool: verify_recovery
undefined
name: payment-http-503-rollback version: 1.0.0 triggers:
  • payment service 503 spike
  • payment 5xx rate > 40 steps:
  • name: collect_evidence tool: query_metrics params: service: payment-service metric: http_5xx_rate
  • name: verify_deploy_change tool: get_deploy_history params: service: payment-service limit: 5
  • name: create_checkpoint tool: create_checkpoint
  • name: request_approval tool: request_approval risk_level: high
  • name: rollback_deployment tool: execute_rollback
  • name: verify_recovery tool: verify_recovery
undefined

Observability Integration

可观测性集成

Prometheus Metrics

Prometheus指标

python
from integrations.observability.prometheus_adapter import PrometheusAdapter

prom = PrometheusAdapter(base_url="http://localhost:9090")
python
from integrations.observability.prometheus_adapter import PrometheusAdapter

prom = PrometheusAdapter(base_url="http://localhost:9090")

Query current 5xx rate

查询当前5xx请求率

result = prom.query_range( query='rate(http_requests_total{status=~"5..", service="payment-service"}[5m])', start="-15m", end="now", step="30s" )
result = prom.query_range( query='rate(http_requests_total{status=~"5..", service="payment-service"}[5m])', start="-15m", end="now", step="30s" )

Extract evidence

提取证据

if result.has_spike(threshold=5.0): evidence = { "type": "metric", "signal": "http_5xx_spike", "max_value": result.max_value(), "timestamp": result.max_timestamp() }
undefined
if result.has_spike(threshold=5.0): evidence = { "type": "metric", "signal": "http_5xx_spike", "max_value": result.max_value(), "timestamp": result.max_timestamp() }
undefined

Loki Logs

Loki日志

python
from integrations.observability.loki_adapter import LokiAdapter

loki = LokiAdapter(base_url="http://localhost:3100")
python
from integrations.observability.loki_adapter import LokiAdapter

loki = LokiAdapter(base_url="http://localhost:3100")

Search error logs

搜索错误日志

logs = loki.query_range( query='{service="payment-service"} |= "error" | json', start="-15m", limit=100 )
logs = loki.query_range( query='{service="payment-service"} |= "error" | json', start="-15m", limit=100 )

Extract patterns

提取错误模式

error_patterns = logs.extract_patterns(min_frequency=5)
undefined
error_patterns = logs.extract_patterns(min_frequency=5)
undefined

Jaeger Traces

Jaeger链路追踪

python
from integrations.observability.jaeger_adapter import JaegerAdapter

jaeger = JaegerAdapter(base_url="http://localhost:16686")
python
from integrations.observability.jaeger_adapter import JaegerAdapter

jaeger = JaegerAdapter(base_url="http://localhost:16686")

Find slow traces

查找慢链路追踪

traces = jaeger.search_traces( service="payment-service", start="-15m", min_duration="500ms", limit=20 )
traces = jaeger.search_traces( service="payment-service", start="-15m", min_duration="500ms", limit=20 )

Analyze error traces

分析错误链路追踪

for trace in traces.with_errors(): root_cause_span = trace.find_slowest_span()
undefined
for trace in traces.with_errors(): root_cause_span = trace.find_slowest_span()
undefined

Running Hermes Agent with RunbookHermes Profile

使用RunbookHermes Profile运行Hermes Agent

Direct CLI Invocation

直接CLI调用

bash
undefined
bash
undefined

Run incident triage

运行事件分类

hermes run
--profile runbook-hermes
--input "Payment service p95 latency is 2.5s, normal is 200ms"
--verbose
hermes run
--profile runbook-hermes
--input "Payment service p95 latency is 2.5s, normal is 200ms"
--verbose

Run with specific tool selection

指定工具运行

hermes run
--profile runbook-hermes
--input "Check payment service deployment history"
--tools query_metrics,get_deploy_history
undefined
hermes run
--profile runbook-hermes
--input "Check payment service deployment history"
--tools query_metrics,get_deploy_history
undefined

Programmatic Invocation

程序化调用

python
from agent.runtime import HermesRuntime
from agent.config import AgentConfig

config = AgentConfig(
    profile="runbook-hermes",
    tools=["runbook-hermes"],
    context_engine="evidence_stack",
    memory_provider="incident_memory"
)

runtime = HermesRuntime(config)

response = runtime.run(
    input_text="Investigate payment-service HTTP 503 spike",
    context={
        "service": "payment-service",
        "incident_id": "inc_001",
        "severity": "critical"
    }
)

print(response.final_answer)
print(response.evidence_chain)
print(response.recommended_actions)
python
from agent.runtime import HermesRuntime
from agent.config import AgentConfig

config = AgentConfig(
    profile="runbook-hermes",
    tools=["runbook-hermes"],
    context_engine="evidence_stack",
    memory_provider="incident_memory"
)

runtime = HermesRuntime(config)

response = runtime.run(
    input_text="Investigate payment-service HTTP 503 spike",
    context={
        "service": "payment-service",
        "incident_id": "inc_001",
        "severity": "critical"
    }
)

print(response.final_answer)
print(response.evidence_chain)
print(response.recommended_actions)

Common Patterns

常见模式

Pattern 1: Full Incident Response Workflow

模式1:完整事件响应工作流

python
from runbook_hermes.workflow import IncidentResponseWorkflow

workflow = IncidentResponseWorkflow()
python
from runbook_hermes.workflow import IncidentResponseWorkflow

workflow = IncidentResponseWorkflow()

Execute end-to-end

执行端到端流程

result = workflow.execute( service="payment-service", symptom="http_503_spike", severity="critical", auto_approve=False # Require human approval )
print(f"Root cause: {result.root_cause}") print(f"Remediation: {result.remediation_action}") print(f"Status: {result.status}")
undefined
result = workflow.execute( service="payment-service", symptom="http_503_spike", severity="critical", auto_approve=False # 需要人工审批 )
print(f"根因: {result.root_cause}") print(f"故障修复方案: {result.remediation_action}") print(f"状态: {result.status}")
undefined

Pattern 2: Evidence-Driven Diagnosis

模式2:基于证据的诊断

python
from runbook_hermes.diagnosis import EvidenceDiagnosis

diagnosis = EvidenceDiagnosis(service="payment-service")
python
from runbook_hermes.diagnosis import EvidenceDiagnosis

diagnosis = EvidenceDiagnosis(service="payment-service")

Collect all evidence types

收集所有类型的证据

diagnosis.collect_metrics(time_window="15m") diagnosis.collect_logs(time_window="15m", error_only=True) diagnosis.collect_traces(time_window="15m", min_duration="500ms") diagnosis.collect_deploy_history(limit=10)
diagnosis.collect_metrics(time_window="15m") diagnosis.collect_logs(time_window="15m", error_only=True) diagnosis.collect_traces(time_window="15m", min_duration="500ms") diagnosis.collect_deploy_history(limit=10)

Analyze

分析诊断

root_cause = diagnosis.analyze()
print(f"Most likely cause: {root_cause.hypothesis}") print(f"Confidence: {root_cause.confidence}") print(f"Supporting evidence: {root_cause.evidence_ids}")
undefined
root_cause = diagnosis.analyze()
print(f"最可能原因: {root_cause.hypothesis}") print(f"置信度: {root_cause.confidence}") print(f"支持证据: {root_cause.evidence_ids}")
undefined

Pattern 3: Safe Remediation with Approval

模式3:带审批的安全故障修复

python
from runbook_hermes.remediation import SafeRemediation

remediation = SafeRemediation(incident_id="inc_001")
python
from runbook_hermes.remediation import SafeRemediation

remediation = SafeRemediation(incident_id="inc_001")

Plan action

制定回滚计划

plan = remediation.plan_rollback( service="payment-service", target_version="v1.2.3" )
plan = remediation.plan_rollback( service="payment-service", target_version="v1.2.3" )

Create checkpoint

创建检查点

checkpoint = remediation.create_checkpoint()
checkpoint = remediation.create_checkpoint()

Request approval (blocks until human decision)

请求审批(等待人工决策)

approval = remediation.request_approval( action=plan, checkpoint=checkpoint, timeout_minutes=30 )
if approval.is_approved(): # Execute with dry-run first dry_run_result = remediation.execute(dry_run=True)
if dry_run_result.success:
    # Real execution
    result = remediation.execute(dry_run=False)
    
    # Verify recovery
    if remediation.verify_recovery():
        print("Remediation successful")
    else:
        # Auto-rollback to checkpoint
        remediation.restore_checkpoint(checkpoint.id)
undefined
approval = remediation.request_approval( action=plan, checkpoint=checkpoint, timeout_minutes=30 )
if approval.is_approved(): # 先执行预演 dry_run_result = remediation.execute(dry_run=True)
if dry_run_result.success:
    # 执行实际操作
    result = remediation.execute(dry_run=False)
    
    # 验证恢复状态
    if remediation.verify_recovery():
        print("故障修复成功")
    else:
        # 自动回滚到检查点
        remediation.restore_checkpoint(checkpoint.id)
undefined

Pattern 4: Multi-Service Impact Analysis

模式4:多服务影响分析

python
from runbook_hermes.topology import ServiceTopology

topology = ServiceTopology()
python
from runbook_hermes.topology import ServiceTopology

topology = ServiceTopology()

Build dependency graph

构建依赖关系图

graph = topology.build_graph( root_service="payment-service", depth=2 )
graph = topology.build_graph( root_service="payment-service", depth=2 )

Analyze impact

分析影响范围

impact = topology.analyze_impact( failing_service="payment-service", failure_type="http_503" )
print(f"Directly impacted: {impact.direct}") print(f"Indirectly impacted: {impact.indirect}") print(f"Suggested investigation order: {impact.priority_list}")
undefined
impact = topology.analyze_impact( failing_service="payment-service", failure_type="http_503" )
print(f"直接影响服务: {impact.direct}") print(f"间接影响服务: {impact.indirect}") print(f"建议排查顺序: {impact.priority_list}")
undefined

Configuration Reference

配置参考

RunbookHermes Config File

RunbookHermes配置文件

Create
config/runbook_hermes.yaml
:
yaml
undefined
创建
config/runbook_hermes.yaml
yaml
undefined

Incident response settings

事件响应设置

incident: auto_create_from_alert: true default_severity: high evidence_collection_timeout: 300 # seconds
incident: auto_create_from_alert: true default_severity: high evidence_collection_timeout: 300 # 秒

Evidence collection

证据收集配置

evidence: metrics: enabled: true time_window: 15m retention_days: 30 logs: enabled: true max_lines: 1000 error_patterns_only: false traces: enabled: true sample_limit: 100 min_duration: 200ms
evidence: metrics: enabled: true time_window: 15m retention_days: 30 logs: enabled: true max_lines: 1000 error_patterns_only: false traces: enabled: true sample_limit: 100 min_duration: 200ms

Approval settings

审批设置

approval: required_for: - rollback - restart - config_change - scale_down auto_approve_on_critical: false approval_timeout_minutes: 30 require_checkpoint: true
approval: required_for: - rollback - restart - config_change - scale_down auto_approve_on_critical: false approval_timeout_minutes: 30 require_checkpoint: true

Remediation

故障修复配置

remediation: dry_run_first: true verify_recovery: true recovery_check_interval: 30 # seconds max_recovery_wait: 300 # seconds auto_rollback_on_failure: true
remediation: dry_run_first: true verify_recovery: true recovery_check_interval: 30 # 秒 max_recovery_wait: 300 # 秒 auto_rollback_on_failure: true

Runbook skill generation

运行手册技能生成配置

skills: auto_generate: true min_success_count: 1 output_dir: skills/runbooks
skills: auto_generate: true min_success_count: 1 output_dir: skills/runbooks

Model-assisted analysis (optional)

AI辅助分析配置(可选)

model: enabled: true provider: openai temperature: 0.3 max_tokens: 2000
undefined
model: enabled: true provider: openai temperature: 0.3 max_tokens: 2000
undefined

Tool Configuration

工具配置

yaml
undefined
yaml
undefined

plugins/runbook_hermes/config.yaml

plugins/runbook_hermes/config.yaml

tools: query_metrics: timeout: 30 max_results: 1000 query_logs: timeout: 60 max_lines: 5000 query_traces: timeout: 45 max_traces: 200 execute_rollback: require_approval: true require_checkpoint: true dry_run_first: true
undefined
tools: query_metrics: timeout: 30 max_results: 1000 query_logs: timeout: 60 max_lines: 5000 query_traces: timeout: 45 max_traces: 200 execute_rollback: require_approval: true require_checkpoint: true dry_run_first: true
undefined

Troubleshooting

故障排查

Issue: Evidence collection returns empty results

问题:证据收集返回空结果

Cause: Observability backends not reachable or no data in time window
Solution:
python
undefined
原因:可观测性后端无法访问或时间窗口内无数据
解决方案
python
undefined

Test backend connectivity

测试后端连通性

from integrations.observability.health import check_backends
health = check_backends() print(f"Prometheus: {health['prometheus']}") print(f"Loki: {health['loki']}") print(f"Jaeger: {health['jaeger']}")
from integrations.observability.health import check_backends
health = check_backends() print(f"Prometheus: {health['prometheus']}") print(f"Loki: {health['loki']}") print(f"Jaeger: {health['jaeger']}")

Verify time window

验证时间窗口

Ensure time_window matches your metric retention

确保时间窗口与指标保留周期匹配

evidence = query_metrics( service="payment-service", time_window="1h" # Increase window )
undefined
evidence = query_metrics( service="payment-service", time_window="1h" # 增大时间窗口 )
undefined

Issue: Approval requests timeout

问题:审批请求超时

Cause: No operator reviewing approvals in time
Solution:
yaml
undefined
原因:无运维人员及时处理审批请求
解决方案
yaml
undefined

config/runbook_hermes.yaml

config/runbook_hermes.yaml

approval: approval_timeout_minutes: 60 # Increase timeout fallback_to_auto_reject: false # Prevent auto-reject
approval: approval_timeout_minutes: 60 # 增加超时时间 fallback_to_auto_reject: false # 禁止自动驳回

Or configure notification

或配置通知

notification: on_approval_request: - type: feishu webhook_url: ${FEISHU_APPROVAL_WEBHOOK}
undefined
notification: on_approval_request: - type: feishu webhook_url: ${FEISHU_APPROVAL_WEBHOOK}
undefined

Issue: Runbook skills not generating

问题:运行手册技能未生成

Cause: Incident not marked as resolved or missing evidence
Solution:
python
undefined
原因:事件未标记为已解决或缺少证据
解决方案
python
undefined

Explicitly mark incident resolved

显式标记事件为已解决

from runbook_hermes.incident import IncidentManager
mgr = IncidentManager() mgr.mark_resolved( incident_id="inc_001", resolution="Rolled back to v1.2.3", root_cause="Bad deployment v1.2.4" )
from runbook_hermes.incident import IncidentManager
mgr = IncidentManager() mgr.mark_resolved( incident_id="inc_001", resolution="Rolled back to v1.2.3", root_cause="Bad deployment v1.2.4" )

Manually trigger skill generation

手动触发技能生成

from runbook_hermes.skills import SkillGenerator
generator = SkillGenerator() skill = generator.generate_from_incident("inc_001") skill.save()
undefined
from runbook_hermes.skills import SkillGenerator
generator = SkillGenerator() skill = generator.generate_from_incident("inc_001") skill.save()
undefined

Issue: Model-assisted summaries failing

问题:AI辅助摘要失败

Cause: Model API key not configured or endpoint unreachable
Solution:
bash
undefined
原因:未配置模型API密钥或端点无法访问
解决方案
bash
undefined

Verify environment variables

验证环境变量

echo $OPENAI_API_KEY echo $OPENAI_BASE_URL
echo $OPENAI_API_KEY echo $OPENAI_BASE_URL

Test model connectivity

测试模型连通性

curl $OPENAI_BASE_URL/models
-H "Authorization: Bearer $OPENAI_API_KEY"
curl $OPENAI_BASE_URL/models
-H "Authorization: Bearer $OPENAI_API_KEY"

Disable model if not needed

若不需要可禁用模型

config/runbook_hermes.yaml

config/runbook_hermes.yaml

model: enabled: false # Fall back to evidence-only mode
undefined
model: enabled: false # 回退到仅基于证据的模式
undefined

Issue: Hermes profile not found

问题:Hermes profile未找到

Cause: Profile directory not in Hermes search path
Solution:
bash
undefined
原因:profile目录不在Hermes搜索路径中
解决方案
bash
undefined

Add RunbookHermes profiles to Hermes config

将RunbookHermes profiles添加到Hermes配置

export HERMES_PROFILE_PATH="./profiles/runbook-hermes:$HERMES_PROFILE_PATH"
export HERMES_PROFILE_PATH="./profiles/runbook-hermes:$HERMES_PROFILE_PATH"

Or copy profile to Hermes profiles directory

或复制profile到Hermes profiles目录

cp -r profiles/runbook-hermes ~/.hermes/profiles/
undefined
cp -r profiles/runbook-hermes ~/.hermes/profiles/
undefined

Debug Mode

调试模式

Enable verbose logging:
python
import logging
logging.basicConfig(level=logging.DEBUG)
启用详细日志:
python
import logging
logging.basicConfig(level=logging.DEBUG)

Or set environment variable

或设置环境变量

export RUNBOOK_HERMES_LOG_LEVEL=DEBUG

```bash
export RUNBOOK_HERMES_LOG_LEVEL=DEBUG

```bash

Run with debug output

带调试输出运行

hermes run
--profile runbook-hermes
--input "Debug payment service issue"
--debug
--trace-tools
undefined
hermes run
--profile runbook-hermes
--input "Debug payment service issue"
--debug
--trace-tools
undefined

Advanced Usage

高级用法

Custom Tool Integration

自定义工具集成

Add domain-specific tools:
python
undefined
添加领域特定工具:
python
undefined

plugins/runbook_hermes/custom_tools.py

plugins/runbook_hermes/custom_tools.py

from agent.tools import Tool, ToolParameter
class CheckDatabaseConnectionTool(Tool): name = "check_database_connection" description = "Verify database connectivity and connection pool status"
parameters = [
    ToolParameter(name="service", type="string", required=True),
    ToolParameter(name="db_name", type="string", required=True)
]

def execute(self, service: str, db_name: str) -> dict:
    # Your custom logic
    return {
        "status": "healthy",
        "active_connections": 25,
        "max_connections": 100
    }
from agent.tools import Tool, ToolParameter
class CheckDatabaseConnectionTool(Tool): name = "check_database_connection" description = "Verify database connectivity and connection pool status"
parameters = [
    ToolParameter(name="service", type="string", required=True),
    ToolParameter(name="db_name", type="string", required=True)
]

def execute(self, service: str, db_name: str) -> dict:
    # 自定义逻辑
    return {
        "status": "healthy",
        "active_connections": 25,
        "max_connections": 100
    }

Register tool

注册工具

from plugins.runbook_hermes.registry import register_tool register_tool(CheckDatabaseConnectionTool())
undefined
from plugins.runbook_hermes.registry import register_tool register_tool(CheckDatabaseConnectionTool())
undefined

Custom Evidence Type

自定义证据类型

python
undefined
python
undefined

runbook_hermes/evidence/custom_evidence.py

runbook_hermes/evidence/custom_evidence.py

from runbook_hermes.evidence import EvidenceCollector
class CostEvidenceCollector(EvidenceCollector): def collect(self, service: str, time_window: str) -> dict: # Collect cost metrics from billing API return { "type": "cost_spike", "service": service, "cost_increase_pct": 150, "period": time_window }
from runbook_hermes.evidence import EvidenceCollector
class CostEvidenceCollector(EvidenceCollector): def collect(self, service: str, time_window: str) -> dict: # 从计费API收集成本指标 return { "type": "cost_spike", "service": service, "cost_increase_pct": 150, "period": time_window }

Register collector

注册收集器

from runbook_hermes.evidence import register_collector register_collector("cost", CostEvidenceCollector())

This skill enables AI coding agents to help developers deploy, configure, and operate RunbookHermes for production incident response with Hermes Agent integration.
from runbook_hermes.evidence import register_collector register_collector("cost", CostEvidenceCollector())

该技能使AI编码代理能够帮助开发者部署、配置和运行RunbookHermes,实现与Hermes Agent集成的生产环境事件响应。