runbookhermes-aiops-agent

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

RunbookHermes AIOps Agent Skill

RunbookHermes AIOps Agent 技能

Skill by ara.so — Hermes Skills collection.

RunbookHermes is a Hermes-native AIOps agent that specializes in incident response workflows. It extends Hermes Agent's runtime with evidence collection from observability tools (Prometheus, Loki, Jaeger), approval-gated remediation, checkpoint/rollback capabilities, and automatic runbook skill generation from resolved incidents.

由ara.so提供的技能——Hermes技能集合。

RunbookHermes是一款Hermes原生的AIOps代理，专注于事件响应工作流。它扩展了Hermes Agent的运行时能力，支持从可观测性工具（Prometheus、Loki、Jaeger）收集证据、审批管控的故障修复、检查点/回滚功能，以及从已解决事件中自动生成可复用的运行手册技能。

What RunbookHermes Does

RunbookHermes 功能

Evidence-driven incident analysis: Collects metrics, logs, traces, and deployment history
Approval-gated remediation: Requires human approval before risky actions
Runbook learning: Converts successful incident resolutions into reusable skills
Multi-channel intake: Accepts incidents from Web UI, Alertmanager, Feishu, WeCom, API
EvidenceStack context engine: Compresses observability data for model reasoning
IncidentMemory: Remembers service profiles, incident patterns, team preferences

基于证据的事件分析：收集指标、日志、链路追踪和部署历史
审批管控的故障修复：执行高风险操作前需人工审批
运行手册学习：将成功的事件解决方案转化为可复用技能
多渠道事件接入：支持从Web UI、Alertmanager、飞书、企业微信、API接入事件
EvidenceStack上下文引擎：压缩可观测性数据以供模型推理
IncidentMemory：存储服务配置文件、事件模式、团队偏好

Installation

安装

Prerequisites

前置条件

Python 3.10+
Hermes Agent (included as
```
agent/
```
subdirectory)
Docker and Docker Compose (for local payment demo environment)

Python 3.10+
Hermes Agent（包含在
```
agent/
```
子目录中）
Docker和Docker Compose（用于本地支付演示环境）

Clone and Install

克隆并安装

bash

git clone https://github.com/Tommy-yw/RunbookHermes.git
cd RunbookHermes

bash

git clone https://github.com/Tommy-yw/RunbookHermes.git
cd RunbookHermes

Install dependencies

安装依赖

pip install -r requirements.txt

Or use Poetry

或使用Poetry

poetry install

undefined

poetry install

undefined

Environment Configuration

环境配置

Create

.env

file in project root:

bash

undefined

在项目根目录创建

.env

文件：

bash

undefined

Model provider (optional, for AI-assisted summaries)

模型提供商（可选，用于AI辅助摘要）

OPENAI_API_KEY=${OPENAI_API_KEY} OPENAI_BASE_URL=https://api.openai.com/v1 OPENAI_MODEL=gpt-4o

Observability backends

可观测性后端

PROMETHEUS_URL=http://localhost:9090 LOKI_URL=http://localhost:3100 JAEGER_URL=http://localhost:16686

Deploy history backend

部署历史后端

DEPLOY_BACKEND_TYPE=local_json DEPLOY_HISTORY_PATH=./data/payment_demo/deploy_history.json

Execution backend (for rollback/remediation)

执行后端（用于回滚/故障修复）

EXECUTION_BACKEND_TYPE=local_reference EXECUTION_CONFIG_PATH=./data/payment_demo/execution_config.json

Feishu integration (optional)

飞书集成（可选）

FEISHU_APP_ID=${FEISHU_APP_ID} FEISHU_APP_SECRET=${FEISHU_APP_SECRET}

WeCom integration (optional)

企业微信集成（可选）

WECOM_CORP_ID=${WECOM_CORP_ID} WECOM_AGENT_SECRET=${WECOM_AGENT_SECRET}

Web API

RUNBOOK_API_HOST=0.0.0.0 RUNBOOK_API_PORT=8000

undefined

RUNBOOK_API_HOST=0.0.0.0 RUNBOOK_API_PORT=8000

undefined

Start Local Payment Demo Environment

启动本地支付演示环境

bash

cd demo/payment_system
docker-compose up -d
cd ../..

bash

cd demo/payment_system
docker-compose up -d
cd ../..

Verify services are running

验证服务是否运行

curl http://localhost:8001/health # payment-service curl http://localhost:8002/health # coupon-service curl http://localhost:8003/health # order-service

undefined

curl http://localhost:8001/health # payment-service curl http://localhost:8002/health # coupon-service curl http://localhost:8003/health # order-service

undefined

Start RunbookHermes API Server

启动RunbookHermes API服务

bash

undefined

bash

undefined

From project root

从项目根目录执行

python -m apps.runbook_api.main

Or with uvicorn directly

或直接使用uvicorn

uvicorn apps.runbook_api.main:app --host 0.0.0.0 --port 8000 --reload


Access Web Console at `http://localhost:8000`

uvicorn apps.runbook_api.main:app --host 0.0.0.0 --port 8000 --reload


访问Web控制台：`http://localhost:8000`

Core Concepts

核心概念

1. Hermes Profile Integration

1. Hermes Profile集成

RunbookHermes runs as a Hermes Agent profile located at

profiles/runbook-hermes/

yaml

undefined

RunbookHermes作为Hermes Agent profile运行，路径为

profiles/runbook-hermes/

：

yaml

undefined

profiles/runbook-hermes/profile.yaml

name: runbook-hermes version: 1.0.0 description: AIOps agent for incident response persona: incident_responder tools:

runbook-hermes context_engine: evidence_stack memory_provider: incident_memory

undefined

name: runbook-hermes version: 1.0.0 description: AIOps agent for incident response persona: incident_responder tools:

runbook-hermes context_engine: evidence_stack memory_provider: incident_memory

undefined

2. Evidence Collection Tools

2. 证据收集工具

The

runbook-hermes

tool plugin provides incident-response capabilities:

python

undefined

runbook-hermes

工具插件提供事件响应能力：

python

undefined

Example: Query metrics evidence

示例：查询指标证据

from plugins.runbook_hermes.tools import query_metrics

evidence = query_metrics( service="payment-service", metric_type="http_5xx_rate", time_window="5m" )


Available tools in the plugin:
- `query_metrics` - Prometheus metrics collection
- `query_logs` - Loki log search
- `query_traces` - Jaeger trace analysis
- `get_deploy_history` - Recent deployment records
- `create_checkpoint` - Save system state before remediation
- `request_approval` - Gate risky actions
- `execute_rollback` - Controlled rollback execution
- `verify_recovery` - Post-remediation health check

from plugins.runbook_hermes.tools import query_metrics

evidence = query_metrics( service="payment-service", metric_type="http_5xx_rate", time_window="5m" )


插件中可用的工具：
- `query_metrics` - Prometheus指标收集
- `query_logs` - Loki日志搜索
- `query_traces` - Jaeger链路追踪分析
- `get_deploy_history` - 近期部署记录
- `create_checkpoint` - 故障修复前保存系统状态
- `request_approval` - 管控高风险操作
- `execute_rollback` - 可控回滚执行
- `verify_recovery` - 故障修复后健康检查

3. EvidenceStack Context Engine

3. EvidenceStack上下文引擎

Compresses observability data for model consumption:

python

from plugins.context_engine.evidence_stack.engine import EvidenceStackEngine

engine = EvidenceStackEngine()

压缩可观测性数据以供模型使用：

python

from plugins.context_engine.evidence_stack.engine import EvidenceStackEngine

engine = EvidenceStackEngine()

Add evidence

添加证据

engine.add_evidence({ "type": "metric", "service": "payment-service", "signal": "http_503_rate_spike", "value": "45 req/s", "severity": "critical" })

Get compressed context

获取压缩后的上下文

context = engine.get_context()

Returns: alert summary, key evidence, hypotheses, action plan

返回内容：告警摘要、关键证据、假设、行动计划

undefined

undefined

4. IncidentMemory Provider

4. IncidentMemory存储组件

Stores operational knowledge:

python

from plugins.memory.incident_memory.provider import IncidentMemoryProvider

memory = IncidentMemoryProvider()

存储运维知识：

python

from plugins.memory.incident_memory.provider import IncidentMemoryProvider

memory = IncidentMemoryProvider()

Remember service profile

保存服务配置文件

memory.save_service_profile("payment-service", { "critical_metrics": ["http_5xx_rate", "p95_latency"], "dependencies": ["coupon-service", "order-service"], "rollback_safe": True })

Recall incident patterns

查询相似事件模式

similar = memory.recall_similar_incidents( service="payment-service", symptom="http_503_spike" )

undefined

similar = memory.recall_similar_incidents( service="payment-service", symptom="http_503_spike" )

undefined

Creating and Managing Incidents

创建与管理事件

Via Web Console

通过Web控制台

Navigate to

http://localhost:8000/incidents/create

and fill the form:

Service name
Severity (critical, high, medium, low)
Description
Alert data (optional)

访问

http://localhost:8000/incidents/create

并填写表单：

服务名称
严重程度（critical、high、medium、low）
描述
告警数据（可选）

Via API

通过API

python

import requests

response = requests.post("http://localhost:8000/api/incidents", json={
    "service": "payment-service",
    "severity": "critical",
    "description": "HTTP 503 rate spike detected",
    "alert": {
        "metric": "http_5xx_rate",
        "value": 45.2,
        "threshold": 5.0
    },
    "metadata": {
        "source": "alertmanager",
        "runbook_url": "https://wiki.example.com/payment-503"
    }
})

incident_id = response.json()["incident_id"]

python

import requests

response = requests.post("http://localhost:8000/api/incidents", json={
    "service": "payment-service",
    "severity": "critical",
    "description": "检测到HTTP 503请求率飙升",
    "alert": {
        "metric": "http_5xx_rate",
        "value": 45.2,
        "threshold": 5.0
    },
    "metadata": {
        "source": "alertmanager",
        "runbook_url": "https://wiki.example.com/payment-503"
    }
})

incident_id = response.json()["incident_id"]

Via Hermes CLI

通过Hermes CLI

bash

undefined

bash

undefined

Run incident response through Hermes profile

通过Hermes profile运行事件响应

hermes run
--profile runbook-hermes
--input "Payment service showing HTTP 503 errors at 45 req/s"
--context '{"service": "payment-service", "severity": "critical"}'

undefined

hermes run
--profile runbook-hermes
--input "Payment service showing HTTP 503 errors at 45 req/s"
--context '{"service": "payment-service", "severity": "critical"}'

undefined

Via Alertmanager Webhook

通过Alertmanager Webhook

Configure Alertmanager to send webhooks:

yaml

undefined

配置Alertmanager发送Webhook：

yaml

undefined

alertmanager.yml

receivers:

name: runbook-hermes webhook_configs:
- url: http://localhost:8000/gateway/alertmanager send_resolved: true

undefined

receivers:

name: runbook-hermes webhook_configs:
- url: http://localhost:8000/gateway/alertmanager send_resolved: true

undefined

Approval Workflow

审批工作流

RunbookHermes gates risky actions behind approval:

python

undefined

RunbookHermes对高风险操作设置审批管控：

python

undefined

In your incident response logic

在事件响应逻辑中

from runbook_hermes.approval import ApprovalManager

approval_mgr = ApprovalManager()

from runbook_hermes.approval import ApprovalManager

approval_mgr = ApprovalManager()

Request approval for rollback

请求回滚审批

approval_id = approval_mgr.request_approval( incident_id="inc_001", action_type="rollback", target_service="payment-service", target_version="v1.2.3", risk_level="high", reason="Rollback to last known good version due to 503 spike", checkpoint_id="chk_001" )

approval_id = approval_mgr.request_approval( incident_id="inc_001", action_type="rollback", target_service="payment-service", target_version="v1.2.3", risk_level="high", reason="因503请求率飙升回滚至已知稳定版本", checkpoint_id="chk_001" )

Check approval status

检查审批状态

status = approval_mgr.get_status(approval_id) if status == "approved": # Execute rollback execute_rollback(service="payment-service", version="v1.2.3")

undefined

status = approval_mgr.get_status(approval_id) if status == "approved": # 执行回滚 execute_rollback(service="payment-service", version="v1.2.3")

undefined

Approve via Web Console

通过Web控制台审批

Navigate to

http://localhost:8000/approvals

to review and approve/reject pending actions.

访问

http://localhost:8000/approvals

查看并审批/驳回待处理操作。

Approve via API

通过API审批

python

requests.post(f"http://localhost:8000/api/approvals/{approval_id}/approve", json={
    "operator": "alice",
    "comment": "Approved after verifying checkpoint"
})

python

requests.post(f"http://localhost:8000/api/approvals/{approval_id}/approve", json={
    "operator": "alice",
    "comment": "验证检查点后批准"
})

Checkpoint and Rollback

检查点与回滚

Create Checkpoint Before Remediation

故障修复前创建检查点

python

from runbook_hermes.checkpoint import CheckpointManager

checkpoint_mgr = CheckpointManager()

checkpoint = checkpoint_mgr.create(
    incident_id="inc_001",
    service="payment-service",
    snapshot_type="deployment",
    metadata={
        "current_version": "v1.2.4",
        "replica_count": 3,
        "config_hash": "abc123"
    }
)

python

from runbook_hermes.checkpoint import CheckpointManager

checkpoint_mgr = CheckpointManager()

checkpoint = checkpoint_mgr.create(
    incident_id="inc_001",
    service="payment-service",
    snapshot_type="deployment",
    metadata={
        "current_version": "v1.2.4",
        "replica_count": 3,
        "config_hash": "abc123"
    }
)

Execute Rollback

执行回滚

python

from runbook_hermes.remediation import RemediationExecutor

executor = RemediationExecutor()

result = executor.rollback(
    service="payment-service",
    target_version="v1.2.3",
    checkpoint_id=checkpoint.id,
    dry_run=False
)

python

from runbook_hermes.remediation import RemediationExecutor

executor = RemediationExecutor()

result = executor.rollback(
    service="payment-service",
    target_version="v1.2.3",
    checkpoint_id=checkpoint.id,
    dry_run=False
)

Verify recovery

验证恢复状态

recovery_status = executor.verify_recovery( service="payment-service", expected_metrics={"http_5xx_rate": "<5"} )

undefined

recovery_status = executor.verify_recovery( service="payment-service", expected_metrics={"http_5xx_rate": "<5"} )

undefined

Runbook Skill Generation

运行手册技能生成

After resolving an incident, generate a reusable skill:

python

from runbook_hermes.skills import SkillGenerator

generator = SkillGenerator()

skill = generator.generate_from_incident(
    incident_id="inc_001",
    skill_name="payment-http-503-rollback",
    trigger_conditions=["payment service 503 spike", "payment 5xx rate > 40"],
    steps=[
        "collect_evidence",
        "verify_deploy_change",
        "create_checkpoint",
        "request_approval",
        "rollback_deployment",
        "verify_recovery"
    ]
)

事件解决后，生成可复用技能：

python

from runbook_hermes.skills import SkillGenerator

generator = SkillGenerator()

skill = generator.generate_from_incident(
    incident_id="inc_001",
    skill_name="payment-http-503-rollback",
    trigger_conditions=["payment service 503 spike", "payment 5xx rate > 40"],
    steps=[
        "collect_evidence",
        "verify_deploy_change",
        "create_checkpoint",
        "request_approval",
        "rollback_deployment",
        "verify_recovery"
    ]
)

Save to skills directory

保存到技能目录

skill.save("skills/runbooks/payment-http-503-rollback.yaml")


Generated skill format:

```yaml

skill.save("skills/runbooks/payment-http-503-rollback.yaml")


生成的技能格式：

```yaml

skills/runbooks/payment-http-503-rollback.yaml

name: payment-http-503-rollback version: 1.0.0 triggers:

payment service 503 spike
payment 5xx rate > 40 steps:
name: collect_evidence tool: query_metrics params: service: payment-service metric: http_5xx_rate
name: verify_deploy_change tool: get_deploy_history params: service: payment-service limit: 5
name: create_checkpoint tool: create_checkpoint
name: request_approval tool: request_approval risk_level: high
name: rollback_deployment tool: execute_rollback
name: verify_recovery tool: verify_recovery

undefined

name: payment-http-503-rollback version: 1.0.0 triggers:

payment service 503 spike
payment 5xx rate > 40 steps:
name: collect_evidence tool: query_metrics params: service: payment-service metric: http_5xx_rate
name: verify_deploy_change tool: get_deploy_history params: service: payment-service limit: 5
name: create_checkpoint tool: create_checkpoint
name: request_approval tool: request_approval risk_level: high
name: rollback_deployment tool: execute_rollback
name: verify_recovery tool: verify_recovery

undefined

Observability Integration

可观测性集成

Prometheus Metrics

Prometheus指标

python

from integrations.observability.prometheus_adapter import PrometheusAdapter

prom = PrometheusAdapter(base_url="http://localhost:9090")

python

from integrations.observability.prometheus_adapter import PrometheusAdapter

prom = PrometheusAdapter(base_url="http://localhost:9090")

Query current 5xx rate

查询当前5xx请求率

result = prom.query_range( query='rate(http_requests_total{status=~"5..", service="payment-service"}[5m])', start="-15m", end="now", step="30s" )

Extract evidence

提取证据

if result.has_spike(threshold=5.0): evidence = { "type": "metric", "signal": "http_5xx_spike", "max_value": result.max_value(), "timestamp": result.max_timestamp() }

undefined

if result.has_spike(threshold=5.0): evidence = { "type": "metric", "signal": "http_5xx_spike", "max_value": result.max_value(), "timestamp": result.max_timestamp() }

undefined

Loki Logs

Loki日志

python

from integrations.observability.loki_adapter import LokiAdapter

loki = LokiAdapter(base_url="http://localhost:3100")

python

from integrations.observability.loki_adapter import LokiAdapter

loki = LokiAdapter(base_url="http://localhost:3100")

Search error logs

搜索错误日志

logs = loki.query_range( query='{service="payment-service"} |= "error" | json', start="-15m", limit=100 )

Extract patterns

提取错误模式

error_patterns = logs.extract_patterns(min_frequency=5)

undefined

error_patterns = logs.extract_patterns(min_frequency=5)

undefined

Jaeger Traces

Jaeger链路追踪

python

from integrations.observability.jaeger_adapter import JaegerAdapter

jaeger = JaegerAdapter(base_url="http://localhost:16686")

python

from integrations.observability.jaeger_adapter import JaegerAdapter

jaeger = JaegerAdapter(base_url="http://localhost:16686")

Find slow traces

查找慢链路追踪

traces = jaeger.search_traces( service="payment-service", start="-15m", min_duration="500ms", limit=20 )

Analyze error traces

分析错误链路追踪

for trace in traces.with_errors(): root_cause_span = trace.find_slowest_span()

undefined

for trace in traces.with_errors(): root_cause_span = trace.find_slowest_span()

undefined

Running Hermes Agent with RunbookHermes Profile

使用RunbookHermes Profile运行Hermes Agent

Direct CLI Invocation

直接CLI调用

bash

undefined

bash

undefined

Run incident triage

运行事件分类

hermes run
--profile runbook-hermes
--input "Payment service p95 latency is 2.5s, normal is 200ms"
--verbose

Run with specific tool selection

指定工具运行

hermes run
--profile runbook-hermes
--input "Check payment service deployment history"
--tools query_metrics,get_deploy_history

undefined

hermes run
--profile runbook-hermes
--input "Check payment service deployment history"
--tools query_metrics,get_deploy_history

undefined

Programmatic Invocation

程序化调用

python

from agent.runtime import HermesRuntime
from agent.config import AgentConfig

config = AgentConfig(
    profile="runbook-hermes",
    tools=["runbook-hermes"],
    context_engine="evidence_stack",
    memory_provider="incident_memory"
)

runtime = HermesRuntime(config)

response = runtime.run(
    input_text="Investigate payment-service HTTP 503 spike",
    context={
        "service": "payment-service",
        "incident_id": "inc_001",
        "severity": "critical"
    }
)

print(response.final_answer)
print(response.evidence_chain)
print(response.recommended_actions)

python

from agent.runtime import HermesRuntime
from agent.config import AgentConfig

config = AgentConfig(
    profile="runbook-hermes",
    tools=["runbook-hermes"],
    context_engine="evidence_stack",
    memory_provider="incident_memory"
)

runtime = HermesRuntime(config)

response = runtime.run(
    input_text="Investigate payment-service HTTP 503 spike",
    context={
        "service": "payment-service",
        "incident_id": "inc_001",
        "severity": "critical"
    }
)

print(response.final_answer)
print(response.evidence_chain)
print(response.recommended_actions)

Common Patterns

常见模式

Pattern 1: Full Incident Response Workflow

模式1：完整事件响应工作流

python

from runbook_hermes.workflow import IncidentResponseWorkflow

workflow = IncidentResponseWorkflow()

python

from runbook_hermes.workflow import IncidentResponseWorkflow

workflow = IncidentResponseWorkflow()

Execute end-to-end

执行端到端流程

result = workflow.execute( service="payment-service", symptom="http_503_spike", severity="critical", auto_approve=False # Require human approval )

print(f"Root cause: {result.root_cause}") print(f"Remediation: {result.remediation_action}") print(f"Status: {result.status}")

undefined

result = workflow.execute( service="payment-service", symptom="http_503_spike", severity="critical", auto_approve=False # 需要人工审批 )

print(f"根因: {result.root_cause}") print(f"故障修复方案: {result.remediation_action}") print(f"状态: {result.status}")

undefined

Pattern 2: Evidence-Driven Diagnosis

模式2：基于证据的诊断

python

from runbook_hermes.diagnosis import EvidenceDiagnosis

diagnosis = EvidenceDiagnosis(service="payment-service")

python

from runbook_hermes.diagnosis import EvidenceDiagnosis

diagnosis = EvidenceDiagnosis(service="payment-service")

Collect all evidence types

收集所有类型的证据

diagnosis.collect_metrics(time_window="15m") diagnosis.collect_logs(time_window="15m", error_only=True) diagnosis.collect_traces(time_window="15m", min_duration="500ms") diagnosis.collect_deploy_history(limit=10)

Analyze

分析诊断

root_cause = diagnosis.analyze()

print(f"Most likely cause: {root_cause.hypothesis}") print(f"Confidence: {root_cause.confidence}") print(f"Supporting evidence: {root_cause.evidence_ids}")

undefined

root_cause = diagnosis.analyze()

print(f"最可能原因: {root_cause.hypothesis}") print(f"置信度: {root_cause.confidence}") print(f"支持证据: {root_cause.evidence_ids}")

undefined

Pattern 3: Safe Remediation with Approval

模式3：带审批的安全故障修复

python

from runbook_hermes.remediation import SafeRemediation

remediation = SafeRemediation(incident_id="inc_001")

python

from runbook_hermes.remediation import SafeRemediation

remediation = SafeRemediation(incident_id="inc_001")

Plan action

制定回滚计划

plan = remediation.plan_rollback( service="payment-service", target_version="v1.2.3" )

Create checkpoint

创建检查点

checkpoint = remediation.create_checkpoint()

Request approval (blocks until human decision)

请求审批（等待人工决策）

approval = remediation.request_approval( action=plan, checkpoint=checkpoint, timeout_minutes=30 )

if approval.is_approved(): # Execute with dry-run first dry_run_result = remediation.execute(dry_run=True)

if dry_run_result.success:
    # Real execution
    result = remediation.execute(dry_run=False)
    
    # Verify recovery
    if remediation.verify_recovery():
        print("Remediation successful")
    else:
        # Auto-rollback to checkpoint
        remediation.restore_checkpoint(checkpoint.id)

undefined

approval = remediation.request_approval( action=plan, checkpoint=checkpoint, timeout_minutes=30 )

if approval.is_approved(): # 先执行预演 dry_run_result = remediation.execute(dry_run=True)

if dry_run_result.success:
    # 执行实际操作
    result = remediation.execute(dry_run=False)
    
    # 验证恢复状态
    if remediation.verify_recovery():
        print("故障修复成功")
    else:
        # 自动回滚到检查点
        remediation.restore_checkpoint(checkpoint.id)

undefined

Pattern 4: Multi-Service Impact Analysis

模式4：多服务影响分析

python

from runbook_hermes.topology import ServiceTopology

topology = ServiceTopology()

python

from runbook_hermes.topology import ServiceTopology

topology = ServiceTopology()

Build dependency graph

构建依赖关系图

graph = topology.build_graph( root_service="payment-service", depth=2 )

Analyze impact

分析影响范围

impact = topology.analyze_impact( failing_service="payment-service", failure_type="http_503" )

print(f"Directly impacted: {impact.direct}") print(f"Indirectly impacted: {impact.indirect}") print(f"Suggested investigation order: {impact.priority_list}")

undefined

impact = topology.analyze_impact( failing_service="payment-service", failure_type="http_503" )

print(f"直接影响服务: {impact.direct}") print(f"间接影响服务: {impact.indirect}") print(f"建议排查顺序: {impact.priority_list}")

undefined

Configuration Reference

配置参考

RunbookHermes Config File

RunbookHermes配置文件

Create

config/runbook_hermes.yaml

yaml

undefined

创建

config/runbook_hermes.yaml

：

yaml

undefined

Incident response settings

事件响应设置

incident: auto_create_from_alert: true default_severity: high evidence_collection_timeout: 300 # seconds

incident: auto_create_from_alert: true default_severity: high evidence_collection_timeout: 300 # 秒

Evidence collection

证据收集配置

evidence: metrics: enabled: true time_window: 15m retention_days: 30 logs: enabled: true max_lines: 1000 error_patterns_only: false traces: enabled: true sample_limit: 100 min_duration: 200ms

Approval settings

审批设置

approval: required_for: - rollback - restart - config_change - scale_down auto_approve_on_critical: false approval_timeout_minutes: 30 require_checkpoint: true

Remediation

故障修复配置

remediation: dry_run_first: true verify_recovery: true recovery_check_interval: 30 # seconds max_recovery_wait: 300 # seconds auto_rollback_on_failure: true

remediation: dry_run_first: true verify_recovery: true recovery_check_interval: 30 # 秒 max_recovery_wait: 300 # 秒 auto_rollback_on_failure: true

Runbook skill generation

运行手册技能生成配置

skills: auto_generate: true min_success_count: 1 output_dir: skills/runbooks

Model-assisted analysis (optional)

AI辅助分析配置（可选）

model: enabled: true provider: openai temperature: 0.3 max_tokens: 2000

undefined

model: enabled: true provider: openai temperature: 0.3 max_tokens: 2000

undefined

Tool Configuration

工具配置

yaml

undefined

yaml

undefined

plugins/runbook_hermes/config.yaml

tools: query_metrics: timeout: 30 max_results: 1000 query_logs: timeout: 60 max_lines: 5000 query_traces: timeout: 45 max_traces: 200 execute_rollback: require_approval: true require_checkpoint: true dry_run_first: true

undefined

undefined

Troubleshooting

故障排查

Issue: Evidence collection returns empty results

问题：证据收集返回空结果

Cause: Observability backends not reachable or no data in time window

Solution:

python

undefined

原因：可观测性后端无法访问或时间窗口内无数据

解决方案：

python

undefined

Test backend connectivity

测试后端连通性

from integrations.observability.health import check_backends

health = check_backends() print(f"Prometheus: {health['prometheus']}") print(f"Loki: {health['loki']}") print(f"Jaeger: {health['jaeger']}")

from integrations.observability.health import check_backends

health = check_backends() print(f"Prometheus: {health['prometheus']}") print(f"Loki: {health['loki']}") print(f"Jaeger: {health['jaeger']}")

Verify time window

验证时间窗口

Ensure time_window matches your metric retention

确保时间窗口与指标保留周期匹配

evidence = query_metrics( service="payment-service", time_window="1h" # Increase window )

undefined

evidence = query_metrics( service="payment-service", time_window="1h" # 增大时间窗口 )

undefined

Issue: Approval requests timeout

问题：审批请求超时

Cause: No operator reviewing approvals in time

Solution:

yaml

undefined

原因：无运维人员及时处理审批请求

解决方案：

yaml

undefined

config/runbook_hermes.yaml

approval: approval_timeout_minutes: 60 # Increase timeout fallback_to_auto_reject: false # Prevent auto-reject

approval: approval_timeout_minutes: 60 # 增加超时时间 fallback_to_auto_reject: false # 禁止自动驳回

Or configure notification

或配置通知

notification: on_approval_request: - type: feishu webhook_url: ${FEISHU_APPROVAL_WEBHOOK}

undefined

notification: on_approval_request: - type: feishu webhook_url: ${FEISHU_APPROVAL_WEBHOOK}

undefined

Issue: Runbook skills not generating

问题：运行手册技能未生成

Cause: Incident not marked as resolved or missing evidence

Solution:

python

undefined

原因：事件未标记为已解决或缺少证据

解决方案：

python

undefined

Explicitly mark incident resolved

显式标记事件为已解决

from runbook_hermes.incident import IncidentManager

mgr = IncidentManager() mgr.mark_resolved( incident_id="inc_001", resolution="Rolled back to v1.2.3", root_cause="Bad deployment v1.2.4" )

from runbook_hermes.incident import IncidentManager

mgr = IncidentManager() mgr.mark_resolved( incident_id="inc_001", resolution="Rolled back to v1.2.3", root_cause="Bad deployment v1.2.4" )

Manually trigger skill generation

手动触发技能生成

from runbook_hermes.skills import SkillGenerator

generator = SkillGenerator() skill = generator.generate_from_incident("inc_001") skill.save()

undefined

from runbook_hermes.skills import SkillGenerator

generator = SkillGenerator() skill = generator.generate_from_incident("inc_001") skill.save()

undefined

Issue: Model-assisted summaries failing

问题：AI辅助摘要失败

Cause: Model API key not configured or endpoint unreachable

Solution:

bash

undefined

原因：未配置模型API密钥或端点无法访问

解决方案：

bash

undefined

Verify environment variables

验证环境变量

echo $OPENAI_API_KEY echo $OPENAI_BASE_URL

Test model connectivity

测试模型连通性

curl $OPENAI_BASE_URL/models
-H "Authorization: Bearer $OPENAI_API_KEY"

Disable model if not needed

若不需要可禁用模型

config/runbook_hermes.yaml

model: enabled: false # Fall back to evidence-only mode

undefined

model: enabled: false # 回退到仅基于证据的模式

undefined

Issue: Hermes profile not found

问题：Hermes profile未找到

Cause: Profile directory not in Hermes search path

Solution:

bash

undefined

原因：profile目录不在Hermes搜索路径中

解决方案：

bash

undefined

Add RunbookHermes profiles to Hermes config

将RunbookHermes profiles添加到Hermes配置

export HERMES_PROFILE_PATH="./profiles/runbook-hermes:$HERMES_PROFILE_PATH"

Or copy profile to Hermes profiles directory

或复制profile到Hermes profiles目录

cp -r profiles/runbook-hermes ~/.hermes/profiles/

undefined

cp -r profiles/runbook-hermes ~/.hermes/profiles/

undefined

Debug Mode

调试模式

Enable verbose logging:

python

import logging
logging.basicConfig(level=logging.DEBUG)

启用详细日志：

python

import logging
logging.basicConfig(level=logging.DEBUG)

Or set environment variable

或设置环境变量

export RUNBOOK_HERMES_LOG_LEVEL=DEBUG


```bash

export RUNBOOK_HERMES_LOG_LEVEL=DEBUG


```bash

Run with debug output

带调试输出运行

hermes run
--profile runbook-hermes
--input "Debug payment service issue"
--debug
--trace-tools

undefined

hermes run
--profile runbook-hermes
--input "Debug payment service issue"
--debug
--trace-tools

undefined

Advanced Usage

高级用法

Custom Tool Integration

自定义工具集成

Add domain-specific tools:

python

undefined

添加领域特定工具：

python

undefined

plugins/runbook_hermes/custom_tools.py

from agent.tools import Tool, ToolParameter

class CheckDatabaseConnectionTool(Tool): name = "check_database_connection" description = "Verify database connectivity and connection pool status"

parameters = [
    ToolParameter(name="service", type="string", required=True),
    ToolParameter(name="db_name", type="string", required=True)
]

def execute(self, service: str, db_name: str) -> dict:
    # Your custom logic
    return {
        "status": "healthy",
        "active_connections": 25,
        "max_connections": 100
    }

from agent.tools import Tool, ToolParameter

class CheckDatabaseConnectionTool(Tool): name = "check_database_connection" description = "Verify database connectivity and connection pool status"

parameters = [
    ToolParameter(name="service", type="string", required=True),
    ToolParameter(name="db_name", type="string", required=True)
]

def execute(self, service: str, db_name: str) -> dict:
    # 自定义逻辑
    return {
        "status": "healthy",
        "active_connections": 25,
        "max_connections": 100
    }

Register tool

注册工具

from plugins.runbook_hermes.registry import register_tool register_tool(CheckDatabaseConnectionTool())

undefined

from plugins.runbook_hermes.registry import register_tool register_tool(CheckDatabaseConnectionTool())

undefined

Custom Evidence Type

自定义证据类型

python

undefined

python

undefined

runbook_hermes/evidence/custom_evidence.py

from runbook_hermes.evidence import EvidenceCollector

class CostEvidenceCollector(EvidenceCollector): def collect(self, service: str, time_window: str) -> dict: # Collect cost metrics from billing API return { "type": "cost_spike", "service": service, "cost_increase_pct": 150, "period": time_window }

from runbook_hermes.evidence import EvidenceCollector

class CostEvidenceCollector(EvidenceCollector): def collect(self, service: str, time_window: str) -> dict: # 从计费API收集成本指标 return { "type": "cost_spike", "service": service, "cost_increase_pct": 150, "period": time_window }

Register collector

注册收集器

from runbook_hermes.evidence import register_collector register_collector("cost", CostEvidenceCollector())


This skill enables AI coding agents to help developers deploy, configure, and operate RunbookHermes for production incident response with Hermes Agent integration.

from runbook_hermes.evidence import register_collector register_collector("cost", CostEvidenceCollector())


该技能使AI编码代理能够帮助开发者部署、配置和运行RunbookHermes，实现与Hermes Agent集成的生产环境事件响应。