enterprise-agent-ops
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEnterprise Agent Ops
企业级Agent运维
Use this skill for cloud-hosted or continuously running agent systems that need operational controls beyond single CLI sessions.
该技能适用于云托管或持续运行的Agent系统,这类系统需要超出单CLI会话范围的运营控制。
Operational Domains
运营域
- runtime lifecycle (start, pause, stop, restart)
- observability (logs, metrics, traces)
- safety controls (scopes, permissions, kill switches)
- change management (rollout, rollback, audit)
- 运行时生命周期(启动、暂停、停止、重启)
- 可观测性(日志、指标、链路追踪)
- 安全控制(作用域、权限、紧急停止开关)
- 变更管理(上线、回滚、审计)
Baseline Controls
基线控制
- immutable deployment artifacts
- least-privilege credentials
- environment-level secret injection
- hard timeout and retry budgets
- audit log for high-risk actions
- 不可变部署制品
- 最小权限凭证
- 环境级密钥注入
- 硬性超时和重试预算
- 高风险操作审计日志
Metrics to Track
需跟踪的指标
- success rate
- mean retries per task
- time to recovery
- cost per successful task
- failure class distribution
- 成功率
- 单任务平均重试次数
- 恢复时间
- 单成功任务成本
- 故障类型分布
Incident Pattern
事件处理模式
When failure spikes:
- freeze new rollout
- capture representative traces
- isolate failing route
- patch with smallest safe change
- run regression + security checks
- resume gradually
当故障激增时:
- 冻结新上线
- 采集代表性链路追踪数据
- 隔离故障路由
- 采用最小安全变更进行补丁修复
- 运行回归测试+安全检查
- 逐步恢复服务
Deployment Integrations
部署集成
This skill pairs with:
- PM2 workflows
- systemd services
- container orchestrators
- CI/CD gates
该技能可与以下工具搭配使用:
- PM2工作流
- systemd服务
- 容器编排器
- CI/CD门禁