ai-operations
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAI Operations
AI运维
Configure AI-powered predictive failure analysis and intelligent alert correlation using Harness AIDA.
通过Harness AIDA配置AI驱动的预测性故障分析与智能告警关联。
Instructions
操作步骤
Step 1: Establish Scope
步骤1:确定范围
Confirm the user's org, project, service, and observability stack.
Call MCP tool: harness_list
Parameters:
resource_type: "project"
org_id: "<organization>"确认用户的组织、项目、服务以及可观测性堆栈。
调用MCP工具:harness_list
参数:
resource_type: "project"
org_id: "<organization>"Step 2: Identify the AI Operations Task
步骤2:确定AI运维任务类型
Determine which workflow the user needs:
- Predictive Failure Analysis -- ML-based detection of impending failures before SLO breach
- Alert Correlation and Noise Reduction -- Group related alerts and suppress duplicates
明确用户所需的工作流:
- 预测性故障分析——基于ML的故障预判,在SLO违规前检测出即将发生的故障
- 告警关联与降噪——对相关告警进行分组并抑制重复告警
Step 3: Configure Predictive Failure Analysis
步骤3:配置预测性故障分析
Gather from the user:
- Service name and data sources (Datadog, Prometheus, CloudWatch)
- Prediction horizon (30 minutes, 1 hour, 4 hours, 24 hours ahead)
- Training data period (30 days, 90 days, 6 months)
- Model type preference (anomaly detection, time series forecasting, ensemble)
Configure failure prediction scenarios:
- Memory leak detection -- Flag services where memory grows above threshold per window
- Disk exhaustion -- Predict time-to-full and alert N hours in advance
- Connection pool saturation -- Alert when pool usage exceeds threshold for sustained duration
- Latency degradation -- Detect progressive slowdown before SLO breach
- Deployment-induced regression -- Correlate metric changes with recent deployments
Configure alerting:
- Set prediction confidence threshold (suppress below threshold to reduce noise)
- Route alerts to PagerDuty, Slack, or other channels
- Enable auto-generated runbook suggestions using AIDA
- Set up false positive feedback loop for model improvement
Configure data sources:
- Metrics source (Prometheus, Datadog, CloudWatch)
- Log source (Elasticsearch, Splunk, CloudWatch Logs)
- Trace source (Jaeger, Datadog APM, AWS X-Ray)
- Model retraining frequency (daily, weekly, monthly, on data drift)
向用户收集以下信息:
- 服务名称与数据源(Datadog、Prometheus、CloudWatch)
- 预测时长(提前30分钟、1小时、4小时、24小时)
- 训练数据周期(30天、90天、6个月)
- 模型类型偏好(异常检测、时间序列预测、集成模型)
配置故障预测场景:
- 内存泄漏检测——标记内存在窗口期内超出阈值的服务
- 磁盘耗尽预测——预测磁盘耗尽时间并提前N小时发出告警
- 连接池饱和告警——当连接池使用率持续超出阈值时触发告警
- 延迟退化检测——在SLO违规前检测渐进式性能下降
- 部署引发的性能退化——将指标变化与近期部署进行关联
配置告警设置:
- 设置预测置信度阈值(抑制低于阈值的预测以减少噪音)
- 将告警路由至PagerDuty、Slack或其他渠道
- 启用AIDA自动生成的运行手册建议
- 建立误报反馈循环以优化模型
配置数据源:
- 指标源(Prometheus、Datadog、CloudWatch)
- 日志源(Elasticsearch、Splunk、CloudWatch Logs)
- 追踪源(Jaeger、Datadog APM、AWS X-Ray)
- 模型重训练频率(每日、每周、每月、数据漂移时)
Step 4: Configure Alert Correlation and Noise Reduction
步骤4:配置告警关联与降噪
Gather from the user:
- Current alert volume (alerts/day) and target reduction percentage
- Alerting tools in use (PagerDuty, OpsGenie, Grafana, Datadog)
- Correlation preferences
Configure alert correlation:
- Correlation window: group alerts fired within N minutes
- Correlation method: topology-based, time-based, ML-based, or hybrid
- Service dependency mapping for topology-based correlation
Configure noise reduction:
- Deduplication: merge identical alerts across sources
- Suppression: suppress known-noisy alerts during maintenance windows
- Aggregation: combine N similar alerts into a single incident
- Priority scoring: ML-based severity assignment using historical resolution data
Configure intelligent routing:
- Route to the team that owns the affected service
- Escalation: auto-escalate if not acknowledged within SLA
- Context enrichment: attach recent deployments, related logs, and runbook links to alerts
向用户收集以下信息:
- 当前告警量(每日告警数)与目标降噪百分比
- 正在使用的告警工具(PagerDuty、OpsGenie、Grafana、Datadog)
- 关联偏好设置
配置告警关联:
- 关联窗口:将N分钟内触发的告警归为一组
- 关联方式:基于拓扑、基于时间、基于ML或混合方式
- 基于拓扑关联的服务依赖映射
配置降噪设置:
- 去重:合并跨来源的重复告警
- 抑制:在维护窗口期间抑制已知的高频噪音告警
- 聚合:将N条相似告警合并为单个事件
- 优先级评分:基于历史处理数据的ML式严重程度分配
配置智能路由:
- 将告警路由至受影响服务的负责团队
- 升级:若在SLA内未得到确认则自动升级告警
- 上下文 enrichment:为告警附加近期部署信息、相关日志与运行手册链接
Examples
示例
- "Set up predictive failure analysis for our payment service" -- Configure ML models to detect memory leaks, disk exhaustion, and latency degradation
- "Reduce our alert noise by 50%" -- Configure alert correlation and deduplication to reduce daily alert volume
- "Alert us 4 hours before disk runs out" -- Configure disk exhaustion prediction with advance warning
- "Correlate alerts across our microservices" -- Set up topology-based alert correlation using service dependency map
- "Auto-generate runbook suggestions for alerts" -- Enable AIDA-powered runbook recommendations
- "为我们的支付服务设置预测性故障分析"——配置ML模型以检测内存泄漏、磁盘耗尽与延迟退化
- "将告警噪音降低50%"——配置告警关联与去重以减少每日告警量
- "在磁盘耗尽前4小时提醒我们"——配置磁盘耗尽预测并设置提前告警
- "跨微服务关联告警"——利用服务依赖映射设置基于拓扑的告警关联
- "为告警自动生成运行手册建议"——启用AIDA驱动的运行手册推荐
Performance Notes
性能说明
- ML models need 2-4 weeks of baseline data before predictions become reliable -- expect higher false positive rates initially.
- Topology-based correlation requires an accurate service dependency map -- stale maps cause missed correlations.
- Alert correlation windows should balance grouping (longer = fewer alerts) with response time (shorter = faster notification).
- Retraining frequency should match how fast the system changes -- fast-moving services need weekly retraining.
- ML模型需要2-4周的基线数据才能生成可靠预测——初期误报率会较高。
- 基于拓扑的关联需要准确的服务依赖映射——过时的映射会导致关联遗漏。
- 告警关联窗口需在分组效果(窗口越长,告警越少)与响应时间(窗口越短,通知越快)之间取得平衡。
- 重训练频率应与系统变化速度匹配——快速迭代的服务需要每周重训练。
Troubleshooting
故障排查
High False Positive Rate
高误报率
- Increase the confidence threshold to suppress low-confidence predictions
- Provide false positive feedback to improve the model
- Check that training data period includes representative traffic patterns (weekday, weekend, peak)
- 提高置信度阈值以抑制低置信度预测
- 提供误报反馈以优化模型
- 检查训练数据周期是否包含具有代表性的流量模式(工作日、周末、峰值时段)
Predictions Not Triggering
未触发预测
- Verify data sources are connected and sending metrics
- Check that the prediction horizon is appropriate for the failure mode
- Ensure the model has completed initial training (2-4 weeks minimum)
- 验证数据源已连接并正在发送指标
- 检查预测时长是否适用于当前故障模式
- 确保模型已完成初始训练(至少2-4周)
Alert Correlation Missing Related Alerts
告警关联遗漏相关告警
- Increase the correlation window to capture cascading failures
- Update the service dependency map if topology-based correlation is in use
- Check that all alert sources are integrated (missing sources cause orphaned alerts)
- 增大关联窗口以捕获连锁故障
- 若使用基于拓扑的关联,请更新服务依赖映射
- 检查是否集成了所有告警源(缺失源会导致告警孤立)