ai-operations

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AI Operations

AI运维

Configure AI-powered predictive failure analysis and intelligent alert correlation using Harness AIDA.
通过Harness AIDA配置AI驱动的预测性故障分析与智能告警关联。

Instructions

操作步骤

Step 1: Establish Scope

步骤1:确定范围

Confirm the user's org, project, service, and observability stack.
Call MCP tool: harness_list
Parameters:
  resource_type: "project"
  org_id: "<organization>"
确认用户的组织、项目、服务以及可观测性堆栈。
调用MCP工具:harness_list
参数:
  resource_type: "project"
  org_id: "<organization>"

Step 2: Identify the AI Operations Task

步骤2:确定AI运维任务类型

Determine which workflow the user needs:
  1. Predictive Failure Analysis -- ML-based detection of impending failures before SLO breach
  2. Alert Correlation and Noise Reduction -- Group related alerts and suppress duplicates
明确用户所需的工作流:
  1. 预测性故障分析——基于ML的故障预判,在SLO违规前检测出即将发生的故障
  2. 告警关联与降噪——对相关告警进行分组并抑制重复告警

Step 3: Configure Predictive Failure Analysis

步骤3:配置预测性故障分析

Gather from the user:
  • Service name and data sources (Datadog, Prometheus, CloudWatch)
  • Prediction horizon (30 minutes, 1 hour, 4 hours, 24 hours ahead)
  • Training data period (30 days, 90 days, 6 months)
  • Model type preference (anomaly detection, time series forecasting, ensemble)
Configure failure prediction scenarios:
  1. Memory leak detection -- Flag services where memory grows above threshold per window
  2. Disk exhaustion -- Predict time-to-full and alert N hours in advance
  3. Connection pool saturation -- Alert when pool usage exceeds threshold for sustained duration
  4. Latency degradation -- Detect progressive slowdown before SLO breach
  5. Deployment-induced regression -- Correlate metric changes with recent deployments
Configure alerting:
  • Set prediction confidence threshold (suppress below threshold to reduce noise)
  • Route alerts to PagerDuty, Slack, or other channels
  • Enable auto-generated runbook suggestions using AIDA
  • Set up false positive feedback loop for model improvement
Configure data sources:
  • Metrics source (Prometheus, Datadog, CloudWatch)
  • Log source (Elasticsearch, Splunk, CloudWatch Logs)
  • Trace source (Jaeger, Datadog APM, AWS X-Ray)
  • Model retraining frequency (daily, weekly, monthly, on data drift)
向用户收集以下信息:
  • 服务名称与数据源(Datadog、Prometheus、CloudWatch)
  • 预测时长(提前30分钟、1小时、4小时、24小时)
  • 训练数据周期(30天、90天、6个月)
  • 模型类型偏好(异常检测、时间序列预测、集成模型)
配置故障预测场景:
  1. 内存泄漏检测——标记内存在窗口期内超出阈值的服务
  2. 磁盘耗尽预测——预测磁盘耗尽时间并提前N小时发出告警
  3. 连接池饱和告警——当连接池使用率持续超出阈值时触发告警
  4. 延迟退化检测——在SLO违规前检测渐进式性能下降
  5. 部署引发的性能退化——将指标变化与近期部署进行关联
配置告警设置:
  • 设置预测置信度阈值(抑制低于阈值的预测以减少噪音)
  • 将告警路由至PagerDuty、Slack或其他渠道
  • 启用AIDA自动生成的运行手册建议
  • 建立误报反馈循环以优化模型
配置数据源:
  • 指标源(Prometheus、Datadog、CloudWatch)
  • 日志源(Elasticsearch、Splunk、CloudWatch Logs)
  • 追踪源(Jaeger、Datadog APM、AWS X-Ray)
  • 模型重训练频率(每日、每周、每月、数据漂移时)

Step 4: Configure Alert Correlation and Noise Reduction

步骤4:配置告警关联与降噪

Gather from the user:
  • Current alert volume (alerts/day) and target reduction percentage
  • Alerting tools in use (PagerDuty, OpsGenie, Grafana, Datadog)
  • Correlation preferences
Configure alert correlation:
  • Correlation window: group alerts fired within N minutes
  • Correlation method: topology-based, time-based, ML-based, or hybrid
  • Service dependency mapping for topology-based correlation
Configure noise reduction:
  • Deduplication: merge identical alerts across sources
  • Suppression: suppress known-noisy alerts during maintenance windows
  • Aggregation: combine N similar alerts into a single incident
  • Priority scoring: ML-based severity assignment using historical resolution data
Configure intelligent routing:
  • Route to the team that owns the affected service
  • Escalation: auto-escalate if not acknowledged within SLA
  • Context enrichment: attach recent deployments, related logs, and runbook links to alerts
向用户收集以下信息:
  • 当前告警量(每日告警数)与目标降噪百分比
  • 正在使用的告警工具(PagerDuty、OpsGenie、Grafana、Datadog)
  • 关联偏好设置
配置告警关联:
  • 关联窗口:将N分钟内触发的告警归为一组
  • 关联方式:基于拓扑、基于时间、基于ML或混合方式
  • 基于拓扑关联的服务依赖映射
配置降噪设置:
  • 去重:合并跨来源的重复告警
  • 抑制:在维护窗口期间抑制已知的高频噪音告警
  • 聚合:将N条相似告警合并为单个事件
  • 优先级评分:基于历史处理数据的ML式严重程度分配
配置智能路由:
  • 将告警路由至受影响服务的负责团队
  • 升级:若在SLA内未得到确认则自动升级告警
  • 上下文 enrichment:为告警附加近期部署信息、相关日志与运行手册链接

Examples

示例

  • "Set up predictive failure analysis for our payment service" -- Configure ML models to detect memory leaks, disk exhaustion, and latency degradation
  • "Reduce our alert noise by 50%" -- Configure alert correlation and deduplication to reduce daily alert volume
  • "Alert us 4 hours before disk runs out" -- Configure disk exhaustion prediction with advance warning
  • "Correlate alerts across our microservices" -- Set up topology-based alert correlation using service dependency map
  • "Auto-generate runbook suggestions for alerts" -- Enable AIDA-powered runbook recommendations
  • "为我们的支付服务设置预测性故障分析"——配置ML模型以检测内存泄漏、磁盘耗尽与延迟退化
  • "将告警噪音降低50%"——配置告警关联与去重以减少每日告警量
  • "在磁盘耗尽前4小时提醒我们"——配置磁盘耗尽预测并设置提前告警
  • "跨微服务关联告警"——利用服务依赖映射设置基于拓扑的告警关联
  • "为告警自动生成运行手册建议"——启用AIDA驱动的运行手册推荐

Performance Notes

性能说明

  • ML models need 2-4 weeks of baseline data before predictions become reliable -- expect higher false positive rates initially.
  • Topology-based correlation requires an accurate service dependency map -- stale maps cause missed correlations.
  • Alert correlation windows should balance grouping (longer = fewer alerts) with response time (shorter = faster notification).
  • Retraining frequency should match how fast the system changes -- fast-moving services need weekly retraining.
  • ML模型需要2-4周的基线数据才能生成可靠预测——初期误报率会较高。
  • 基于拓扑的关联需要准确的服务依赖映射——过时的映射会导致关联遗漏。
  • 告警关联窗口需在分组效果(窗口越长,告警越少)与响应时间(窗口越短,通知越快)之间取得平衡。
  • 重训练频率应与系统变化速度匹配——快速迭代的服务需要每周重训练。

Troubleshooting

故障排查

High False Positive Rate

高误报率

  • Increase the confidence threshold to suppress low-confidence predictions
  • Provide false positive feedback to improve the model
  • Check that training data period includes representative traffic patterns (weekday, weekend, peak)
  • 提高置信度阈值以抑制低置信度预测
  • 提供误报反馈以优化模型
  • 检查训练数据周期是否包含具有代表性的流量模式(工作日、周末、峰值时段)

Predictions Not Triggering

未触发预测

  • Verify data sources are connected and sending metrics
  • Check that the prediction horizon is appropriate for the failure mode
  • Ensure the model has completed initial training (2-4 weeks minimum)
  • 验证数据源已连接并正在发送指标
  • 检查预测时长是否适用于当前故障模式
  • 确保模型已完成初始训练(至少2-4周)

Alert Correlation Missing Related Alerts

告警关联遗漏相关告警

  • Increase the correlation window to capture cascading failures
  • Update the service dependency map if topology-based correlation is in use
  • Check that all alert sources are integrated (missing sources cause orphaned alerts)
  • 增大关联窗口以捕获连锁故障
  • 若使用基于拓扑的关联,请更新服务依赖映射
  • 检查是否集成了所有告警源(缺失源会导致告警孤立)