ai-operations

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AI Operations

AI运维

Configure AI-powered predictive failure analysis and intelligent alert correlation using Harness AIDA.

通过Harness AIDA配置AI驱动的预测性故障分析与智能告警关联。

Instructions

操作步骤

Step 1: Establish Scope

步骤1：确定范围

Confirm the user's org, project, service, and observability stack.

Call MCP tool: harness_list
Parameters:
  resource_type: "project"
  org_id: "<organization>"

确认用户的组织、项目、服务以及可观测性堆栈。

调用MCP工具：harness_list
参数：
  resource_type: "project"
  org_id: "<organization>"

Step 2: Identify the AI Operations Task

步骤2：确定AI运维任务类型

Determine which workflow the user needs:

Predictive Failure Analysis -- ML-based detection of impending failures before SLO breach
Alert Correlation and Noise Reduction -- Group related alerts and suppress duplicates

明确用户所需的工作流：

预测性故障分析——基于ML的故障预判，在SLO违规前检测出即将发生的故障
告警关联与降噪——对相关告警进行分组并抑制重复告警

Step 3: Configure Predictive Failure Analysis

步骤3：配置预测性故障分析

Gather from the user:

Service name and data sources (Datadog, Prometheus, CloudWatch)
Prediction horizon (30 minutes, 1 hour, 4 hours, 24 hours ahead)
Training data period (30 days, 90 days, 6 months)
Model type preference (anomaly detection, time series forecasting, ensemble)

Configure failure prediction scenarios:

Memory leak detection -- Flag services where memory grows above threshold per window
Disk exhaustion -- Predict time-to-full and alert N hours in advance
Connection pool saturation -- Alert when pool usage exceeds threshold for sustained duration
Latency degradation -- Detect progressive slowdown before SLO breach
Deployment-induced regression -- Correlate metric changes with recent deployments

Configure alerting:

Set prediction confidence threshold (suppress below threshold to reduce noise)
Route alerts to PagerDuty, Slack, or other channels
Enable auto-generated runbook suggestions using AIDA
Set up false positive feedback loop for model improvement

Configure data sources:

Metrics source (Prometheus, Datadog, CloudWatch)
Log source (Elasticsearch, Splunk, CloudWatch Logs)
Trace source (Jaeger, Datadog APM, AWS X-Ray)
Model retraining frequency (daily, weekly, monthly, on data drift)

向用户收集以下信息：

服务名称与数据源（Datadog、Prometheus、CloudWatch）
预测时长（提前30分钟、1小时、4小时、24小时）
训练数据周期（30天、90天、6个月）
模型类型偏好（异常检测、时间序列预测、集成模型）

配置故障预测场景：

内存泄漏检测——标记内存在窗口期内超出阈值的服务
磁盘耗尽预测——预测磁盘耗尽时间并提前N小时发出告警
连接池饱和告警——当连接池使用率持续超出阈值时触发告警
延迟退化检测——在SLO违规前检测渐进式性能下降
部署引发的性能退化——将指标变化与近期部署进行关联

配置告警设置：

设置预测置信度阈值（抑制低于阈值的预测以减少噪音）
将告警路由至PagerDuty、Slack或其他渠道
启用AIDA自动生成的运行手册建议
建立误报反馈循环以优化模型

配置数据源：

指标源（Prometheus、Datadog、CloudWatch）
日志源（Elasticsearch、Splunk、CloudWatch Logs）
追踪源（Jaeger、Datadog APM、AWS X-Ray）
模型重训练频率（每日、每周、每月、数据漂移时）

Step 4: Configure Alert Correlation and Noise Reduction

步骤4：配置告警关联与降噪

Gather from the user:

Current alert volume (alerts/day) and target reduction percentage
Alerting tools in use (PagerDuty, OpsGenie, Grafana, Datadog)
Correlation preferences

Configure alert correlation:

Correlation window: group alerts fired within N minutes
Correlation method: topology-based, time-based, ML-based, or hybrid
Service dependency mapping for topology-based correlation

Configure noise reduction:

Deduplication: merge identical alerts across sources
Suppression: suppress known-noisy alerts during maintenance windows
Aggregation: combine N similar alerts into a single incident
Priority scoring: ML-based severity assignment using historical resolution data

Configure intelligent routing:

Route to the team that owns the affected service
Escalation: auto-escalate if not acknowledged within SLA
Context enrichment: attach recent deployments, related logs, and runbook links to alerts

向用户收集以下信息：

当前告警量（每日告警数）与目标降噪百分比
正在使用的告警工具（PagerDuty、OpsGenie、Grafana、Datadog）
关联偏好设置

配置告警关联：

关联窗口：将N分钟内触发的告警归为一组
关联方式：基于拓扑、基于时间、基于ML或混合方式
基于拓扑关联的服务依赖映射

配置降噪设置：

去重：合并跨来源的重复告警
抑制：在维护窗口期间抑制已知的高频噪音告警
聚合：将N条相似告警合并为单个事件
优先级评分：基于历史处理数据的ML式严重程度分配

配置智能路由：

将告警路由至受影响服务的负责团队
升级：若在SLA内未得到确认则自动升级告警
上下文 enrichment：为告警附加近期部署信息、相关日志与运行手册链接

Examples

示例

"Set up predictive failure analysis for our payment service" -- Configure ML models to detect memory leaks, disk exhaustion, and latency degradation
"Reduce our alert noise by 50%" -- Configure alert correlation and deduplication to reduce daily alert volume
"Alert us 4 hours before disk runs out" -- Configure disk exhaustion prediction with advance warning
"Correlate alerts across our microservices" -- Set up topology-based alert correlation using service dependency map
"Auto-generate runbook suggestions for alerts" -- Enable AIDA-powered runbook recommendations

"为我们的支付服务设置预测性故障分析"——配置ML模型以检测内存泄漏、磁盘耗尽与延迟退化
"将告警噪音降低50%"——配置告警关联与去重以减少每日告警量
"在磁盘耗尽前4小时提醒我们"——配置磁盘耗尽预测并设置提前告警
"跨微服务关联告警"——利用服务依赖映射设置基于拓扑的告警关联
"为告警自动生成运行手册建议"——启用AIDA驱动的运行手册推荐

Performance Notes

性能说明

ML models need 2-4 weeks of baseline data before predictions become reliable -- expect higher false positive rates initially.
Topology-based correlation requires an accurate service dependency map -- stale maps cause missed correlations.
Alert correlation windows should balance grouping (longer = fewer alerts) with response time (shorter = faster notification).
Retraining frequency should match how fast the system changes -- fast-moving services need weekly retraining.

ML模型需要2-4周的基线数据才能生成可靠预测——初期误报率会较高。
基于拓扑的关联需要准确的服务依赖映射——过时的映射会导致关联遗漏。
告警关联窗口需在分组效果（窗口越长，告警越少）与响应时间（窗口越短，通知越快）之间取得平衡。
重训练频率应与系统变化速度匹配——快速迭代的服务需要每周重训练。

Troubleshooting

故障排查

High False Positive Rate

高误报率

Increase the confidence threshold to suppress low-confidence predictions
Provide false positive feedback to improve the model
Check that training data period includes representative traffic patterns (weekday, weekend, peak)

提高置信度阈值以抑制低置信度预测
提供误报反馈以优化模型
检查训练数据周期是否包含具有代表性的流量模式（工作日、周末、峰值时段）

Predictions Not Triggering

未触发预测

Verify data sources are connected and sending metrics
Check that the prediction horizon is appropriate for the failure mode
Ensure the model has completed initial training (2-4 weeks minimum)

验证数据源已连接并正在发送指标
检查预测时长是否适用于当前故障模式
确保模型已完成初始训练（至少2-4周）

Alert Correlation Missing Related Alerts

告警关联遗漏相关告警

Increase the correlation window to capture cascading failures
Update the service dependency map if topology-based correlation is in use
Check that all alert sources are integrated (missing sources cause orphaned alerts)

增大关联窗口以捕获连锁故障
若使用基于拓扑的关联，请更新服务依赖映射
检查是否集成了所有告警源（缺失源会导致告警孤立）