Total 50,530 skills, DevOps & Cloud Services has 3052 skills
Showing 12 of 3052 skills
Diagnose Harness pipeline executions via MCP. Analyzes any execution (failed or successful) to produce structured reports with stage/step breakdown, timing, bottlenecks, failure details, chained pipeline drill-down, and execution logs. Use when asked to debug a pipeline, investigate a failure, find out why a build failed, analyze pipeline errors, check execution logs, review execution performance, or find bottlenecks. Trigger phrases: debug pipeline, pipeline failed, why did my build fail, analyze failure, pipeline error, execution logs, fix pipeline, execution bottleneck, slow pipeline.
Analyze cloud costs, find optimization opportunities, and track anomalies using Harness CCM via MCP. Use when user says "cloud costs", "analyze costs", "cost optimization", "reduce spending", "cost report", or asks about cloud bills.
Expert evaluator for Prometheus label strategy. Audits, designs, and improves label schemas using cardinality scoring, access-pattern alignment, static vs. dynamic label rules, histogram bucket discipline, instrumentation hygiene, and source-side prevention via relabel_config / metric_relabel_configs. Use when the user asks to evaluate, audit, design, or improve Prometheus labels — or asks how to prevent high cardinality at the source. For post-ingest aggregation, see the adaptive-metrics skill. For "why is my Prometheus slow / expensive right now" triage, see prometheus-cardinality-troubleshooter.
Use this spell when you need to see what is happening right now on a distant system rather than reading stale logs or cached reports.
Infrastructure-as-Code patterns for data engineering using Terraform to provision AWS resources (S3, EC2, IAM)
Monitor health and availability of systems, services, APIs, and infrastructure endpoints.
Use when managing IAM policies, users, and permissions in a Tigris organization
Ubuntu Server 24.04 LTS: apt, user management, disk/filesystem, sysctl, log management
Evaluate a workload's performance efficiency against the Well-Architected Performance Efficiency pillar, covering resource selection, scaling, monitoring, and optimization opportunities.
Use when launching cloud VMs, Kubernetes pods, or Slurm jobs for GPU/TPU/CPU workloads, training or fine-tuning models on cloud GPUs, deploying inference servers (vllm, TGI, etc.) with autoscaling, writing or debugging SkyPilot task YAML files, using spot/preemptible instances for cost savings, comparing GPU prices across clouds, managing compute across 25+ clouds, Kubernetes, Slurm, and on-prem clusters with failover between them, troubleshooting resource availability or SkyPilot errors, or optimizing cost and GPU availability.
Creates comprehensive GitHub Actions CI/CD workflows for linting, testing, building, and deploying. Includes caching strategies, matrix builds, artifact handling, and failure diagnostics. Use for "GitHub Actions", "CI pipeline", "workflow automation", or "continuous integration".
Creates SLO-based alerts and operational dashboards with key charts, alert thresholds, and runbook links. Use for "alerting", "dashboards", "SLO", or "monitoring".