Loading...
Loading...
Found 38 Skills
Manage the full lifecycle of Alibaba Cloud EMR Serverless StarRocks instances — create, scale, configure, maintain and diagnose. Use this Skill when operations engineers, SREs, or architects need to manage StarRocks instances. Typical scenarios include: "create a StarRocks", "check instance status", "scale up CU", "modify configuration", "restart instance", "diagnose issues", etc. Not applicable for: writing SQL/DDL, data import/export, query tuning, materialized view configuration, or managing non-StarRocks products (EMR clusters, Spark, Milvus, ClickHouse, Doris, RDS, ECS).
Use this skill when implementing SRE practices, defining error budgets, reducing toil, planning capacity, or improving service reliability. Triggers on SRE, error budgets, SLOs, SLAs, toil automation, incident management, postmortems, on-call rotation, capacity planning, chaos engineering, and any task requiring reliability engineering decisions.
Guides AI ops leadership—LLM SRE, model/prompt releases, eval/incidents, cost/capacity, vendors, and cross-functional cadence. Use for AI platform ops, LLM SLAs, incidents, rollout governance, unit economics, red-team/eval gates, and team rituals—not memory (ai-memory-developer), context code (ai-context-engineer), security programs (cybersecurity), token roadmaps (ai-token-improvement-plan-engineer), solution architecture (applied-ai-architect-commercial-enterprise), skills portfolio (ai-skill-manager), or vertical AI product eng management (engineering-manager-vertical-ai-products). Prompt/eval team management and golden-set release policy: engineering-manager-agent-prompts-evals. Safeguard inference platform: ml-infrastructure-engineer-safeguards. Safeguard model research: ml-research-engineer-safeguards.
Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.
Design and optimize systems for high concurrency, throughput, scalability, and elastic scale—concurrency models (threads, async/await, actors), lock-free patterns, connection pooling, caching stampede mitigation, horizontal scaling, load balancing, backpressure, queueing, rate limiting, bulkheads, read replicas, sharding, pool tuning, profiling, capacity planning, SLO-driven autoscaling, multi-region and CDN edge architecture. Use when the user asks about high concurrency, scalability, throughput, horizontal scaling, connection pooling, backpressure, rate limiting, caching stampede, read replica, sharding, autoscaling, capacity planning, lock contention, async scalability, or load balancing—not service decomposition (microservices-developer), event buses only (event-driven-architecture), generic CRUD (senior-software-engineer), SRE on-call only (site-reliability-engineer), load tests without architecture (performance-engineer), or cost-only FinOps (cloud-economist).
Build autonomous AI agents with Claude Agent SDK. Structured outputs guarantee JSON schema validation, with plugins system and hooks for event-driven workflows. Prevents 14 documented errors. Use when: building coding agents, SRE systems, security auditors, or troubleshooting CLI not found, structured output validation, session forking errors, MCP config issues, subagent cleanup.
Expert knowledge for Azure Reliability development including best practices, decision making, architecture & design patterns, limits & quotas, and deployment. Use when designing AZ zone/zone-redundant setups, resilient Functions, AKS, MySQL HA migrations, or Queue Storage limits, and other Azure Reliability related development tasks. Not for Azure Resiliency (use azure-resiliency), Azure Monitor (use azure-monitor), Azure Service Health (use azure-service-health), Azure Sre Agent (use azure-sre-agent).
Guides failure-prevention culture and operational excellence for mission-critical engineering— zero-defect aspiration vs error budgets; HRO principles; defense-in-depth; fail-safe/fail-closed; verification gates and independent checks; redundancy and graceful degradation; pre-mortems and FMEA; stop-the-line; defect escape, near-miss, and repeat-incident metrics; leadership against normalization of deviance—not blame culture. Use for failure-prevention programs, HRO practices, verification gates, fail-safe design, pre-mortem/FMEA, stop-the-line, near-miss reporting, or defect-escape metrics—not SRE error budgets only (site-reliability-engineer), incident command only (incident-management-engineer), backup/restore only (cyber-resilience-engineer), CI lint only (build-validator), agile coaching, HR discipline, or classified ATO without ops-excellence lens (classified-cyber-security-senior-manager).
Analyze and transform messy, prototype, overgrown, slop-prone, or hard-to-maintain software repositories into maintainable product-shaped codebases while preserving existing product behavior. Use when the user asks to antislop a codebase, clean up a messy repo, run a maintainability migration, write a refactor plan, modernize structure, improve TypeScript/type boundaries, harden tests, reduce large files, clean architecture, coordinate subagent-driven refactors, or produce a final migration audit/report/microsite. Do not use for broader production-readiness specialties such as security audits, observability/logging programs, compliance hardening, SRE/runbook work, or reliability engineering unless the user explicitly scopes those as part of the maintainability refactor.
Guides architects on when and how to use goal-seeking agents as a design pattern. This skill helps evaluate whether autonomous agents are appropriate for a given problem, how to structure their objectives, integrate with goal_agent_generator, and reference real amplihack examples like AKS SRE automation, CI diagnostics, pre-commit workflows, and fix-agent pattern matching.
Complete ClickHouse operations guide for DevOps and SRE teams managing production deployments. Provides practical guidance on monitoring essential metrics (query latency, throughput, memory, disk), introspecting system tables, performance analysis, scaling strategies (vertical and horizontal), backup/disaster recovery, tuning at query/server/table levels, and troubleshooting common issues. Use when diagnosing ClickHouse problems, optimizing performance, planning capacity, setting up monitoring, implementing backups, or managing production clusters. Includes resource management strategies for disk space, connections, and background operations plus production checklists.
Review a resume like a professional resume writer AND an ATS (Applicant Tracking System) — produce an ATS score (0-100), a level detection (fresher / mid / senior), a breakdown of what's working, what's failing, the 7 Deadly Sins check, priority fixes, concrete before→after bullet rewrites, and missing keywords. Use this skill whenever the user asks to review, audit, critique, score, rate, roast, or improve a resume or CV, or types /resume-review, or drops a resume PDF with a job description. Trigger EVEN IF the user just says "look at my CV", "check my resume", "is this resume good", "why am I not getting interviews", "roast my resume", or shares a resume file without explicit review instructions — any interaction involving a resume file or CV should invoke this skill. Especially strong for DevOps, SRE, Cloud, Platform, and Infrastructure engineering resumes, but works for any technical resume.