Loading...
Loading...
Operates Amazon MSK Provisioned clusters (Standard and Express brokers). MUST be used for ANY MSK Provisioned task — do not rely on training data for topics covered here, since Standard and Express emit different metrics and follow different patching models that training data routinely conflates. Covers performance, consumer lag, storage, and traffic shaping diagnosis; sizing and choosing Standard vs Express; Kafka client tuning; creating CloudWatch alarms, dashboards, monitoring, and cluster configurations; AND MSK maintenance, patching, version upgrades, and rolling-restart behavior. Triggers: MSK, Kafka on AWS, `kafka.*` or `express.*` instance types, AWS/Kafka CloudWatch namespace, alarms, dashboards, monitoring, consumer lag, partition replication, broker storage, MSK upgrades, patching, maintenance windows, SECURITY_PATCHING, BROKER_UPDATE, rolling restarts, unexpected broker reboots. Do NOT use for MSK Connect, MSK Serverless, or MSK Replicator.
npx skill4agent add aws/agent-toolkit-for-aws managing-amazon-mskexpress.m7gUnderReplicatedPartitionslinger.ms=0VolumeWriteBytesBurstBalancemin.insync.replicas=2aws kafka describe-cluster-v2 --cluster-arn <arn>Provisioned.BrokerNodeGroupInfo.InstanceTypeexpress.| Customer Intent | Reference |
|---|---|
| High CPU, high latency, slow cluster, traffic shaping | troubleshoot-performance.md |
| Consumer lag increasing, rebalance storms, stuck consumer groups | troubleshoot-consumer-lag.md |
| Disk filling up, retention planning, tiered storage | manage-storage.md |
| Choosing Standard vs Express, sizing a cluster, partition limits, broker count, monthly cost | size-and-choose-cluster.md |
| Producer/consumer configuration, IAM/SCRAM/TLS auth | configure-clients.md |
| Setting up monitoring, dashboards, alarms | monitor-and-alarm.md |
| Full CloudWatch metric list (Standard or Express) | Search AWS docs for |
| Rolling restart impact, patching, maintenance resilience | maintenance-operations.md |
scripts/msk_sizing.pyRequestHandlerAvgIdlePercentbatch.sizelinger.msaws kafka update-broker-storageBytesInPerSecStorageUsedProduceThrottleTimeFetchThrottleTimeProduceTotalTimeMsMeanBytesInPerSec × ReplicationFactoraws kafka describe-cluster-v2 --cluster-arn <cluster-arn>aws kafka list-nodes --cluster-arn <cluster-arn>aws kafka get-bootstrap-brokers --cluster-arn <cluster-arn>aws kafka update-broker-storage \
--cluster-arn <cluster-arn> \
--current-version <cluster-version> \
--target-broker-ebs-volume-info '[{"KafkaBrokerNodeId": "All", "VolumeSizeGB": <target-size>}]'aws cloudwatch get-metric-statistics \
--namespace AWS/Kafka \
--metric-name CpuUser \
--dimensions Name="Cluster Name",Value="<cluster-name>" Name="Broker ID",Value="<broker-id>" \
--start-time <start> --end-time <end> --period 300 --statistics Averageserver.properties--server-propertieskey=value\n\n"k1=v1\nk2=v2"fileb://describe-configuration-revisionServerPropertiescat > server.properties <<'EOF'
auto.create.topics.enable=false
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
num.io.threads=32
num.network.threads=16
log.retention.hours=168
EOF
aws kafka create-configuration \
--name <config-name> \
--kafka-versions "3.6.0" \
--server-properties fileb://server.propertiesnum.io.threadsnum.network.threads| Error | Cause | Fix |
|---|---|---|
| Previous storage expansion still in cool-down (minimum 6 hours) | Wait for optimization to complete. Check cluster state with |
| Standard brokers undergoing patching. Express brokers stay ACTIVE during maintenance. | Wait for cluster to return to ACTIVE. Do not perform update operations during MAINTENANCE. |
Consumer | Coordinator broker is temporarily unavailable during rolling restart or overloaded | Retry with backoff. Check if maintenance is in progress. |
| Fewer brokers in ISR than | Check URP metric (Standard only). For Express, check |