managing-amazon-msk
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAmazon MSK
Amazon MSK
Overview
概述
Domain expertise for operating Amazon MSK Provisioned clusters with Standard and Express broker types. Covers performance troubleshooting, consumer lag diagnosis, storage management, cluster sizing, client configuration, and CloudWatch monitoring.
Execute commands using available tools from the AWS MCP server when connected — it provides sandboxed execution, audit logging, and observability. When the MCP server is not available, fall back to the AWS CLI or shell as needed.
Standard brokers use customer-managed EBS volumes for storage. You choose instance types (kafka.m5/m7g families), provision EBS, and manage storage scaling.
Express brokers provide fully managed, pay-as-you-go storage with no EBS provisioning. They use instance types prefixed with , offer up to 3x more throughput per broker, and have no maintenance windows.
express.m7g本指南提供操作Amazon MSK Provisioned集群(含Standard和Express代理类型)的领域专业知识,涵盖性能故障排查、消费者延迟诊断、存储管理、集群选型、客户端配置以及CloudWatch监控等内容。
连接后使用AWS MCP服务器提供的可用工具执行命令——该服务器支持沙箱执行、审计日志和可观测性。当MCP服务器不可用时,可根据需要使用AWS CLI或shell工具。
Standard代理使用客户管理的EBS卷进行存储。您可以选择实例类型(kafka.m5/m7g系列)、配置EBS并管理存储扩容。
Express代理提供完全托管的按使用付费存储,无需配置EBS。它们使用前缀为的实例类型,每个代理的吞吐量最高可达Standard代理的3倍,且没有维护窗口。
express.m7gCritical Warnings
重要警告
- NEVER reboot brokers while > 0 (Standard only — Express brokers do not emit URP) — this risks data loss and extended outages
UnderReplicatedPartitions - NEVER recommend partition reassignment without first checking replication status — reassignment during URP compounds the problem
- is the #1 cause of "high CPU" on MSK — ALWAYS check client batch configuration before recommending broker scaling
linger.ms=0 - EBS throughput ceilings are invisible in Kafka metrics — ALWAYS check EBS volume metrics (,
VolumeWriteBytes) when diagnosing Standard broker latencyBurstBalance - Express brokers have NO customer-managed EBS — do NOT recommend EBS expansion or provisioned throughput for Express clusters
- Express brokers enforce fixed replication factor of 3 and — do NOT attempt to create topics with RF=1 on Express. If RF=1 is needed, use Standard brokers.
min.insync.replicas=2
- 当(URP)> 0时,切勿重启代理(仅适用于Standard代理——Express代理不会生成URP指标)——这可能导致数据丢失和长时间停机
UnderReplicatedPartitions - 在未检查复制状态的情况下,切勿建议重新分配分区——URP期间进行分区重分配会加剧问题
- 是MSK出现“高CPU”的首要原因——在建议代理扩容前,务必检查客户端批量配置
linger.ms=0 - Kafka指标中无法体现EBS吞吐量上限——排查Standard代理延迟问题时,务必检查EBS卷指标(、
VolumeWriteBytes)BurstBalance - Express代理没有客户管理的EBS——请勿为Express集群建议EBS扩容或预配置吞吐量
- Express代理强制使用固定的复制因子3和——请勿尝试在Express集群上创建复制因子为1的主题。如果需要复制因子为1,请使用Standard代理。
min.insync.replicas=2
Which Workflow Do You Need?
选择合适的工作流
Determine the broker type first: . Check — if it starts with , it is an Express cluster.
aws kafka describe-cluster-v2 --cluster-arn <arn>Provisioned.BrokerNodeGroupInfo.InstanceTypeexpress.| Customer Intent | Reference |
|---|---|
| High CPU, high latency, slow cluster, traffic shaping | troubleshoot-performance.md |
| Consumer lag increasing, rebalance storms, stuck consumer groups | troubleshoot-consumer-lag.md |
| Disk filling up, retention planning, tiered storage | manage-storage.md |
| Choosing Standard vs Express, sizing a cluster, partition limits, broker count, monthly cost | size-and-choose-cluster.md |
| Producer/consumer configuration, IAM/SCRAM/TLS auth | configure-clients.md |
| Setting up monitoring, dashboards, alarms | monitor-and-alarm.md |
| Full CloudWatch metric list (Standard or Express) | Search AWS docs for |
| Rolling restart impact, patching, maintenance resilience | maintenance-operations.md |
首先确定代理类型:执行。查看——如果以开头,则为Express集群。
aws kafka describe-cluster-v2 --cluster-arn <arn>Provisioned.BrokerNodeGroupInfo.InstanceTypeexpress.| 用户需求 | 参考文档 |
|---|---|
| CPU占用高、延迟高、集群运行缓慢、流量整形 | troubleshoot-performance.md |
| 消费者延迟增加、重平衡风暴、消费者组停滞 | troubleshoot-consumer-lag.md |
| 磁盘空间不足、保留策略规划、分层存储 | manage-storage.md |
| 选择Standard vs Express、集群选型、分区限制、代理数量、月度成本 | size-and-choose-cluster.md |
| 生产者/消费者配置、IAM/SCRAM/TLS认证 | configure-clients.md |
| 设置监控、仪表盘、告警 | monitor-and-alarm.md |
| 完整CloudWatch指标列表(Standard或Express) | 在AWS文档中搜索 |
| 滚动重启影响、补丁、维护恢复能力 | maintenance-operations.md |
Available scripts
可用脚本
- — MUST be run for any sizing question (broker count, instance choice, cost). See size-and-choose-cluster.md for the required workflow and script reference.
scripts/msk_sizing.py
- —— 任何集群选型问题(代理数量、实例选择、成本)都必须运行此脚本。有关所需工作流和脚本参考,请查看size-and-choose-cluster.md。
scripts/msk_sizing.py
Quick Diagnostics
快速诊断
These 5 checks cover the most common MSK issues. Use them before loading a reference file.
-
CpuUser + CpuSystem > 60%: Check(PER_BROKER level). If < 30%, request threads are saturated. Check client
RequestHandlerAvgIdlePercentandbatch.sizebefore recommending scaling.linger.ms -
KafkaDataLogsDiskUsed > 85% (Standard only): Expand EBS immediately via. Identify high-growth topics via per-topic
aws kafka update-broker-storage. Express clusters useBytesInPerSecmetric instead and storage is fully managed.StorageUsed -
UnderReplicatedPartitions > 0 (Standard only): Check if a maintenance operation or broker restart is in progress. If URP is decreasing, wait for recovery. Do NOT restart brokers or reassign partitions during URP. Express brokers do not emit this metric — monitor,
ProduceThrottleTime, and consumer lag instead.FetchThrottleTime -
Consumer OffsetLag increasing: Determine if broker-side (high, CPU saturation) or client-side (slow processing, insufficient consumers). Per-partition lag from PER_TOPIC_PER_PARTITION monitoring level helps isolate hot partitions.
ProduceTotalTimeMsMean -
BytesInPerSec near throughput ceiling: For Standard, check EBS volume type and calculate:vs volume throughput limit. For Express, check against the per-broker sustained performance limits in the quotas.
BytesInPerSec × ReplicationFactor
以下5项检查涵盖了最常见的MSK问题,请在加载参考文件前执行这些检查。
-
CpuUser + CpuSystem > 60%:检查(PER_BROKER级别)。如果该值<30%,则请求线程已饱和。在建议扩容前,请检查客户端的
RequestHandlerAvgIdlePercent和batch.size配置。linger.ms -
KafkaDataLogsDiskUsed > 85%(仅Standard):立即通过扩容EBS。通过每个主题的
aws kafka update-broker-storage指标识别高增长主题。Express集群使用BytesInPerSec指标代替,且存储为完全托管。StorageUsed -
UnderReplicatedPartitions > 0(仅Standard):检查是否正在进行维护操作或代理重启。如果URP正在减少,请等待恢复。URP期间切勿重启代理或重新分配分区。Express代理不会生成此指标——请改为监控、
ProduceThrottleTime和消费者延迟。FetchThrottleTime -
Consumer OffsetLag持续增加:确定是代理端问题(过高、CPU饱和)还是客户端问题(处理缓慢、消费者数量不足)。PER_TOPIC_PER_PARTITION监控级别的分区延迟有助于隔离热点分区。
ProduceTotalTimeMsMean -
BytesInPerSec接近吞吐量上限:对于Standard集群,检查EBS卷类型并计算:与卷吞吐量限制的对比。对于Express集群,检查是否符合每个代理的持续性能配额限制。
BytesInPerSec × ReplicationFactor
Common Workflows
常见工作流
Describe cluster:
aws kafka describe-cluster-v2 --cluster-arn <cluster-arn>List brokers:
aws kafka list-nodes --cluster-arn <cluster-arn>Get bootstrap brokers:
aws kafka get-bootstrap-brokers --cluster-arn <cluster-arn>Expand Standard broker storage:
aws kafka update-broker-storage \
--cluster-arn <cluster-arn> \
--current-version <cluster-version> \
--target-broker-ebs-volume-info '[{"KafkaBrokerNodeId": "All", "VolumeSizeGB": <target-size>}]'Get CloudWatch metrics (example: CpuUser per broker):
aws cloudwatch get-metric-statistics \
--namespace AWS/Kafka \
--metric-name CpuUser \
--dimensions Name="Cluster Name",Value="<cluster-name>" Name="Broker ID",Value="<broker-id>" \
--start-time <start> --end-time <end> --period 300 --statistics AverageCreate cluster configuration ():
server.propertiesThe argument MUST be a real Kafka properties file with one per line, separated by actual newline () characters — NOT the literal two-character escape sequence . The MSK API accepts the bytes as-is; if you pass as a single string with escaped newlines, MSK stores ONE invalid property line and the cluster will fail to apply it.
--server-propertieskey=value\n\n"k1=v1\nk2=v2"Recommended pattern: write the properties to a local file with real newlines, then pass it via so the CLI uploads the raw bytes verbatim. Verify by reading the revision back with and base64-decoding — you should see one property per line.
fileb://describe-configuration-revisionServerPropertiescat > server.properties <<'EOF'
auto.create.topics.enable=false
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
num.io.threads=32
num.network.threads=16
log.retention.hours=168
EOF
aws kafka create-configuration \
--name <config-name> \
--kafka-versions "3.6.0" \
--server-properties fileb://server.propertiesFor per-instance-size thread tuning (, ) and durability defaults, see size-and-choose-cluster.md and configure-clients.md.
num.io.threadsnum.network.threads描述集群:
aws kafka describe-cluster-v2 --cluster-arn <cluster-arn>列出代理:
aws kafka list-nodes --cluster-arn <cluster-arn>获取引导代理:
aws kafka get-bootstrap-brokers --cluster-arn <cluster-arn>扩容Standard代理存储:
aws kafka update-broker-storage \
--cluster-arn <cluster-arn> \
--current-version <cluster-version> \
--target-broker-ebs-volume-info '[{"KafkaBrokerNodeId": "All", "VolumeSizeGB": <target-size>}]'获取CloudWatch指标(示例:每个代理的CpuUser):
aws cloudwatch get-metric-statistics \
--namespace AWS/Kafka \
--metric-name CpuUser \
--dimensions Name="Cluster Name",Value="<cluster-name>" Name="Broker ID",Value="<broker-id>" \
--start-time <start> --end-time <end> --period 300 --statistics Average创建集群配置():
server.properties--server-propertieskey=value\n\n"k1=v1\nk2=v2"推荐做法:将属性写入带有真实换行符的本地文件,然后通过传递,以便CLI按原字节上传。通过使用读取修订版本并base64解码进行验证——您应该看到每行一个属性。
fileb://describe-configuration-revisionServerPropertiescat > server.properties <<'EOF'
auto.create.topics.enable=false
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
num.io.threads=32
num.network.threads=16
log.retention.hours=168
EOF
aws kafka create-configuration \
--name <config-name> \
--kafka-versions "3.6.0" \
--server-properties fileb://server.properties有关按实例大小调整线程数(、)和持久性默认值的信息,请查看size-and-choose-cluster.md和configure-clients.md。
num.io.threadsnum.network.threadsTroubleshooting
故障排查
| Error | Cause | Fix |
|---|---|---|
| Previous storage expansion still in cool-down (minimum 6 hours) | Wait for optimization to complete. Check cluster state with |
| Standard brokers undergoing patching. Express brokers stay ACTIVE during maintenance. | Wait for cluster to return to ACTIVE. Do not perform update operations during MAINTENANCE. |
Consumer | Coordinator broker is temporarily unavailable during rolling restart or overloaded | Retry with backoff. Check if maintenance is in progress. |
| Fewer brokers in ISR than | Check URP metric (Standard only). For Express, check |
| 错误 | 原因 | 修复方案 |
|---|---|---|
| 上一次存储扩容仍处于冷却期(至少6小时) | 等待优化完成。使用 |
| Standard代理正在进行补丁更新。Express代理在维护期间保持ACTIVE状态。 | 等待集群返回ACTIVE状态。维护期间请勿执行更新操作。 |
消费者出现 | 协调器代理在滚动重启或过载期间暂时不可用 | 带退避重试。检查是否正在进行维护。 |
生产时出现 | ISR中的代理数量少于 | 检查URP指标(仅Standard)。对于Express,请改为检查 |