managing-amazon-msk

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Amazon MSK

Amazon MSK

Overview

概述

Domain expertise for operating Amazon MSK Provisioned clusters with Standard and Express broker types. Covers performance troubleshooting, consumer lag diagnosis, storage management, cluster sizing, client configuration, and CloudWatch monitoring.
Execute commands using available tools from the AWS MCP server when connected — it provides sandboxed execution, audit logging, and observability. When the MCP server is not available, fall back to the AWS CLI or shell as needed.
Standard brokers use customer-managed EBS volumes for storage. You choose instance types (kafka.m5/m7g families), provision EBS, and manage storage scaling.
Express brokers provide fully managed, pay-as-you-go storage with no EBS provisioning. They use instance types prefixed with
express.m7g
, offer up to 3x more throughput per broker, and have no maintenance windows.
本指南提供操作Amazon MSK Provisioned集群(含Standard和Express代理类型)的领域专业知识,涵盖性能故障排查、消费者延迟诊断、存储管理、集群选型、客户端配置以及CloudWatch监控等内容。
连接后使用AWS MCP服务器提供的可用工具执行命令——该服务器支持沙箱执行、审计日志和可观测性。当MCP服务器不可用时,可根据需要使用AWS CLI或shell工具。
Standard代理使用客户管理的EBS卷进行存储。您可以选择实例类型(kafka.m5/m7g系列)、配置EBS并管理存储扩容。
Express代理提供完全托管的按使用付费存储,无需配置EBS。它们使用前缀为
express.m7g
的实例类型,每个代理的吞吐量最高可达Standard代理的3倍,且没有维护窗口。

Critical Warnings

重要警告

  • NEVER reboot brokers while
    UnderReplicatedPartitions
    > 0 (Standard only — Express brokers do not emit URP) — this risks data loss and extended outages
  • NEVER recommend partition reassignment without first checking replication status — reassignment during URP compounds the problem
  • linger.ms=0
    is the #1 cause of "high CPU" on MSK — ALWAYS check client batch configuration before recommending broker scaling
  • EBS throughput ceilings are invisible in Kafka metrics — ALWAYS check EBS volume metrics (
    VolumeWriteBytes
    ,
    BurstBalance
    ) when diagnosing Standard broker latency
  • Express brokers have NO customer-managed EBS — do NOT recommend EBS expansion or provisioned throughput for Express clusters
  • Express brokers enforce fixed replication factor of 3 and
    min.insync.replicas=2
    — do NOT attempt to create topics with RF=1 on Express. If RF=1 is needed, use Standard brokers.
  • UnderReplicatedPartitions
    (URP)> 0时,切勿重启代理(仅适用于Standard代理——Express代理不会生成URP指标)——这可能导致数据丢失和长时间停机
  • 在未检查复制状态的情况下,切勿建议重新分配分区——URP期间进行分区重分配会加剧问题
  • linger.ms=0
    是MSK出现“高CPU”的首要原因——在建议代理扩容前,务必检查客户端批量配置
  • Kafka指标中无法体现EBS吞吐量上限——排查Standard代理延迟问题时,务必检查EBS卷指标(
    VolumeWriteBytes
    BurstBalance
  • Express代理没有客户管理的EBS——请勿为Express集群建议EBS扩容或预配置吞吐量
  • Express代理强制使用固定的复制因子3和
    min.insync.replicas=2
    ——请勿尝试在Express集群上创建复制因子为1的主题。如果需要复制因子为1,请使用Standard代理。

Which Workflow Do You Need?

选择合适的工作流

Determine the broker type first:
aws kafka describe-cluster-v2 --cluster-arn <arn>
. Check
Provisioned.BrokerNodeGroupInfo.InstanceType
— if it starts with
express.
, it is an Express cluster.
Customer IntentReference
High CPU, high latency, slow cluster, traffic shapingtroubleshoot-performance.md
Consumer lag increasing, rebalance storms, stuck consumer groupstroubleshoot-consumer-lag.md
Disk filling up, retention planning, tiered storagemanage-storage.md
Choosing Standard vs Express, sizing a cluster, partition limits, broker count, monthly costsize-and-choose-cluster.md
Producer/consumer configuration, IAM/SCRAM/TLS authconfigure-clients.md
Setting up monitoring, dashboards, alarmsmonitor-and-alarm.md
Full CloudWatch metric list (Standard or Express)Search AWS docs for
"MSK CloudWatch metrics Standard brokers"
or
"MSK CloudWatch metrics Express brokers"
Rolling restart impact, patching, maintenance resiliencemaintenance-operations.md
首先确定代理类型:执行
aws kafka describe-cluster-v2 --cluster-arn <arn>
。查看
Provisioned.BrokerNodeGroupInfo.InstanceType
——如果以
express.
开头,则为Express集群。
用户需求参考文档
CPU占用高、延迟高、集群运行缓慢、流量整形troubleshoot-performance.md
消费者延迟增加、重平衡风暴、消费者组停滞troubleshoot-consumer-lag.md
磁盘空间不足、保留策略规划、分层存储manage-storage.md
选择Standard vs Express、集群选型、分区限制、代理数量、月度成本size-and-choose-cluster.md
生产者/消费者配置、IAM/SCRAM/TLS认证configure-clients.md
设置监控、仪表盘、告警monitor-and-alarm.md
完整CloudWatch指标列表(Standard或Express)在AWS文档中搜索
"MSK CloudWatch metrics Standard brokers"
"MSK CloudWatch metrics Express brokers"
滚动重启影响、补丁、维护恢复能力maintenance-operations.md

Available scripts

可用脚本

  • scripts/msk_sizing.py
    MUST be run for any sizing question (broker count, instance choice, cost). See size-and-choose-cluster.md for the required workflow and script reference.
  • scripts/msk_sizing.py
    —— 任何集群选型问题(代理数量、实例选择、成本)都必须运行此脚本。有关所需工作流和脚本参考,请查看size-and-choose-cluster.md

Quick Diagnostics

快速诊断

These 5 checks cover the most common MSK issues. Use them before loading a reference file.
  1. CpuUser + CpuSystem > 60%: Check
    RequestHandlerAvgIdlePercent
    (PER_BROKER level). If < 30%, request threads are saturated. Check client
    batch.size
    and
    linger.ms
    before recommending scaling.
  2. KafkaDataLogsDiskUsed > 85% (Standard only): Expand EBS immediately via
    aws kafka update-broker-storage
    . Identify high-growth topics via per-topic
    BytesInPerSec
    . Express clusters use
    StorageUsed
    metric instead and storage is fully managed.
  3. UnderReplicatedPartitions > 0 (Standard only): Check if a maintenance operation or broker restart is in progress. If URP is decreasing, wait for recovery. Do NOT restart brokers or reassign partitions during URP. Express brokers do not emit this metric — monitor
    ProduceThrottleTime
    ,
    FetchThrottleTime
    , and consumer lag instead.
  4. Consumer OffsetLag increasing: Determine if broker-side (high
    ProduceTotalTimeMsMean
    , CPU saturation) or client-side (slow processing, insufficient consumers). Per-partition lag from PER_TOPIC_PER_PARTITION monitoring level helps isolate hot partitions.
  5. BytesInPerSec near throughput ceiling: For Standard, check EBS volume type and calculate:
    BytesInPerSec × ReplicationFactor
    vs volume throughput limit. For Express, check against the per-broker sustained performance limits in the quotas.
以下5项检查涵盖了最常见的MSK问题,请在加载参考文件前执行这些检查。
  1. CpuUser + CpuSystem > 60%:检查
    RequestHandlerAvgIdlePercent
    (PER_BROKER级别)。如果该值<30%,则请求线程已饱和。在建议扩容前,请检查客户端的
    batch.size
    linger.ms
    配置。
  2. KafkaDataLogsDiskUsed > 85%(仅Standard):立即通过
    aws kafka update-broker-storage
    扩容EBS。通过每个主题的
    BytesInPerSec
    指标识别高增长主题。Express集群使用
    StorageUsed
    指标代替,且存储为完全托管。
  3. UnderReplicatedPartitions > 0(仅Standard):检查是否正在进行维护操作或代理重启。如果URP正在减少,请等待恢复。URP期间切勿重启代理或重新分配分区。Express代理不会生成此指标——请改为监控
    ProduceThrottleTime
    FetchThrottleTime
    和消费者延迟。
  4. Consumer OffsetLag持续增加:确定是代理端问题(
    ProduceTotalTimeMsMean
    过高、CPU饱和)还是客户端问题(处理缓慢、消费者数量不足)。PER_TOPIC_PER_PARTITION监控级别的分区延迟有助于隔离热点分区。
  5. BytesInPerSec接近吞吐量上限:对于Standard集群,检查EBS卷类型并计算:
    BytesInPerSec × ReplicationFactor
    与卷吞吐量限制的对比。对于Express集群,检查是否符合每个代理的持续性能配额限制。

Common Workflows

常见工作流

Describe cluster:
aws kafka describe-cluster-v2 --cluster-arn <cluster-arn>
List brokers:
aws kafka list-nodes --cluster-arn <cluster-arn>
Get bootstrap brokers:
aws kafka get-bootstrap-brokers --cluster-arn <cluster-arn>
Expand Standard broker storage:
aws kafka update-broker-storage \
  --cluster-arn <cluster-arn> \
  --current-version <cluster-version> \
  --target-broker-ebs-volume-info '[{"KafkaBrokerNodeId": "All", "VolumeSizeGB": <target-size>}]'
Get CloudWatch metrics (example: CpuUser per broker):
aws cloudwatch get-metric-statistics \
  --namespace AWS/Kafka \
  --metric-name CpuUser \
  --dimensions Name="Cluster Name",Value="<cluster-name>" Name="Broker ID",Value="<broker-id>" \
  --start-time <start> --end-time <end> --period 300 --statistics Average
Create cluster configuration (
server.properties
):
The
--server-properties
argument MUST be a real Kafka properties file with one
key=value
per line, separated by actual newline (
\n
) characters — NOT the literal two-character escape sequence
\n
. The MSK API accepts the bytes as-is; if you pass
"k1=v1\nk2=v2"
as a single string with escaped newlines, MSK stores ONE invalid property line and the cluster will fail to apply it.
Recommended pattern: write the properties to a local file with real newlines, then pass it via
fileb://
so the CLI uploads the raw bytes verbatim. Verify by reading the revision back with
describe-configuration-revision
and base64-decoding
ServerProperties
— you should see one property per line.
cat > server.properties <<'EOF'
auto.create.topics.enable=false
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
num.io.threads=32
num.network.threads=16
log.retention.hours=168
EOF

aws kafka create-configuration \
  --name <config-name> \
  --kafka-versions "3.6.0" \
  --server-properties fileb://server.properties
For per-instance-size thread tuning (
num.io.threads
,
num.network.threads
) and durability defaults, see size-and-choose-cluster.md and configure-clients.md.
描述集群:
aws kafka describe-cluster-v2 --cluster-arn <cluster-arn>
列出代理:
aws kafka list-nodes --cluster-arn <cluster-arn>
获取引导代理:
aws kafka get-bootstrap-brokers --cluster-arn <cluster-arn>
扩容Standard代理存储:
aws kafka update-broker-storage \
  --cluster-arn <cluster-arn> \
  --current-version <cluster-version> \
  --target-broker-ebs-volume-info '[{"KafkaBrokerNodeId": "All", "VolumeSizeGB": <target-size>}]'
获取CloudWatch指标(示例:每个代理的CpuUser):
aws cloudwatch get-metric-statistics \
  --namespace AWS/Kafka \
  --metric-name CpuUser \
  --dimensions Name="Cluster Name",Value="<cluster-name>" Name="Broker ID",Value="<broker-id>" \
  --start-time <start> --end-time <end> --period 300 --statistics Average
创建集群配置(
server.properties
):
--server-properties
参数必须是真实的Kafka属性文件,每行一个
key=value
,使用实际换行符(
\n
)分隔——而非字面意义的两个字符转义序列
\n
。MSK API会按原样接受字节;如果您传递带有转义换行符的单个字符串
"k1=v1\nk2=v2"
,MSK会将其存储为一行无效的属性,集群将无法应用该配置。
推荐做法:将属性写入带有真实换行符的本地文件,然后通过
fileb://
传递,以便CLI按原字节上传。通过使用
describe-configuration-revision
读取修订版本并base64解码
ServerProperties
进行验证——您应该看到每行一个属性。
cat > server.properties <<'EOF'
auto.create.topics.enable=false
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
num.io.threads=32
num.network.threads=16
log.retention.hours=168
EOF

aws kafka create-configuration \
  --name <config-name> \
  --kafka-versions "3.6.0" \
  --server-properties fileb://server.properties
有关按实例大小调整线程数(
num.io.threads
num.network.threads
)和持久性默认值的信息,请查看size-and-choose-cluster.mdconfigure-clients.md

Troubleshooting

故障排查

ErrorCauseFix
aws kafka update-broker-storage
returns "storage is optimizing"
Previous storage expansion still in cool-down (minimum 6 hours)Wait for optimization to complete. Check cluster state with
describe-cluster-v2
.
ClusterState
is
MAINTENANCE
Standard brokers undergoing patching. Express brokers stay ACTIVE during maintenance.Wait for cluster to return to ACTIVE. Do not perform update operations during MAINTENANCE.
Consumer
GROUP_COORDINATOR_NOT_AVAILABLE
Coordinator broker is temporarily unavailable during rolling restart or overloadedRetry with backoff. Check if maintenance is in progress.
NotEnoughReplicasException
on produce
Fewer brokers in ISR than
min.insync.replicas
(default: 2)
Check URP metric (Standard only). For Express, check
ProduceThrottleTime
and broker health instead — URP is not available. If a broker is down for maintenance, this is transient. Do not lower
min.insync.replicas
.
错误原因修复方案
aws kafka update-broker-storage
返回"storage is optimizing"
上一次存储扩容仍处于冷却期(至少6小时)等待优化完成。使用
describe-cluster-v2
检查集群状态。
ClusterState
MAINTENANCE
Standard代理正在进行补丁更新。Express代理在维护期间保持ACTIVE状态。等待集群返回ACTIVE状态。维护期间请勿执行更新操作。
消费者出现
GROUP_COORDINATOR_NOT_AVAILABLE
协调器代理在滚动重启或过载期间暂时不可用带退避重试。检查是否正在进行维护。
生产时出现
NotEnoughReplicasException
ISR中的代理数量少于
min.insync.replicas
(默认值:2)
检查URP指标(仅Standard)。对于Express,请改为检查
ProduceThrottleTime
和代理健康状况——URP指标不可用。如果代理因维护而停机,此问题为暂时性的。请勿降低
min.insync.replicas

Additional Resources

额外资源