managing-amazon-msk

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Amazon MSK

Overview

概述

Domain expertise for operating Amazon MSK Provisioned clusters with Standard and Express broker types. Covers performance troubleshooting, consumer lag diagnosis, storage management, cluster sizing, client configuration, and CloudWatch monitoring.

Execute commands using available tools from the AWS MCP server when connected — it provides sandboxed execution, audit logging, and observability. When the MCP server is not available, fall back to the AWS CLI or shell as needed.

Standard brokers use customer-managed EBS volumes for storage. You choose instance types (kafka.m5/m7g families), provision EBS, and manage storage scaling.

Express brokers provide fully managed, pay-as-you-go storage with no EBS provisioning. They use instance types prefixed with

express.m7g

, offer up to 3x more throughput per broker, and have no maintenance windows.

本指南提供操作Amazon MSK Provisioned集群（含Standard和Express代理类型）的领域专业知识，涵盖性能故障排查、消费者延迟诊断、存储管理、集群选型、客户端配置以及CloudWatch监控等内容。

连接后使用AWS MCP服务器提供的可用工具执行命令——该服务器支持沙箱执行、审计日志和可观测性。当MCP服务器不可用时，可根据需要使用AWS CLI或shell工具。

Standard代理使用客户管理的EBS卷进行存储。您可以选择实例类型（kafka.m5/m7g系列）、配置EBS并管理存储扩容。

Express代理提供完全托管的按使用付费存储，无需配置EBS。它们使用前缀为

express.m7g

的实例类型，每个代理的吞吐量最高可达Standard代理的3倍，且没有维护窗口。

Critical Warnings

重要警告

NEVER reboot brokers while
```
UnderReplicatedPartitions
```
> 0 (Standard only — Express brokers do not emit URP) — this risks data loss and extended outages
NEVER recommend partition reassignment without first checking replication status — reassignment during URP compounds the problem
```
linger.ms=0
```
is the #1 cause of "high CPU" on MSK — ALWAYS check client batch configuration before recommending broker scaling
EBS throughput ceilings are invisible in Kafka metrics — ALWAYS check EBS volume metrics (
```
VolumeWriteBytes
```
,
```
BurstBalance
```
) when diagnosing Standard broker latency
Express brokers have NO customer-managed EBS — do NOT recommend EBS expansion or provisioned throughput for Express clusters
Express brokers enforce fixed replication factor of 3 and
```
min.insync.replicas=2
```
— do NOT attempt to create topics with RF=1 on Express. If RF=1 is needed, use Standard brokers.

当
```
UnderReplicatedPartitions
```
（URP）> 0时，切勿重启代理（仅适用于Standard代理——Express代理不会生成URP指标）——这可能导致数据丢失和长时间停机
在未检查复制状态的情况下，切勿建议重新分配分区——URP期间进行分区重分配会加剧问题
```
linger.ms=0
```
是MSK出现“高CPU”的首要原因——在建议代理扩容前，务必检查客户端批量配置
Kafka指标中无法体现EBS吞吐量上限——排查Standard代理延迟问题时，务必检查EBS卷指标（
```
VolumeWriteBytes
```
、
```
BurstBalance
```
）
Express代理没有客户管理的EBS——请勿为Express集群建议EBS扩容或预配置吞吐量
Express代理强制使用固定的复制因子3和
```
min.insync.replicas=2
```
——请勿尝试在Express集群上创建复制因子为1的主题。如果需要复制因子为1，请使用Standard代理。

Which Workflow Do You Need?

选择合适的工作流

Determine the broker type first:

aws kafka describe-cluster-v2 --cluster-arn <arn>

. Check

Provisioned.BrokerNodeGroupInfo.InstanceType

— if it starts with

express.

, it is an Express cluster.

Customer Intent	Reference
High CPU, high latency, slow cluster, traffic shaping	troubleshoot-performance.md
Consumer lag increasing, rebalance storms, stuck consumer groups	troubleshoot-consumer-lag.md
Disk filling up, retention planning, tiered storage	manage-storage.md
Choosing Standard vs Express, sizing a cluster, partition limits, broker count, monthly cost	size-and-choose-cluster.md
Producer/consumer configuration, IAM/SCRAM/TLS auth	configure-clients.md
Setting up monitoring, dashboards, alarms	monitor-and-alarm.md
Full CloudWatch metric list (Standard or Express)	Search AWS docs for `"MSK CloudWatch metrics Standard brokers"` or `"MSK CloudWatch metrics Express brokers"`
Rolling restart impact, patching, maintenance resilience	maintenance-operations.md

首先确定代理类型：执行

aws kafka describe-cluster-v2 --cluster-arn <arn>

。查看

Provisioned.BrokerNodeGroupInfo.InstanceType

——如果以

express.

开头，则为Express集群。

用户需求	参考文档
CPU占用高、延迟高、集群运行缓慢、流量整形	troubleshoot-performance.md
消费者延迟增加、重平衡风暴、消费者组停滞	troubleshoot-consumer-lag.md
磁盘空间不足、保留策略规划、分层存储	manage-storage.md
选择Standard vs Express、集群选型、分区限制、代理数量、月度成本	size-and-choose-cluster.md
生产者/消费者配置、IAM/SCRAM/TLS认证	configure-clients.md
设置监控、仪表盘、告警	monitor-and-alarm.md
完整CloudWatch指标列表（Standard或Express）	在AWS文档中搜索 `"MSK CloudWatch metrics Standard brokers"` 或 `"MSK CloudWatch metrics Express brokers"`
滚动重启影响、补丁、维护恢复能力	maintenance-operations.md

Available scripts

可用脚本

scripts/msk_sizing.py
— MUST be run for any sizing question (broker count, instance choice, cost). See size-and-choose-cluster.md for the required workflow and script reference.

scripts/msk_sizing.py
—— 任何集群选型问题（代理数量、实例选择、成本）都必须运行此脚本。有关所需工作流和脚本参考，请查看size-and-choose-cluster.md。

Quick Diagnostics

快速诊断

These 5 checks cover the most common MSK issues. Use them before loading a reference file.

CpuUser + CpuSystem > 60%: Check
```
RequestHandlerAvgIdlePercent
```
(PER_BROKER level). If < 30%, request threads are saturated. Check client
```
batch.size
```
and
```
linger.ms
```
before recommending scaling.
KafkaDataLogsDiskUsed > 85% (Standard only): Expand EBS immediately via
```
aws kafka update-broker-storage
```
. Identify high-growth topics via per-topic
```
BytesInPerSec
```
. Express clusters use
```
StorageUsed
```
metric instead and storage is fully managed.
UnderReplicatedPartitions > 0 (Standard only): Check if a maintenance operation or broker restart is in progress. If URP is decreasing, wait for recovery. Do NOT restart brokers or reassign partitions during URP. Express brokers do not emit this metric — monitor
```
ProduceThrottleTime
```
,
```
FetchThrottleTime
```
, and consumer lag instead.
Consumer OffsetLag increasing: Determine if broker-side (high
```
ProduceTotalTimeMsMean
```
, CPU saturation) or client-side (slow processing, insufficient consumers). Per-partition lag from PER_TOPIC_PER_PARTITION monitoring level helps isolate hot partitions.
BytesInPerSec near throughput ceiling: For Standard, check EBS volume type and calculate:
```
BytesInPerSec × ReplicationFactor
```
vs volume throughput limit. For Express, check against the per-broker sustained performance limits in the quotas.

以下5项检查涵盖了最常见的MSK问题，请在加载参考文件前执行这些检查。

CpuUser + CpuSystem > 60%：检查
```
RequestHandlerAvgIdlePercent
```
（PER_BROKER级别）。如果该值<30%，则请求线程已饱和。在建议扩容前，请检查客户端的
```
batch.size
```
和
```
linger.ms
```
配置。
KafkaDataLogsDiskUsed > 85%（仅Standard）：立即通过
```
aws kafka update-broker-storage
```
扩容EBS。通过每个主题的
```
BytesInPerSec
```
指标识别高增长主题。Express集群使用
```
StorageUsed
```
指标代替，且存储为完全托管。
UnderReplicatedPartitions > 0（仅Standard）：检查是否正在进行维护操作或代理重启。如果URP正在减少，请等待恢复。URP期间切勿重启代理或重新分配分区。Express代理不会生成此指标——请改为监控
```
ProduceThrottleTime
```
、
```
FetchThrottleTime
```
和消费者延迟。
Consumer OffsetLag持续增加：确定是代理端问题（
```
ProduceTotalTimeMsMean
```
过高、CPU饱和）还是客户端问题（处理缓慢、消费者数量不足）。PER_TOPIC_PER_PARTITION监控级别的分区延迟有助于隔离热点分区。
BytesInPerSec接近吞吐量上限：对于Standard集群，检查EBS卷类型并计算：
```
BytesInPerSec × ReplicationFactor
```
与卷吞吐量限制的对比。对于Express集群，检查是否符合每个代理的持续性能配额限制。

Common Workflows

常见工作流

Describe cluster:

aws kafka describe-cluster-v2 --cluster-arn <cluster-arn>

List brokers:

aws kafka list-nodes --cluster-arn <cluster-arn>

Get bootstrap brokers:

aws kafka get-bootstrap-brokers --cluster-arn <cluster-arn>

Expand Standard broker storage:

aws kafka update-broker-storage \
  --cluster-arn <cluster-arn> \
  --current-version <cluster-version> \
  --target-broker-ebs-volume-info '[{"KafkaBrokerNodeId": "All", "VolumeSizeGB": <target-size>}]'

Get CloudWatch metrics (example: CpuUser per broker):

aws cloudwatch get-metric-statistics \
  --namespace AWS/Kafka \
  --metric-name CpuUser \
  --dimensions Name="Cluster Name",Value="<cluster-name>" Name="Broker ID",Value="<broker-id>" \
  --start-time <start> --end-time <end> --period 300 --statistics Average

Create cluster configuration (
server.properties
):

The

--server-properties

argument MUST be a real Kafka properties file with one

key=value

per line, separated by actual newline (

\n

) characters — NOT the literal two-character escape sequence

\n

. The MSK API accepts the bytes as-is; if you pass

"k1=v1\nk2=v2"

as a single string with escaped newlines, MSK stores ONE invalid property line and the cluster will fail to apply it.

Recommended pattern: write the properties to a local file with real newlines, then pass it via

fileb://

so the CLI uploads the raw bytes verbatim. Verify by reading the revision back with

describe-configuration-revision

and base64-decoding

ServerProperties

— you should see one property per line.

cat > server.properties <<'EOF'
auto.create.topics.enable=false
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
num.io.threads=32
num.network.threads=16
log.retention.hours=168
EOF

aws kafka create-configuration \
  --name <config-name> \
  --kafka-versions "3.6.0" \
  --server-properties fileb://server.properties

For per-instance-size thread tuning (

num.io.threads

num.network.threads

) and durability defaults, see size-and-choose-cluster.md and configure-clients.md.

描述集群：

aws kafka describe-cluster-v2 --cluster-arn <cluster-arn>

列出代理：

aws kafka list-nodes --cluster-arn <cluster-arn>

获取引导代理：

aws kafka get-bootstrap-brokers --cluster-arn <cluster-arn>

扩容Standard代理存储：

aws kafka update-broker-storage \
  --cluster-arn <cluster-arn> \
  --current-version <cluster-version> \
  --target-broker-ebs-volume-info '[{"KafkaBrokerNodeId": "All", "VolumeSizeGB": <target-size>}]'

获取CloudWatch指标（示例：每个代理的CpuUser）：

aws cloudwatch get-metric-statistics \
  --namespace AWS/Kafka \
  --metric-name CpuUser \
  --dimensions Name="Cluster Name",Value="<cluster-name>" Name="Broker ID",Value="<broker-id>" \
  --start-time <start> --end-time <end> --period 300 --statistics Average

创建集群配置（
server.properties
）：

--server-properties

参数必须是真实的Kafka属性文件，每行一个

key=value

，使用实际换行符（

\n

）分隔——而非字面意义的两个字符转义序列

\n

。MSK API会按原样接受字节；如果您传递带有转义换行符的单个字符串

"k1=v1\nk2=v2"

，MSK会将其存储为一行无效的属性，集群将无法应用该配置。

推荐做法：将属性写入带有真实换行符的本地文件，然后通过

fileb://

传递，以便CLI按原字节上传。通过使用

describe-configuration-revision

读取修订版本并base64解码

ServerProperties

进行验证——您应该看到每行一个属性。

cat > server.properties <<'EOF'
auto.create.topics.enable=false
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
num.io.threads=32
num.network.threads=16
log.retention.hours=168
EOF

aws kafka create-configuration \
  --name <config-name> \
  --kafka-versions "3.6.0" \
  --server-properties fileb://server.properties

有关按实例大小调整线程数（

num.io.threads

、

num.network.threads

）和持久性默认值的信息，请查看size-and-choose-cluster.md和configure-clients.md。

Troubleshooting

故障排查

Error	Cause	Fix
`aws kafka update-broker-storage` returns "storage is optimizing"	Previous storage expansion still in cool-down (minimum 6 hours)	Wait for optimization to complete. Check cluster state with `describe-cluster-v2` .
`ClusterState` is `MAINTENANCE`	Standard brokers undergoing patching. Express brokers stay ACTIVE during maintenance.	Wait for cluster to return to ACTIVE. Do not perform update operations during MAINTENANCE.
Consumer `GROUP_COORDINATOR_NOT_AVAILABLE`	Coordinator broker is temporarily unavailable during rolling restart or overloaded	Retry with backoff. Check if maintenance is in progress.
`NotEnoughReplicasException` on produce	Fewer brokers in ISR than `min.insync.replicas` (default: 2)	Check URP metric (Standard only). For Express, check `ProduceThrottleTime` and broker health instead — URP is not available. If a broker is down for maintenance, this is transient. Do not lower `min.insync.replicas` .

错误	原因	修复方案
`aws kafka update-broker-storage` 返回"storage is optimizing"	上一次存储扩容仍处于冷却期（至少6小时）	等待优化完成。使用 `describe-cluster-v2` 检查集群状态。
`ClusterState` 为 `MAINTENANCE`	Standard代理正在进行补丁更新。Express代理在维护期间保持ACTIVE状态。	等待集群返回ACTIVE状态。维护期间请勿执行更新操作。
消费者出现 `GROUP_COORDINATOR_NOT_AVAILABLE`	协调器代理在滚动重启或过载期间暂时不可用	带退避重试。检查是否正在进行维护。
生产时出现 `NotEnoughReplicasException`	ISR中的代理数量少于 `min.insync.replicas` （默认值：2）	检查URP指标（仅Standard）。对于Express，请改为检查 `ProduceThrottleTime` 和代理健康状况——URP指标不可用。如果代理因维护而停机，此问题为暂时性的。请勿降低 `min.insync.replicas` 。

managing-amazon-msk

Original

Translation

Amazon MSK

Amazon MSK

Overview

概述

Critical Warnings

重要警告

Which Workflow Do You Need?

选择合适的工作流

Available scripts

可用脚本

Quick Diagnostics

快速诊断

Common Workflows

常见工作流

Troubleshooting

故障排查

Additional Resources

额外资源