Migrating to MSK Express
Overview
This skill helps customers migrate self-managed Apache Kafka workloads to Amazon MSK
Express. It provides two independent phases — Discovery and Assessment —
that can be run end-to-end or individually depending on the customer's needs.
Scope
This skill covers migrations from self-managed Apache Kafka (on-premises, EC2,
Docker, Kubernetes, or other non-MSK deployments) to MSK Express. Migrations from
MSK Standard (Provisioned) to MSK Express are out of scope.
Prerequisites
The AWS MCP server is recommended for documentation lookups and informational
questions, but is not required. The assessment scripts are pure file processors
with no AWS API calls.
Intent Routing
Route the customer's request based on their intent:
1. Open/exploratory question ("How do I migrate to MSK?")
Explain what this skill offers:
This skill helps you migrate to MSK Express in two phases:
Phase 1 — Discovery: Inventory your source Kafka cluster — brokers, topics,
partition counts, configs, authentication, and workload metrics.
I can discover this from IaC files (Terraform, CDK, Docker Compose, Kubernetes
manifests), provide commands for you to run on your cluster, or you can provide the
information manually. Output:
migrate-to-msk-skill-artifacts/<cluster_name>/cluster-config.json
.
Phase 2 — Assessment: Validate your cluster against MSK Express across 5
compatibility pillars (topology, Kafka version, configs, auth, quotas) and produce
a target Express specification using the AWS-published MSK Sizing/Pricing workbook.
I'll flag what Express will refuse vs what Express will silently convert. Outputs:
compatibility.<cluster_name>.json
, the filled
MSK_Sizing_Pricing.<cluster_name>.xlsx
,
and
msk-sizing-inputs.<cluster_name>.json
.
Data replication: For migrating data to your Express cluster, you can use
MSK Replicator. I can provide guidance on setup and configuration.
Where would you like to start? I can begin with discovery if you point me to your
infrastructure code or describe your cluster, or jump to assessment if you already have a
file.
Guardrails for this overview response:
- This response is an overview and a routing question only. Do NOT begin, simulate, or pre-empt any phase.
- Do NOT produce or estimate assessment output here — no verdicts, pillar findings, compatibility conclusions, broker counts, instance recommendations, or cost figures. Those values exist only after you run the Phase 2 scripts against a real .
- Do NOT open, read, or summarize the internals of , , or the reference files to explain how a phase works. Describe the phases at the level shown above; do not walk the customer through the implementation.
- When the customer chooses a phase, run that phase's scripts or flow to produce real results. Always operate the skill to answer — never answer from having read its source. For the exact commands, see "Running the assessment" in references/assessment-compatibility.md for Phase 2.
2. Discovery intent (DEFAULT when IaC files are provided)
If the customer provides a directory path, IaC files, or says "here's our infra" —
this is discovery intent. Run ONLY Phase 1 (Discovery). Do NOT run assessment,
do NOT suggest migration steps, do NOT mention blockers or compatibility.
Produce the
migrate-to-msk-skill-artifacts/<cluster_name>/cluster-config.json
file and stop.
3. Assessment intent
Customer explicitly asks to assess or has a
migrate-to-msk-skill-artifacts/<cluster_name>/cluster-config.json
file
already produced. Run Phase 2 (Assessment) only.
4. Informational questions
Customer asks about Express capabilities, constraints, configuration differences,
authentication support, pricing, or compaction behavior without providing
cluster-specific data. Use AWS documentation tools (
aws___search_documentation
,
) if available to look up the answer from MSK Express
documentation. If MCP tools are not available, reference the
MSK Express documentation
and answer based on knowledge of AWS MSK.
5. Migration strategy questions
Customer asks about MSK Replicator compatibility, version upgrade paths, MirrorMaker 2,
or migration strategies. MSK Replicator is the native AWS-supported solution for data
replication and works for both MSK-to-MSK and non-MSK-to-MSK migrations. Use AWS
documentation tools (
aws___search_documentation
,
) if
available to retrieve current requirements and supported configurations. If MCP tools
are not available, reference the
MSK Replicator documentation
and answer based on knowledge of AWS MSK.
Phase 1 — Discovery
Purpose: Inventory the source cluster to build a migration profile.
Input: One of:
- A directory path containing IaC files (CDK, CloudFormation, Docker Compose, Kubernetes manifests, Terraform)
- Output from Kafka CLI commands the customer runs on their cluster
- Manual information provided by the customer in conversation
Output: migrate-to-msk-skill-artifacts/<cluster_name>/cluster-config.json
— saved to the working directory.
MANDATORY first step for discovery
Before doing ANYTHING else in discovery, you MUST read the reference file:
(located at the skill path shown above).
Use
to read the full content of
. This file
contains the REQUIRED response template and JSON schema. You MUST follow the
template exactly — your response format, forbidden content, and JSON structure
are all defined there. Do NOT respond until you have read this file.
Discovery methods
-
IaC analysis — Read infrastructure files and extract cluster metadata.
-
Kafka CLI commands — Display standard Kafka CLI commands for the customer to run on
their cluster (kafka-topics.sh, kafka-configs.sh, kafka-broker-api-versions.sh).
Do NOT generate or offer Python scripts.
-
Runtime metrics intake — Ingest metrics provided by the customer.
-
Manual conversation — Ask the customer for cluster details.
Discovery rules
- You MUST read before responding.
- Follow the response template from that file EXACTLY.
- ALWAYS save
migrate-to-msk-skill-artifacts/<cluster_name>/cluster-config.json
in the working directory.
- Do NOT proceed to Phase 2 without explicit customer confirmation.
Phase 2 — Assessment
Purpose: Assess the cluster against MSK Express requirements and produce a target
Express specification (instance type, broker count, monthly cost projection).
Input: migrate-to-msk-skill-artifacts/<cluster_name>/cluster-config.json
from Phase 1.
Outputs:
migrate-to-msk-skill-artifacts/<cluster_name>/compatibility.<cluster_name>.json
— five-pillar verdict.
migrate-to-msk-skill-artifacts/<cluster_name>/MSK_Sizing_Pricing.<cluster_name>.xlsx
— the AWS-published MSK Sizing/Pricing workbook (downloaded by the agent) with the six workload inputs filled into the sheet. Open it to read the broker count and cost recommendations.
migrate-to-msk-skill-artifacts/<cluster_name>/msk-sizing-inputs.<cluster_name>.json
— a record of the six input values and the cell each maps to.
Assessment is implemented as two file processors (no live AWS API calls):
- — five-pillar compatibility assessment.
- — computes the six workbook inputs from the discovery contract and fills them into the AWS-published workbook the agent downloads.
Both run via
with PEP 723 inline dependencies. For the exact
invocation commands, see "Running the assessment" in
references/assessment-compatibility.md.
Compatibility pillars
validates the source against MSK Express across five pillars:
- Topology — AZ count, broker count, KRaft vs ZooKeeper, per-cluster broker quota.
- Kafka version — source version against the Express supported set (3.6, 3.8, 3.9).
- Configs — broker- and topic-level configs against Express's editable, read-only,
range-restricted, and enforced-value sets (sourced from the Express broker configuration
documentation on ).
- Auth — checks the source's authentication mechanism against those MSK Express supports and surfaces any incompatibilities.
- Quotas — peak workload against absolute Express ceilings (per-broker ingress/
egress, partition count, IAM connection cap, per-partition throughput).
See references/assessment-compatibility.md
for the full pseudocode, evidence codes, and verdict mapping.
Verdict vocabulary
Each pillar emits one of three verdicts; the overall is the worst across pillars.
| Verdict | Meaning |
|---|
| Your source cluster already lines up with MSK Express here. Surfaced for informational purposes. No action needed. |
| Your source cluster differs from MSK Express here, but Express handles this for you at the target by adjusting or replacing the setting. Migration can proceed; review it so the resulting behavior change is expected. |
| Identifies a configuration or condition that MSK Express is not expected to accept in its current form. Remediation on the source prior to migration is recommended. |
Sizing
computes the six workbook inputs from the source workload (peak
in/out, total partitions, retention). The agent downloads the AWS-published
workbook by reading the Express best-practices page and following its workbook
hyperlink, then runs
sizing.py --workbook <downloaded.xlsx>
, which fills the
sheet and writes the filled
MSK_Sizing_Pricing.<cluster_name>.xlsx
(plus a
msk-sizing-inputs.<cluster_name>.json
record). Open the filled workbook to
read the per-instance broker count and monthly cost; its formulas recalculate
on open. The workbook is downloaded at assessment time, not packaged with the
skill, and the script itself performs no network access (it fills a workbook
the agent already downloaded, using the Python standard library). See
references/assessment-sizing.md for the cell
mapping, the download flow, and caveats.
Assessment rules
- Run and independently; neither blocks the other.
- Surface any evidence to the user for awareness, but do not gate further phases on it. Express may still accept the workload with mitigations.
- Do NOT pivot back into discovery. Assessment operates on the existing
as-is. Partial data is fine — the scripts emit
ADVISORY evidence (, , etc.) for
missing fields; surface those findings and stop. Do not propose Kafka CLI
commands, IaC walks, scripts, or questionnaires to fill the gaps. Full
forbidden-behavior list in
references/assessment-compatibility.md.
- Your response MUST follow the assessment response template in
references/assessment-compatibility.md
(section "Response Template"). One template covers both artifacts. Do
not freestyle the post-script summary — the template defines required
sections, mandatory vocabulary (use the verdict strings verbatim), and
forbidden content (no scores, no narrative editorializing, no in-prose
cost / instance recommendations — the user reads those from the filled workbook).
Execution model
Scripts run on the customer's local machine via
. They declare their own
dependencies (PEP 723) and are pure file processors — no AWS API calls, no
network access, and no third-party dependencies (standard library only).
Security Considerations
Apply these controls at every phase. For additional detail, see
MSK Security best practices
and
MSK IAM access control.
-
Encryption in transit (mandatory). Enforce TLS for client-broker traffic
on the MSK Express target (
EncryptionInTransit.ClientBroker = TLS
).
-
Encryption at rest (mandatory). Provision the target cluster with a
customer-managed KMS key (or AWS-managed if your compliance posture allows).
-
Authentication — prefer IAM over long-lived credentials. Configure the
MSK Express target with IAM authentication as the sole client auth method.
This gives ephemeral, role-based credentials with full CloudTrail coverage.
-
Credential storage — use AWS Secrets Manager. Store SASL/SCRAM and TLS
credentials for source cluster access in Secrets Manager. Never pass passwords
as CLI arguments.
-
Network isolation. Deploy MSK clusters in private subnets. Use security
groups scoped to specific CIDR ranges or security group references. Do NOT use
0.0.0.0/0 ingress rules.
-
CloudTrail logging and CloudWatch alarms. Ensure CloudTrail is enabled in
the target account and covers
API calls. Configure alarms:
ClientAuthenticationFailure
— surge indicates credential problems or attack
- — abnormal spike may indicate connection-flooding
- CloudTrail metric filters for denied actions
- Connection-rate alarms approaching the 100 conn/sec/broker IAM limit
-
Sensitive data handling. Discovery and assessment outputs contain broker
addresses, auth hints, and broker config values. Treat these as sensitive — do
not paste into public channels or ticketing systems without redaction.
Troubleshooting
Single-broker / single-AZ source. Topology pillar emits
/
ADVISORY — Express auto-fixes at the target by deploying across 3
AZs with ≥3 brokers regardless of source.
Out-of-range topic configs. max.compaction.lag.ms < 1 day
is the only
Express-rejected topic-config bound encoded in compatibility.py. Adjust on the
source before migration.
Workbook recommendations look blank or stale. The recommendation and cost
cells are workbook formulas; they populate once the filled workbook is opened
in Excel / LibreOffice / Sheets and its formulas recalculate.
sets
so this happens automatically on open — if your spreadsheet
app has automatic recalculation disabled, trigger a manual recalculation.