hyperpod-issue-report
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHyperPod Issue Report
HyperPod 问题报告
Collect diagnostic logs from HyperPod cluster nodes via SSM, store results in S3. Supports both EKS and Slurm clusters with auto-detection. Uses the bundled for reliable parallel collection.
scripts/hyperpod_issue_report.py通过SSM收集HyperPod集群节点的诊断日志,将结果存储在S3中。支持自动识别EKS和Slurm两种集群。使用捆绑的实现可靠的并行收集。
scripts/hyperpod_issue_report.pyPrerequisites
前置要求
- AWS CLI configured with permissions: ,
sagemaker:DescribeCluster,sagemaker:ListClusterNodes,ssm:StartSession,s3:PutObject,s3:GetObjecteks:DescribeCluster - Python 3.8+ and uv (see uv installation docs for install options)
- SSM Agent running on target nodes; node IAM roles need /
s3:GetObjecton the report buckets3:PutObject - For EKS clusters: kubectl installed and configured (see Workflow step 2)
Workflow
工作流
1. Gather Information
1. 收集信息
Collect from the user:
- Cluster identifier (required): accepts cluster name or full cluster ARN (e.g., )
arn:aws:sagemaker:us-west-2:123456789012:cluster/abc123 - AWS region (required unless extractable from ARN)
- S3 path for report storage (required, e.g. ). If the user doesn't have a bucket, create one (e.g.,
s3://bucket/prefix)s3://hyperpod-diagnostics-<account-id>-<region> - Issue description (optional)
- Target scope: all nodes, specific instance groups, or specific node IDs (optional)
- Additional commands to run on nodes (optional)
从用户处收集以下信息:
- 集群标识(必填):支持集群名称或完整集群ARN(例如:)
arn:aws:sagemaker:us-west-2:123456789012:cluster/abc123 - AWS区域(如果可从ARN中提取则非必填)
- 存储报告的S3路径(必填,例如)。如果用户没有存储桶,新建一个即可(例如:
s3://bucket/prefix)s3://hyperpod-diagnostics-<account-id>-<region> - 问题描述(可选)
- 目标范围:所有节点、特定实例组或特定节点ID(可选)
- 要在节点上运行的额外命令(可选)
2. Verify Environment
2. 验证环境
bash
aws sts get-caller-identity
aws sagemaker describe-cluster --cluster-name <name-or-arn> --region <region>If the S3 bucket doesn't exist, create it:
bash
aws s3 mb s3://<bucket-name> --region <region>For EKS clusters (check in describe-cluster output):
Orchestrator.Eks-
Ensure kubectl is installed (). If missing, install it for the current platform.
which kubectl -
Configure kubeconfig using the EKS cluster name from the describe-cluster response:bash
aws eks update-kubeconfig --name <eks-cluster-name> --region <region>
bash
aws sts get-caller-identity
aws sagemaker describe-cluster --cluster-name <name-or-arn> --region <region>如果S3存储桶不存在,创建存储桶:
bash
aws s3 mb s3://<bucket-name> --region <region>针对EKS集群(检查describe-cluster输出中的字段):
Orchestrator.Eks-
确认已安装kubectl(执行检查)。如果缺失,针对当前平台安装即可。
which kubectl -
使用describe-cluster返回的EKS集群名称配置kubeconfig:bash
aws eks update-kubeconfig --name <eks-cluster-name> --region <region>
3. Run the Collection Script
3. 运行收集脚本
bash
uv run scripts/hyperpod_issue_report.py \
--cluster <cluster-name-or-arn> \
--region <region> \
--s3-path s3://<bucket>[/prefix]Use for all options including , , , , and . Note: and are mutually exclusive. Node identifiers accept instance IDs (), EKS names (), or Slurm names ().
--help--instance-groups--nodes--command--max-workers--debug--instance-groups--nodesi-*hyperpod-i-*ip-*bash
uv run scripts/hyperpod_issue_report.py \
--cluster <cluster-name-or-arn> \
--region <region> \
--s3-path s3://<bucket>[/prefix]使用可查看所有可选参数,包括、、、和。注意:和参数互斥。节点标识支持实例ID()、EKS名称()或Slurm名称()。
--help--instance-groups--nodes--command--max-workers--debug--instance-groups--nodesi-*hyperpod-i-*ip-*4. Present Results
4. 展示结果
After collection, the script shows statistics and offers interactive download. Report the S3 location and offer to:
- Download the report locally
- Help analyze collected diagnostics (see references/collection-details.md for what's in each file)
- Prepare a summary for AWS Support
收集完成后,脚本会展示统计信息并提供交互式下载选项。告知用户S3存储位置,同时可提供以下选项:
- 本地下载报告
- 协助分析收集到的诊断信息(各文件包含的内容见references/collection-details.md)
- 为AWS Support准备摘要信息
Troubleshooting
故障排查
See references/troubleshooting.md for error handling, large cluster tuning, and known limitations.
错误处理、大规模集群调优和已知限制请见references/troubleshooting.md