hyperpod-issue-report

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

HyperPod Issue Report

HyperPod 问题报告

Collect diagnostic logs from HyperPod cluster nodes via SSM, store results in S3. Supports both EKS and Slurm clusters with auto-detection. Uses the bundled
scripts/hyperpod_issue_report.py
for reliable parallel collection.
通过SSM收集HyperPod集群节点的诊断日志,将结果存储在S3中。支持自动识别EKS和Slurm两种集群。使用捆绑的
scripts/hyperpod_issue_report.py
实现可靠的并行收集。

Prerequisites

前置要求

  • AWS CLI configured with permissions:
    sagemaker:DescribeCluster
    ,
    sagemaker:ListClusterNodes
    ,
    ssm:StartSession
    ,
    s3:PutObject
    ,
    s3:GetObject
    ,
    eks:DescribeCluster
  • Python 3.8+ and uv (see uv installation docs for install options)
  • SSM Agent running on target nodes; node IAM roles need
    s3:GetObject
    /
    s3:PutObject
    on the report bucket
  • For EKS clusters: kubectl installed and configured (see Workflow step 2)
  • 已配置AWS CLI,且拥有以下权限:
    sagemaker:DescribeCluster
    sagemaker:ListClusterNodes
    ssm:StartSession
    s3:PutObject
    s3:GetObject
    eks:DescribeCluster
  • Python 3.8+ 和 uv(安装选项见uv安装文档
  • 目标节点上运行有SSM Agent;节点的IAM角色需要拥有报告存储桶的
    s3:GetObject
    /
    s3:PutObject
    权限
  • 针对EKS集群:已安装并配置kubectl(见工作流步骤2)

Workflow

工作流

1. Gather Information

1. 收集信息

Collect from the user:
  • Cluster identifier (required): accepts cluster name or full cluster ARN (e.g.,
    arn:aws:sagemaker:us-west-2:123456789012:cluster/abc123
    )
  • AWS region (required unless extractable from ARN)
  • S3 path for report storage (required, e.g.
    s3://bucket/prefix
    ). If the user doesn't have a bucket, create one (e.g.,
    s3://hyperpod-diagnostics-<account-id>-<region>
    )
  • Issue description (optional)
  • Target scope: all nodes, specific instance groups, or specific node IDs (optional)
  • Additional commands to run on nodes (optional)
从用户处收集以下信息:
  • 集群标识(必填):支持集群名称或完整集群ARN(例如:
    arn:aws:sagemaker:us-west-2:123456789012:cluster/abc123
  • AWS区域(如果可从ARN中提取则非必填)
  • 存储报告的S3路径(必填,例如
    s3://bucket/prefix
    )。如果用户没有存储桶,新建一个即可(例如:
    s3://hyperpod-diagnostics-<account-id>-<region>
  • 问题描述(可选)
  • 目标范围:所有节点、特定实例组或特定节点ID(可选)
  • 要在节点上运行的额外命令(可选)

2. Verify Environment

2. 验证环境

bash
aws sts get-caller-identity
aws sagemaker describe-cluster --cluster-name <name-or-arn> --region <region>
If the S3 bucket doesn't exist, create it:
bash
aws s3 mb s3://<bucket-name> --region <region>
For EKS clusters (check
Orchestrator.Eks
in describe-cluster output):
  1. Ensure kubectl is installed (
    which kubectl
    ). If missing, install it for the current platform.
  2. Configure kubeconfig using the EKS cluster name from the describe-cluster response:
    bash
    aws eks update-kubeconfig --name <eks-cluster-name> --region <region>
bash
aws sts get-caller-identity
aws sagemaker describe-cluster --cluster-name <name-or-arn> --region <region>
如果S3存储桶不存在,创建存储桶:
bash
aws s3 mb s3://<bucket-name> --region <region>
针对EKS集群(检查describe-cluster输出中的
Orchestrator.Eks
字段):
  1. 确认已安装kubectl(执行
    which kubectl
    检查)。如果缺失,针对当前平台安装即可。
  2. 使用describe-cluster返回的EKS集群名称配置kubeconfig:
    bash
    aws eks update-kubeconfig --name <eks-cluster-name> --region <region>

3. Run the Collection Script

3. 运行收集脚本

bash
uv run scripts/hyperpod_issue_report.py \
  --cluster <cluster-name-or-arn> \
  --region <region> \
  --s3-path s3://<bucket>[/prefix]
Use
--help
for all options including
--instance-groups
,
--nodes
,
--command
,
--max-workers
, and
--debug
. Note:
--instance-groups
and
--nodes
are mutually exclusive. Node identifiers accept instance IDs (
i-*
), EKS names (
hyperpod-i-*
), or Slurm names (
ip-*
).
bash
uv run scripts/hyperpod_issue_report.py \
  --cluster <cluster-name-or-arn> \
  --region <region> \
  --s3-path s3://<bucket>[/prefix]
使用
--help
可查看所有可选参数,包括
--instance-groups
--nodes
--command
--max-workers
--debug
。注意:
--instance-groups
--nodes
参数互斥。节点标识支持实例ID(
i-*
)、EKS名称(
hyperpod-i-*
)或Slurm名称(
ip-*
)。

4. Present Results

4. 展示结果

After collection, the script shows statistics and offers interactive download. Report the S3 location and offer to:
  • Download the report locally
  • Help analyze collected diagnostics (see references/collection-details.md for what's in each file)
  • Prepare a summary for AWS Support
收集完成后,脚本会展示统计信息并提供交互式下载选项。告知用户S3存储位置,同时可提供以下选项:
  • 本地下载报告
  • 协助分析收集到的诊断信息(各文件包含的内容见references/collection-details.md
  • 为AWS Support准备摘要信息

Troubleshooting

故障排查

See references/troubleshooting.md for error handling, large cluster tuning, and known limitations.
错误处理、大规模集群调优和已知限制请见references/troubleshooting.md