gpu-container-setup-flagos

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GPU Container Setup Skill

GPU容器设置Skill

This skill automates multi-vendor GPU container setup for PyTorch workloads.
本技能可自动为PyTorch工作负载完成多厂商GPU容器的设置。

Supported GPU Vendors

支持的GPU厂商

VendorPyTorch BackendDetection
NVIDIACUDA
nvidia-smi
AMDROCm (HIP)
rocm-smi
,
/opt/rocm
Ascendtorch_npu
npu-smi
,
/usr/local/Ascend
Metaxtorch_musa
mx-smi
,
/opt/metax
Iluvatartorch_corex
ixsmi
,
/opt/iluvatar
厂商PyTorch后端检测方式
NVIDIACUDA
nvidia-smi
AMDROCm (HIP)
rocm-smi
,
/opt/rocm
Ascendtorch_npu
npu-smi
,
/usr/local/Ascend
Metaxtorch_musa
mx-smi
,
/opt/metax
Iluvatartorch_corex
ixsmi
,
/opt/iluvatar

Execution Flow

执行流程

When invoked, follow these steps:
调用本技能时,遵循以下步骤:

Step 1: Parse Arguments

步骤1:解析参数

Check if user provided:
  • --vendor <name>
    - Force specific vendor (skip detection)
  • --image <image>
    - Force specific container image
  • --data <path>
    - Force specific data mount path
  • --name <name>
    - Container name (default:
    pytorch-gpu
    )
检查用户是否提供了以下参数:
  • --vendor <name>
    - 强制指定特定厂商(跳过检测)
  • --image <image>
    - 强制指定特定容器镜像
  • --data <path>
    - 强制指定数据挂载路径
  • --name <name>
    - 容器名称(默认值:
    pytorch-gpu

Step 2: Detect GPU Vendor

步骤2:检测GPU厂商

Run the detection script:
bash
python3 .claude/skills/gpu-container-setup/scripts/detect_gpu.py
Expected output:
json
{"vendor": "ascend", "devices": ["Ascend 910B"], "count": 8}
If detection fails and no
--vendor
flag provided, ask user which vendor to use.
运行检测脚本:
bash
python3 .claude/skills/gpu-container-setup/scripts/detect_gpu.py
预期输出:
json
{"vendor": "ascend", "devices": ["Ascend 910B"], "count": 8}
如果检测失败且未提供
--vendor
参数,则询问用户要使用的厂商。

Step 3: Find Data Disk

步骤3:查找数据磁盘

Run the data disk detection:
bash
python3 .claude/skills/gpu-container-setup/scripts/find_data_disk.py
Expected output:
json
{"data_disk": "/mnt/data", "found": true, "size": "2.0T", "available": "1.5T"}
If no suitable disk found, ask user for data mount path.
运行数据磁盘检测脚本:
bash
python3 .claude/skills/gpu-container-setup/scripts/find_data_disk.py
预期输出:
json
{"data_disk": "/mnt/data", "found": true, "size": "2.0T", "available": "1.5T"}
如果未找到合适的磁盘,则询问用户数据挂载路径。

Step 4: Find Container Image

步骤4:查找容器镜像

Follow strict priority order (only proceed to next if current fails):
1. Primary Vendor Hub (hardcoded) → 2. BAAI Harbor → 3. Web Search → 4. Local Images → 5. Ask User
遵循严格的优先级顺序(仅当前一步失败时才进行下一步):
1. 厂商官方镜像仓库(硬编码)→ 2. BAAI Harbor → 3. 网页搜索 → 4. 本地镜像 → 5. 询问用户

Step 4.1: Primary Vendor Hub (hardcoded URLs)

步骤4.1:厂商官方镜像仓库(硬编码URL)

VendorRegistryAPI/Query
NVIDIA
nvcr.io
https://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags
Ascend
ascendhub.huawei.com
Portal: https://ascendhub.huawei.com
Metax
registry.metax-tech.com
https://registry.metax-tech.com/v2/pytorch/metax-pytorch/tags/list
Iluvatar
hub.iluvatar.com
https://hub.iluvatar.com/v2/pytorch/iluvatar-pytorch/tags/list
AMD
docker.io
(rocm/pytorch)
https://hub.docker.com/v2/repositories/rocm/pytorch/tags
bash
undefined
厂商镜像仓库API/查询方式
NVIDIA
nvcr.io
https://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags
Ascend
ascendhub.huawei.com
门户:https://ascendhub.huawei.com
Metax
registry.metax-tech.com
https://registry.metax-tech.com/v2/pytorch/metax-pytorch/tags/list
Iluvatar
hub.iluvatar.com
https://hub.iluvatar.com/v2/pytorch/iluvatar-pytorch/tags/list
AMD
docker.io
(rocm/pytorch)
https://hub.docker.com/v2/repositories/rocm/pytorch/tags
bash
undefined

Example: Query NGC for latest NVIDIA PyTorch

示例:查询NGC获取最新NVIDIA PyTorch镜像

TAG=$(curl -s "https://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags" | jq -r '.tags[].name' | grep -E '^[0-9]{2}.[0-9]{2}-py3$' | sort -rV | head -1) IMAGE="nvcr.io/nvidia/pytorch:${TAG}"
undefined
TAG=$(curl -s "https://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags" | jq -r '.tags[].name' | grep -E '^[0-9]{2}.[0-9]{2}-py3$' | sort -rV | head -1) IMAGE="nvcr.io/nvidia/pytorch:${TAG}"
undefined

Step 4.2: BAAI Harbor (fallback)

步骤4.2:BAAI Harbor(备选)

Only if Step 4.1 fails (unreachable, no image, pull fails).
bash
undefined
仅当步骤4.1失败(无法访问、无可用镜像、拉取失败)时使用。
bash
undefined

Query BAAI Harbor

查询BAAI Harbor

undefined
undefined

Step 4.3: Web Search (fallback)

步骤4.3:网页搜索(备选)

Only if Steps 4.1 and 4.2 fail. Search for
"<vendor> pytorch docker official"
.
仅当步骤4.1和4.2都失败时使用。搜索关键词为
"<vendor> pytorch docker official"

Step 4.4: Local Images (fallback)

步骤4.4:本地镜像(备选)

Only if Steps 4.1-4.3 fail. Check
docker images | grep pytorch
.
仅当步骤4.1-4.3都失败时使用。检查
docker images | grep pytorch
的结果。

Test Before Use

使用前测试

bash
docker pull "${IMAGE}" && docker run --rm "${IMAGE}" python -c "import torch; print(torch.__version__)"
If test fails, try next source. If all fail, ask user for image.
bash
docker pull "${IMAGE}" && docker run --rm "${IMAGE}" python -c "import torch; print(torch.__version__)"
如果测试失败,则尝试下一个镜像源。如果所有源都失败,则询问用户提供镜像。

Step 4.5: Update Skill (self-improvement)

步骤4.5:更新Skill(自我优化)

IMPORTANT: If image found via Web Search (Step 4.3) passes all tests, update
references/image-sources.md
to add the newly discovered vendor hub as a primary source. This makes future lookups faster.
bash
undefined
重要说明:如果通过网页搜索(步骤4.3)找到的镜像通过了所有测试,请更新
references/image-sources.md
文件,将新发现的厂商镜像仓库添加为首要源。这样可加快未来的查找速度。
bash
undefined

After successful web search discovery:

成功通过网页搜索发现镜像后:

1. Verify image works (pull + pytorch test + GPU test)

1. 验证镜像可用(拉取 + PyTorch测试 + GPU测试)

2. Extract registry URL pattern

2. 提取镜像仓库URL格式

3. Update references/image-sources.md Step 1 section with new vendor hub

3. 更新references/image-sources.md文件的步骤1部分,添加新的厂商镜像仓库

undefined
undefined

Step 5: Build Docker Command

步骤5:构建Docker命令

Refer to
references/mount-requirements.md
for vendor-specific requirements.
NVIDIA:
bash
docker run -d --gpus all \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity
AMD/ROCm:
bash
docker run -d \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --group-add render \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity
Ascend:
bash
docker run -d \
  --device=/dev/davinci0 --device=/dev/davinci1 ... \
  --device=/dev/davinci_manager \
  --device=/dev/devmm_svm \
  --device=/dev/hisi_hdc \
  -v /usr/local/Ascend:/usr/local/Ascend:ro \
  -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi:ro \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity
Metax:
bash
docker run -d \
  --device=/dev/mx0 --device=/dev/mx1 ... \
  -v /opt/metax:/opt/metax:ro \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity
Iluvatar:
bash
docker run -d \
  --device=/dev/bi0 --device=/dev/bi1 ... \
  -v /opt/iluvatar:/opt/iluvatar:ro \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity
参考
references/mount-requirements.md
获取厂商特定的挂载要求。
NVIDIA:
bash
docker run -d --gpus all \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity
AMD/ROCm:
bash
docker run -d \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --group-add render \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity
Ascend:
bash
docker run -d \
  --device=/dev/davinci0 --device=/dev/davinci1 ... \
  --device=/dev/davinci_manager \
  --device=/dev/devmm_svm \
  --device=/dev/hisi_hdc \
  -v /usr/local/Ascend:/usr/local/Ascend:ro \
  -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi:ro \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity
Metax:
bash
docker run -d \
  --device=/dev/mx0 --device=/dev/mx1 ... \
  -v /opt/metax:/opt/metax:ro \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity
Iluvatar:
bash
docker run -d \
  --device=/dev/bi0 --device=/dev/bi1 ... \
  -v /opt/iluvatar:/opt/iluvatar:ro \
  --name pytorch-gpu \
  --shm-size=16g \
  -v <data_disk>:/data \
  <image> sleep infinity

Step 6: Start Container

步骤6:启动容器

Execute the docker run command. If container with same name exists:
  1. Check if it's running - offer to use existing or replace
  2. If stopped - offer to restart or replace
执行docker run命令。如果存在同名容器:
  1. 检查容器是否正在运行 - 提供使用现有容器或替换的选项
  2. 如果容器已停止 - 提供重启或替换的选项

Step 7: Validate PyTorch GPU

步骤7:验证PyTorch GPU功能

Copy and run validation script inside container:
bash
docker cp .claude/skills/gpu-container-setup/scripts/validate_pytorch.py pytorch-gpu:/tmp/
docker exec pytorch-gpu python3 /tmp/validate_pytorch.py
Expected output:
json
{
  "status": "PASS",
  "backend": "npu",
  "device_count": 8,
  "device_names": ["Ascend 910B", ...],
  "tests": {
    "device_detection": true,
    "tensor_creation": true,
    "matrix_multiply": true,
    "gpu_to_cpu_transfer": true
  }
}
将验证脚本复制到容器内并运行:
bash
docker cp .claude/skills/gpu-container-setup/scripts/validate_pytorch.py pytorch-gpu:/tmp/
docker exec pytorch-gpu python3 /tmp/validate_pytorch.py
预期输出:
json
{
  "status": "PASS",
  "backend": "npu",
  "device_count": 8,
  "device_names": ["Ascend 910B", ...],
  "tests": {
    "device_detection": true,
    "tensor_creation": true,
    "matrix_multiply": true,
    "gpu_to_cpu_transfer": true
  }
}

Step 8: Report Results

步骤8:报告结果

Summarize to user:
  • GPU vendor and devices detected
  • Container name and image used
  • Data mount path
  • Validation status
  • How to access:
    docker exec -it pytorch-gpu bash
向用户总结以下信息:
  • 检测到的GPU厂商和设备
  • 使用的容器名称和镜像
  • 数据挂载路径
  • 验证状态
  • 访问方式:
    docker exec -it pytorch-gpu bash

Error Handling

错误处理

ErrorAction
No GPU detectedAsk user for vendor or check drivers
Image pull failsTry alternative registry or web search
Container start failsCheck device permissions, show error
Validation failsShow detailed error, suggest fixes
错误操作
未检测到GPU询问用户厂商或检查驱动
镜像拉取失败尝试备选镜像仓库或网页搜索
容器启动失败检查设备权限,显示错误信息
验证失败显示详细错误,建议修复方案

Reference Files

参考文件

  • references/gpu-detection.md
    - Detection methods by vendor
  • references/image-sources.md
    - Image discovery guide (registry APIs, priority order, selection criteria)
  • references/mount-requirements.md
    - Vendor mount specifications
  • references/gpu-detection.md
    - 各厂商的检测方法
  • references/image-sources.md
    - 镜像发现指南(镜像仓库API、优先级顺序、选择标准)
  • references/mount-requirements.md
    - 各厂商的挂载规范

Example Usage

示例用法

User: /gpu-container-setup
User: setup a pytorch container
User: start container with ascend GPU
User: /gpu-container-setup --image nvcr.io/nvidia/pytorch:24.01-py3
User: /gpu-container-setup --image harbor.baai.ac.cn/flagrelease-public/ngctorch:2601
用户:/gpu-container-setup
用户:setup a pytorch container
用户:start container with ascend GPU
用户:/gpu-container-setup --image nvcr.io/nvidia/pytorch:24.01-py3
用户:/gpu-container-setup --image harbor.baai.ac.cn/flagrelease-public/ngctorch:2601