gpu-container-setup-flagos
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGPU Container Setup Skill
GPU容器设置Skill
This skill automates multi-vendor GPU container setup for PyTorch workloads.
本技能可自动为PyTorch工作负载完成多厂商GPU容器的设置。
Supported GPU Vendors
支持的GPU厂商
| Vendor | PyTorch Backend | Detection |
|---|---|---|
| NVIDIA | CUDA | |
| AMD | ROCm (HIP) | |
| Ascend | torch_npu | |
| Metax | torch_musa | |
| Iluvatar | torch_corex | |
| 厂商 | PyTorch后端 | 检测方式 |
|---|---|---|
| NVIDIA | CUDA | |
| AMD | ROCm (HIP) | |
| Ascend | torch_npu | |
| Metax | torch_musa | |
| Iluvatar | torch_corex | |
Execution Flow
执行流程
When invoked, follow these steps:
调用本技能时,遵循以下步骤:
Step 1: Parse Arguments
步骤1:解析参数
Check if user provided:
- - Force specific vendor (skip detection)
--vendor <name> - - Force specific container image
--image <image> - - Force specific data mount path
--data <path> - - Container name (default:
--name <name>)pytorch-gpu
检查用户是否提供了以下参数:
- - 强制指定特定厂商(跳过检测)
--vendor <name> - - 强制指定特定容器镜像
--image <image> - - 强制指定数据挂载路径
--data <path> - - 容器名称(默认值:
--name <name>)pytorch-gpu
Step 2: Detect GPU Vendor
步骤2:检测GPU厂商
Run the detection script:
bash
python3 .claude/skills/gpu-container-setup/scripts/detect_gpu.pyExpected output:
json
{"vendor": "ascend", "devices": ["Ascend 910B"], "count": 8}If detection fails and no flag provided, ask user which vendor to use.
--vendor运行检测脚本:
bash
python3 .claude/skills/gpu-container-setup/scripts/detect_gpu.py预期输出:
json
{"vendor": "ascend", "devices": ["Ascend 910B"], "count": 8}如果检测失败且未提供参数,则询问用户要使用的厂商。
--vendorStep 3: Find Data Disk
步骤3:查找数据磁盘
Run the data disk detection:
bash
python3 .claude/skills/gpu-container-setup/scripts/find_data_disk.pyExpected output:
json
{"data_disk": "/mnt/data", "found": true, "size": "2.0T", "available": "1.5T"}If no suitable disk found, ask user for data mount path.
运行数据磁盘检测脚本:
bash
python3 .claude/skills/gpu-container-setup/scripts/find_data_disk.py预期输出:
json
{"data_disk": "/mnt/data", "found": true, "size": "2.0T", "available": "1.5T"}如果未找到合适的磁盘,则询问用户数据挂载路径。
Step 4: Find Container Image
步骤4:查找容器镜像
Follow strict priority order (only proceed to next if current fails):
1. Primary Vendor Hub (hardcoded) → 2. BAAI Harbor → 3. Web Search → 4. Local Images → 5. Ask User遵循严格的优先级顺序(仅当前一步失败时才进行下一步):
1. 厂商官方镜像仓库(硬编码)→ 2. BAAI Harbor → 3. 网页搜索 → 4. 本地镜像 → 5. 询问用户Step 4.1: Primary Vendor Hub (hardcoded URLs)
步骤4.1:厂商官方镜像仓库(硬编码URL)
| Vendor | Registry | API/Query |
|---|---|---|
| NVIDIA | | |
| Ascend | | Portal: https://ascendhub.huawei.com |
| Metax | | |
| Iluvatar | | |
| AMD | | |
bash
undefined| 厂商 | 镜像仓库 | API/查询方式 |
|---|---|---|
| NVIDIA | | |
| Ascend | | 门户:https://ascendhub.huawei.com |
| Metax | | |
| Iluvatar | | |
| AMD | | |
bash
undefinedExample: Query NGC for latest NVIDIA PyTorch
示例:查询NGC获取最新NVIDIA PyTorch镜像
TAG=$(curl -s "https://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags" | jq -r '.tags[].name' | grep -E '^[0-9]{2}.[0-9]{2}-py3$' | sort -rV | head -1)
IMAGE="nvcr.io/nvidia/pytorch:${TAG}"
undefinedTAG=$(curl -s "https://api.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags" | jq -r '.tags[].name' | grep -E '^[0-9]{2}.[0-9]{2}-py3$' | sort -rV | head -1)
IMAGE="nvcr.io/nvidia/pytorch:${TAG}"
undefinedStep 4.2: BAAI Harbor (fallback)
步骤4.2:BAAI Harbor(备选)
Only if Step 4.1 fails (unreachable, no image, pull fails).
bash
undefined仅当步骤4.1失败(无法访问、无可用镜像、拉取失败)时使用。
bash
undefinedQuery BAAI Harbor
查询BAAI Harbor
curl -s "https://harbor.baai.ac.cn/api/v2.0/projects/flagrelease-public/repositories?page_size=100" | jq -r '.[].name' | grep "flagrelease-<vendor>"
undefinedcurl -s "https://harbor.baai.ac.cn/api/v2.0/projects/flagrelease-public/repositories?page_size=100" | jq -r '.[].name' | grep "flagrelease-<vendor>"
undefinedStep 4.3: Web Search (fallback)
步骤4.3:网页搜索(备选)
Only if Steps 4.1 and 4.2 fail. Search for .
"<vendor> pytorch docker official"仅当步骤4.1和4.2都失败时使用。搜索关键词为。
"<vendor> pytorch docker official"Step 4.4: Local Images (fallback)
步骤4.4:本地镜像(备选)
Only if Steps 4.1-4.3 fail. Check .
docker images | grep pytorch仅当步骤4.1-4.3都失败时使用。检查的结果。
docker images | grep pytorchTest Before Use
使用前测试
bash
docker pull "${IMAGE}" && docker run --rm "${IMAGE}" python -c "import torch; print(torch.__version__)"If test fails, try next source. If all fail, ask user for image.
bash
docker pull "${IMAGE}" && docker run --rm "${IMAGE}" python -c "import torch; print(torch.__version__)"如果测试失败,则尝试下一个镜像源。如果所有源都失败,则询问用户提供镜像。
Step 4.5: Update Skill (self-improvement)
步骤4.5:更新Skill(自我优化)
IMPORTANT: If image found via Web Search (Step 4.3) passes all tests, update to add the newly discovered vendor hub as a primary source. This makes future lookups faster.
references/image-sources.mdbash
undefined重要说明:如果通过网页搜索(步骤4.3)找到的镜像通过了所有测试,请更新文件,将新发现的厂商镜像仓库添加为首要源。这样可加快未来的查找速度。
references/image-sources.mdbash
undefinedAfter successful web search discovery:
成功通过网页搜索发现镜像后:
1. Verify image works (pull + pytorch test + GPU test)
1. 验证镜像可用(拉取 + PyTorch测试 + GPU测试)
2. Extract registry URL pattern
2. 提取镜像仓库URL格式
3. Update references/image-sources.md Step 1 section with new vendor hub
3. 更新references/image-sources.md文件的步骤1部分,添加新的厂商镜像仓库
undefinedundefinedStep 5: Build Docker Command
步骤5:构建Docker命令
Refer to for vendor-specific requirements.
references/mount-requirements.mdNVIDIA:
bash
docker run -d --gpus all \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinityAMD/ROCm:
bash
docker run -d \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinityAscend:
bash
docker run -d \
--device=/dev/davinci0 --device=/dev/davinci1 ... \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend:/usr/local/Ascend:ro \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi:ro \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinityMetax:
bash
docker run -d \
--device=/dev/mx0 --device=/dev/mx1 ... \
-v /opt/metax:/opt/metax:ro \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinityIluvatar:
bash
docker run -d \
--device=/dev/bi0 --device=/dev/bi1 ... \
-v /opt/iluvatar:/opt/iluvatar:ro \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinity参考获取厂商特定的挂载要求。
references/mount-requirements.mdNVIDIA:
bash
docker run -d --gpus all \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinityAMD/ROCm:
bash
docker run -d \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinityAscend:
bash
docker run -d \
--device=/dev/davinci0 --device=/dev/davinci1 ... \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend:/usr/local/Ascend:ro \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi:ro \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinityMetax:
bash
docker run -d \
--device=/dev/mx0 --device=/dev/mx1 ... \
-v /opt/metax:/opt/metax:ro \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinityIluvatar:
bash
docker run -d \
--device=/dev/bi0 --device=/dev/bi1 ... \
-v /opt/iluvatar:/opt/iluvatar:ro \
--name pytorch-gpu \
--shm-size=16g \
-v <data_disk>:/data \
<image> sleep infinityStep 6: Start Container
步骤6:启动容器
Execute the docker run command. If container with same name exists:
- Check if it's running - offer to use existing or replace
- If stopped - offer to restart or replace
执行docker run命令。如果存在同名容器:
- 检查容器是否正在运行 - 提供使用现有容器或替换的选项
- 如果容器已停止 - 提供重启或替换的选项
Step 7: Validate PyTorch GPU
步骤7:验证PyTorch GPU功能
Copy and run validation script inside container:
bash
docker cp .claude/skills/gpu-container-setup/scripts/validate_pytorch.py pytorch-gpu:/tmp/
docker exec pytorch-gpu python3 /tmp/validate_pytorch.pyExpected output:
json
{
"status": "PASS",
"backend": "npu",
"device_count": 8,
"device_names": ["Ascend 910B", ...],
"tests": {
"device_detection": true,
"tensor_creation": true,
"matrix_multiply": true,
"gpu_to_cpu_transfer": true
}
}将验证脚本复制到容器内并运行:
bash
docker cp .claude/skills/gpu-container-setup/scripts/validate_pytorch.py pytorch-gpu:/tmp/
docker exec pytorch-gpu python3 /tmp/validate_pytorch.py预期输出:
json
{
"status": "PASS",
"backend": "npu",
"device_count": 8,
"device_names": ["Ascend 910B", ...],
"tests": {
"device_detection": true,
"tensor_creation": true,
"matrix_multiply": true,
"gpu_to_cpu_transfer": true
}
}Step 8: Report Results
步骤8:报告结果
Summarize to user:
- GPU vendor and devices detected
- Container name and image used
- Data mount path
- Validation status
- How to access:
docker exec -it pytorch-gpu bash
向用户总结以下信息:
- 检测到的GPU厂商和设备
- 使用的容器名称和镜像
- 数据挂载路径
- 验证状态
- 访问方式:
docker exec -it pytorch-gpu bash
Error Handling
错误处理
| Error | Action |
|---|---|
| No GPU detected | Ask user for vendor or check drivers |
| Image pull fails | Try alternative registry or web search |
| Container start fails | Check device permissions, show error |
| Validation fails | Show detailed error, suggest fixes |
| 错误 | 操作 |
|---|---|
| 未检测到GPU | 询问用户厂商或检查驱动 |
| 镜像拉取失败 | 尝试备选镜像仓库或网页搜索 |
| 容器启动失败 | 检查设备权限,显示错误信息 |
| 验证失败 | 显示详细错误,建议修复方案 |
Reference Files
参考文件
- - Detection methods by vendor
references/gpu-detection.md - - Image discovery guide (registry APIs, priority order, selection criteria)
references/image-sources.md - - Vendor mount specifications
references/mount-requirements.md
- - 各厂商的检测方法
references/gpu-detection.md - - 镜像发现指南(镜像仓库API、优先级顺序、选择标准)
references/image-sources.md - - 各厂商的挂载规范
references/mount-requirements.md
Example Usage
示例用法
User: /gpu-container-setup
User: setup a pytorch container
User: start container with ascend GPU
User: /gpu-container-setup --image nvcr.io/nvidia/pytorch:24.01-py3
User: /gpu-container-setup --image harbor.baai.ac.cn/flagrelease-public/ngctorch:2601用户:/gpu-container-setup
用户:setup a pytorch container
用户:start container with ascend GPU
用户:/gpu-container-setup --image nvcr.io/nvidia/pytorch:24.01-py3
用户:/gpu-container-setup --image harbor.baai.ac.cn/flagrelease-public/ngctorch:2601