vast-gpu
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVast.ai GPU Management
Vast.ai GPU 管理
Manage vast.ai GPU instance: $ARGUMENTS
管理vast.ai GPU实例:$ARGUMENTS
Overview
概述
Rent cheap, capable GPUs from vast.ai on demand. This skill analyzes the training task to determine GPU requirements, searches for the best-value offers, presents options with estimated total cost, and handles the full lifecycle: rent → setup → run → destroy.
Users do NOT specify GPU models or hardware. They describe the task — the skill figures out what to rent.
Prerequisites: The CLI must be installed (requires Python ≥ 3.10) and authenticated:
vastaibash
pip install vastai
vastai set api-key YOUR_API_KEYIf your system Python is < 3.10, create a virtual environment with Python ≥ 3.10 (e.g.,,conda create,pyenv, etc.) and installuv venvthere.vastai
SSH public key must be uploaded at https://cloud.vast.ai/manage-keys/ BEFORE creating any instance. Keys are baked into instances at creation time — if you add a key after renting, you must destroy and re-create the instance.
按需从vast.ai租用性价比高的GPU。本技能会分析训练任务以确定GPU需求,搜索最具性价比的方案,展示包含预估总成本的选项,并处理完整生命周期:租用 → 配置 → 运行 → 销毁。
用户无需指定GPU型号或硬件参数,只需描述任务——技能会自动确定需要租用的资源。
前提条件: 必须安装 CLI(要求 Python ≥ 3.10)并完成认证:
vastaibash
pip install vastai
vastai set api-key YOUR_API_KEY如果系统Python版本低于3.10,请创建一个Python ≥3.10的虚拟环境(例如、conda create、pyenv等),并在该环境中安装uv venv。vastai
必须在创建任何实例之前,将SSH公钥上传至 https://cloud.vast.ai/manage-keys/。密钥会在创建实例时嵌入其中——如果在租用后添加密钥,必须销毁并重新创建实例。
State File
状态文件
All active vast.ai instances are tracked in at the project root:
vast-instances.jsonjson
[
{
"instance_id": 33799165,
"offer_id": 25831376,
"gpu_name": "RTX_3060",
"num_gpus": 1,
"dph": 0.0414,
"ssh_url": "ssh://root@1.208.108.242:58955",
"ssh_host": "1.208.108.242",
"ssh_port": 58955,
"created_at": "2026-03-29T21:12:00Z",
"status": "running",
"experiment": "exp01_baseline",
"estimated_hours": 4.0,
"estimated_cost": 0.17
}
]This file is the source of truth for and to connect to vast.ai instances.
/run-experiment/monitor-experiment所有活跃的vast.ai实例都会在项目根目录的文件中追踪:
vast-instances.jsonjson
[
{
"instance_id": 33799165,
"offer_id": 25831376,
"gpu_name": "RTX_3060",
"num_gpus": 1,
"dph": 0.0414,
"ssh_url": "ssh://root@1.208.108.242:58955",
"ssh_host": "1.208.108.242",
"ssh_port": 58955,
"created_at": "2026-03-29T21:12:00Z",
"status": "running",
"experiment": "exp01_baseline",
"estimated_hours": 4.0,
"estimated_cost": 0.17
}
]该文件是和连接vast.ai实例的唯一可信来源。
/run-experiment/monitor-experimentWorkflow
工作流程
Action: Provision (default)
操作:资源配置(默认)
Analyze the task, find the best GPU, and present cost-optimized options. This is the main entry point — called directly or automatically by when is set.
/run-experimentgpu: vastStep 1: Analyze Task Requirements
Read available context to determine what the task needs:
-
From the experiment plan ():
refine-logs/EXPERIMENT_PLAN.md- Compute budget (total GPU-hours)
- Hardware hints (e.g., "4x RTX 3090")
- Model architecture and dataset size
- Run order and per-milestone cost estimates
-
From experiment scripts (if already written):
- Model size — scan for model class, , config files
num_parameters - Batch size, sequence length — estimate VRAM from these
- Dataset — estimate training time from dataset size + epochs
- Multi-GPU — check for ,
DataParallel,DistributedDataParallel,acceleratedeepspeed
- Model size — scan for model class,
-
From user description (if no plan/scripts exist):
- Model name/size (e.g., "fine-tune LLaMA-7B", "train ResNet-50")
- Dataset scale (e.g., "ImageNet", "10k samples")
- Estimated duration (e.g., "about 2 hours")
Step 2: Determine GPU Requirements
Based on the task analysis, determine:
| Factor | How to estimate |
|---|---|
| Min VRAM | Model params × 4 bytes (fp32) or × 2 (fp16/bf16) + optimizer states + activations. Rules of thumb: 7B model ≈ 16 GB (fp16), 13B ≈ 28 GB, 70B ≈ 140 GB (needs multi-GPU). ResNet/ViT ≈ 4-8 GB. Add 20% headroom. |
| Num GPUs | 1 unless: model doesn't fit in single GPU VRAM, or scripts use DDP/FSDP/DeepSpeed, or plan specifies multi-GPU |
| Est. hours | From experiment plan's cost column, or: (dataset_size × epochs) / (throughput × batch_size). Default to user estimate if available. Add 30% buffer for setup + unexpected slowdowns |
| Min disk | 20 GB base + model checkpoint size + dataset size. Default: 50 GB |
| CUDA version | Match PyTorch version. PyTorch 2.x needs CUDA ≥ 11.8. Default: 12.1 |
Step 3: Search Offers
Search across multiple GPU tiers to find the best value. Always search broadly — do NOT limit to one GPU model:
bash
undefined分析任务需求,找到最优GPU,并展示成本优化后的选项。这是主要入口——可直接调用,或当设置时由自动调用。
gpu: vast/run-experiment步骤1:分析任务需求
读取可用上下文以确定任务所需资源:
-
来自实验计划():
refine-logs/EXPERIMENT_PLAN.md- 计算预算(总GPU时长)
- 硬件提示(例如“4x RTX 3090”)
- 模型架构和数据集大小
- 运行顺序和各里程碑成本预估
-
来自实验脚本(若已编写):
- 模型大小——扫描模型类、、配置文件
num_parameters - 批量大小、序列长度——据此预估显存占用
- 数据集——根据数据集大小和轮数预估训练时间
- 多GPU支持——检查是否使用、
DataParallel、DistributedDataParallel、acceleratedeepspeed
- 模型大小——扫描模型类、
-
来自用户描述(若无计划/脚本):
- 模型名称/大小(例如“微调LLaMA-7B”、“训练ResNet-50”)
- 数据集规模(例如“ImageNet”、“10k样本”)
- 预估时长(例如“约2小时”)
步骤2:确定GPU需求
基于任务分析,确定以下参数:
| 因素 | 预估方式 |
|---|---|
| 最小显存 | 模型参数×4字节(fp32)或×2(fp16/bf16) + 优化器状态 + 激活值。经验法则:7B模型≈16 GB(fp16),13B≈28 GB,70B≈140 GB(需多GPU)。ResNet/ViT≈4-8 GB。额外预留20%余量。 |
| GPU数量 | 默认1个,除非:模型无法装入单GPU显存、脚本使用DDP/FSDP/DeepSpeed、计划指定多GPU |
| 预估时长 | 来自实验计划的成本列,或:(数据集大小×轮数)/(吞吐量×批量大小)。若有用户预估则默认使用。额外预留30%缓冲时间用于配置和意外减速 |
| 最小磁盘 | 基础20 GB + 模型 checkpoint 大小 + 数据集大小。默认:50 GB |
| CUDA版本 | 匹配PyTorch版本。PyTorch 2.x需要CUDA ≥11.8。默认:12.1 |
步骤3:搜索方案
跨多个GPU层级搜索以找到最优性价比。始终进行广泛搜索——不要局限于单一GPU型号:
bash
undefinedTier 1: Budget GPUs (good for small models, fine-tuning, ablations)
层级1:经济型GPU(适用于小型模型、微调、消融实验)
vastai search offers "gpu_ram>=<MIN_VRAM> num_gpus>=<N> reliability>0.95 inet_down>100" -o 'dph+' --storage <DISK> --limit 10
vastai search offers "gpu_ram>=<MIN_VRAM> num_gpus>=<N> reliability>0.95 inet_down>100" -o 'dph+' --storage <DISK> --limit 10
Tier 2: If VRAM > 24 GB, also search high-VRAM cards specifically
层级2:若显存>24 GB,额外搜索高显存显卡
vastai search offers "gpu_ram>=48 num_gpus>=<N> reliability>0.95" -o 'dph+' --storage <DISK> --limit 5
The output is a table with columns: `ID`, `CUDA`, `N` (GPU count), `Model`, `PCIE`, `cpu_ghz`, `vCPUs`, `RAM`, `Disk`, `$/hr`, `DLP` (deep learning perf), `score`, `NV Driver`, `Net_up`, `Net_down`, `R` (reliability %), `Max_Days`, `mach_id`, `status`, `host_id`, `ports`, `country`.
The **first column (`ID`)** is the offer ID needed for `vastai create instance`.
**Step 4: Present Cost-Optimized Options**
Present **3 options** to the user, ranked by estimated total cost:
Task analysis:
- Model: [model name/size] → estimated VRAM: ~[X] GB
- Training: ~[Y] hours estimated
- Requirements: [N] GPU(s), ≥[X] GB VRAM, ~[Z] GB disk
Recommended options (sorted by estimated total cost):
| # | GPU | VRAM | $/hr | Est. Hours | Est. Total | Reliability | Offer ID |
|---|---|---|---|---|---|---|---|
| 1 | RTX 3060 | 12 GB | $0.04 | ~6h | ~$0.25 | 99.4% | 25831376 |
| 2 | RTX 4090 | 24 GB | $0.28 | ~4h | ~$1.12 | 99.2% | 6995713 |
| 3 | A100 SXM | 80 GB | $0.95 | ~2h | ~$1.90 | 99.5% | 7023456 |
Option 1 is cheapest overall. Option 3 finishes fastest.
Pick a number (or type a different offer ID):
**Key presentation rules:**
- Always show **estimated total cost** ($/hr × estimated hours), not just $/hr
- Faster GPUs have shorter estimated hours (scale by relative FLOPS)
- Flag if a cheap option has reliability < 0.97 ("budget pick — 3% chance of interruption")
- If task is small (<1 hour), recommend interruptible pricing for even lower cost
- If no offers meet VRAM requirements, explain why and suggest alternatives (e.g., multi-GPU, quantization)
**Relative speed scaling (approximate, for estimating hours across GPU tiers):**
| GPU | Relative Speed (FP16) |
|-----|-----------------------:|
| RTX 3060 | 0.5× |
| RTX 3090 | 1.0× |
| RTX 4090 | 1.6× |
| A5000 | 0.9× |
| A6000 | 1.1× |
| L40S | 1.5× |
| A100 SXM | 2.0× |
| H100 SXM | 3.3× |
Use these to scale the base estimated hours across offers.vastai search offers "gpu_ram>=48 num_gpus>=<N> reliability>0.95" -o 'dph+' --storage <DISK> --limit 5
输出为表格,包含以下列:`ID`、`CUDA`、`N`(GPU数量)、`Model`、`PCIE`、`cpu_ghz`、`vCPUs`、`RAM`、`Disk`、`$/hr`、`DLP`(深度学习性能)、`score`、`NV Driver`、`Net_up`、`Net_down`、`R`(可靠性%)、`Max_Days`、`mach_id`、`status`、`host_id`、`ports`、`country`。
**第一列(`ID`)**是`vastai create instance`所需的方案ID。
**步骤4:展示成本优化后的选项**
向用户展示**3个选项**,按预估总成本排序:
任务分析:
- 模型:[模型名称/大小] → 预估显存:~[X] GB
- 训练:预估~[Y]小时
- 需求:[N]个GPU,≥[X] GB显存,~[Z] GB磁盘
推荐选项(按预估总成本排序):
| # | GPU | 显存 | 每小时费用 | 预估时长 | 预估总成本 | 可靠性 | 方案ID |
|---|---|---|---|---|---|---|---|
| 1 | RTX 3060 | 12 GB | $0.04 | ~6h | ~$0.25 | 99.4% | 25831376 |
| 2 | RTX 4090 | 24 GB | $0.28 | ~4h | ~$1.12 | 99.2% | 6995713 |
| 3 | A100 SXM | 80 GB | $0.95 | ~2h | ~$1.90 | 99.5% | 7023456 |
选项1总成本最低,选项3完成速度最快。
选择序号(或输入其他方案ID):
**关键展示规则:**
- 始终展示**预估总成本**(每小时费用×预估时长),而非仅展示每小时费用
- 更快的GPU预估时长更短(按相对FLOPS比例调整)
- 若低价选项可靠性<0.97,需标注说明(“经济型选择——3%概率中断”)
- 若任务时长较短(<1小时),推荐使用可中断定价以进一步降低成本
- 若无方案满足显存需求,说明原因并建议替代方案(例如多GPU、量化)
**相对速度比例(近似值,用于跨GPU层级预估时长):**
| GPU | 相对速度(FP16) |
|-----|-----------------------:|
| RTX 3060 | 0.5× |
| RTX 3090 | 1.0× |
| RTX 4090 | 1.6× |
| A5000 | 0.9× |
| A6000 | 1.1× |
| L40S | 1.5× |
| A100 SXM | 2.0× |
| H100 SXM | 3.3× |
使用这些比例在不同方案间调整基础预估时长。Action: Rent
操作:租用
Create an instance from a user-selected offer.
Step 1: Create Instance
bash
vastai create instance <OFFER_ID> \
--image <DOCKER_IMAGE> \
--disk <DISK_GB> \
--ssh \
--direct \
--onstart-cmd "apt-get update && apt-get install -y git screen rsync"Default Docker image: (override via field if set).
pytorch/pytorch:2.1.0-cuda12.1-cudnn8-develCLAUDE.mdimage:The output looks like:
Started. {'success': True, 'new_contract': 33799165, 'instance_api_key': '...'}The value is the instance ID — save this for all subsequent commands.
new_contractStep 2: Wait for Instance Ready
Poll instance status every 20 seconds until it's running (typically takes 30-60 seconds, max ~5 minutes):
bash
vastai show instances --raw | python3 -c "
import sys, json
instances = json.load(sys.stdin)
for inst in instances:
if inst['id'] == <INSTANCE_ID>:
print(inst['actual_status'])
"Wait states: → . If stuck in for >5 minutes, warn the user — the host may be slow or the image may be large.
loadingrunningloadingStep 3: Get SSH Connection Details
bash
vastai ssh-url <INSTANCE_ID>This returns a URL in the format:
ssh://root@<HOST>:<PORT>Parse out host and port from this URL. Example:
- Input:
ssh://root@1.208.108.242:58955 - Host: , Port:
1.208.108.24258955
Important: Always useto get connection details — do NOT rely onvastai ssh-url/ssh_hostfromssh_port, as those may point to proxy servers that differ from the direct connection endpoint.vastai show instances
Step 4: Verify SSH Connectivity
bash
ssh -o StrictHostKeyChecking=no -o ConnectTimeout=15 -p <PORT> root@<HOST> "nvidia-smi && echo 'CONNECTION_OK'"If SSH fails with "Permission denied (publickey)":
- The user's SSH key was not uploaded to https://cloud.vast.ai/manage-keys/ before the instance was created
- Fix: Destroy this instance, have user upload their key, then create a new instance. Keys are baked in at creation time — there is no way to add keys to a running instance.
If SSH fails with "Connection refused":
- The instance may still be initializing. Retry up to 3 times with 15-second intervals.
Step 5: Update State File
Write/update with the new instance details including the from Step 3, estimated hours and cost.
vast-instances.jsonssh_urlStep 6: Report
Vast.ai instance ready:
- Instance ID: <ID>
- GPU: <GPU_NAME> x <NUM_GPUS>
- Cost: $<DPH>/hr (estimated total: ~$<TOTAL>)
- SSH: ssh -p <PORT> root@<HOST>
- Docker: <IMAGE>
To deploy: /run-experiment (will auto-detect this instance)
To destroy when done: /vast-gpu destroy <ID>根据用户选择的方案创建实例。
步骤1:创建实例
bash
vastai create instance <OFFER_ID> \
--image <DOCKER_IMAGE> \
--disk <DISK_GB> \
--ssh \
--direct \
--onstart-cmd "apt-get update && apt-get install -y git screen rsync"默认Docker镜像:(若中设置了字段则覆盖默认值)。
pytorch/pytorch:2.1.0-cuda12.1-cudnn8-develCLAUDE.mdimage:输出示例:
Started. {'success': True, 'new_contract': 33799165, 'instance_api_key': '...'}new_contract步骤2:等待实例就绪
每20秒轮询一次实例状态,直到实例运行(通常需要30-60秒,最长约5分钟):
bash
vastai show instances --raw | python3 -c "
import sys, json
instances = json.load(sys.stdin)
for inst in instances:
if inst['id'] == <INSTANCE_ID>:
print(inst['actual_status'])
"等待状态: → 。若状态持续超过5分钟,需向用户发出警告——主机可能运行缓慢或镜像过大。
loadingrunningloading步骤3:获取SSH连接信息
bash
vastai ssh-url <INSTANCE_ID>返回格式为:的URL。
ssh://root@<HOST>:<PORT>从该URL中解析主机和端口。示例:
- 输入:
ssh://root@1.208.108.242:58955 - 主机:,端口:
1.208.108.24258955
重要提示: 始终使用获取连接信息——不要依赖vastai ssh-url中的vastai show instances/ssh_host,因为这些可能指向与直接连接端点不同的代理服务器。ssh_port
步骤4:验证SSH连通性
bash
ssh -o StrictHostKeyChecking=no -o ConnectTimeout=15 -p <PORT> root@<HOST> "nvidia-smi && echo 'CONNECTION_OK'"若SSH失败并提示“Permission denied (publickey)”:
- 用户的SSH密钥未在创建实例之前上传至https://cloud.vast.ai/manage-keys/
- 解决方法: 销毁该实例,让用户上传密钥,然后重新创建实例。密钥会在创建时嵌入——无法向运行中的实例添加密钥。
若SSH失败并提示“Connection refused”:
- 实例可能仍在初始化中。最多重试3次,每次间隔15秒。
步骤5:更新状态文件
将新实例的详细信息(包括步骤3中的、预估时长和成本)写入/更新。
ssh_urlvast-instances.json步骤6:报告
Vast.ai实例已就绪:
- 实例ID:<ID>
- GPU:<GPU_NAME> × <NUM_GPUS>
- 成本:$<DPH>/小时(预估总成本:~$<TOTAL>)
- SSH连接:ssh -p <PORT> root@<HOST>
- Docker镜像:<IMAGE>
部署命令:/run-experiment(会自动检测该实例)
完成后销毁命令:/vast-gpu destroy <ID>Action: Setup
操作:配置
Set up the rented instance for a specific experiment. Called automatically by when targeting a vast.ai instance.
/run-experimentStep 1: Install Dependencies
bash
ssh -p <PORT> root@<HOST> "pip install -q wandb tensorboard scipy scikit-learn pandas"If a exists in the project, install that instead:
requirements.txtbash
scp -P <PORT> requirements.txt root@<HOST>:/workspace/
ssh -p <PORT> root@<HOST> "pip install -q -r /workspace/requirements.txt"Note:uses uppercasescpfor port, while-Puses lowercasessh.-p
Step 2: Sync Code
bash
rsync -avz -e "ssh -p <PORT>" \
--include='*.py' --include='*.yaml' --include='*.yml' --include='*.json' \
--include='*.txt' --include='*.sh' --include='*/' \
--exclude='*.pt' --exclude='*.pth' --exclude='*.ckpt' \
--exclude='__pycache__' --exclude='.git' --exclude='data/' \
--exclude='wandb/' --exclude='outputs/' \
./ root@<HOST>:/workspace/project/Step 3: Verify Setup
bash
ssh -p <PORT> root@<HOST> "cd /workspace/project && python -c 'import torch; print(f\"PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}\")'"Expected output: (or more GPUs if multi-GPU instance).
PyTorch 2.1.0, CUDA: True, GPUs: 1为特定实验配置租用的实例。当目标为vast.ai实例时,由自动调用。
/run-experiment步骤1:安装依赖
bash
ssh -p <PORT> root@<HOST> "pip install -q wandb tensorboard scipy scikit-learn pandas"若项目中存在,则安装该文件中的依赖:
requirements.txtbash
scp -P <PORT> requirements.txt root@<HOST>:/workspace/
ssh -p <PORT> root@<HOST> "pip install -q -r /workspace/requirements.txt"注意:使用大写scp指定端口,而-P使用小写ssh。-p
步骤2:同步代码
bash
rsync -avz -e "ssh -p <PORT>" \
--include='*.py' --include='*.yaml' --include='*.yml' --include='*.json' \
--include='*.txt' --include='*.sh' --include='*/' \
--exclude='*.pt' --exclude='*.pth' --exclude='*.ckpt' \
--exclude='__pycache__' --exclude='.git' --exclude='data/' \
--exclude='wandb/' --exclude='outputs/' \
./ root@<HOST>:/workspace/project/步骤3:验证配置
bash
ssh -p <PORT> root@<HOST> "cd /workspace/project && python -c 'import torch; print(f\"PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}\")'"预期输出:(若为多GPU实例则显示更多GPU数量)。
PyTorch 2.1.0, CUDA: True, GPUs: 1Action: Destroy
操作:销毁
Tear down a vast.ai instance to stop billing.
Step 1: Confirm Results Collected
Before destroying, check if there are experiment results to download:
bash
ssh -p <PORT> root@<HOST> "ls /workspace/project/results/ 2>/dev/null || echo 'NO_RESULTS_DIR'"If results exist, download them first:
bash
rsync -avz -e "ssh -p <PORT>" root@<HOST>:/workspace/project/results/ ./results/Also download logs:
bash
scp -P <PORT> root@<HOST>:/workspace/*.log ./logs/ 2>/dev/nullStep 2: Destroy Instance
bash
vastai destroy instance <INSTANCE_ID>Output:
destroying instance <INSTANCE_ID>.Destruction is irreversible — all data on the instance is permanently deleted.
Step 3: Update State File
Remove the instance from or mark its status as .
vast-instances.jsondestroyedStep 4: Report Cost
Calculate actual cost based on creation time and $/hr:
Instance <ID> destroyed.
- Duration: ~X.X hours
- Actual cost: ~$X.XX (estimated was $Y.YY)
- Results downloaded to: ./results/销毁vast.ai实例以停止计费。
步骤1:确认结果已收集
销毁前,检查是否有实验结果需要下载:
bash
ssh -p <PORT> root@<HOST> "ls /workspace/project/results/ 2>/dev/null || echo 'NO_RESULTS_DIR'"若存在结果,先下载:
bash
rsync -avz -e "ssh -p <PORT>" root@<HOST>:/workspace/project/results/ ./results/同时下载日志:
bash
scp -P <PORT> root@<HOST>:/workspace/*.log ./logs/ 2>/dev/null步骤2:销毁实例
bash
vastai destroy instance <INSTANCE_ID>输出:
destroying instance <INSTANCE_ID>.销毁操作不可逆转——实例上的所有数据会被永久删除。
步骤3:更新状态文件
从中移除该实例,或将其状态标记为。
vast-instances.jsondestroyed步骤4:报告成本
根据创建时间和每小时费用计算实际成本:
实例<ID>已销毁。
- 运行时长:~X.X小时
- 实际成本:~$X.XX(预估为$Y.YY)
- 结果已下载至:./results/Action: List
操作:列出实例
Show all active vast.ai instances:
bash
vastai show instancesCross-reference with for experiment associations.
vast-instances.json展示所有活跃的vast.ai实例:
bash
vastai show instances结合查看实例与实验的关联关系。
vast-instances.jsonAction: Destroy All
操作:销毁所有实例
Tear down all active instances (use after all experiments complete):
- Download results from each instance
- Destroy all instances
- Clear
vast-instances.json - Report total cost
销毁所有活跃实例(所有实验完成后使用):
- 从每个实例下载结果
- 销毁所有实例
- 清空
vast-instances.json - 报告总成本
Key Rules
核心规则
- Task-driven selection — NEVER ask users to pick GPU models. Analyze the task, estimate requirements, present cost-optimized options with total price
- ALWAYS destroy instances when experiments are done — vast.ai bills per second, leaving instances running wastes money
- Download results before destroying — data is lost permanently on destroy
- Prefer on-demand pricing for short experiments (<2 hours). Suggest interruptible/bid pricing for long runs (>4 hours) with checkpointing
- Check reliability > 0.95 — unreliable hosts may crash mid-training
- Use SSH when creating instances — faster than proxy SSH
--direct - Always use to get connection details — the host/port from
vastai ssh-url <ID>may differshow instances - SSH keys must be uploaded BEFORE creating instances — keys are baked in at creation time. If SSH fails with "Permission denied", destroy and recreate after adding the key
- Default Docker image: unless user specifies otherwise
pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel - Working directory on instance: (Docker default). Code syncs to
/workspace//workspace/project/ - State file must stay up to date — other skills depend on it
vast-instances.json - Show estimated total cost, not just $/hr — a $0.90/hr GPU that finishes in 2h ($1.80) beats a $0.30/hr GPU that takes 8h ($2.40)
- CLI requires Python ≥ 3.10 — if system Python is older, use a conda env
vastai
- 任务驱动选择——绝不要求用户选择GPU型号。分析任务、预估需求、展示包含总价的成本优化选项
- 实验完成后务必销毁实例——vast.ai按秒计费,让实例持续运行会浪费资金
- 销毁前下载结果——销毁后数据会永久丢失
- 短实验(<2小时)优先按需定价。长运行(>4小时)且支持 checkpointing 的任务建议使用可中断/竞价定价
- 选择可靠性>0.95的主机——不可靠的主机可能在训练中途崩溃
- 创建实例时使用SSH——比代理SSH更快
--direct - 始终使用获取连接信息——
vastai ssh-url <ID>中的主机/端口可能不同show instances - SSH密钥必须在创建实例前上传——密钥会在创建时嵌入。若SSH提示“Permission denied”,销毁实例并添加密钥后重新创建
- 默认Docker镜像:,除非用户指定其他镜像
pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel - 实例上的工作目录:(Docker默认)。代码同步至
/workspace//workspace/project/ - 状态文件必须保持更新——其他技能依赖该文件
vast-instances.json - 展示预估总成本而非仅每小时费用——每小时$0.90、2小时完成的GPU(总成本$1.80)优于每小时$0.30、8小时完成的GPU(总成本$2.40)
- CLI要求Python ≥3.10——若系统Python版本较旧,使用conda环境
vastai
CLAUDE.md Example
CLAUDE.md示例
Users only need to set — no hardware preferences required:
gpu: vastmarkdown
undefined用户只需设置——无需指定硬件偏好:
gpu: vastmarkdown
undefinedVast.ai
Vast.ai
- gpu: vast # tells run-experiment to use vast.ai
- auto_destroy: true # auto-destroy after experiment completes (default: true)
- max_budget: 5.00 # optional: max total $ to spend (skill warns if estimate exceeds this)
- image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel # optional: override Docker image
The skill analyzes experiment scripts and plans to determine what GPU to rent. No need to specify GPU model, VRAM, or instance count.- gpu: vast # 告知run-experiment使用vast.ai
- auto_destroy: true # 实验完成后自动销毁实例(默认值:true)
- max_budget: 5.00 # 可选:最大总花费(若预估超过该值,技能会发出警告)
- image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel # 可选:覆盖默认Docker镜像
技能会分析实验脚本和计划以确定需要租用的GPU。无需指定GPU型号、显存或实例数量。Composing with Other Skills
与其他技能组合使用
/run-experiment "train model" ← detects gpu: vast, calls /vast-gpu provision
↳ /vast-gpu provision ← analyzes task, presents options with cost
↳ user picks option ← rent + setup + deploy
↳ /vast-gpu destroy ← auto-destroy when done (if auto_destroy: true)
/vast-gpu provision ← manual: analyze task + show options
/vast-gpu rent <offer_id> ← manual: rent a specific offer
/vast-gpu list ← show active instances
/vast-gpu destroy <instance_id> ← tear down, stop billing
/vast-gpu destroy-all ← tear down everything/run-experiment "train model" ← 检测到gpu: vast,调用/vast-gpu provision
↳ /vast-gpu provision ← 分析任务,展示成本优化选项
↳ 用户选择选项 ← 租用 + 配置 + 部署
↳ /vast-gpu destroy ← 完成后自动销毁(若auto_destroy: true)
/vast-gpu provision ← 手动操作:分析任务 + 展示选项
/vast-gpu rent <offer_id> ← 手动操作:租用特定方案
/vast-gpu list ← 展示活跃实例
/vast-gpu destroy <instance_id> ← 销毁实例,停止计费
/vast-gpu destroy-all ← 销毁所有实例