vast-gpu

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Vast.ai GPU Management

Vast.ai GPU 管理

Manage vast.ai GPU instance: $ARGUMENTS
管理vast.ai GPU实例:$ARGUMENTS

Overview

概述

Rent cheap, capable GPUs from vast.ai on demand. This skill analyzes the training task to determine GPU requirements, searches for the best-value offers, presents options with estimated total cost, and handles the full lifecycle: rent → setup → run → destroy.
Users do NOT specify GPU models or hardware. They describe the task — the skill figures out what to rent.
Prerequisites: The
vastai
CLI must be installed (requires Python ≥ 3.10) and authenticated:
bash
pip install vastai
vastai set api-key YOUR_API_KEY
If your system Python is < 3.10, create a virtual environment with Python ≥ 3.10 (e.g.,
conda create
,
pyenv
,
uv venv
, etc.) and install
vastai
there.
SSH public key must be uploaded at https://cloud.vast.ai/manage-keys/ BEFORE creating any instance. Keys are baked into instances at creation time — if you add a key after renting, you must destroy and re-create the instance.
按需从vast.ai租用性价比高的GPU。本技能会分析训练任务以确定GPU需求,搜索最具性价比的方案,展示包含预估总成本的选项,并处理完整生命周期:租用 → 配置 → 运行 → 销毁。
用户无需指定GPU型号或硬件参数,只需描述任务——技能会自动确定需要租用的资源。
前提条件: 必须安装
vastai
CLI(要求 Python ≥ 3.10)并完成认证:
bash
pip install vastai
vastai set api-key YOUR_API_KEY
如果系统Python版本低于3.10,请创建一个Python ≥3.10的虚拟环境(例如
conda create
pyenv
uv venv
等),并在该环境中安装
vastai

State File

状态文件

All active vast.ai instances are tracked in
vast-instances.json
at the project root:
json
[
  {
    "instance_id": 33799165,
    "offer_id": 25831376,
    "gpu_name": "RTX_3060",
    "num_gpus": 1,
    "dph": 0.0414,
    "ssh_url": "ssh://root@1.208.108.242:58955",
    "ssh_host": "1.208.108.242",
    "ssh_port": 58955,
    "created_at": "2026-03-29T21:12:00Z",
    "status": "running",
    "experiment": "exp01_baseline",
    "estimated_hours": 4.0,
    "estimated_cost": 0.17
  }
]
This file is the source of truth for
/run-experiment
and
/monitor-experiment
to connect to vast.ai instances.
所有活跃的vast.ai实例都会在项目根目录的
vast-instances.json
文件中追踪:
json
[
  {
    "instance_id": 33799165,
    "offer_id": 25831376,
    "gpu_name": "RTX_3060",
    "num_gpus": 1,
    "dph": 0.0414,
    "ssh_url": "ssh://root@1.208.108.242:58955",
    "ssh_host": "1.208.108.242",
    "ssh_port": 58955,
    "created_at": "2026-03-29T21:12:00Z",
    "status": "running",
    "experiment": "exp01_baseline",
    "estimated_hours": 4.0,
    "estimated_cost": 0.17
  }
]
该文件是
/run-experiment
/monitor-experiment
连接vast.ai实例的唯一可信来源。

Workflow

工作流程

Action: Provision (default)

操作:资源配置(默认)

Analyze the task, find the best GPU, and present cost-optimized options. This is the main entry point — called directly or automatically by
/run-experiment
when
gpu: vast
is set.
Step 1: Analyze Task Requirements
Read available context to determine what the task needs:
  1. From the experiment plan (
    refine-logs/EXPERIMENT_PLAN.md
    ):
    • Compute budget (total GPU-hours)
    • Hardware hints (e.g., "4x RTX 3090")
    • Model architecture and dataset size
    • Run order and per-milestone cost estimates
  2. From experiment scripts (if already written):
    • Model size — scan for model class,
      num_parameters
      , config files
    • Batch size, sequence length — estimate VRAM from these
    • Dataset — estimate training time from dataset size + epochs
    • Multi-GPU — check for
      DataParallel
      ,
      DistributedDataParallel
      ,
      accelerate
      ,
      deepspeed
  3. From user description (if no plan/scripts exist):
    • Model name/size (e.g., "fine-tune LLaMA-7B", "train ResNet-50")
    • Dataset scale (e.g., "ImageNet", "10k samples")
    • Estimated duration (e.g., "about 2 hours")
Step 2: Determine GPU Requirements
Based on the task analysis, determine:
FactorHow to estimate
Min VRAMModel params × 4 bytes (fp32) or × 2 (fp16/bf16) + optimizer states + activations. Rules of thumb: 7B model ≈ 16 GB (fp16), 13B ≈ 28 GB, 70B ≈ 140 GB (needs multi-GPU). ResNet/ViT ≈ 4-8 GB. Add 20% headroom.
Num GPUs1 unless: model doesn't fit in single GPU VRAM, or scripts use DDP/FSDP/DeepSpeed, or plan specifies multi-GPU
Est. hoursFrom experiment plan's cost column, or: (dataset_size × epochs) / (throughput × batch_size). Default to user estimate if available. Add 30% buffer for setup + unexpected slowdowns
Min disk20 GB base + model checkpoint size + dataset size. Default: 50 GB
CUDA versionMatch PyTorch version. PyTorch 2.x needs CUDA ≥ 11.8. Default: 12.1
Step 3: Search Offers
Search across multiple GPU tiers to find the best value. Always search broadly — do NOT limit to one GPU model:
bash
undefined
分析任务需求,找到最优GPU,并展示成本优化后的选项。这是主要入口——可直接调用,或当设置
gpu: vast
时由
/run-experiment
自动调用。
步骤1:分析任务需求
读取可用上下文以确定任务所需资源:
  1. 来自实验计划
    refine-logs/EXPERIMENT_PLAN.md
    ):
    • 计算预算(总GPU时长)
    • 硬件提示(例如“4x RTX 3090”)
    • 模型架构和数据集大小
    • 运行顺序和各里程碑成本预估
  2. 来自实验脚本(若已编写):
    • 模型大小——扫描模型类、
      num_parameters
      、配置文件
    • 批量大小、序列长度——据此预估显存占用
    • 数据集——根据数据集大小和轮数预估训练时间
    • 多GPU支持——检查是否使用
      DataParallel
      DistributedDataParallel
      accelerate
      deepspeed
  3. 来自用户描述(若无计划/脚本):
    • 模型名称/大小(例如“微调LLaMA-7B”、“训练ResNet-50”)
    • 数据集规模(例如“ImageNet”、“10k样本”)
    • 预估时长(例如“约2小时”)
步骤2:确定GPU需求
基于任务分析,确定以下参数:
因素预估方式
最小显存模型参数×4字节(fp32)或×2(fp16/bf16) + 优化器状态 + 激活值。经验法则:7B模型≈16 GB(fp16),13B≈28 GB,70B≈140 GB(需多GPU)。ResNet/ViT≈4-8 GB。额外预留20%余量。
GPU数量默认1个,除非:模型无法装入单GPU显存、脚本使用DDP/FSDP/DeepSpeed、计划指定多GPU
预估时长来自实验计划的成本列,或:(数据集大小×轮数)/(吞吐量×批量大小)。若有用户预估则默认使用。额外预留30%缓冲时间用于配置和意外减速
最小磁盘基础20 GB + 模型 checkpoint 大小 + 数据集大小。默认:50 GB
CUDA版本匹配PyTorch版本。PyTorch 2.x需要CUDA ≥11.8。默认:12.1
步骤3:搜索方案
跨多个GPU层级搜索以找到最优性价比。始终进行广泛搜索——不要局限于单一GPU型号:
bash
undefined

Tier 1: Budget GPUs (good for small models, fine-tuning, ablations)

层级1:经济型GPU(适用于小型模型、微调、消融实验)

vastai search offers "gpu_ram>=<MIN_VRAM> num_gpus>=<N> reliability>0.95 inet_down>100" -o 'dph+' --storage <DISK> --limit 10
vastai search offers "gpu_ram>=<MIN_VRAM> num_gpus>=<N> reliability>0.95 inet_down>100" -o 'dph+' --storage <DISK> --limit 10

Tier 2: If VRAM > 24 GB, also search high-VRAM cards specifically

层级2:若显存>24 GB,额外搜索高显存显卡

vastai search offers "gpu_ram>=48 num_gpus>=<N> reliability>0.95" -o 'dph+' --storage <DISK> --limit 5

The output is a table with columns: `ID`, `CUDA`, `N` (GPU count), `Model`, `PCIE`, `cpu_ghz`, `vCPUs`, `RAM`, `Disk`, `$/hr`, `DLP` (deep learning perf), `score`, `NV Driver`, `Net_up`, `Net_down`, `R` (reliability %), `Max_Days`, `mach_id`, `status`, `host_id`, `ports`, `country`.

The **first column (`ID`)** is the offer ID needed for `vastai create instance`.

**Step 4: Present Cost-Optimized Options**

Present **3 options** to the user, ranked by estimated total cost:
Task analysis:
  • Model: [model name/size] → estimated VRAM: ~[X] GB
  • Training: ~[Y] hours estimated
  • Requirements: [N] GPU(s), ≥[X] GB VRAM, ~[Z] GB disk
Recommended options (sorted by estimated total cost):
#GPUVRAM$/hrEst. HoursEst. TotalReliabilityOffer ID
1RTX 306012 GB$0.04~6h~$0.2599.4%25831376
2RTX 409024 GB$0.28~4h~$1.1299.2%6995713
3A100 SXM80 GB$0.95~2h~$1.9099.5%7023456
Option 1 is cheapest overall. Option 3 finishes fastest. Pick a number (or type a different offer ID):

**Key presentation rules:**
- Always show **estimated total cost** ($/hr × estimated hours), not just $/hr
- Faster GPUs have shorter estimated hours (scale by relative FLOPS)
- Flag if a cheap option has reliability < 0.97 ("budget pick — 3% chance of interruption")
- If task is small (<1 hour), recommend interruptible pricing for even lower cost
- If no offers meet VRAM requirements, explain why and suggest alternatives (e.g., multi-GPU, quantization)

**Relative speed scaling (approximate, for estimating hours across GPU tiers):**

| GPU | Relative Speed (FP16) |
|-----|-----------------------:|
| RTX 3060 | 0.5× |
| RTX 3090 | 1.0× |
| RTX 4090 | 1.6× |
| A5000 | 0.9× |
| A6000 | 1.1× |
| L40S | 1.5× |
| A100 SXM | 2.0× |
| H100 SXM | 3.3× |

Use these to scale the base estimated hours across offers.
vastai search offers "gpu_ram>=48 num_gpus>=<N> reliability>0.95" -o 'dph+' --storage <DISK> --limit 5

输出为表格,包含以下列:`ID`、`CUDA`、`N`(GPU数量)、`Model`、`PCIE`、`cpu_ghz`、`vCPUs`、`RAM`、`Disk`、`$/hr`、`DLP`(深度学习性能)、`score`、`NV Driver`、`Net_up`、`Net_down`、`R`(可靠性%)、`Max_Days`、`mach_id`、`status`、`host_id`、`ports`、`country`。

**第一列(`ID`)**是`vastai create instance`所需的方案ID。

**步骤4:展示成本优化后的选项**

向用户展示**3个选项**,按预估总成本排序:
任务分析:
  • 模型:[模型名称/大小] → 预估显存:~[X] GB
  • 训练:预估~[Y]小时
  • 需求:[N]个GPU,≥[X] GB显存,~[Z] GB磁盘
推荐选项(按预估总成本排序):
#GPU显存每小时费用预估时长预估总成本可靠性方案ID
1RTX 306012 GB$0.04~6h~$0.2599.4%25831376
2RTX 409024 GB$0.28~4h~$1.1299.2%6995713
3A100 SXM80 GB$0.95~2h~$1.9099.5%7023456
选项1总成本最低,选项3完成速度最快。 选择序号(或输入其他方案ID):

**关键展示规则:**
- 始终展示**预估总成本**(每小时费用×预估时长),而非仅展示每小时费用
- 更快的GPU预估时长更短(按相对FLOPS比例调整)
- 若低价选项可靠性<0.97,需标注说明(“经济型选择——3%概率中断”)
- 若任务时长较短(<1小时),推荐使用可中断定价以进一步降低成本
- 若无方案满足显存需求,说明原因并建议替代方案(例如多GPU、量化)

**相对速度比例(近似值,用于跨GPU层级预估时长):**

| GPU | 相对速度(FP16) |
|-----|-----------------------:|
| RTX 3060 | 0.5× |
| RTX 3090 | 1.0× |
| RTX 4090 | 1.6× |
| A5000 | 0.9× |
| A6000 | 1.1× |
| L40S | 1.5× |
| A100 SXM | 2.0× |
| H100 SXM | 3.3× |

使用这些比例在不同方案间调整基础预估时长。

Action: Rent

操作:租用

Create an instance from a user-selected offer.
Step 1: Create Instance
bash
vastai create instance <OFFER_ID> \
  --image <DOCKER_IMAGE> \
  --disk <DISK_GB> \
  --ssh \
  --direct \
  --onstart-cmd "apt-get update && apt-get install -y git screen rsync"
Default Docker image:
pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
(override via
CLAUDE.md
image:
field if set).
The output looks like:
Started. {'success': True, 'new_contract': 33799165, 'instance_api_key': '...'}
The
new_contract
value is the instance ID
— save this for all subsequent commands.
Step 2: Wait for Instance Ready
Poll instance status every 20 seconds until it's running (typically takes 30-60 seconds, max ~5 minutes):
bash
vastai show instances --raw | python3 -c "
import sys, json
instances = json.load(sys.stdin)
for inst in instances:
    if inst['id'] == <INSTANCE_ID>:
        print(inst['actual_status'])
"
Wait states:
loading
running
. If stuck in
loading
for >5 minutes, warn the user — the host may be slow or the image may be large.
Step 3: Get SSH Connection Details
bash
vastai ssh-url <INSTANCE_ID>
This returns a URL in the format:
ssh://root@<HOST>:<PORT>
Parse out host and port from this URL. Example:
  • Input:
    ssh://root@1.208.108.242:58955
  • Host:
    1.208.108.242
    , Port:
    58955
Important: Always use
vastai ssh-url
to get connection details — do NOT rely on
ssh_host
/
ssh_port
from
vastai show instances
, as those may point to proxy servers that differ from the direct connection endpoint.
Step 4: Verify SSH Connectivity
bash
ssh -o StrictHostKeyChecking=no -o ConnectTimeout=15 -p <PORT> root@<HOST> "nvidia-smi && echo 'CONNECTION_OK'"
If SSH fails with "Permission denied (publickey)":
  • The user's SSH key was not uploaded to https://cloud.vast.ai/manage-keys/ before the instance was created
  • Fix: Destroy this instance, have user upload their key, then create a new instance. Keys are baked in at creation time — there is no way to add keys to a running instance.
If SSH fails with "Connection refused":
  • The instance may still be initializing. Retry up to 3 times with 15-second intervals.
Step 5: Update State File
Write/update
vast-instances.json
with the new instance details including the
ssh_url
from Step 3, estimated hours and cost.
Step 6: Report
Vast.ai instance ready:
- Instance ID: <ID>
- GPU: <GPU_NAME> x <NUM_GPUS>
- Cost: $<DPH>/hr (estimated total: ~$<TOTAL>)
- SSH: ssh -p <PORT> root@<HOST>
- Docker: <IMAGE>

To deploy: /run-experiment (will auto-detect this instance)
To destroy when done: /vast-gpu destroy <ID>
根据用户选择的方案创建实例。
步骤1:创建实例
bash
vastai create instance <OFFER_ID> \
  --image <DOCKER_IMAGE> \
  --disk <DISK_GB> \
  --ssh \
  --direct \
  --onstart-cmd "apt-get update && apt-get install -y git screen rsync"
默认Docker镜像:
pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
(若
CLAUDE.md
中设置了
image:
字段则覆盖默认值)。
输出示例:
Started. {'success': True, 'new_contract': 33799165, 'instance_api_key': '...'}
new_contract
的值即为实例ID
——请保存该值用于后续所有命令。
步骤2:等待实例就绪
每20秒轮询一次实例状态,直到实例运行(通常需要30-60秒,最长约5分钟):
bash
vastai show instances --raw | python3 -c "
import sys, json
instances = json.load(sys.stdin)
for inst in instances:
    if inst['id'] == <INSTANCE_ID>:
        print(inst['actual_status'])
"
等待状态:
loading
running
。若
loading
状态持续超过5分钟,需向用户发出警告——主机可能运行缓慢或镜像过大。
步骤3:获取SSH连接信息
bash
vastai ssh-url <INSTANCE_ID>
返回格式为:
ssh://root@<HOST>:<PORT>
的URL。
从该URL中解析主机和端口。示例:
  • 输入:
    ssh://root@1.208.108.242:58955
  • 主机:
    1.208.108.242
    ,端口:
    58955
重要提示: 始终使用
vastai ssh-url
获取连接信息——不要依赖
vastai show instances
中的
ssh_host
/
ssh_port
,因为这些可能指向与直接连接端点不同的代理服务器。
步骤4:验证SSH连通性
bash
ssh -o StrictHostKeyChecking=no -o ConnectTimeout=15 -p <PORT> root@<HOST> "nvidia-smi && echo 'CONNECTION_OK'"
若SSH失败并提示“Permission denied (publickey)”:
  • 用户的SSH密钥未在创建实例之前上传至https://cloud.vast.ai/manage-keys/
  • 解决方法: 销毁该实例,让用户上传密钥,然后重新创建实例。密钥会在创建时嵌入——无法向运行中的实例添加密钥。
若SSH失败并提示“Connection refused”:
  • 实例可能仍在初始化中。最多重试3次,每次间隔15秒。
步骤5:更新状态文件
将新实例的详细信息(包括步骤3中的
ssh_url
、预估时长和成本)写入/更新
vast-instances.json
步骤6:报告
Vast.ai实例已就绪:
- 实例ID:<ID>
- GPU:<GPU_NAME> × <NUM_GPUS>
- 成本:$<DPH>/小时(预估总成本:~$<TOTAL>)
- SSH连接:ssh -p <PORT> root@<HOST>
- Docker镜像:<IMAGE>

部署命令:/run-experiment(会自动检测该实例)
完成后销毁命令:/vast-gpu destroy <ID>

Action: Setup

操作:配置

Set up the rented instance for a specific experiment. Called automatically by
/run-experiment
when targeting a vast.ai instance.
Step 1: Install Dependencies
bash
ssh -p <PORT> root@<HOST> "pip install -q wandb tensorboard scipy scikit-learn pandas"
If a
requirements.txt
exists in the project, install that instead:
bash
scp -P <PORT> requirements.txt root@<HOST>:/workspace/
ssh -p <PORT> root@<HOST> "pip install -q -r /workspace/requirements.txt"
Note:
scp
uses uppercase
-P
for port, while
ssh
uses lowercase
-p
.
Step 2: Sync Code
bash
rsync -avz -e "ssh -p <PORT>" \
  --include='*.py' --include='*.yaml' --include='*.yml' --include='*.json' \
  --include='*.txt' --include='*.sh' --include='*/' \
  --exclude='*.pt' --exclude='*.pth' --exclude='*.ckpt' \
  --exclude='__pycache__' --exclude='.git' --exclude='data/' \
  --exclude='wandb/' --exclude='outputs/' \
  ./ root@<HOST>:/workspace/project/
Step 3: Verify Setup
bash
ssh -p <PORT> root@<HOST> "cd /workspace/project && python -c 'import torch; print(f\"PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}\")'"
Expected output:
PyTorch 2.1.0, CUDA: True, GPUs: 1
(or more GPUs if multi-GPU instance).
为特定实验配置租用的实例。当目标为vast.ai实例时,由
/run-experiment
自动调用。
步骤1:安装依赖
bash
ssh -p <PORT> root@<HOST> "pip install -q wandb tensorboard scipy scikit-learn pandas"
若项目中存在
requirements.txt
,则安装该文件中的依赖:
bash
scp -P <PORT> requirements.txt root@<HOST>:/workspace/
ssh -p <PORT> root@<HOST> "pip install -q -r /workspace/requirements.txt"
注意:
scp
使用大写
-P
指定端口,而
ssh
使用小写
-p
步骤2:同步代码
bash
rsync -avz -e "ssh -p <PORT>" \
  --include='*.py' --include='*.yaml' --include='*.yml' --include='*.json' \
  --include='*.txt' --include='*.sh' --include='*/' \
  --exclude='*.pt' --exclude='*.pth' --exclude='*.ckpt' \
  --exclude='__pycache__' --exclude='.git' --exclude='data/' \
  --exclude='wandb/' --exclude='outputs/' \
  ./ root@<HOST>:/workspace/project/
步骤3:验证配置
bash
ssh -p <PORT> root@<HOST> "cd /workspace/project && python -c 'import torch; print(f\"PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}\")'"
预期输出:
PyTorch 2.1.0, CUDA: True, GPUs: 1
(若为多GPU实例则显示更多GPU数量)。

Action: Destroy

操作:销毁

Tear down a vast.ai instance to stop billing.
Step 1: Confirm Results Collected
Before destroying, check if there are experiment results to download:
bash
ssh -p <PORT> root@<HOST> "ls /workspace/project/results/ 2>/dev/null || echo 'NO_RESULTS_DIR'"
If results exist, download them first:
bash
rsync -avz -e "ssh -p <PORT>" root@<HOST>:/workspace/project/results/ ./results/
Also download logs:
bash
scp -P <PORT> root@<HOST>:/workspace/*.log ./logs/ 2>/dev/null
Step 2: Destroy Instance
bash
vastai destroy instance <INSTANCE_ID>
Output:
destroying instance <INSTANCE_ID>.
Destruction is irreversible — all data on the instance is permanently deleted.
Step 3: Update State File
Remove the instance from
vast-instances.json
or mark its status as
destroyed
.
Step 4: Report Cost
Calculate actual cost based on creation time and $/hr:
Instance <ID> destroyed.
- Duration: ~X.X hours
- Actual cost: ~$X.XX (estimated was $Y.YY)
- Results downloaded to: ./results/
销毁vast.ai实例以停止计费。
步骤1:确认结果已收集
销毁前,检查是否有实验结果需要下载:
bash
ssh -p <PORT> root@<HOST> "ls /workspace/project/results/ 2>/dev/null || echo 'NO_RESULTS_DIR'"
若存在结果,先下载:
bash
rsync -avz -e "ssh -p <PORT>" root@<HOST>:/workspace/project/results/ ./results/
同时下载日志:
bash
scp -P <PORT> root@<HOST>:/workspace/*.log ./logs/ 2>/dev/null
步骤2:销毁实例
bash
vastai destroy instance <INSTANCE_ID>
输出:
destroying instance <INSTANCE_ID>.
销毁操作不可逆转——实例上的所有数据会被永久删除。
步骤3:更新状态文件
vast-instances.json
中移除该实例,或将其状态标记为
destroyed
步骤4:报告成本
根据创建时间和每小时费用计算实际成本:
实例<ID>已销毁。
- 运行时长:~X.X小时
- 实际成本:~$X.XX(预估为$Y.YY)
- 结果已下载至:./results/

Action: List

操作:列出实例

Show all active vast.ai instances:
bash
vastai show instances
Cross-reference with
vast-instances.json
for experiment associations.
展示所有活跃的vast.ai实例:
bash
vastai show instances
结合
vast-instances.json
查看实例与实验的关联关系。

Action: Destroy All

操作:销毁所有实例

Tear down all active instances (use after all experiments complete):
  1. Download results from each instance
  2. Destroy all instances
  3. Clear
    vast-instances.json
  4. Report total cost
销毁所有活跃实例(所有实验完成后使用):
  1. 从每个实例下载结果
  2. 销毁所有实例
  3. 清空
    vast-instances.json
  4. 报告总成本

Key Rules

核心规则

  • Task-driven selection — NEVER ask users to pick GPU models. Analyze the task, estimate requirements, present cost-optimized options with total price
  • ALWAYS destroy instances when experiments are done — vast.ai bills per second, leaving instances running wastes money
  • Download results before destroying — data is lost permanently on destroy
  • Prefer on-demand pricing for short experiments (<2 hours). Suggest interruptible/bid pricing for long runs (>4 hours) with checkpointing
  • Check reliability > 0.95 — unreliable hosts may crash mid-training
  • Use
    --direct
    SSH
    when creating instances — faster than proxy SSH
  • Always use
    vastai ssh-url <ID>
    to get connection details — the host/port from
    show instances
    may differ
  • SSH keys must be uploaded BEFORE creating instances — keys are baked in at creation time. If SSH fails with "Permission denied", destroy and recreate after adding the key
  • Default Docker image:
    pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
    unless user specifies otherwise
  • Working directory on instance:
    /workspace/
    (Docker default). Code syncs to
    /workspace/project/
  • State file
    vast-instances.json
    must stay up to date
    — other skills depend on it
  • Show estimated total cost, not just $/hr — a $0.90/hr GPU that finishes in 2h ($1.80) beats a $0.30/hr GPU that takes 8h ($2.40)
  • vastai
    CLI requires Python ≥ 3.10
    — if system Python is older, use a conda env
  • 任务驱动选择——绝不要求用户选择GPU型号。分析任务、预估需求、展示包含总价的成本优化选项
  • 实验完成后务必销毁实例——vast.ai按秒计费,让实例持续运行会浪费资金
  • 销毁前下载结果——销毁后数据会永久丢失
  • 短实验(<2小时)优先按需定价。长运行(>4小时)且支持 checkpointing 的任务建议使用可中断/竞价定价
  • 选择可靠性>0.95的主机——不可靠的主机可能在训练中途崩溃
  • 创建实例时使用
    --direct
    SSH
    ——比代理SSH更快
  • 始终使用
    vastai ssh-url <ID>
    获取连接信息
    ——
    show instances
    中的主机/端口可能不同
  • SSH密钥必须在创建实例前上传——密钥会在创建时嵌入。若SSH提示“Permission denied”,销毁实例并添加密钥后重新创建
  • 默认Docker镜像
    pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
    ,除非用户指定其他镜像
  • 实例上的工作目录
    /workspace/
    (Docker默认)。代码同步至
    /workspace/project/
  • 状态文件
    vast-instances.json
    必须保持更新
    ——其他技能依赖该文件
  • 展示预估总成本而非仅每小时费用——每小时$0.90、2小时完成的GPU(总成本$1.80)优于每小时$0.30、8小时完成的GPU(总成本$2.40)
  • vastai
    CLI要求Python ≥3.10
    ——若系统Python版本较旧,使用conda环境

CLAUDE.md Example

CLAUDE.md示例

Users only need to set
gpu: vast
— no hardware preferences required:
markdown
undefined
用户只需设置
gpu: vast
——无需指定硬件偏好:
markdown
undefined

Vast.ai

Vast.ai

  • gpu: vast # tells run-experiment to use vast.ai
  • auto_destroy: true # auto-destroy after experiment completes (default: true)
  • max_budget: 5.00 # optional: max total $ to spend (skill warns if estimate exceeds this)
  • image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel # optional: override Docker image

The skill analyzes experiment scripts and plans to determine what GPU to rent. No need to specify GPU model, VRAM, or instance count.
  • gpu: vast # 告知run-experiment使用vast.ai
  • auto_destroy: true # 实验完成后自动销毁实例(默认值:true)
  • max_budget: 5.00 # 可选:最大总花费(若预估超过该值,技能会发出警告)
  • image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel # 可选:覆盖默认Docker镜像

技能会分析实验脚本和计划以确定需要租用的GPU。无需指定GPU型号、显存或实例数量。

Composing with Other Skills

与其他技能组合使用

/run-experiment "train model"       ← detects gpu: vast, calls /vast-gpu provision
  ↳ /vast-gpu provision             ← analyzes task, presents options with cost
  ↳ user picks option               ← rent + setup + deploy
  ↳ /vast-gpu destroy               ← auto-destroy when done (if auto_destroy: true)

/vast-gpu provision                 ← manual: analyze task + show options
/vast-gpu rent <offer_id>           ← manual: rent a specific offer
/vast-gpu list                      ← show active instances
/vast-gpu destroy <instance_id>     ← tear down, stop billing
/vast-gpu destroy-all               ← tear down everything
/run-experiment "train model"       ← 检测到gpu: vast,调用/vast-gpu provision
  ↳ /vast-gpu provision             ← 分析任务,展示成本优化选项
  ↳ 用户选择选项               ← 租用 + 配置 + 部署
  ↳ /vast-gpu destroy               ← 完成后自动销毁(若auto_destroy: true)

/vast-gpu provision                 ← 手动操作:分析任务 + 展示选项
/vast-gpu rent <offer_id>           ← 手动操作:租用特定方案
/vast-gpu list                      ← 展示活跃实例
/vast-gpu destroy <instance_id>     ← 销毁实例,停止计费
/vast-gpu destroy-all               ← 销毁所有实例