skypilot-multi-cloud-orchestration

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SkyPilot Multi-Cloud Orchestration

SkyPilot 多云编排

Comprehensive guide to running ML workloads across clouds with automatic cost optimization using SkyPilot.
使用SkyPilot在多云环境中运行机器学习工作负载并实现自动成本优化的综合指南。

When to use SkyPilot

何时使用SkyPilot

Use SkyPilot when:
  • Running ML workloads across multiple clouds (AWS, GCP, Azure, etc.)
  • Need cost optimization with automatic cloud/region selection
  • Running long jobs on spot instances with auto-recovery
  • Managing distributed multi-node training
  • Want unified interface for 20+ cloud providers
  • Need to avoid vendor lock-in
Key features:
  • Multi-cloud: AWS, GCP, Azure, Kubernetes, Lambda, RunPod, 20+ providers
  • Cost optimization: Automatic cheapest cloud/region selection
  • Spot instances: 3-6x cost savings with automatic recovery
  • Distributed training: Multi-node jobs with gang scheduling
  • Managed jobs: Auto-recovery, checkpointing, fault tolerance
  • Sky Serve: Model serving with autoscaling
Use alternatives instead:
  • Modal: For simpler serverless GPU with Python-native API
  • RunPod: For single-cloud persistent pods
  • Kubernetes: For existing K8s infrastructure
  • Ray: For pure Ray-based orchestration
以下场景适用SkyPilot:
  • 在多云环境(AWS、GCP、Azure等)中运行ML工作负载
  • 需要通过自动选择云厂商/区域实现成本优化
  • 在spot instances上运行长期作业并需要自动恢复功能
  • 管理分布式多节点训练
  • 想要一个支持20+云厂商的统一操作界面
  • 需要避免厂商锁定
核心特性:
  • 多云支持:AWS、GCP、Azure、Kubernetes、Lambda、RunPod等20+云厂商
  • 成本优化:自动选择最便宜的云厂商/区域
  • Spot instances:可节省3-6倍成本,且支持自动恢复
  • 分布式训练:支持gang调度的多节点作业
  • 托管作业:自动恢复、checkpointing、容错
  • Sky Serve:支持自动扩缩容的模型服务
可选择以下替代工具:
  • Modal:适用于需要更简洁的Python原生API的无服务器GPU场景
  • RunPod:适用于单云环境下的持久化pod
  • Kubernetes:适用于已有K8s基础设施的场景
  • Ray:适用于纯基于Ray的编排场景

Quick start

快速开始

Installation

安装

bash
pip install "skypilot[aws,gcp,azure,kubernetes]"
bash
pip install "skypilot[aws,gcp,azure,kubernetes]"

Verify cloud credentials

Verify cloud credentials

sky check
undefined
sky check
undefined

Hello World

快速入门示例

Create
hello.yaml
:
yaml
resources:
  accelerators: T4:1

run: |
  nvidia-smi
  echo "Hello from SkyPilot!"
Launch:
bash
sky launch -c hello hello.yaml
创建
hello.yaml
:
yaml
resources:
  accelerators: T4:1

run: |
  nvidia-smi
  echo "Hello from SkyPilot!"
启动任务:
bash
sky launch -c hello hello.yaml

SSH to cluster

SSH to cluster

ssh hello
ssh hello

Terminate

Terminate

sky down hello
undefined
sky down hello
undefined

Core concepts

核心概念

Task YAML structure

任务YAML结构

yaml
undefined
yaml
undefined

Task name (optional)

Task name (optional)

name: my-task
name: my-task

Resource requirements

Resource requirements

resources: cloud: aws # Optional: auto-select if omitted region: us-west-2 # Optional: auto-select if omitted accelerators: A100:4 # GPU type and count cpus: 8+ # Minimum CPUs memory: 32+ # Minimum memory (GB) use_spot: true # Use spot instances disk_size: 256 # Disk size (GB)
resources: cloud: aws # Optional: auto-select if omitted region: us-west-2 # Optional: auto-select if omitted accelerators: A100:4 # GPU type and count cpus: 8+ # Minimum CPUs memory: 32+ # Minimum memory (GB) use_spot: true # Use spot instances disk_size: 256 # Disk size (GB)

Number of nodes for distributed training

Number of nodes for distributed training

num_nodes: 2
num_nodes: 2

Working directory (synced to ~/sky_workdir)

Working directory (synced to ~/sky_workdir)

workdir: .
workdir: .

Setup commands (run once)

Setup commands (run once)

setup: | pip install -r requirements.txt
setup: | pip install -r requirements.txt

Run commands

Run commands

run: | python train.py
undefined
run: | python train.py
undefined

Key commands

核心命令

CommandPurpose
sky launch
Launch cluster and run task
sky exec
Run task on existing cluster
sky status
Show cluster status
sky stop
Stop cluster (preserve state)
sky down
Terminate cluster
sky logs
View task logs
sky queue
Show job queue
sky jobs launch
Launch managed job
sky serve up
Deploy serving endpoint
命令用途
sky launch
启动集群并运行任务
sky exec
在已有集群上运行任务
sky status
查看集群状态
sky stop
停止集群(保留状态)
sky down
终止集群
sky logs
查看任务日志
sky queue
查看作业队列
sky jobs launch
启动托管作业
sky serve up
部署服务端点

GPU configuration

GPU配置

Available accelerators

可用加速器

yaml
undefined
yaml
undefined

NVIDIA GPUs

NVIDIA GPUs

accelerators: T4:1 accelerators: L4:1 accelerators: A10G:1 accelerators: L40S:1 accelerators: A100:4 accelerators: A100-80GB:8 accelerators: H100:8
accelerators: T4:1 accelerators: L4:1 accelerators: A10G:1 accelerators: L40S:1 accelerators: A100:4 accelerators: A100-80GB:8 accelerators: H100:8

Cloud-specific

Cloud-specific

accelerators: V100:4 # AWS/GCP accelerators: TPU-v4-8 # GCP TPUs
undefined
accelerators: V100:4 # AWS/GCP accelerators: TPU-v4-8 # GCP TPUs
undefined

GPU fallbacks

GPU备选配置

yaml
resources:
  accelerators:
    H100: 8
    A100-80GB: 8
    A100: 8
  any_of:
    - cloud: gcp
    - cloud: aws
    - cloud: azure
yaml
resources:
  accelerators:
    H100: 8
    A100-80GB: 8
    A100: 8
  any_of:
    - cloud: gcp
    - cloud: aws
    - cloud: azure

Spot instances

Spot instances配置

yaml
resources:
  accelerators: A100:8
  use_spot: true
  spot_recovery: FAILOVER  # Auto-recover on preemption
yaml
resources:
  accelerators: A100:8
  use_spot: true
  spot_recovery: FAILOVER  # Auto-recover on preemption

Cluster management

集群管理

Launch and execute

启动与执行

bash
undefined
bash
undefined

Launch new cluster

Launch new cluster

sky launch -c mycluster task.yaml
sky launch -c mycluster task.yaml

Run on existing cluster (skip setup)

Run on existing cluster (skip setup)

sky exec mycluster another_task.yaml
sky exec mycluster another_task.yaml

Interactive SSH

Interactive SSH

ssh mycluster
ssh mycluster

Stream logs

Stream logs

sky logs mycluster
undefined
sky logs mycluster
undefined

Autostop

自动停止

yaml
resources:
  accelerators: A100:4
  autostop:
    idle_minutes: 30
    down: true  # Terminate instead of stop
bash
undefined
yaml
resources:
  accelerators: A100:4
  autostop:
    idle_minutes: 30
    down: true  # Terminate instead of stop
bash
undefined

Set autostop via CLI

Set autostop via CLI

sky autostop mycluster -i 30 --down
undefined
sky autostop mycluster -i 30 --down
undefined

Cluster status

集群状态

bash
undefined
bash
undefined

All clusters

All clusters

sky status
sky status

Detailed view

Detailed view

sky status -a
undefined
sky status -a
undefined

Distributed training

分布式训练

Multi-node setup

多节点配置

yaml
resources:
  accelerators: A100:8

num_nodes: 4  # 4 nodes × 8 GPUs = 32 GPUs total

setup: |
  pip install torch torchvision

run: |
  torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --node_rank=$SKYPILOT_NODE_RANK \
    --master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
    --master_port=12355 \
    train.py
yaml
resources:
  accelerators: A100:8

num_nodes: 4  # 4 nodes × 8 GPUs = 32 GPUs total

setup: |
  pip install torch torchvision

run: |
  torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --node_rank=$SKYPILOT_NODE_RANK \
    --master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
    --master_port=12355 \
    train.py

Environment variables

环境变量

VariableDescription
SKYPILOT_NODE_RANK
Node index (0 to num_nodes-1)
SKYPILOT_NODE_IPS
Newline-separated IP addresses
SKYPILOT_NUM_NODES
Total number of nodes
SKYPILOT_NUM_GPUS_PER_NODE
GPUs per node
变量说明
SKYPILOT_NODE_RANK
节点索引(0到num_nodes-1)
SKYPILOT_NODE_IPS
换行分隔的IP地址列表
SKYPILOT_NUM_NODES
总节点数
SKYPILOT_NUM_GPUS_PER_NODE
每节点GPU数量

Head-node-only execution

仅主节点执行

bash
run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    python orchestrate.py
  fi
bash
run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    python orchestrate.py
  fi

Managed jobs

托管作业

Spot recovery

Spot实例恢复

bash
undefined
bash
undefined

Launch managed job with spot recovery

Launch managed job with spot recovery

sky jobs launch -n my-job train.yaml
undefined
sky jobs launch -n my-job train.yaml
undefined

Checkpointing

检查点配置

yaml
name: training-job

file_mounts:
  /checkpoints:
    name: my-checkpoints
    store: s3
    mode: MOUNT

resources:
  accelerators: A100:8
  use_spot: true

run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume-from-latest
yaml
name: training-job

file_mounts:
  /checkpoints:
    name: my-checkpoints
    store: s3
    mode: MOUNT

resources:
  accelerators: A100:8
  use_spot: true

run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume-from-latest

Job management

作业管理

bash
undefined
bash
undefined

List jobs

List jobs

sky jobs queue
sky jobs queue

View logs

View logs

sky jobs logs my-job
sky jobs logs my-job

Cancel job

Cancel job

sky jobs cancel my-job
undefined
sky jobs cancel my-job
undefined

File mounts and storage

文件挂载与存储

Local file sync

本地文件同步

yaml
workdir: ./my-project  # Synced to ~/sky_workdir

file_mounts:
  /data/config.yaml: ./config.yaml
  ~/.vimrc: ~/.vimrc
yaml
workdir: ./my-project  # Synced to ~/sky_workdir

file_mounts:
  /data/config.yaml: ./config.yaml
  ~/.vimrc: ~/.vimrc

Cloud storage

云存储

yaml
file_mounts:
  # Mount S3 bucket
  /datasets:
    source: s3://my-bucket/datasets
    mode: MOUNT  # Stream from S3

  # Copy GCS bucket
  /models:
    source: gs://my-bucket/models
    mode: COPY  # Pre-fetch to disk

  # Cached mount (fast writes)
  /outputs:
    name: my-outputs
    store: s3
    mode: MOUNT_CACHED
yaml
file_mounts:
  # Mount S3 bucket
  /datasets:
    source: s3://my-bucket/datasets
    mode: MOUNT  # Stream from S3

  # Copy GCS bucket
  /models:
    source: gs://my-bucket/models
    mode: COPY  # Pre-fetch to disk

  # Cached mount (fast writes)
  /outputs:
    name: my-outputs
    store: s3
    mode: MOUNT_CACHED

Storage modes

存储模式

ModeDescriptionBest For
MOUNT
Stream from cloudLarge datasets, read-heavy
COPY
Pre-fetch to diskSmall files, random access
MOUNT_CACHED
Cache with async uploadCheckpoints, outputs
模式说明适用场景
MOUNT
从云存储流式读取大型数据集、读密集型场景
COPY
预取到本地磁盘小型文件、随机访问场景
MOUNT_CACHED
缓存并异步上传检查点、输出文件场景

Sky Serve (Model Serving)

Sky Serve(模型服务)

Basic service

基础服务

yaml
undefined
yaml
undefined

service.yaml

service.yaml

service: readiness_probe: /health replica_policy: min_replicas: 1 max_replicas: 10 target_qps_per_replica: 2.0
resources: accelerators: A100:1
run: | python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--port 8000

```bash
service: readiness_probe: /health replica_policy: min_replicas: 1 max_replicas: 10 target_qps_per_replica: 2.0
resources: accelerators: A100:1
run: | python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--port 8000

```bash

Deploy

Deploy

sky serve up -n my-service service.yaml
sky serve up -n my-service service.yaml

Check status

Check status

sky serve status
sky serve status

Get endpoint

Get endpoint

sky serve status my-service
undefined
sky serve status my-service
undefined

Autoscaling policies

自动扩缩容策略

yaml
service:
  replica_policy:
    min_replicas: 1
    max_replicas: 10
    target_qps_per_replica: 2.0
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300
  load_balancing_policy: round_robin
yaml
service:
  replica_policy:
    min_replicas: 1
    max_replicas: 10
    target_qps_per_replica: 2.0
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300
  load_balancing_policy: round_robin

Cost optimization

成本优化

Automatic cloud selection

自动云厂商选择

yaml
undefined
yaml
undefined

SkyPilot finds cheapest option

SkyPilot finds cheapest option

resources: accelerators: A100:8

No cloud specified - auto-select cheapest


```bash
resources: accelerators: A100:8

No cloud specified - auto-select cheapest


```bash

Show optimizer decision

Show optimizer decision

sky launch task.yaml --dryrun
undefined
sky launch task.yaml --dryrun
undefined

Cloud preferences

云厂商偏好设置

yaml
resources:
  accelerators: A100:8
  any_of:
    - cloud: gcp
      region: us-central1
    - cloud: aws
      region: us-east-1
    - cloud: azure
yaml
resources:
  accelerators: A100:8
  any_of:
    - cloud: gcp
      region: us-central1
    - cloud: aws
      region: us-east-1
    - cloud: azure

Environment variables

环境变量

yaml
envs:
  HF_TOKEN: $HF_TOKEN  # Inherited from local env
  WANDB_API_KEY: $WANDB_API_KEY
yaml
envs:
  HF_TOKEN: $HF_TOKEN  # Inherited from local env
  WANDB_API_KEY: $WANDB_API_KEY

Or use secrets

Or use secrets

secrets:
  • HF_TOKEN
  • WANDB_API_KEY
undefined
secrets:
  • HF_TOKEN
  • WANDB_API_KEY
undefined

Common workflows

常见工作流

Workflow 1: Fine-tuning with checkpoints

工作流1:带检查点的微调

yaml
name: llm-finetune

file_mounts:
  /checkpoints:
    name: finetune-checkpoints
    store: s3
    mode: MOUNT_CACHED

resources:
  accelerators: A100:8
  use_spot: true

setup: |
  pip install transformers accelerate

run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume
yaml
name: llm-finetune

file_mounts:
  /checkpoints:
    name: finetune-checkpoints
    store: s3
    mode: MOUNT_CACHED

resources:
  accelerators: A100:8
  use_spot: true

setup: |
  pip install transformers accelerate

run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume

Workflow 2: Hyperparameter sweep

工作流2:超参数搜索

yaml
name: hp-sweep-${RUN_ID}

envs:
  RUN_ID: 0
  LEARNING_RATE: 1e-4
  BATCH_SIZE: 32

resources:
  accelerators: A100:1
  use_spot: true

run: |
  python train.py \
    --lr $LEARNING_RATE \
    --batch-size $BATCH_SIZE \
    --run-id $RUN_ID
bash
undefined
yaml
name: hp-sweep-${RUN_ID}

envs:
  RUN_ID: 0
  LEARNING_RATE: 1e-4
  BATCH_SIZE: 32

resources:
  accelerators: A100:1
  use_spot: true

run: |
  python train.py \
    --lr $LEARNING_RATE \
    --batch-size $BATCH_SIZE \
    --run-id $RUN_ID
bash
undefined

Launch multiple jobs

Launch multiple jobs

for i in {1..10}; do sky jobs launch sweep.yaml
--env RUN_ID=$i
--env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))") done
undefined
for i in {1..10}; do sky jobs launch sweep.yaml
--env RUN_ID=$i
--env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))") done
undefined

Debugging

调试

bash
undefined
bash
undefined

SSH to cluster

SSH to cluster

ssh mycluster
ssh mycluster

View logs

View logs

sky logs mycluster
sky logs mycluster

Check job queue

Check job queue

sky queue mycluster
sky queue mycluster

View managed job logs

View managed job logs

sky jobs logs my-job
undefined
sky jobs logs my-job
undefined

Common issues

常见问题

IssueSolution
Quota exceededRequest quota increase, try different region
Spot preemptionUse
sky jobs launch
for auto-recovery
Slow file syncUse
MOUNT_CACHED
mode for outputs
GPU not availableUse
any_of
for fallback clouds
问题解决方案
Quota exceeded申请配额提升,尝试其他区域
Spot preemption使用
sky jobs launch
实现自动恢复
Slow file sync对输出文件使用
MOUNT_CACHED
模式
GPU not available使用
any_of
配置备选云厂商

References

参考资料

  • Advanced Usage - Multi-cloud, optimization, production patterns
  • Troubleshooting - Common issues and solutions
  • 高级用法 - 多云、优化、生产模式
  • 故障排查 - 常见问题与解决方案

Resources

资源链接