skypilot-multi-cloud-orchestration

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SkyPilot Multi-Cloud Orchestration

SkyPilot 多云编排

Comprehensive guide to running ML workloads across clouds with automatic cost optimization using SkyPilot.

使用SkyPilot在多云环境中运行机器学习工作负载并实现自动成本优化的综合指南。

When to use SkyPilot

何时使用SkyPilot

Use SkyPilot when:

Running ML workloads across multiple clouds (AWS, GCP, Azure, etc.)
Need cost optimization with automatic cloud/region selection
Running long jobs on spot instances with auto-recovery
Managing distributed multi-node training
Want unified interface for 20+ cloud providers
Need to avoid vendor lock-in

Key features:

Multi-cloud: AWS, GCP, Azure, Kubernetes, Lambda, RunPod, 20+ providers
Cost optimization: Automatic cheapest cloud/region selection
Spot instances: 3-6x cost savings with automatic recovery
Distributed training: Multi-node jobs with gang scheduling
Managed jobs: Auto-recovery, checkpointing, fault tolerance
Sky Serve: Model serving with autoscaling

Use alternatives instead:

Modal: For simpler serverless GPU with Python-native API
RunPod: For single-cloud persistent pods
Kubernetes: For existing K8s infrastructure
Ray: For pure Ray-based orchestration

以下场景适用SkyPilot：

在多云环境（AWS、GCP、Azure等）中运行ML工作负载
需要通过自动选择云厂商/区域实现成本优化
在spot instances上运行长期作业并需要自动恢复功能
管理分布式多节点训练
想要一个支持20+云厂商的统一操作界面
需要避免厂商锁定

核心特性：

多云支持：AWS、GCP、Azure、Kubernetes、Lambda、RunPod等20+云厂商
成本优化：自动选择最便宜的云厂商/区域
Spot instances：可节省3-6倍成本，且支持自动恢复
分布式训练：支持gang调度的多节点作业
托管作业：自动恢复、checkpointing、容错
Sky Serve：支持自动扩缩容的模型服务

可选择以下替代工具：

Modal：适用于需要更简洁的Python原生API的无服务器GPU场景
RunPod：适用于单云环境下的持久化pod
Kubernetes：适用于已有K8s基础设施的场景
Ray：适用于纯基于Ray的编排场景

Quick start

快速开始

Installation

安装

bash

pip install "skypilot[aws,gcp,azure,kubernetes]"

bash

pip install "skypilot[aws,gcp,azure,kubernetes]"

Verify cloud credentials

sky check

undefined

sky check

undefined

Hello World

快速入门示例

Create

hello.yaml

yaml

resources:
  accelerators: T4:1

run: |
  nvidia-smi
  echo "Hello from SkyPilot!"

Launch:

bash

sky launch -c hello hello.yaml

创建

hello.yaml

yaml

resources:
  accelerators: T4:1

run: |
  nvidia-smi
  echo "Hello from SkyPilot!"

启动任务：

bash

sky launch -c hello hello.yaml

SSH to cluster

ssh hello

Terminate

sky down hello

undefined

sky down hello

undefined

Core concepts

核心概念

Task YAML structure

任务YAML结构

yaml

undefined

yaml

undefined

Task name (optional)

name: my-task

Resource requirements

resources: cloud: aws # Optional: auto-select if omitted region: us-west-2 # Optional: auto-select if omitted accelerators: A100:4 # GPU type and count cpus: 8+ # Minimum CPUs memory: 32+ # Minimum memory (GB) use_spot: true # Use spot instances disk_size: 256 # Disk size (GB)

Number of nodes for distributed training

num_nodes: 2

Working directory (synced to ~/sky_workdir)

workdir: .

Setup commands (run once)

setup: | pip install -r requirements.txt

Run commands

run: | python train.py

undefined

run: | python train.py

undefined

Key commands

核心命令

Command	Purpose
`sky launch`	Launch cluster and run task
`sky exec`	Run task on existing cluster
`sky status`	Show cluster status
`sky stop`	Stop cluster (preserve state)
`sky down`	Terminate cluster
`sky logs`	View task logs
`sky queue`	Show job queue
`sky jobs launch`	Launch managed job
`sky serve up`	Deploy serving endpoint

命令	用途
`sky launch`	启动集群并运行任务
`sky exec`	在已有集群上运行任务
`sky status`	查看集群状态
`sky stop`	停止集群（保留状态）
`sky down`	终止集群
`sky logs`	查看任务日志
`sky queue`	查看作业队列
`sky jobs launch`	启动托管作业
`sky serve up`	部署服务端点

GPU configuration

GPU配置

Available accelerators

可用加速器

yaml

undefined

yaml

undefined

NVIDIA GPUs

accelerators: T4:1 accelerators: L4:1 accelerators: A10G:1 accelerators: L40S:1 accelerators: A100:4 accelerators: A100-80GB:8 accelerators: H100:8

Cloud-specific

accelerators: V100:4 # AWS/GCP accelerators: TPU-v4-8 # GCP TPUs

undefined

accelerators: V100:4 # AWS/GCP accelerators: TPU-v4-8 # GCP TPUs

undefined

GPU fallbacks

GPU备选配置

yaml

resources:
  accelerators:
    H100: 8
    A100-80GB: 8
    A100: 8
  any_of:
    - cloud: gcp
    - cloud: aws
    - cloud: azure

yaml

resources:
  accelerators:
    H100: 8
    A100-80GB: 8
    A100: 8
  any_of:
    - cloud: gcp
    - cloud: aws
    - cloud: azure

Spot instances

Spot instances配置

yaml

resources:
  accelerators: A100:8
  use_spot: true
  spot_recovery: FAILOVER  # Auto-recover on preemption

yaml

resources:
  accelerators: A100:8
  use_spot: true
  spot_recovery: FAILOVER  # Auto-recover on preemption

Cluster management

集群管理

Launch and execute

启动与执行

bash

undefined

bash

undefined

Launch new cluster

sky launch -c mycluster task.yaml

Run on existing cluster (skip setup)

sky exec mycluster another_task.yaml

Interactive SSH

ssh mycluster

Stream logs

sky logs mycluster

undefined

sky logs mycluster

undefined

Autostop

自动停止

yaml

resources:
  accelerators: A100:4
  autostop:
    idle_minutes: 30
    down: true  # Terminate instead of stop

bash

undefined

yaml

resources:
  accelerators: A100:4
  autostop:
    idle_minutes: 30
    down: true  # Terminate instead of stop

bash

undefined

Set autostop via CLI

sky autostop mycluster -i 30 --down

undefined

sky autostop mycluster -i 30 --down

undefined

Cluster status

集群状态

bash

undefined

bash

undefined

All clusters

sky status

Detailed view

sky status -a

undefined

sky status -a

undefined

Distributed training

分布式训练

Multi-node setup

多节点配置

yaml

resources:
  accelerators: A100:8

num_nodes: 4  # 4 nodes × 8 GPUs = 32 GPUs total

setup: |
  pip install torch torchvision

run: |
  torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --node_rank=$SKYPILOT_NODE_RANK \
    --master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
    --master_port=12355 \
    train.py

yaml

resources:
  accelerators: A100:8

num_nodes: 4  # 4 nodes × 8 GPUs = 32 GPUs total

setup: |
  pip install torch torchvision

run: |
  torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --node_rank=$SKYPILOT_NODE_RANK \
    --master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
    --master_port=12355 \
    train.py

Environment variables

环境变量

Variable	Description
`SKYPILOT_NODE_RANK`	Node index (0 to num_nodes-1)
`SKYPILOT_NODE_IPS`	Newline-separated IP addresses
`SKYPILOT_NUM_NODES`	Total number of nodes
`SKYPILOT_NUM_GPUS_PER_NODE`	GPUs per node

变量	说明
`SKYPILOT_NODE_RANK`	节点索引（0到num_nodes-1）
`SKYPILOT_NODE_IPS`	换行分隔的IP地址列表
`SKYPILOT_NUM_NODES`	总节点数
`SKYPILOT_NUM_GPUS_PER_NODE`	每节点GPU数量

Head-node-only execution

仅主节点执行

bash

run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    python orchestrate.py
  fi

bash

run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    python orchestrate.py
  fi

Managed jobs

托管作业

Spot recovery

Spot实例恢复

bash

undefined

bash

undefined

Launch managed job with spot recovery

sky jobs launch -n my-job train.yaml

undefined

sky jobs launch -n my-job train.yaml

undefined

Checkpointing

检查点配置

yaml

name: training-job

file_mounts:
  /checkpoints:
    name: my-checkpoints
    store: s3
    mode: MOUNT

resources:
  accelerators: A100:8
  use_spot: true

run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume-from-latest

yaml

name: training-job

file_mounts:
  /checkpoints:
    name: my-checkpoints
    store: s3
    mode: MOUNT

resources:
  accelerators: A100:8
  use_spot: true

run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume-from-latest

Job management

作业管理

bash

undefined

bash

undefined

List jobs

sky jobs queue

View logs

sky jobs logs my-job

Cancel job

sky jobs cancel my-job

undefined

sky jobs cancel my-job

undefined

File mounts and storage

文件挂载与存储

Local file sync

本地文件同步

yaml

workdir: ./my-project  # Synced to ~/sky_workdir

file_mounts:
  /data/config.yaml: ./config.yaml
  ~/.vimrc: ~/.vimrc

yaml

workdir: ./my-project  # Synced to ~/sky_workdir

file_mounts:
  /data/config.yaml: ./config.yaml
  ~/.vimrc: ~/.vimrc

Cloud storage

云存储

yaml

file_mounts:
  # Mount S3 bucket
  /datasets:
    source: s3://my-bucket/datasets
    mode: MOUNT  # Stream from S3

  # Copy GCS bucket
  /models:
    source: gs://my-bucket/models
    mode: COPY  # Pre-fetch to disk

  # Cached mount (fast writes)
  /outputs:
    name: my-outputs
    store: s3
    mode: MOUNT_CACHED

yaml

file_mounts:
  # Mount S3 bucket
  /datasets:
    source: s3://my-bucket/datasets
    mode: MOUNT  # Stream from S3

  # Copy GCS bucket
  /models:
    source: gs://my-bucket/models
    mode: COPY  # Pre-fetch to disk

  # Cached mount (fast writes)
  /outputs:
    name: my-outputs
    store: s3
    mode: MOUNT_CACHED

Storage modes

存储模式

Mode	Description	Best For
`MOUNT`	Stream from cloud	Large datasets, read-heavy
`COPY`	Pre-fetch to disk	Small files, random access
`MOUNT_CACHED`	Cache with async upload	Checkpoints, outputs

模式	说明	适用场景
`MOUNT`	从云存储流式读取	大型数据集、读密集型场景
`COPY`	预取到本地磁盘	小型文件、随机访问场景
`MOUNT_CACHED`	缓存并异步上传	检查点、输出文件场景

Sky Serve (Model Serving)

Sky Serve（模型服务）

Basic service

基础服务

yaml

undefined

yaml

undefined

service.yaml

service: readiness_probe: /health replica_policy: min_replicas: 1 max_replicas: 10 target_qps_per_replica: 2.0

resources: accelerators: A100:1

run: | python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--port 8000


```bash

service: readiness_probe: /health replica_policy: min_replicas: 1 max_replicas: 10 target_qps_per_replica: 2.0

resources: accelerators: A100:1

run: | python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-2-7b-chat-hf
--port 8000


```bash

Deploy

sky serve up -n my-service service.yaml

Check status

sky serve status

Get endpoint

sky serve status my-service

undefined

sky serve status my-service

undefined

Autoscaling policies

自动扩缩容策略

yaml

service:
  replica_policy:
    min_replicas: 1
    max_replicas: 10
    target_qps_per_replica: 2.0
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300
  load_balancing_policy: round_robin

yaml

service:
  replica_policy:
    min_replicas: 1
    max_replicas: 10
    target_qps_per_replica: 2.0
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300
  load_balancing_policy: round_robin

Cost optimization

成本优化

Automatic cloud selection

自动云厂商选择

yaml

undefined

yaml

undefined

SkyPilot finds cheapest option

resources: accelerators: A100:8

No cloud specified - auto-select cheapest


```bash

resources: accelerators: A100:8

No cloud specified - auto-select cheapest


```bash

Show optimizer decision

sky launch task.yaml --dryrun

undefined

sky launch task.yaml --dryrun

undefined

Cloud preferences

云厂商偏好设置

yaml

resources:
  accelerators: A100:8
  any_of:
    - cloud: gcp
      region: us-central1
    - cloud: aws
      region: us-east-1
    - cloud: azure

yaml

resources:
  accelerators: A100:8
  any_of:
    - cloud: gcp
      region: us-central1
    - cloud: aws
      region: us-east-1
    - cloud: azure

Environment variables

环境变量

yaml

envs:
  HF_TOKEN: $HF_TOKEN  # Inherited from local env
  WANDB_API_KEY: $WANDB_API_KEY

yaml

envs:
  HF_TOKEN: $HF_TOKEN  # Inherited from local env
  WANDB_API_KEY: $WANDB_API_KEY

Or use secrets

secrets:

HF_TOKEN
WANDB_API_KEY

undefined

secrets:

HF_TOKEN
WANDB_API_KEY

undefined

Common workflows

常见工作流

Workflow 1: Fine-tuning with checkpoints

工作流1：带检查点的微调

yaml

name: llm-finetune

file_mounts:
  /checkpoints:
    name: finetune-checkpoints
    store: s3
    mode: MOUNT_CACHED

resources:
  accelerators: A100:8
  use_spot: true

setup: |
  pip install transformers accelerate

run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume

yaml

name: llm-finetune

file_mounts:
  /checkpoints:
    name: finetune-checkpoints
    store: s3
    mode: MOUNT_CACHED

resources:
  accelerators: A100:8
  use_spot: true

setup: |
  pip install transformers accelerate

run: |
  python train.py \
    --checkpoint-dir /checkpoints \
    --resume

Workflow 2: Hyperparameter sweep

工作流2：超参数搜索

yaml

name: hp-sweep-${RUN_ID}

envs:
  RUN_ID: 0
  LEARNING_RATE: 1e-4
  BATCH_SIZE: 32

resources:
  accelerators: A100:1
  use_spot: true

run: |
  python train.py \
    --lr $LEARNING_RATE \
    --batch-size $BATCH_SIZE \
    --run-id $RUN_ID

bash

undefined

yaml

name: hp-sweep-${RUN_ID}

envs:
  RUN_ID: 0
  LEARNING_RATE: 1e-4
  BATCH_SIZE: 32

resources:
  accelerators: A100:1
  use_spot: true

run: |
  python train.py \
    --lr $LEARNING_RATE \
    --batch-size $BATCH_SIZE \
    --run-id $RUN_ID

bash

undefined

Launch multiple jobs

for i in {1..10}; do sky jobs launch sweep.yaml
--env RUN_ID=$i
--env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))") done

undefined

for i in {1..10}; do sky jobs launch sweep.yaml
--env RUN_ID=$i
--env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))") done

undefined

Debugging

调试

bash

undefined

bash

undefined

SSH to cluster

ssh mycluster

View logs

sky logs mycluster

Check job queue

sky queue mycluster

View managed job logs

sky jobs logs my-job

undefined

sky jobs logs my-job

undefined

Common issues

常见问题

Issue	Solution
Quota exceeded	Request quota increase, try different region
Spot preemption	Use `sky jobs launch` for auto-recovery
Slow file sync	Use `MOUNT_CACHED` mode for outputs
GPU not available	Use `any_of` for fallback clouds

问题	解决方案
Quota exceeded	申请配额提升，尝试其他区域
Spot preemption	使用 `sky jobs launch` 实现自动恢复
Slow file sync	对输出文件使用 `MOUNT_CACHED` 模式
GPU not available	使用 `any_of` 配置备选云厂商

References

参考资料

Advanced Usage - Multi-cloud, optimization, production patterns
Troubleshooting - Common issues and solutions

高级用法 - 多云、优化、生产模式
故障排查 - 常见问题与解决方案

Resources

资源链接

Documentation: https://docs.skypilot.co
GitHub: https://github.com/skypilot-org/skypilot
Slack: https://slack.skypilot.co
Examples: https://github.com/skypilot-org/skypilot/tree/master/examples

官方文档：https://docs.skypilot.co
GitHub仓库：https://github.com/skypilot-org/skypilot
Slack社区：https://slack.skypilot.co
示例代码：https://github.com/skypilot-org/skypilot/tree/master/examples