together-gpu-clusters

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Together GPU Clusters

Together GPU集群

Overview

概述

Use Together AI GPU clusters when the user needs infrastructure control instead of a managed inference product.
Typical fits:
  • distributed training
  • multi-node inference
  • HPC or Slurm workloads
  • custom Kubernetes jobs
  • attached shared storage and cluster lifecycle management
当用户需要基础设施控制权而非托管推理产品时,可使用Together AI GPU集群。
典型适用场景:
  • 分布式训练
  • 多节点推理
  • HPC或Slurm工作负载
  • 自定义Kubernetes任务
  • 附加共享存储和集群生命周期管理

When This Skill Wins

该技能的适用优势

  • Provision a cluster and manage it over time
  • Choose between on-demand and reserved capacity
  • Choose Kubernetes or Slurm as the orchestration layer
  • Manage shared volumes and credentials
  • Scale up, scale down, or troubleshoot node health
  • 按需配置集群并进行长期管理
  • 可选择按需或预留容量模式
  • 可选择Kubernetes或Slurm作为编排层
  • 管理共享卷和凭证
  • 集群扩容、缩容或节点健康排查

Hand Off To Another Skill

可切换至其他技能的场景

  • Use
    together-dedicated-endpoints
    for managed single-model hosting
  • Use
    together-dedicated-containers
    for containerized inference without owning the full cluster
  • Use
    together-sandboxes
    for short-lived remote Python execution
  • Use
    together-fine-tuning
    for managed training jobs instead of raw cluster operations
  • 托管单模型托管场景请使用
    together-dedicated-endpoints
  • 无需管理完整集群的容器化推理场景请使用
    together-dedicated-containers
  • 短生命周期远程Python执行场景请使用
    together-sandboxes
  • 托管训练任务而非原始集群操作场景请使用
    together-fine-tuning

Quick Routing

快速指引

  • Cluster creation, scaling, credentials, deletion
    • Start with scripts/manage_cluster.py or scripts/manage_cluster.ts
    • Read references/api-reference.md
  • Shared storage lifecycle
    • Use scripts/manage_storage.py
    • Read references/api-reference.md
  • Kubernetes vs Slurm operations
    • Read references/cluster-management.md
  • Troubleshooting node health, PVCs, or scheduling
    • Read references/cluster-management.md
  • tcloud CLI workflows
    • Read references/tcloud-cli.md
  • 集群创建、扩缩容、凭证管理、删除
    • 先参考scripts/manage_cluster.pyscripts/manage_cluster.ts
    • 阅读references/api-reference.md
  • 共享存储生命周期管理
    • 使用scripts/manage_storage.py
    • 阅读references/api-reference.md
  • Kubernetes与Slurm操作差异
    • 阅读references/cluster-management.md
  • 节点健康、PVC或调度问题排查
    • 阅读references/cluster-management.md
  • tcloud CLI工作流
    • 阅读references/tcloud-cli.md

Workflow

工作流程

  1. Decide whether the workload really needs cluster-level control.
  2. Choose on-demand vs reserved billing based on run duration and baseline utilization.
  3. Choose Kubernetes vs Slurm based on orchestration requirements and team tooling.
  4. Select region, GPU type, driver version, and shared storage plan.
  5. Provision first, then layer in access credentials, workload deployment, scaling, and health checks.
  1. 确认工作负载是否确实需要集群级控制权。
  2. 根据运行时长和基线利用率选择按需付费或预留付费模式。
  3. 根据编排需求和团队工具链选择Kubernetes或Slurm。
  4. 选择区域、GPU类型、驱动版本和共享存储方案。
  5. 先完成资源配置,再逐步配置访问凭证、工作负载部署、扩缩容和健康检查。

High-Signal Rules

重要注意事项

  • Python scripts require the Together v2 SDK (
    together>=2.0.0
    ). If the user is on an older version, they must upgrade first:
    uv pip install --upgrade "together>=2.0.0"
    .
  • Prefer managed products unless the user explicitly needs raw infrastructure control.
  • Treat storage lifecycle separately from cluster lifecycle; volumes can outlive clusters.
  • When creating a cluster with new shared storage, prefer inline
    shared_volume
    over creating a volume separately and attaching via
    volume_id
    . Separately created volumes may land in a different datacenter partition than the cluster, causing a "does not exist in the datacenter" error even when the volume shows as available.
  • GPU stock-outs (409 "Out of stock") are common. Always call
    list_regions()
    first and be prepared to try multiple regions.
  • The API requires
    cuda_version
    and
    nvidia_driver_version
    as separate fields in addition to the combined
    driver_version
    string. Pass them via
    extra_body
    in the Python SDK.
  • Credentials retrieval is part of provisioning. Do not stop at cluster creation if the user needs to run workloads immediately.
  • Slurm and Kubernetes operational patterns differ materially; read the cluster-management reference before improvising.
  • For repeated cluster operations, start from the scripts instead of rebuilding request shapes.
  • Python脚本需要Together v2 SDK(
    together>=2.0.0
    )。如果用户使用的是旧版本,必须先升级:
    uv pip install --upgrade "together>=2.0.0"
  • 除非用户明确需要原始基础设施控制权,否则优先选择托管产品。
  • 存储生命周期与集群生命周期分开管理;卷的生命周期可以长于集群。
  • 当创建附带新共享存储的集群时,优先使用内联
    shared_volume
    配置,而非单独创建卷再通过
    volume_id
    挂载。单独创建的卷可能与集群分布在不同的数据中心分区,即使卷显示为可用状态,也会出现“数据中心中不存在该资源”的错误。
  • GPU库存不足(409 "Out of stock")是常见情况。请务必先调用
    list_regions()
    接口,并准备好尝试多个区域。
  • 除了组合的
    driver_version
    字符串之外,API还要求单独填写
    cuda_version
    nvidia_driver_version
    字段。可在Python SDK中通过
    extra_body
    传递这些参数。
  • 凭证获取是配置流程的一部分。如果用户需要立即运行工作负载,请不要在集群创建完成后就终止流程。
  • Slurm和Kubernetes的操作模式存在实质性差异;在自行调整操作前请先阅读集群管理参考文档。
  • 对于重复的集群操作,请从提供的脚本开始,不要自行构造请求结构。

Resource Map

资源索引

  • Cluster API reference: references/api-reference.md
  • Operational guide: references/cluster-management.md
  • Operational troubleshooting: references/cluster-management.md
  • CLI guide: references/tcloud-cli.md
  • Python cluster management: scripts/manage_cluster.py
  • TypeScript cluster management: scripts/manage_cluster.ts
  • Python storage management: scripts/manage_storage.py
  • 集群API参考: references/api-reference.md
  • 操作指南: references/cluster-management.md
  • 操作排查指南: references/cluster-management.md
  • CLI指南: references/tcloud-cli.md
  • Python集群管理脚本: scripts/manage_cluster.py
  • TypeScript集群管理脚本: scripts/manage_cluster.ts
  • Python存储管理脚本: scripts/manage_storage.py

Official Docs

官方文档