together-gpu-clusters

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Together GPU Clusters

Together GPU集群

Overview

概述

Use Together AI GPU clusters when the user needs infrastructure control instead of a managed inference product.

Typical fits:

distributed training
multi-node inference
HPC or Slurm workloads
custom Kubernetes jobs
attached shared storage and cluster lifecycle management

当用户需要基础设施控制权而非托管推理产品时，可使用Together AI GPU集群。

典型适用场景：

分布式训练
多节点推理
HPC或Slurm工作负载
自定义Kubernetes任务
附加共享存储和集群生命周期管理

When This Skill Wins

该技能的适用优势

Provision a cluster and manage it over time
Choose between on-demand and reserved capacity
Choose Kubernetes or Slurm as the orchestration layer
Manage shared volumes and credentials
Scale up, scale down, or troubleshoot node health

按需配置集群并进行长期管理
可选择按需或预留容量模式
可选择Kubernetes或Slurm作为编排层
管理共享卷和凭证
集群扩容、缩容或节点健康排查

Hand Off To Another Skill

可切换至其他技能的场景

Use
```
together-dedicated-endpoints
```
for managed single-model hosting
Use
```
together-dedicated-containers
```
for containerized inference without owning the full cluster
Use
```
together-sandboxes
```
for short-lived remote Python execution
Use
```
together-fine-tuning
```
for managed training jobs instead of raw cluster operations

托管单模型托管场景请使用
```
together-dedicated-endpoints
```
无需管理完整集群的容器化推理场景请使用
```
together-dedicated-containers
```
短生命周期远程Python执行场景请使用
```
together-sandboxes
```
托管训练任务而非原始集群操作场景请使用
```
together-fine-tuning
```

Quick Routing

快速指引

Cluster creation, scaling, credentials, deletion
- Start with scripts/manage_cluster.py or scripts/manage_cluster.ts
- Read references/api-reference.md
Shared storage lifecycle
- Use scripts/manage_storage.py
- Read references/api-reference.md
Kubernetes vs Slurm operations
- Read references/cluster-management.md
Troubleshooting node health, PVCs, or scheduling
- Read references/cluster-management.md
tcloud CLI workflows
- Read references/tcloud-cli.md

集群创建、扩缩容、凭证管理、删除
- 先参考scripts/manage_cluster.py或scripts/manage_cluster.ts
- 阅读references/api-reference.md
共享存储生命周期管理
- 使用scripts/manage_storage.py
- 阅读references/api-reference.md
Kubernetes与Slurm操作差异
- 阅读references/cluster-management.md
节点健康、PVC或调度问题排查
- 阅读references/cluster-management.md
tcloud CLI工作流
- 阅读references/tcloud-cli.md

Workflow

工作流程

Decide whether the workload really needs cluster-level control.
Choose on-demand vs reserved billing based on run duration and baseline utilization.
Choose Kubernetes vs Slurm based on orchestration requirements and team tooling.
Select region, GPU type, driver version, and shared storage plan.
Provision first, then layer in access credentials, workload deployment, scaling, and health checks.

确认工作负载是否确实需要集群级控制权。
根据运行时长和基线利用率选择按需付费或预留付费模式。
根据编排需求和团队工具链选择Kubernetes或Slurm。
选择区域、GPU类型、驱动版本和共享存储方案。
先完成资源配置，再逐步配置访问凭证、工作负载部署、扩缩容和健康检查。

High-Signal Rules

重要注意事项

Python scripts require the Together v2 SDK (
```
together>=2.0.0
```
). If the user is on an older version, they must upgrade first:
```
uv pip install --upgrade "together>=2.0.0"
```
.
Prefer managed products unless the user explicitly needs raw infrastructure control.
Treat storage lifecycle separately from cluster lifecycle; volumes can outlive clusters.
When creating a cluster with new shared storage, prefer inline
```
shared_volume
```
over creating a volume separately and attaching via
```
volume_id
```
. Separately created volumes may land in a different datacenter partition than the cluster, causing a "does not exist in the datacenter" error even when the volume shows as available.
GPU stock-outs (409 "Out of stock") are common. Always call
```
list_regions()
```
first and be prepared to try multiple regions.
The API requires
```
cuda_version
```
and
```
nvidia_driver_version
```
as separate fields in addition to the combined
```
driver_version
```
string. Pass them via
```
extra_body
```
in the Python SDK.
Credentials retrieval is part of provisioning. Do not stop at cluster creation if the user needs to run workloads immediately.
Slurm and Kubernetes operational patterns differ materially; read the cluster-management reference before improvising.
For repeated cluster operations, start from the scripts instead of rebuilding request shapes.

Python脚本需要Together v2 SDK（
```
together>=2.0.0
```
）。如果用户使用的是旧版本，必须先升级：
```
uv pip install --upgrade "together>=2.0.0"
```
。
除非用户明确需要原始基础设施控制权，否则优先选择托管产品。
存储生命周期与集群生命周期分开管理；卷的生命周期可以长于集群。
当创建附带新共享存储的集群时，优先使用内联
```
shared_volume
```
配置，而非单独创建卷再通过
```
volume_id
```
挂载。单独创建的卷可能与集群分布在不同的数据中心分区，即使卷显示为可用状态，也会出现“数据中心中不存在该资源”的错误。
GPU库存不足（409 "Out of stock"）是常见情况。请务必先调用
```
list_regions()
```
接口，并准备好尝试多个区域。
除了组合的
```
driver_version
```
字符串之外，API还要求单独填写
```
cuda_version
```
和
```
nvidia_driver_version
```
字段。可在Python SDK中通过
```
extra_body
```
传递这些参数。
凭证获取是配置流程的一部分。如果用户需要立即运行工作负载，请不要在集群创建完成后就终止流程。
Slurm和Kubernetes的操作模式存在实质性差异；在自行调整操作前请先阅读集群管理参考文档。
对于重复的集群操作，请从提供的脚本开始，不要自行构造请求结构。

Resource Map

资源索引

Cluster API reference: references/api-reference.md
Operational guide: references/cluster-management.md
Operational troubleshooting: references/cluster-management.md
CLI guide: references/tcloud-cli.md
Python cluster management: scripts/manage_cluster.py
TypeScript cluster management: scripts/manage_cluster.ts
Python storage management: scripts/manage_storage.py

集群API参考: references/api-reference.md
操作指南: references/cluster-management.md
操作排查指南: references/cluster-management.md
CLI指南: references/tcloud-cli.md
Python集群管理脚本: scripts/manage_cluster.py
TypeScript集群管理脚本: scripts/manage_cluster.ts
Python存储管理脚本: scripts/manage_storage.py