together-gpu-clusters
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTogether GPU Clusters
Together GPU集群
Overview
概述
Use Together AI GPU clusters when the user needs infrastructure control instead of a managed
inference product.
Typical fits:
- distributed training
- multi-node inference
- HPC or Slurm workloads
- custom Kubernetes jobs
- attached shared storage and cluster lifecycle management
当用户需要基础设施控制权而非托管推理产品时,可使用Together AI GPU集群。
典型适用场景:
- 分布式训练
- 多节点推理
- HPC或Slurm工作负载
- 自定义Kubernetes任务
- 附加共享存储和集群生命周期管理
When This Skill Wins
该技能的适用优势
- Provision a cluster and manage it over time
- Choose between on-demand and reserved capacity
- Choose Kubernetes or Slurm as the orchestration layer
- Manage shared volumes and credentials
- Scale up, scale down, or troubleshoot node health
- 按需配置集群并进行长期管理
- 可选择按需或预留容量模式
- 可选择Kubernetes或Slurm作为编排层
- 管理共享卷和凭证
- 集群扩容、缩容或节点健康排查
Hand Off To Another Skill
可切换至其他技能的场景
- Use for managed single-model hosting
together-dedicated-endpoints - Use for containerized inference without owning the full cluster
together-dedicated-containers - Use for short-lived remote Python execution
together-sandboxes - Use for managed training jobs instead of raw cluster operations
together-fine-tuning
- 托管单模型托管场景请使用
together-dedicated-endpoints - 无需管理完整集群的容器化推理场景请使用
together-dedicated-containers - 短生命周期远程Python执行场景请使用
together-sandboxes - 托管训练任务而非原始集群操作场景请使用
together-fine-tuning
Quick Routing
快速指引
- Cluster creation, scaling, credentials, deletion
- Start with scripts/manage_cluster.py or scripts/manage_cluster.ts
- Read references/api-reference.md
- Shared storage lifecycle
- Use scripts/manage_storage.py
- Read references/api-reference.md
- Kubernetes vs Slurm operations
- Read references/cluster-management.md
- Troubleshooting node health, PVCs, or scheduling
- Read references/cluster-management.md
- tcloud CLI workflows
- Read references/tcloud-cli.md
- 集群创建、扩缩容、凭证管理、删除
- 先参考scripts/manage_cluster.py或scripts/manage_cluster.ts
- 阅读references/api-reference.md
- 共享存储生命周期管理
- 使用scripts/manage_storage.py
- 阅读references/api-reference.md
- Kubernetes与Slurm操作差异
- 阅读references/cluster-management.md
- 节点健康、PVC或调度问题排查
- 阅读references/cluster-management.md
- tcloud CLI工作流
- 阅读references/tcloud-cli.md
Workflow
工作流程
- Decide whether the workload really needs cluster-level control.
- Choose on-demand vs reserved billing based on run duration and baseline utilization.
- Choose Kubernetes vs Slurm based on orchestration requirements and team tooling.
- Select region, GPU type, driver version, and shared storage plan.
- Provision first, then layer in access credentials, workload deployment, scaling, and health checks.
- 确认工作负载是否确实需要集群级控制权。
- 根据运行时长和基线利用率选择按需付费或预留付费模式。
- 根据编排需求和团队工具链选择Kubernetes或Slurm。
- 选择区域、GPU类型、驱动版本和共享存储方案。
- 先完成资源配置,再逐步配置访问凭证、工作负载部署、扩缩容和健康检查。
High-Signal Rules
重要注意事项
- Python scripts require the Together v2 SDK (). If the user is on an older version, they must upgrade first:
together>=2.0.0.uv pip install --upgrade "together>=2.0.0" - Prefer managed products unless the user explicitly needs raw infrastructure control.
- Treat storage lifecycle separately from cluster lifecycle; volumes can outlive clusters.
- When creating a cluster with new shared storage, prefer inline over creating a volume separately and attaching via
shared_volume. Separately created volumes may land in a different datacenter partition than the cluster, causing a "does not exist in the datacenter" error even when the volume shows as available.volume_id - GPU stock-outs (409 "Out of stock") are common. Always call first and be prepared to try multiple regions.
list_regions() - The API requires and
cuda_versionas separate fields in addition to the combinednvidia_driver_versionstring. Pass them viadriver_versionin the Python SDK.extra_body - Credentials retrieval is part of provisioning. Do not stop at cluster creation if the user needs to run workloads immediately.
- Slurm and Kubernetes operational patterns differ materially; read the cluster-management reference before improvising.
- For repeated cluster operations, start from the scripts instead of rebuilding request shapes.
- Python脚本需要Together v2 SDK()。如果用户使用的是旧版本,必须先升级:
together>=2.0.0。uv pip install --upgrade "together>=2.0.0" - 除非用户明确需要原始基础设施控制权,否则优先选择托管产品。
- 存储生命周期与集群生命周期分开管理;卷的生命周期可以长于集群。
- 当创建附带新共享存储的集群时,优先使用内联配置,而非单独创建卷再通过
shared_volume挂载。单独创建的卷可能与集群分布在不同的数据中心分区,即使卷显示为可用状态,也会出现“数据中心中不存在该资源”的错误。volume_id - GPU库存不足(409 "Out of stock")是常见情况。请务必先调用接口,并准备好尝试多个区域。
list_regions() - 除了组合的字符串之外,API还要求单独填写
driver_version和cuda_version字段。可在Python SDK中通过nvidia_driver_version传递这些参数。extra_body - 凭证获取是配置流程的一部分。如果用户需要立即运行工作负载,请不要在集群创建完成后就终止流程。
- Slurm和Kubernetes的操作模式存在实质性差异;在自行调整操作前请先阅读集群管理参考文档。
- 对于重复的集群操作,请从提供的脚本开始,不要自行构造请求结构。
Resource Map
资源索引
- Cluster API reference: references/api-reference.md
- Operational guide: references/cluster-management.md
- Operational troubleshooting: references/cluster-management.md
- CLI guide: references/tcloud-cli.md
- Python cluster management: scripts/manage_cluster.py
- TypeScript cluster management: scripts/manage_cluster.ts
- Python storage management: scripts/manage_storage.py
- 集群API参考: references/api-reference.md
- 操作指南: references/cluster-management.md
- 操作排查指南: references/cluster-management.md
- CLI指南: references/tcloud-cli.md
- Python集群管理脚本: scripts/manage_cluster.py
- TypeScript集群管理脚本: scripts/manage_cluster.ts
- Python存储管理脚本: scripts/manage_storage.py