together-audio

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Together Audio (TTS & STT)

Together 音频(TTS & STT)

Overview

概述

Together AI provides text-to-speech and speech-to-text capabilities.
TTS — Generate speech from text via REST, streaming, or WebSocket:
  • Endpoint:
    /v1/audio/speech
  • WebSocket:
    wss://api.together.xyz/v1/audio/speech/websocket
STT — Transcribe audio to text:
  • Endpoint:
    /v1/audio/transcriptions
Together AI提供文本转语音和语音转文本能力。
TTS — 通过REST、流式传输或WebSocket从文本生成语音:
  • 端点:
    /v1/audio/speech
  • WebSocket:
    wss://api.together.xyz/v1/audio/speech/websocket
STT — 将音频转写为文本:
  • 端点:
    /v1/audio/transcriptions

TTS Quick Start

TTS快速入门

Basic Speech Generation

基础语音生成

python
from together import Together
client = Together()

response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="Today is a wonderful day to build something people love!",
    voice="tara",
    response_format="mp3",
)
response.stream_to_file("speech.mp3")
typescript
import Together from "together-ai";
import { Readable } from "stream";
import { createWriteStream } from "fs";

const together = new Together();

async function generateAudio() {
  const res = await together.audio.create({
    input: "Today is a wonderful day to build something people love!",
    voice: "tara",
    response_format: "mp3",
    sample_rate: 44100,
    stream: false,
    model: "canopylabs/orpheus-3b-0.1-ft",
  });

  if (res.body) {
    const nodeStream = Readable.from(res.body as ReadableStream);
    const fileStream = createWriteStream("./speech.mp3");
    nodeStream.pipe(fileStream);
  }
}

generateAudio();
shell
curl -X POST "https://api.together.xyz/v1/audio/speech" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"canopylabs/orpheus-3b-0.1-ft","input":"Hello world","voice":"tara","response_format":"mp3"}' \
  --output speech.mp3
python
from together import Together
client = Together()

response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="Today is a wonderful day to build something people love!",
    voice="tara",
    response_format="mp3",
)
response.stream_to_file("speech.mp3")
typescript
import Together from "together-ai";
import { Readable } from "stream";
import { createWriteStream } from "fs";

const together = new Together();

async function generateAudio() {
  const res = await together.audio.create({
    input: "Today is a wonderful day to build something people love!",
    voice: "tara",
    response_format: "mp3",
    sample_rate: 44100,
    stream: false,
    model: "canopylabs/orpheus-3b-0.1-ft",
  });

  if (res.body) {
    const nodeStream = Readable.from(res.body as ReadableStream);
    const fileStream = createWriteStream("./speech.mp3");
    nodeStream.pipe(fileStream);
  }
}

generateAudio();
shell
curl -X POST "https://api.together.xyz/v1/audio/speech" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"canopylabs/orpheus-3b-0.1-ft","input":"Hello world","voice":"tara","response_format":"mp3"}' \
  --output speech.mp3

Streaming Audio (Low Latency)

流式音频(低延迟)

python
response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="The quick brown fox jumps over the lazy dog",
    voice="tara",
    stream=True,
    response_format="raw",
    response_encoding="pcm_s16le",
)
response.stream_to_file("speech.wav", response_format="wav")
typescript
import Together from "together-ai";

const together = new Together();

async function streamAudio() {
  const response = await together.audio.speech.create({
    model: "canopylabs/orpheus-3b-0.1-ft",
    input: "The quick brown fox jumps over the lazy dog",
    voice: "tara",
    stream: true,
    response_format: "raw",
    response_encoding: "pcm_s16le",
  });

  const chunks = [];
  for await (const chunk of response) {
    chunks.push(chunk);
  }

  console.log("Streaming complete!");
}

streamAudio();
python
response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="The quick brown fox jumps over the lazy dog",
    voice="tara",
    stream=True,
    response_format="raw",
    response_encoding="pcm_s16le",
)
response.stream_to_file("speech.wav", response_format="wav")
typescript
import Together from "together-ai";

const together = new Together();

async function streamAudio() {
  const response = await together.audio.speech.create({
    model: "canopylabs/orpheus-3b-0.1-ft",
    input: "The quick brown fox jumps over the lazy dog",
    voice: "tara",
    stream: true,
    response_format: "raw",
    response_encoding: "pcm_s16le",
  });

  const chunks = [];
  for await (const chunk of response) {
    chunks.push(chunk);
  }

  console.log("Streaming complete!");
}

streamAudio();

WebSocket (Lowest Latency)

WebSocket(最低延迟)

python
import asyncio, websockets, json, base64

async def generate_speech():
    url = "wss://api.together.ai/v1/audio/speech/websocket?model=hexgrad/Kokoro-82M&voice=af_alloy"
    headers = {"Authorization": f"Bearer {api_key}"}

    async with websockets.connect(url, additional_headers=headers) as ws:
        session = json.loads(await ws.recv())
        await ws.send(json.dumps({"type": "input_text_buffer.append", "text": "Hello!"}))
        await ws.send(json.dumps({"type": "input_text_buffer.commit"}))

        audio_data = bytearray()
        async for msg in ws:
            data = json.loads(msg)
            if data["type"] == "conversation.item.audio_output.delta":
                audio_data.extend(base64.b64decode(data["delta"]))
            elif data["type"] == "conversation.item.audio_output.done":
                break
python
import asyncio, websockets, json, base64

async def generate_speech():
    url = "wss://api.together.ai/v1/audio/speech/websocket?model=hexgrad/Kokoro-82M&voice=af_alloy"
    headers = {"Authorization": f"Bearer {api_key}"}

    async with websockets.connect(url, additional_headers=headers) as ws:
        session = json.loads(await ws.recv())
        await ws.send(json.dumps({"type": "input_text_buffer.append", "text": "Hello!"}))
        await ws.send(json.dumps({"type": "input_text_buffer.commit"}))

        audio_data = bytearray()
        async for msg in ws:
            data = json.loads(msg)
            if data["type"] == "conversation.item.audio_output.delta":
                audio_data.extend(base64.b64decode(data["delta"]))
            elif data["type"] == "conversation.item.audio_output.done":
                break

TTS Models

TTS模型

ModelAPI StringEndpointsPrice
Orpheus 3B
canopylabs/orpheus-3b-0.1-ft
REST, Streaming, WebSocket$15/1M chars
Kokoro
hexgrad/Kokoro-82M
REST, Streaming, WebSocket$4/1M chars
Cartesia Sonic 2
cartesia/sonic-2
REST$65/1M chars
Cartesia Sonic
cartesia/sonic
REST-
Rime Arcana v3 Turbo
rime-labs/rime-arcana-v3-turbo
REST, Streaming, WebSocketDE only
MiniMax Speech 2.6
minimax/speech-2.6-turbo
REST, Streaming, WebSocketDE only
模型API字符串支持端点价格
Orpheus 3B
canopylabs/orpheus-3b-0.1-ft
REST, 流式传输, WebSocket$15/1M字符
Kokoro
hexgrad/Kokoro-82M
REST, 流式传输, WebSocket$4/1M字符
Cartesia Sonic 2
cartesia/sonic-2
REST$65/1M字符
Cartesia Sonic
cartesia/sonic
REST-
Rime Arcana v3 Turbo
rime-labs/rime-arcana-v3-turbo
REST, 流式传输, WebSocket仅德国可用
MiniMax Speech 2.6
minimax/speech-2.6-turbo
REST, 流式传输, WebSocket仅德国可用

TTS Parameters

TTS参数

ParameterTypeDescriptionDefault
model
stringTTS model (required)-
input
stringText to synthesize (required)-
voice
stringVoice ID (required)-
response_format
string
mp3
,
wav
(default),
raw
,
mulaw
wav
stream
boolEnable streaming (raw format only)false
response_encoding
string
pcm_f32le
,
pcm_s16le
,
pcm_mulaw
,
pcm_alaw
for raw
-
language
stringLanguage of input text: en, de, fr, es, hi, it, ja, ko, nl, pl, pt, ru, sv, tr, zh"en"
sample_rate
intAudio sample rate (e.g., 44100)-
参数类型描述默认值
model
stringTTS模型(必填)-
input
string待合成的文本(必填)-
voice
string音色ID(必填)-
response_format
string
mp3
,
wav
(默认),
raw
,
mulaw
wav
stream
bool启用流式传输(仅raw格式支持)false
response_encoding
stringraw格式可选编码:
pcm_f32le
,
pcm_s16le
,
pcm_mulaw
,
pcm_alaw
-
language
string输入文本的语言:en, de, fr, es, hi, it, ja, ko, nl, pl, pt, ru, sv, tr, zh"en"
sample_rate
int音频采样率(例如44100)-

List Available Voices

列出可用音色

python
response = client.audio.voices.list()
for model_voices in response.data:
    print(f"Model: {model_voices.model}")
    for voice in model_voices.voices:
        print(f"  - {voice.name}")
Key voices: Orpheus:
tara
,
leah
,
leo
,
dan
,
mia
,
zac
. Kokoro:
af_alloy
,
af_bella
,
am_adam
,
am_echo
. See references/tts-models.md for complete voice lists.
python
response = client.audio.voices.list()
for model_voices in response.data:
    print(f"Model: {model_voices.model}")
    for voice in model_voices.voices:
        print(f"  - {voice.name}")
核心音色: Orpheus:
tara
,
leah
,
leo
,
dan
,
mia
,
zac
。Kokoro:
af_alloy
,
af_bella
,
am_adam
,
am_echo
。完整音色列表请查看 references/tts-models.md

STT Quick Start

STT快速入门

Transcribe Audio

音频转写

python
response = client.audio.transcriptions.create(
    model="openai/whisper-large-v3",
    file=open("audio.mp3", "rb"),
)
print(response.text)
typescript
import Together from "together-ai";

const together = new Together();

const transcription = await together.audio.transcriptions.create({
  file: "path/to/audio.mp3",
  model: "openai/whisper-large-v3",
  language: "en",
});
console.log(transcription.text);
shell
curl -X POST "https://api.together.xyz/v1/audio/transcriptions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -F model="openai/whisper-large-v3" \
  -F file=@audio.mp3
python
response = client.audio.transcriptions.create(
    model="openai/whisper-large-v3",
    file=open("audio.mp3", "rb"),
)
print(response.text)
typescript
import Together from "together-ai";

const together = new Together();

const transcription = await together.audio.transcriptions.create({
  file: "path/to/audio.mp3",
  model: "openai/whisper-large-v3",
  language: "en",
});
console.log(transcription.text);
shell
curl -X POST "https://api.together.xyz/v1/audio/transcriptions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -F model="openai/whisper-large-v3" \
  -F file=@audio.mp3

STT Models

STT模型

ModelAPI String
Whisper Large v3
openai/whisper-large-v3
Voxtral Mini 3B
mistralai/Voxtral-Mini-3B-2507
模型API字符串
Whisper Large v3
openai/whisper-large-v3
Voxtral Mini 3B
mistralai/Voxtral-Mini-3B-2507

Delivery Method Guide

交付方式指南

  • REST: Batch processing, complete audio files
  • Streaming: Real-time apps where TTFB matters
  • WebSocket: Interactive/conversational apps, lowest latency
  • REST:批处理,完整音频文件生成
  • 流式传输:对首包时间(TTFB)有要求的实时应用
  • WebSocket:交互/对话类应用,延迟最低

Resources

资源

  • Complete voice lists: See references/tts-models.md
  • STT details: See references/stt-models.md
  • TTS script: See scripts/tts_generate.py — REST, streaming, and WebSocket TTS (v2 SDK)
  • STT script: See scripts/stt_transcribe.py — transcribe, translate, diarize with CLI flags (v2 SDK)
  • Official docs: Text-to-Speech
  • Official docs: Speech-to-Text
  • API reference: TTS API
  • API reference: STT API
  • 完整音色列表:查看 references/tts-models.md
  • STT详情:查看 references/stt-models.md
  • TTS脚本:查看 scripts/tts_generate.py — 支持REST、流式传输和WebSocket的TTS(v2 SDK)
  • STT脚本:查看 scripts/stt_transcribe.py — 支持CLI参数的转写、翻译、说话人分离功能(v2 SDK)
  • 官方文档文本转语音
  • 官方文档语音转文本
  • API参考TTS API
  • API参考STT API