together-audio

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Together Audio (TTS & STT)

Together 音频（TTS & STT）

Overview

概述

Together AI provides text-to-speech and speech-to-text capabilities.

TTS — Generate speech from text via REST, streaming, or WebSocket:

Endpoint:
```
/v1/audio/speech
```

WebSocket:

wss://api.together.xyz/v1/audio/speech/websocket

STT — Transcribe audio to text:

Endpoint:
```
/v1/audio/transcriptions
```

Together AI提供文本转语音和语音转文本能力。

TTS — 通过REST、流式传输或WebSocket从文本生成语音：

端点：
```
/v1/audio/speech
```

WebSocket：

wss://api.together.xyz/v1/audio/speech/websocket

STT — 将音频转写为文本：

端点：
```
/v1/audio/transcriptions
```

TTS Quick Start

TTS快速入门

Basic Speech Generation

基础语音生成

python

from together import Together
client = Together()

response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="Today is a wonderful day to build something people love!",
    voice="tara",
    response_format="mp3",
)
response.stream_to_file("speech.mp3")

typescript

import Together from "together-ai";
import { Readable } from "stream";
import { createWriteStream } from "fs";

const together = new Together();

async function generateAudio() {
  const res = await together.audio.create({
    input: "Today is a wonderful day to build something people love!",
    voice: "tara",
    response_format: "mp3",
    sample_rate: 44100,
    stream: false,
    model: "canopylabs/orpheus-3b-0.1-ft",
  });

  if (res.body) {
    const nodeStream = Readable.from(res.body as ReadableStream);
    const fileStream = createWriteStream("./speech.mp3");
    nodeStream.pipe(fileStream);
  }
}

generateAudio();

shell

curl -X POST "https://api.together.xyz/v1/audio/speech" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"canopylabs/orpheus-3b-0.1-ft","input":"Hello world","voice":"tara","response_format":"mp3"}' \
  --output speech.mp3

python

from together import Together
client = Together()

response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="Today is a wonderful day to build something people love!",
    voice="tara",
    response_format="mp3",
)
response.stream_to_file("speech.mp3")

typescript

import Together from "together-ai";
import { Readable } from "stream";
import { createWriteStream } from "fs";

const together = new Together();

async function generateAudio() {
  const res = await together.audio.create({
    input: "Today is a wonderful day to build something people love!",
    voice: "tara",
    response_format: "mp3",
    sample_rate: 44100,
    stream: false,
    model: "canopylabs/orpheus-3b-0.1-ft",
  });

  if (res.body) {
    const nodeStream = Readable.from(res.body as ReadableStream);
    const fileStream = createWriteStream("./speech.mp3");
    nodeStream.pipe(fileStream);
  }
}

generateAudio();

shell

curl -X POST "https://api.together.xyz/v1/audio/speech" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"canopylabs/orpheus-3b-0.1-ft","input":"Hello world","voice":"tara","response_format":"mp3"}' \
  --output speech.mp3

Streaming Audio (Low Latency)

流式音频（低延迟）

python

response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="The quick brown fox jumps over the lazy dog",
    voice="tara",
    stream=True,
    response_format="raw",
    response_encoding="pcm_s16le",
)
response.stream_to_file("speech.wav", response_format="wav")

typescript

import Together from "together-ai";

const together = new Together();

async function streamAudio() {
  const response = await together.audio.speech.create({
    model: "canopylabs/orpheus-3b-0.1-ft",
    input: "The quick brown fox jumps over the lazy dog",
    voice: "tara",
    stream: true,
    response_format: "raw",
    response_encoding: "pcm_s16le",
  });

  const chunks = [];
  for await (const chunk of response) {
    chunks.push(chunk);
  }

  console.log("Streaming complete!");
}

streamAudio();

python

response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="The quick brown fox jumps over the lazy dog",
    voice="tara",
    stream=True,
    response_format="raw",
    response_encoding="pcm_s16le",
)
response.stream_to_file("speech.wav", response_format="wav")

typescript

import Together from "together-ai";

const together = new Together();

async function streamAudio() {
  const response = await together.audio.speech.create({
    model: "canopylabs/orpheus-3b-0.1-ft",
    input: "The quick brown fox jumps over the lazy dog",
    voice: "tara",
    stream: true,
    response_format: "raw",
    response_encoding: "pcm_s16le",
  });

  const chunks = [];
  for await (const chunk of response) {
    chunks.push(chunk);
  }

  console.log("Streaming complete!");
}

streamAudio();

WebSocket (Lowest Latency)

WebSocket（最低延迟）

python

import asyncio, websockets, json, base64

async def generate_speech():
    url = "wss://api.together.ai/v1/audio/speech/websocket?model=hexgrad/Kokoro-82M&voice=af_alloy"
    headers = {"Authorization": f"Bearer {api_key}"}

    async with websockets.connect(url, additional_headers=headers) as ws:
        session = json.loads(await ws.recv())
        await ws.send(json.dumps({"type": "input_text_buffer.append", "text": "Hello!"}))
        await ws.send(json.dumps({"type": "input_text_buffer.commit"}))

        audio_data = bytearray()
        async for msg in ws:
            data = json.loads(msg)
            if data["type"] == "conversation.item.audio_output.delta":
                audio_data.extend(base64.b64decode(data["delta"]))
            elif data["type"] == "conversation.item.audio_output.done":
                break

python

import asyncio, websockets, json, base64

async def generate_speech():
    url = "wss://api.together.ai/v1/audio/speech/websocket?model=hexgrad/Kokoro-82M&voice=af_alloy"
    headers = {"Authorization": f"Bearer {api_key}"}

    async with websockets.connect(url, additional_headers=headers) as ws:
        session = json.loads(await ws.recv())
        await ws.send(json.dumps({"type": "input_text_buffer.append", "text": "Hello!"}))
        await ws.send(json.dumps({"type": "input_text_buffer.commit"}))

        audio_data = bytearray()
        async for msg in ws:
            data = json.loads(msg)
            if data["type"] == "conversation.item.audio_output.delta":
                audio_data.extend(base64.b64decode(data["delta"]))
            elif data["type"] == "conversation.item.audio_output.done":
                break

TTS Models

TTS模型

Model	API String	Endpoints	Price
Orpheus 3B	`canopylabs/orpheus-3b-0.1-ft`	REST, Streaming, WebSocket	$15/1M chars
Kokoro	`hexgrad/Kokoro-82M`	REST, Streaming, WebSocket	$4/1M chars
Cartesia Sonic 2	`cartesia/sonic-2`	REST	$65/1M chars
Cartesia Sonic	`cartesia/sonic`	REST	-
Rime Arcana v3 Turbo	`rime-labs/rime-arcana-v3-turbo`	REST, Streaming, WebSocket	DE only
MiniMax Speech 2.6	`minimax/speech-2.6-turbo`	REST, Streaming, WebSocket	DE only

模型	API字符串	支持端点	价格
Orpheus 3B	`canopylabs/orpheus-3b-0.1-ft`	REST, 流式传输, WebSocket	$15/1M字符
Kokoro	`hexgrad/Kokoro-82M`	REST, 流式传输, WebSocket	$4/1M字符
Cartesia Sonic 2	`cartesia/sonic-2`	REST	$65/1M字符
Cartesia Sonic	`cartesia/sonic`	REST	-
Rime Arcana v3 Turbo	`rime-labs/rime-arcana-v3-turbo`	REST, 流式传输, WebSocket	仅德国可用
MiniMax Speech 2.6	`minimax/speech-2.6-turbo`	REST, 流式传输, WebSocket	仅德国可用

TTS Parameters

TTS参数

Parameter	Type	Description	Default
`model`	string	TTS model (required)	-
`input`	string	Text to synthesize (required)	-
`voice`	string	Voice ID (required)	-
`response_format`	string	`mp3` , `wav` (default), `raw` , `mulaw`	`wav`
`stream`	bool	Enable streaming (raw format only)	false
`response_encoding`	string	`pcm_f32le` , `pcm_s16le` , `pcm_mulaw` , `pcm_alaw` for raw	-
`language`	string	Language of input text: en, de, fr, es, hi, it, ja, ko, nl, pl, pt, ru, sv, tr, zh	"en"
`sample_rate`	int	Audio sample rate (e.g., 44100)	-

参数	类型	描述	默认值
`model`	string	TTS模型（必填）	-
`input`	string	待合成的文本（必填）	-
`voice`	string	音色ID（必填）	-
`response_format`	string	`mp3` , `wav` （默认）, `raw` , `mulaw`	`wav`
`stream`	bool	启用流式传输（仅raw格式支持）	false
`response_encoding`	string	raw格式可选编码： `pcm_f32le` , `pcm_s16le` , `pcm_mulaw` , `pcm_alaw`	-
`language`	string	输入文本的语言：en, de, fr, es, hi, it, ja, ko, nl, pl, pt, ru, sv, tr, zh	"en"
`sample_rate`	int	音频采样率（例如44100）	-

List Available Voices

列出可用音色

python

response = client.audio.voices.list()
for model_voices in response.data:
    print(f"Model: {model_voices.model}")
    for voice in model_voices.voices:
        print(f"  - {voice.name}")

Key voices: Orpheus:

tara

leah

leo

dan

mia

zac

. Kokoro:

af_alloy

af_bella

am_adam

am_echo

. See references/tts-models.md for complete voice lists.

python

response = client.audio.voices.list()
for model_voices in response.data:
    print(f"Model: {model_voices.model}")
    for voice in model_voices.voices:
        print(f"  - {voice.name}")

核心音色： Orpheus:

tara

leah

leo

dan

mia

zac

。Kokoro:

af_alloy

af_bella

am_adam

am_echo

。完整音色列表请查看 references/tts-models.md。

STT Quick Start

STT快速入门

Transcribe Audio

音频转写

python

response = client.audio.transcriptions.create(
    model="openai/whisper-large-v3",
    file=open("audio.mp3", "rb"),
)
print(response.text)

typescript

import Together from "together-ai";

const together = new Together();

const transcription = await together.audio.transcriptions.create({
  file: "path/to/audio.mp3",
  model: "openai/whisper-large-v3",
  language: "en",
});
console.log(transcription.text);

shell

curl -X POST "https://api.together.xyz/v1/audio/transcriptions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -F model="openai/whisper-large-v3" \
  -F file=@audio.mp3

python

response = client.audio.transcriptions.create(
    model="openai/whisper-large-v3",
    file=open("audio.mp3", "rb"),
)
print(response.text)

typescript

import Together from "together-ai";

const together = new Together();

const transcription = await together.audio.transcriptions.create({
  file: "path/to/audio.mp3",
  model: "openai/whisper-large-v3",
  language: "en",
});
console.log(transcription.text);

shell

curl -X POST "https://api.together.xyz/v1/audio/transcriptions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -F model="openai/whisper-large-v3" \
  -F file=@audio.mp3

STT Models

STT模型

Model	API String
Whisper Large v3	`openai/whisper-large-v3`
Voxtral Mini 3B	`mistralai/Voxtral-Mini-3B-2507`

模型	API字符串
Whisper Large v3	`openai/whisper-large-v3`
Voxtral Mini 3B	`mistralai/Voxtral-Mini-3B-2507`

Delivery Method Guide

交付方式指南

REST: Batch processing, complete audio files
Streaming: Real-time apps where TTFB matters
WebSocket: Interactive/conversational apps, lowest latency

REST：批处理，完整音频文件生成
流式传输：对首包时间（TTFB）有要求的实时应用
WebSocket：交互/对话类应用，延迟最低

Resources

资源

Complete voice lists: See references/tts-models.md
STT details: See references/stt-models.md
TTS script: See scripts/tts_generate.py — REST, streaming, and WebSocket TTS (v2 SDK)
STT script: See scripts/stt_transcribe.py — transcribe, translate, diarize with CLI flags (v2 SDK)
Official docs: Text-to-Speech
Official docs: Speech-to-Text
API reference: TTS API
API reference: STT API

完整音色列表：查看 references/tts-models.md
STT详情：查看 references/stt-models.md
TTS脚本：查看 scripts/tts_generate.py — 支持REST、流式传输和WebSocket的TTS（v2 SDK）
STT脚本：查看 scripts/stt_transcribe.py — 支持CLI参数的转写、翻译、说话人分离功能（v2 SDK）
官方文档：文本转语音
官方文档：语音转文本
API参考：TTS API
API参考：STT API