diarization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Speaker Diarization

Speaker Diarization（说话人分离）

Use this skill when the agent needs to track who is speaking across a multi-speaker audio stream. Supports two modes: provider-delegated (extracts speaker labels from Deepgram word-level results) and local clustering (spectral-centroid agglomerative clustering, fully offline).

Enable this when transcribing meetings, interviews, podcast recordings, or any scenario where knowing which speaker said what is important for downstream tasks (summaries, action items, CRM updates).

当Agent需要跟踪多说话人音频流中的说话人时，可使用此skill。支持两种模式：服务商委托模式（从Deepgram词级结果中提取说话人标签）和本地聚类模式（频谱质心凝聚聚类，完全离线）。

当转录会议、访谈、播客录音，或任何需要明确“谁在何时说什么”以支持下游任务（摘要生成、行动项提取、CRM更新）的场景时，可启用该功能。

Setup

设置

No API key required for local mode. For provider-delegated mode, enable diarization on the STT provider:

json

{
  "voice": {
    "stt": "deepgram",
    "diarization": "provider",
    "providerOptions": { "diarize": true }
  }
}

For fully offline local clustering:

json

{
  "voice": {
    "diarization": "local"
  }
}

本地模式无需API密钥。对于服务商委托模式，需在STT服务商端启用说话人分离功能：

json

{
  "voice": {
    "stt": "deepgram",
    "diarization": "provider",
    "providerOptions": { "diarize": true }
  }
}

对于完全离线的本地聚类模式：

json

{
  "voice": {
    "diarization": "local"
  }
}

Speaker Enrollment

说话人注册（Speaker Enrollment）

Pre-register known speakers so the engine labels them by name instead of

Speaker_0

Speaker_1

await session.enrollSpeaker('Alice', aliceVoiceprintFloat32Array);
await session.enrollSpeaker('Bob', bobVoiceprintFloat32Array);

预先注册已知说话人，这样引擎会使用姓名而非

Speaker_0

、

Speaker_1

来标记他们：

await session.enrollSpeaker('Alice', aliceVoiceprintFloat32Array);
await session.enrollSpeaker('Bob', bobVoiceprintFloat32Array);

Provider Rules

服务商规则

Prefer provider-delegated mode with Deepgram when speaker accuracy is critical and
```
DEEPGRAM_API_KEY
```
is available. Word-level speaker labels are more reliable than local clustering.
Use local mode when privacy, offline operation, or zero additional API cost is required.
The local backend extracts 16-dimensional spectral feature vectors per 1.5 s window (0.5 s overlap) — suitable for clear audio. Replace with an ONNX x-vector model for noisy environments.
Enroll known speakers when participant names are known in advance to get named labels instead of generic
```
Speaker_N
```
identifiers.

当说话人识别精度要求较高且拥有
```
DEEPGRAM_API_KEY
```
时，优先使用Deepgram的服务商委托模式。词级说话人标签比本地聚类更可靠。
当需要隐私保护、离线运行或零额外API成本时，使用本地模式。
本地后端会每隔1.5秒窗口（重叠0.5秒）提取16维频谱特征向量——适用于清晰音频。在嘈杂环境下，可替换为ONNX x-vector模型。
当预先知晓参会者姓名时，注册已知说话人，以获取带姓名的标签而非通用的
```
Speaker_N
```
标识符。

Events

事件

Event	Description
`speaker_identified`	Active speaker label has changed
`segment_ready`	A labelled audio or transcript segment is ready
`error`	Unrecoverable diarization error
`close`	Session fully terminated

事件名称	描述
`speaker_identified`	当前说话人标签已变更
`segment_ready`	已生成带标签的音频或转录片段
`error`	出现无法恢复的说话人分离错误
`close`	会话已完全终止

Examples

示例

"Transcribe this meeting and label each speaker's turns."
"Use local diarization for a private interview recording."
"Enable Deepgram diarization and label Alice and Bob by voice."

"转录本次会议并标记每位说话人的发言轮次。"
"对私密访谈录音使用本地说话人分离功能。"
"启用Deepgram说话人分离功能，并通过语音标记Alice和Bob。"

Constraints

限制条件

Local mode accuracy depends on audio clarity and spectral separation between speakers.
Provider mode requires
```
diarize: true
```
support on the active STT provider (currently Deepgram).
Speaker enrollment voiceprints must be Float32Arrays computed from clean reference audio.
The built-in feature extractor is intentionally lightweight; replace
```
extractSimpleEmbedding()
```
with an ONNX x-vector model for production-quality voiceprints.

本地模式的识别精度取决于音频清晰度和说话人之间的频谱区分度。
服务商模式要求当前STT服务商支持
```
diarize: true
```
（目前仅Deepgram支持）。
说话人注册的声纹必须是从清晰参考音频中计算得到的Float32Array。
内置特征提取器为轻量级设计；如需生产级质量的声纹，可将
```
extractSimpleEmbedding()
```
替换为ONNX x-vector模型。