audio-transcribe

Original：🇺🇸 English

Translated

Transcribes audio to text with timestamps and optional speaker identification. Use when you need to convert speech to text, create subtitles, transcribe meetings, or process voice recordings.

6installs

Sourceagntswrm/agent-media

Added on2026-02-09

NPX Install

npx skill4agent add agntswrm/agent-media audio-transcribe

SKILL.md Content

View Translation Comparison →

Audio Transcribe

Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.

Command

bash

agent-media audio transcribe --in <path> [options]

Inputs

Option	Required	Description
`--in`	Yes	Input audio file path or URL (supports mp3, wav, m4a, ogg)
`--diarize`	No	Enable speaker identification
`--language`	No	Language code (auto-detected if not provided)
`--speakers`	No	Number of speakers hint for diarization
`--out`	No	Output path, filename or directory (default: ./)
`--provider`	No	Provider to use (local, fal, replicate, runpod)

Output

Returns a JSON object with transcription data:

json

{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "Full transcription text...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}

Examples

Basic transcription (auto-detect language):

bash

agent-media audio transcribe --in interview.mp3

Transcription with speaker identification:

bash

agent-media audio transcribe --in meeting.wav --diarize

Transcription with specific language and speaker count:

bash

agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3

Use specific provider:

bash

agent-media audio transcribe --in audio.wav --provider replicate

Extracting Audio from Video

To transcribe a video file, first extract the audio:

bash

# Step 1: Extract audio from video
agent-media audio extract --in video.mp4 --format mp3

# Step 2: Transcribe the extracted audio
agent-media audio transcribe --in extracted_xxx.mp3

Providers

local

Runs locally on CPU using Transformers.js, no API key required.

Uses Moonshine model (5x faster than Whisper)
Models downloaded on first use (~100MB)
Does NOT support diarization — use fal or replicate for speaker identification
You may see a
```
mutex lock failed
```
error — ignore it, the output is correct if
```
"ok": true
```

bash

agent-media audio transcribe --in audio.mp3 --provider local

fal

Requires
```
FAL_API_KEY
```
Uses
```
wizper
```
model for fast transcription (2x faster) when diarization is disabled
Uses
```
whisper
```
model when diarization is enabled (native support)

replicate

Requires
```
REPLICATE_API_TOKEN
```
Uses
```
whisper-diarization
```
model with Whisper Large V3 Turbo
Native diarization support with word-level timestamps

runpod

Requires
```
RUNPOD_API_KEY
```
Uses
```
pruna/whisper-v3-large
```
model (Whisper Large V3)
Does NOT support diarization (speaker identification) - use fal or replicate for diarization

bash

agent-media audio transcribe --in audio.mp3 --provider runpod

audio-transcribe

NPX Install

Tags

SKILL.md Content

Audio Transcribe

Command

Inputs

Output

Examples

Extracting Audio from Video

Providers

local

fal

replicate

runpod