video-transcript

Original：🇨🇳 Chinese

Translated

4 scripts

Video Transcript Extraction Expert (based on Doubao Video Understanding Model). Supports links from Bilibili, Douyin, Xiaohongshu, YouTube, or local video files. Runs entirely in the background (headless) on the user's computer, no pop-ups, no requirement to log in to video platforms. Outputs strict verbatim transcripts with "semantic segmentation + paragraph-level timestamps" (retains colloquial words, internet memes, pauses). Long videos are automatically segmented to avoid being summarized by the model. Trigger scenarios: - User says "generate transcript", "extract transcript", "convert to text", "video to text" - User says "dictate video", "extract video copy", "video subtitles" - User uses the /video-transcript command - User pastes a video link (Bilibili/Douyin/Xiaohongshu/YouTube) with the intention of getting a text version

3installs

Sourcebacktthefuture/huangshu

Added on2026-04-26

NPX Install

npx skill4agent add backtthefuture/huangshu video-transcript

SKILL.md Content (Chinese)

View Translation Comparison →

Video Transcript Extraction Expert

Input video link or local path → Auto detection → Background download → Slicing → Compression → Doubao transcription → Strict verbatim transcript (stdout + local save)

Phase 0 · Locate Skill Root Directory (First Priority)

When triggered, you must first locate yourself by running this:

bash

VT_HOME="$(
  for d in "$HOME/.claude/skills/video-transcript" \
           "$(pwd)/.claude/skills/video-transcript" \
           "$(pwd)/skills/video-transcript" \
           "$HOME/.claude/plugins/video-transcript/video-transcript"; do
    [ -f "$d/SKILL.md" ] && echo "$d" && break
  done
)"
export VT_HOME
echo "VT_HOME=$VT_HOME"

If the output is empty, it means the skill is installed in a non-standard location. Ask the user to provide the path, then manually run

export VT_HOME=<path>

. After that, all commands must use

"$VT_HOME/scripts/transcript.py"

, do not hardcode relative or absolute paths.

Phase 1 · Dependency Check (First Run / When Suspected)

Run the check first on the first run or when errors occur:

bash

python3 "$VT_HOME/scripts/transcript.py" --doctor

If there are ✗ items, inform the user:

Missing ffmpeg / playwright / chromium / API Key, etc. → Run the one-click installation wizard:
bash
```
bash "$VT_HOME/install.sh"
```
Only proceed to Phase 2 when all checks are ✓.

Phase 2 · Trigger Methods and Execution

Run directly when the user provides a video link (URL or local path):

bash

python3 "$VT_HOME/scripts/transcript.py" "<URL or local path>"

Supported inputs:

Bilibili:

https://www.bilibili.com/video/BVxxx

or

b23.tv/xxx

short link

Douyin:

https://www.douyin.com/video/xxx

,

v.douyin.com/xxx

,

douyin.com/jingxuan?modal_id=xxx

Xiaohongshu:

xiaohongshu.com/discovery/item/xxx

,

xiaohongshu.com/explore/xxx

,

xhslink.com/xxx

YouTube:
```
youtube.com/watch?v=xxx
```
,
```
youtu.be/xxx
```
Local video file path

The script automatically: 0. Detection — Launch headless browser, retrieve title, duration, direct video/audio links; print 📊 evaluation table + estimated time

Download — Reuse the direct links obtained from detection (no browser restart); Bilibili uses dash stream (download video/audio separately + merge with ffmpeg); other platforms download directly as mp4
Long Video Slicing — Automatically slice videos longer than 8min into 6min segments to avoid being summarized by the model
Compression — Use ffmpeg, target size 30MB; automatically select resolution based on duration, done in one go
Transcription — Use Doubao Video Understanding API, strict verbatim prompt (retain colloquial words, internet memes, pauses)
Merge — Merge segments for long videos + automatically adjust timestamp offsets for segment headers

⚠️ Two Mandatory Tasks for You (Agent)

(1) Repeat Evaluation Table — Set User Expectations

After the script starts, the first segment of stderr will print the 📊 Video Detection Evaluation Table. You must repeat it to the user immediately, informing them:

Video title, duration, number of segments
Estimated time (most important, gives the user a waiting expectation)

Do not wait until transcription is complete to speak, repeat the evaluation table first → then wait for transcription. If the user cannot see the duration and estimated time, they may think the program is frozen.

Example:

Video detection completed:

Platform: Bilibili / Title: 《xxx》

Duration: 17 minutes 12 seconds, will be sliced into 3 segments for independent transcription

Estimated time: 3 minutes 20 seconds ~ 5 minutes 25 seconds, running now, please wait...

(2) Output Full Verbatim Transcript After Completion

The script will print the complete Markdown transcript to stdout at the end. You must display the entire transcript exactly as it is in the conversation, do not only say "saved to xxx.md" — that means the user sees nothing.

Correct approach:

Capture all Markdown content after
```
# <Title>
```
in stdout
Repeat it fully to the user (title, duration, all paragraphs)
Add a line at the end indicating the local save path, to help the user open the .md file later

Incorrect approaches:

❌ "Transcript saved to outputs/xxx.md, please check the file"
❌ Only show the first few paragraphs and omit the rest
❌ Summarize/simplify content (transcripts must be verbatim)

Phase 3 · Output Destinations

The script will output to two destinations simultaneously:

Full Markdown transcript directly to stdout — For the agent to repeat to the user (see mandatory requirement 2 in Phase 2)
Saved locally to
```
$VT_HOME/outputs/<title>_transcript.md
```
— Managed by the skill, all outputs are stored in one place

Disable local save:

--no-save

Change save path:

--output-dir <path>

Evaluation Table Example

═══════════════════════════════════════════════════════
  📊 Video Detection
───────────────────────────────────────────────────────
  Platform:      Bilibili
  Title:      A city of 100,000 people disappeared between Zhejiang and Anhui
  Duration:      17 minutes 12 seconds
  Segments:      3 segments (each ≤ 6 minutes)
  Estimated Time:  3 minutes 20 seconds ~ 5 minutes 25 seconds
═══════════════════════════════════════════════════════

Output Format Example

markdown

# Video Title

> Duration 5:32 | Source: <URL or filename>

## 1. Introduce Topic [00:00 - 00:42]
Hello everyone, today we're going to talk about...

## 2. Core Viewpoint [00:42 - 02:15]
Well, my opinion is this, first of all...

Features:

Semantic Segmentation: Automatically divided into 3-8 segments by topic, each with a short subtitle
Paragraph-level Timestamps: Marked with
```
[MM:SS - MM:SS]
```
at the start of each segment for easy positioning
Verbatim Transcription: Retains colloquial words ("uh" "well" "ah" "you know"), internet memes, pauses, no summarization or rewriting
Silent Segments: Marked with
```
_(No voice here, XX seconds)_
```

Phase 4 · Exception Handling

Scenario	Handling
`--doctor` reports missing dependencies	Run `bash "$VT_HOME/install.sh"`
`Doubao API Key not found`	Check `$VT_HOME/.env` ; if not configured, ask the user to run install.sh to guide Key setup
API 401 / "Model not authorized"	Volcano Ark Console → Model Plaza → Click "Activate" for Doubao-Seed-2.0-pro
Douyin graphic note (note link)	Inform the user that graphic content is not supported, only videos are supported
Crawling failure due to platform frontend revision	Check `$VT_HOME/FALLBACK.md` for manual workaround
Long video is summarized instead of verbatim	Already handled automatically (>8min sliced); report if issue persists
Compressed video still > 50MB	The script will automatically iterate to reduce bitrate/resolution (max 4 rounds)

Command Line Options

Parameter	Description
`input`	Video URL or local file path (required, optional for `--doctor` )
`--title`	Video title (uses detected title by default)
`--target-size`	Compression target size in MB, default 30
`--no-save`	Do not save .md file (saved to `$VT_HOME/outputs/` by default)
`--output-dir`	Change save path
`--doctor`	Check mode: verify dependencies and configurations

Notes

Native Doubao video understanding, the model directly processes video audio, no independent ASR required
Timestamp precision is paragraph-level (not word/sentence-level), used for chapter positioning
Full Markdown transcript is output directly to stdout by default (for upper-layer agent display); simultaneously saved locally
Estimated time model (based on actual measurements): headless startup 10s + download (0.4s per video second) + 12s compression per segment + 30s API processing per segment, with ±20% range
API Key configuration is stored in
```
$VT_HOME/.env
```
(permission 600, gitignore), not distributed with the repository

video-transcript

NPX Install

Tags

SKILL.md Content (Chinese)

Video Transcript Extraction Expert

Phase 0 · Locate Skill Root Directory (First Priority)

Phase 1 · Dependency Check (First Run / When Suspected)

Phase 2 · Trigger Methods and Execution

⚠️ Two Mandatory Tasks for You (Agent)

Phase 3 · Output Destinations

Evaluation Table Example

Output Format Example

Phase 4 · Exception Handling

Command Line Options

Notes