digital-health-clinical-asr-build

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0 -->
<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0 -->

Clinical ASR Flywheel — Stage 2 (Build the benchmark)

临床ASR飞轮——第2阶段(构建基准测试集)

⚠ Agent: read this entire SKILL.md before answering. This stage is conversational and gated. Specifically: ask the user 1–2 specialty-aware clarifying questions before proposing terms (Step 2a), walk them through the two-tier IPA pipeline (override → merriam-webster → magpie_g2p) in Step 2c, hit the explicit QA-mode audition gate in Step 2d before full Cartesian synthesis, and name KER as the headline metric they'll see in Stage 3. Skipping any of these defeats the methodology.
You are the curate-and-synthesize stage. The user arrives from
/digital-health-clinical-asr-setup
and leaves with a NeMo-format
manifest.jsonl
plus the audio it references — both ready for scoring at
/digital-health-clinical-asr-eval
.
Be conversational. This is the warmest, most domain-aware step in the flywheel: you're asking a clinician (or someone who works with them) which terms hurt today and shaping a benchmark around their reality. Ask short, focused questions. Show the user what's being added. Don't lecture.
⚠ Agent:请在作答前完整阅读本SKILL.md。 本阶段采用对话式流程且设有准入机制。具体要求:在提出术语前(步骤2a),向用户询问1-2个专科相关的澄清问题;在步骤2c中,向用户讲解双层IPA处理流程(优先级:自定义覆盖 → Merriam-Webster → magpie_g2p);在进行全笛卡尔积合成前,需通过步骤2d中的显式QA模式审核关卡;并告知用户第3阶段将以KER作为核心指标。跳过任何步骤都会破坏方法论的有效性。
你处于术语整理与合成阶段。用户从
/digital-health-clinical-asr-setup
进入本阶段,离开时将获得NeMo格式的
manifest.jsonl
及其对应的音频文件——两者均可直接用于
/digital-health-clinical-asr-eval
的评分环节。
保持对话风格。这是飞轮流程中最贴近业务场景、最具领域感知的步骤:你需要询问临床医生(或相关从业者)当前遇到的术语识别痛点,并围绕实际场景构建基准测试集。提问要简短聚焦,向用户展示正在添加的内容,避免说教。

Data leaves your environment — disclose this to the user before any term is sent

数据将离开你的环境——在发送任何术语前告知用户

This stage transmits user-curated content to two external services. Surface this to the user before invoking either call:
ServiceWhat gets sentWhen
Merriam-Webster (
dictionaryapi.com
API or
merriam-webster.com
public site)
One HTTP request per term in the seed list — term goes in URL pathStep 2c — see MW path bullets below
NVIDIA NVCF Magpie TTS (
grpc.nvcf.nvidia.com
)
Each generated clinical sentence (text, plus any SSML IPA wrappers)Steps 2d and 2e, every synthesis call
Both endpoints expect non-PHI synthetic content — the term list you curate, the sentences
/data-designer
(or your fallback templates) generates from it. Do not pass real patient records, real ASR transcripts, or any PHI through this skill. If the term list itself is sensitive (proprietary drug codenames, unreleased product names, customer-confidential indications), confirm with the user that external-API transmission is acceptable under their organization's data-governance policy before proceeding.
If no MW transmission is acceptable: take Path C below (skip MW; pipeline falls through to Magpie G2P with reduced coverage on long-tail terms).
本阶段会将用户整理的内容传输至两个外部服务。在调用任一服务前,需向用户明确说明:
服务传输内容时机
Merriam-Webster
dictionaryapi.com
API 或
merriam-webster.com
公共站点)
种子列表中每个术语对应一次HTTP请求——术语放在URL路径中步骤2c——参见下方MW路径说明
NVIDIA NVCF Magpie TTS
grpc.nvcf.nvidia.com
每个生成的临床句子(文本及所有SSML IPA包装)步骤2d和2e的每次合成调用
两个端点均要求传输非PHI的合成内容——即你整理的术语列表、由
/data-designer
(或备用模板)生成的句子。请勿通过本Skill传输真实患者记录、真实ASR转录文本或任何PHI数据。若术语列表本身涉及敏感内容(如专有药物代号、未发布产品名称、客户保密适应症),需先确认用户所在组织的数据治理政策允许向外部API传输此类内容,再继续操作。
若不允许使用Merriam-Webster传输:选择下方路径C(跳过MW;流程自动 fallback 到Magpie G2P,但长尾术语的覆盖范围会降低)。

Purpose

目标

Curate a clinical-specialty term list, generate eval audio for it through Magpie TTS with a two-tier IPA pipeline, and write a NeMo-format manifest tagged with the clinical-extension fields (
term
,
entity_category
,
ipa_source
,
voice_id
,
noise_level
,
context_type
). The output is the input to Stage 3.
By the end the user has:
$EVAL_DIR/cycle<N>/
├── audio/<slug>.wav        synthesized clips
├── manifest.jsonl          NeMo format + clinical extension
├── term_seed.csv           the curated input
└── pronunciation_overrides.csv   appendable across cycles
(
$EVAL_DIR
is the user's own choice — this skill does not impose a layout. The structure above is a recommendation, not a requirement.)
整理临床专科术语列表,通过带有双层IPA流程的Magpie TTS生成评估音频,并生成带有临床扩展字段(
term
entity_category
ipa_source
voice_id
noise_level
context_type
)的NeMo格式清单。输出结果将作为第3阶段的输入。
完成后,用户将获得以下内容:
$EVAL_DIR/cycle<N>/
├── audio/<slug>.wav        合成音频片段
├── manifest.jsonl          NeMo格式 + 临床扩展字段
├── term_seed.csv           整理后的输入术语
└── pronunciation_overrides.csv   可跨周期追加的发音覆盖文件
$EVAL_DIR
由用户自行选择——本Skill不强制目录结构。上述结构为推荐方案,非硬性要求。)

When to use this skill

何时使用本Skill

Activate on user phrases like:
  • "Build a clinical ASR benchmark"
  • "Curate drug names / procedure names for ASR eval"
  • "Generate eval audio for medical terms"
  • "Create a NeMo manifest from clinical terms"
  • "Add oncology / cardiology / ortho terms to my benchmark"
  • "Audition the TTS pronunciation for these drug names"
  • "Make me a cycle-N manifest"
Do not activate when (also: if the message mentions
auth
,
API key
,
gRPC
,
streaming
,
riva-build
,
NIM deploy
,
NGC
, or
Docker
, route per the bullets below and stop):
  • The user already has a manifest and wants to score it →
    /digital-health-clinical-asr-eval
  • The user wants to fine-tune on an existing manifest →
    /digital-health-clinical-asr-finetune
  • The user is asking generic TTS / SSML / voice-cloning / voice-catalog questions →
    /read-aloud
    (or
    /riva-tts
    )
  • TTS/ASR auth / API keys / gRPC / streaming
    /riva-tts
    or
    /riva-asr
  • NIM deploy or
    riva-build
    /
    riva-deploy
    flags →
    /riva-asr-custom
    or
    /riva-tts-custom
  • NGC / Docker / NVIDIA Container Toolkit
    /riva-nim-setup
  • The user is asking generic synthetic-data questions →
    /data-designer
当用户提出以下类似需求时激活:
  • "构建临床ASR基准测试集"
  • "整理药物名称/手术名称用于ASR评估"
  • "为医学术语生成评估音频"
  • "从临床术语创建NeMo清单"
  • "在我的基准测试集中添加肿瘤/心血管/骨科术语"
  • "试听这些药物名称的TTS发音"
  • "帮我生成cycle-N清单"
请勿激活的场景(此外:若消息中提及
auth
API key
gRPC
streaming
riva-build
NIM deploy
NGC
Docker
,请按以下指引路由并停止操作):
  • 用户已有清单并想要评分 →
    /digital-health-clinical-asr-eval
  • 用户想要基于现有清单进行微调 →
    /digital-health-clinical-asr-finetune
  • 用户询问通用TTS/SSML/语音克隆/语音库相关问题 →
    /read-aloud
    (或
    /riva-tts
  • TTS/ASR 认证/API密钥/gRPC/流式传输相关问题 →
    /riva-tts
    /riva-asr
  • NIM部署
    riva-build
    /
    riva-deploy
    参数相关问题 →
    /riva-asr-custom
    /riva-tts-custom
  • NGC/Docker/NVIDIA容器工具包相关问题 →
    /riva-nim-setup
  • 用户询问通用合成数据相关问题 →
    /data-designer

Prerequisites

前置条件

  • /digital-health-clinical-asr-setup
    completed
    NVIDIA_API_KEY
    exported, Python deps installed, the six upstream skills confirmed.
  • /read-aloud
    (or
    /riva-tts
    ) reachable. Hosted Magpie via NVCF is the default. Self-hosted Magpie NIM works but adds
    /riva-nim-setup
    to the prerequisite chain.
  • /data-designer
    reachable. Template fallback is acceptable for a first cycle if
    /data-designer
    is unavailable, but tag those rows so future cycles can re-generate.
  • A working directory the user owns. The skill recommends
    $EVAL_DIR/cycle<N>/
    but does not enforce it.
  • 已完成
    /digital-health-clinical-asr-setup
    ——已导出
    NVIDIA_API_KEY
    ,安装Python依赖,确认六个上游Skill可用。
  • /read-aloud
    (或
    /riva-tts
    )可访问。默认使用NVCF托管的Magpie。自托管Magpie NIM也可使用,但需额外完成
    /riva-nim-setup
    前置流程。
  • **
    /data-designer
    **可访问。若
    /data-designer
    不可用,首次周期可使用模板备用方案,但需为这些行添加标签以便后续周期重新生成。
  • 用户拥有一个工作目录。本Skill推荐使用
    $EVAL_DIR/cycle<N>/
    ,但不强制要求。

Instructions

操作步骤

2a. Specialty interview →
term_seed.csv

2a. 专科访谈 →
term_seed.csv

Ask one question at a time. The goal is to surface 4–10 candidate terms with the right
entity_category
, not to write a textbook.
Questions, in order:
  1. What specialty / workflow is this for? (oncology dictation, ICU handoff, psych intake, ortho post-op, …)
  2. What ASR failure modes have you seen? — drug names, multi-word procedures, abbreviations, compound conditions.
  3. Which terms come up daily vs which are the hard ones? — daily-common terms become the sanity baseline; daily-hard terms become the signal.
Propose 4–10 candidate terms with
entity_category
. Confirm with the user before writing. Then write
term_seed.csv
:
csv
term,entity_category
cefazolin,drug
acetabular reamer,procedure
tibial plateau,anatomy
femoroacetabular impingement,condition
hemoglobin a1c,lab
respiratory therapist,role
The category vocabulary is fixed. KER keys off it. Allowed values:
drug | procedure | anatomy | condition | lab | role
If the user proposes a new category, push back: either it maps to one of the six, or the methodology needs a deliberate extension (which is a future cycle's job, not a one-off ad-hoc add).
一次只问一个问题。目标是筛选出4-10个带有正确
entity_category
的候选术语,而非撰写专业教材。
提问顺序:
  1. 这是针对哪个专科/工作流程的?(如肿瘤口述、ICU交接班、精神科接诊、骨科术后随访等)
  2. 你遇到过哪些ASR识别失败的情况?——如药物名称、多词手术名称、缩写、复合病症。
  3. 哪些是日常高频术语,哪些是识别难度高的术语?——日常高频术语作为 sanity 基线;识别难度高的术语作为核心测试信号。
提出4-10个带有
entity_category
的候选术语,经用户确认后写入
term_seed.csv
csv
term,entity_category
cefazolin,drug
acetabular reamer,procedure
tibial plateau,anatomy
femoroacetabular impingement,condition
hemoglobin a1c,lab
respiratory therapist,role
分类词汇是固定的。KER指标依赖该分类。允许的值为:
drug | procedure | anatomy | condition | lab | role
若用户提出新分类,请说明:要么可映射到上述六个分类之一,要么需要对方法论进行针对性扩展(这属于后续周期的工作,而非临时添加)。

2b. Sentence generation via
/data-designer

2b. 通过
/data-designer
生成句子

Brief
/data-designer
with:
For each row in
term_seed.csv
, generate one or more natural English sentences embedding
term
in a way that fits the row's
entity_category
. Output schema:
{term, entity_category, sentence, context_type}
. Generate 3–5
context_type
variants per term. Initial
context_type
vocabulary:
dictation
,
handoff
,
chart_note
,
history
. Sentence length 10–30 words.
The output of this step is a per-term sentence variants file. Any filename is fine — pick one and use it consistently across the cycle directory.
Template fallback. If
/data-designer
is unavailable, use a 4-template fallback (one per
context_type
) and substitute
term
mechanically. Tag those rows in the manifest (
context_type
is set, the sentence is just less natural) so a future cycle can regenerate.
/data-designer
提供以下指令:
针对
term_seed.csv
中的每一行,生成一个或多个自然英文句子,将
term
嵌入符合该行
entity_category
的场景中。输出 schema:
{term, entity_category, sentence, context_type}
。每个术语生成3-5种
context_type
变体。初始
context_type
词汇:
dictation
handoff
chart_note
history
。句子长度为10-30个单词。
本步骤的输出是每个术语的句子变体文件。文件名可任意选择,但需在整个周期目录中保持一致。
模板备用方案。若
/data-designer
不可用,使用4种模板(每种对应一种
context_type
)并自动替换
term
。在清单中为这些行添加标签(设置
context_type
,但句子仅为机械替换),以便后续周期重新生成更自然的句子。

2c. Two-tier IPA tagging (the load-bearing quality lever)

2c. 双层IPA标注(核心质量保障环节)

Every term passes through a 3-tier pipeline, in order:
  1. Override
    pronunciation_overrides.csv
    carries verified IPA the team has audited. If
    term
    matches a row here, the override wins.
  2. Merriam-Webster — for un-overridden terms, fetch the MW respelling, convert to IPA, validate against Magpie's en-US phoneme set. If both succeed, the term is tagged
    merriam-webster
    .
  3. Magpie G2P (fall-through) — if neither override nor MW produces a valid IPA, the plain text is passed to Magpie's neural G2P at synthesis time. The row is tagged
    magpie_g2p
    .
Every manifest row carries the
ipa_source
tag (
override | merriam-webster | magpie_g2p
). The delta between
merriam-webster
and
magpie_g2p
rows in the Stage 3 leaderboard is the proof the pronunciation strategy is working — call it out explicitly when you produce the leaderboard.
Three MW lookup choices — all tag
merriam-webster
. A:
dictionaryapi.com
JSON API +
DICTIONARY_API_KEY
(free at dictionaryapi.com) — recommended for standalone use. B: HTML scrape of
merriam-webster.com
— no key, brittle to site HTML changes; recipe inlined in
references/pronunciation-pipeline.md
. C: skip MW, fall through to Magpie G2P with weaker long-tail coverage. Both recipes + the full respelling→IPA table live in
references/pronunciation-pipeline.md
. The Path A function takes
api_key
as an arg (never reads
os.environ
); pass
None
to skip MW.
pronunciation_overrides.csv
schema:
csv
term,ipa,verified_by,verified_at,notes
cefazolin,sɛfəˈzoʊlɪn,brandoing,2026-05-13,confirmed against MW respelling + ear test
Append-only across cycles. Re-running the build later picks up new entries automatically.
每个术语都会依次通过三层处理流程:
  1. 自定义覆盖——
    pronunciation_overrides.csv
    包含经过团队审核的验证IPA。若
    term
    与其中某行匹配,则使用自定义覆盖的发音。
  2. Merriam-Webster——对于未被覆盖的术语,获取MW的音标转写,转换为IPA格式,并验证是否符合Magpie的美式英语音素集。若两者均成功,则该术语标记为
    merriam-webster
  3. Magpie G2P(兜底方案)——若自定义覆盖和MW均无法生成有效IPA,则在合成时将纯文本传入Magpie的神经G2P模型。该行标记为
    magpie_g2p
每个清单行都会携带
ipa_source
标签(
override | merriam-webster | magpie_g2p
)。第3阶段排行榜中
merriam-webster
magpie_g2p
行的差异正是发音策略有效性的证明——在生成排行榜时需明确指出这一点。
三种MW查询选项——均标记为
merriam-webster
A
dictionaryapi.com
JSON API +
DICTIONARY_API_KEY
(可在dictionaryapi.com免费获取)——推荐独立使用。B:爬取
merriam-webster.com
的HTML页面——无需密钥,但易受网站HTML结构变化影响;实现方法见
references/pronunciation-pipeline.md
C:跳过MW,直接使用Magpie G2P兜底,但长尾术语的覆盖能力较弱。两种实现方案+完整的音标转写→IPA对照表均位于
references/pronunciation-pipeline.md
中。路径A的函数以
api_key
为参数(从不读取
os.environ
);传入
None
即可跳过MW。
pronunciation_overrides.csv
的schema:
csv
term,ipa,verified_by,verified_at,notes
cefazolin,sɛfəˈzoʊlɪn,brandoing,2026-05-13,confirmed against MW respelling + ear test
可跨周期追加内容。后续重新运行构建流程时会自动读取新条目。

2d. QA-mode synthesis (do not skip this gate)

2d. QA模式合成(请勿跳过此关卡

Before running the full Cartesian product, synthesize one wav per term with: first voice, clean noise, default context. Audition each clip with the user.
For every term tagged
magpie_g2p
, propose an IPA candidate using clinical suffix patterns and validate against Magpie's en-US phoneme set before suggesting:
SuffixStress pattern (example)
-mycin
…ˈmaɪsɪn (vancomycin, gentamicin)
-prazole
…ˈpreɪzoʊl (esomeprazole, omeprazole)
-statin
…ˈstætɪn (atorvastatin, rosuvastatin)
-sartan
…ˈsɑːrtən (losartan, valsartan)
-azole
…ˈeɪzoʊl (fluconazole, ketoconazole)
-cillin
…ˈsɪlɪn (amoxicillin, piperacillin)
-parin
…ˈpɛərɪn (enoxaparin, heparin)
Phoneme-validation pattern — live-probe Magpie's en-US neural G2P with a candidate IPA. If Magpie accepts the SSML, the IPA is in its inventory. Use the suffix patterns above as a pre-filter (cheap heuristic) and the live probe to confirm before committing to an override. The
magpie_validates_ipa(ipa, api_key, voice_id)
recipe — a minimal NVCF gRPC synthesis call that returns
True
/
False
fail-closed — is in
references/pronunciation-pipeline.md
.
Call it once per candidate IPA before showing it to the user. On user approval, append the verified IPA to
pronunciation_overrides.csv
. The row's
ipa_source
flips from
magpie_g2p
to
override
on the next manifest generation.
HITL audition gate before Step 2e — fail-closed. Do not synthesize the full Cartesian product, do not promote any staged IPA candidate to
pronunciation_overrides.csv
, and do not advance to Stage 3 until one of the following has happened explicitly in conversation:
  1. The user confirms they have auditioned the QA clips and reports their verdict per clip (or per bucket: "the MW set sounds fine", "fix
    pembrolizumab
    ", etc.). Provide the
    afplay
    (macOS) or
    paplay
    /
    aplay
    (Linux) commands so the user can play them — then halt and wait for their reply after listening. Paper-only approval via an AskUserQuestion prompt — clicking "Promote all" or "Lock in" without auditioning — does not satisfy this gate. Magpie-validating an IPA proves it's in the phoneme inventory; it does not prove it matches the intended pronunciation. Only the user's ears do that.
  2. The user explicitly opts to skip audition for this cycle, in deliberate language (e.g. "skip audition, accept the risk that mispronunciations may dilute the Stage 3 KER signal — log it as a cycle-N caveat"), not as a side-effect of a single click-through. Record the skip in a cycle-level note (e.g.
    eval/cycle<N>/cycle_notes.md
    ) so a future operator can see the audition was deferred.
Magpie NVCF rate-limits aggressively on >100-row jobs, and a do-over costs both API credits and clock time — but the larger risk is shipping a manifest with mispronounced reference audio that quietly corrupts the Stage 3 KER signal. Time spent auditioning is cheaper than re-running the cycle.
在运行全笛卡尔积合成前,为每个术语合成一个wav文件:使用第一个语音、无噪声、默认场景。与用户一起试听每个音频片段。
对于所有标记为
magpie_g2p
的术语,先使用临床后缀模式生成IPA候选,并验证是否符合Magpie的美式英语音素集,再向用户提出建议:
后缀重音模式(示例)
-mycin
…ˈmaɪsɪn(vancomycin, gentamicin)
-prazole
…ˈpreɪzoʊl(esomeprazole, omeprazole)
-statin
…ˈstætɪn(atorvastatin, rosuvastatin)
-sartan
…ˈsɑːrtən(losartan, valsartan)
-azole
…ˈeɪzoʊl(fluconazole, ketoconazole)
-cillin
…ˈsɪlɪn(amoxicillin, piperacillin)
-parin
…ˈpɛərɪn(enoxaparin, heparin)
音素验证方式——使用候选IPA实时测试Magpie的美式英语神经G2P模型。若Magpie接受该SSML,则说明该IPA在其音素库中。先使用上述后缀模式作为预筛选(低成本启发式规则),再通过实时测试确认后,方可提交自定义覆盖。
magpie_validates_ipa(ipa, api_key, voice_id)
的实现方法——一个最小化的NVCF gRPC合成调用,返回
True
/
False
的闭包——位于
references/pronunciation-pipeline.md
中。
在向用户展示候选IPA前,需调用一次验证。经用户批准后,将验证通过的IPA追加到
pronunciation_overrides.csv
中。下次生成清单时,该行的
ipa_source
将从
magpie_g2p
变为
override
进入步骤2e前需通过HITL审核关卡——未通过则终止流程。在对话中明确发生以下情况之一前,不得进行全笛卡尔积合成、不得将任何候选IPA升级到
pronunciation_overrides.csv
、不得进入第3阶段:
  1. 用户确认已试听QA音频,并针对每个音频(或分组)给出反馈(如“MW组的发音没问题”、“修正
    pembrolizumab
    的发音”等)。提供
    afplay
    (macOS)或
    paplay
    /
    aplay
    (Linux)命令供用户播放音频——然后暂停并等待用户听完后的回复。仅通过点击“全部确认”或“锁定”而未试听的纸面批准不满足此关卡要求。Magpie验证IPA仅能证明其在音素库中,无法证明其符合预期发音。只有用户的听觉判断才能确认这一点。
  2. 用户明确选择跳过本次周期的试听,且表述清晰(例如:“跳过试听,接受发音错误可能削弱第3阶段KER信号的风险——将此记录为cycle-N的注意事项”),而非仅通过单次点击操作。将跳过试听的情况记录在周期级备注中(如
    eval/cycle<N>/cycle_notes.md
    ),以便后续操作人员知晓试听已被推迟。
Magpie NVCF对超过100行的任务会严格限流,重新执行会消耗API额度和时间——但更大的风险是交付带有错误发音参考音频的清单,从而悄悄破坏第3阶段的KER信号。花时间试听比重新执行整个周期更划算。

2e. Full benchmark generation

2e. 完整基准测试集生成

After pronunciations are locked, generate the full Cartesian product
|terms| × |voices| × |noise_levels| × |context_types|
. Defaults: 2–4 Magpie en-US voices (Mia/Jason/Ray),
[clean, snr_15db, snr_5db]
,
[dictation, handoff, chart_note, history]
.
Self-contained synthesis — no
/read-aloud
required. The
synthesize_row(row, all_overrides, out_dir, api_key)
recipe — opens an NVCF gRPC stream, wraps overrides into SSML via
render_sentence_with_overrides
, writes 16-bit mono PCM to
<out_dir>/audio/<slug>.wav
— is in
references/pronunciation-pipeline.md
(§Synthesis call). Key invariant:
all_overrides
carries every entry from
pronunciation_overrides.csv
(including context-word overrides like
intravenously
) so the renderer wraps any override whose verbatim text appears in
row['text']
. Wrapping only
row['term']
silently drops context-word overrides.
Noise-injection (clean →
snr_15db
snr_5db
) and the manifest schema (NeMo canonical fields + clinical extension, plus pre-flight schema and audio-existence checks) all live in
references/manifest-schema.md
.
Warn when product > 100 rows. Magpie NVCF rate-limits with ~5–10%
RESOURCE_EXHAUSTED
drops on big runs. Re-run the dropped rows.
发音确认无误后,生成全笛卡尔积:
|术语数| × |语音数| × |噪声等级| × |场景类型|
。默认配置:2-4种Magpie美式英语语音(Mia/Jason/Ray)、
[clean, snr_15db, snr_5db]
[dictation, handoff, chart_note, history]
合成流程独立完成——无需依赖
/read-aloud
synthesize_row(row, all_overrides, out_dir, api_key)
的实现方法——打开NVCF gRPC流,通过
render_sentence_with_overrides
将自定义覆盖包装为SSML,将16位单声道PCM写入
<out_dir>/audio/<slug>.wav
——位于
references/pronunciation-pipeline.md
(§合成调用)。核心规则:
all_overrides
需包含
pronunciation_overrides.csv
中的所有条目(包括
intravenously
等场景词汇的覆盖),以便渲染器自动包装
row['text']
中出现的任何覆盖词汇。仅包装
row['term']
会导致场景词汇的覆盖被忽略。
噪声注入(clean →
snr_15db
snr_5db
)和清单schema(NeMo标准字段+临床扩展字段,以及预校验schema和音频存在性检查)均位于
references/manifest-schema.md
中。
当笛卡尔积行数>100时发出警告。Magpie NVCF对大型任务会有约5-10%的
RESOURCE_EXHAUSTED
错误。需重新运行失败的行。

Stage 2 completion checklist

第2阶段完成检查清单

Don't consider Stage 2 done until all five sub-steps ran. Agents commonly stop after 2a or 2b; the goal is a synthesized manifest plus a hand-off:
  • 2a
    term_seed.csv
    , 4–10 terms,
    entity_category ∈ {drug, procedure, anatomy, condition, lab, role}
  • 2b — 3–5
    context_type
    sentence variants per term
  • 2c — every term tagged
    ipa_source ∈ {override, merriam-webster, magpie_g2p}
  • 2d — QA wavs auditioned, IPA overrides locked with explicit user approval
  • 2e
    manifest.jsonl
    + per-row audio for the Cartesian product
  • Hand-off — name
    /digital-health-clinical-asr-eval
    as the next skill and KER as its headline metric
Writes go only into the user-chosen
$EVAL_DIR/cycle<N>/
. Don't write elsewhere, modify env, or install packages — those belong to
/digital-health-clinical-asr-setup
.
需完成所有五个子步骤后,方可认为第2阶段结束。Agent常停留在步骤2a或2b;本阶段的目标是生成合成清单并完成交接:
  • 2a
    term_seed.csv
    ,包含4-10个术语,
    entity_category ∈ {drug, procedure, anatomy, condition, lab, role}
  • 2b — 每个术语对应3-5种
    context_type
    的句子变体
  • 2c — 每个术语均标记
    ipa_source ∈ {override, merriam-webster, magpie_g2p}
  • 2d — QA音频已试听,IPA覆盖经用户明确批准后锁定
  • 2e
    manifest.jsonl
    + 笛卡尔积中每一行对应的音频
  • 交接 — 告知用户下一Skill为
    /digital-health-clinical-asr-eval
    ,其核心指标为KER
所有写入操作仅允许在用户选择的
$EVAL_DIR/cycle<N>/
目录中进行。不得写入其他位置、修改环境变量或安装包——这些操作属于
/digital-health-clinical-asr-setup
的职责范围。

Examples

示例

Scenario A — fresh oncology benchmark. User: "We're seeing chemo drug names mistranscribed. Where do I start?" → Step 2a: confirm specialty is oncology, ask about which drugs (immunotherapy biologics, platinum agents, taxanes). Propose ~10 candidates:
cisplatin
,
paclitaxel
,
pembrolizumab
,
nivolumab
,
carboplatin
,
docetaxel
,
bevacizumab
,
trastuzumab
,
cetuximab
,
pemetrexed
. Write
term_seed.csv
with all
entity_category=drug
. Step 2b: brief
/data-designer
for 4 context variants each = 40 sentences. Step 2c: MW lookup for each — biologics like
pembrolizumab
will likely fall to
magpie_g2p
; platinum agents likely hit MW. Step 2d: synthesize one QA wav per term, walk the user through the
pembrolizumab
etc. clips, propose IPA candidates with
-mab
suffix stress patterns. Step 2e: on approval, run 10 terms × 2 voices × 2 noise levels × 3 contexts = 120 rows.
Scenario B — appending to an existing cycle. User: "I have a cycle-1 manifest and I want to add 5 more procedures." → Re-run only Steps 2a (specialty interview just for the new terms), 2b (sentence gen for the additions), 2c (IPA pipeline for the additions), 2d (audition the new terms), and 2e (synthesize only the new term rows). Append to the existing
manifest.jsonl
. Do not regenerate audio for existing terms — cycle isolation is intentional so leaderboards diff cycle N vs cycle N+1 cleanly.
场景A——全新肿瘤基准测试集。用户:“我们发现化疗药物名称经常被转录错误。我该从哪里开始?” → 步骤2a:确认专科为肿瘤学,询问涉及哪些药物(免疫治疗生物制剂、铂类药物、紫杉烷类)。提出约10个候选术语:
cisplatin
paclitaxel
pembrolizumab
nivolumab
carboplatin
docetaxel
bevacizumab
trastuzumab
cetuximab
pemetrexed
。将所有术语的
entity_category=drug
写入
term_seed.csv
。步骤2b:向
/data-designer
提供指令,每个术语生成4种场景变体 → 共40个句子。步骤2c:为每个术语查询MW——
pembrolizumab
等生物制剂可能会 fallback 到
magpie_g2p
;铂类药物可能匹配到MW结果。步骤2d:为每个术语合成一个QA音频,引导用户试听
pembrolizumab
等音频,基于
-mab
后缀重音模式提出IPA候选。步骤2e:获得批准后,运行10个术语 × 2种语音 × 2种噪声等级 × 3种场景 = 120行。
场景B——向现有周期追加内容。用户:“我有一个cycle-1清单,想添加5个手术术语。” → 仅重新运行步骤2a(仅针对新增术语进行专科访谈)、2b(为新增术语生成句子)、2c(为新增术语执行IPA流程)、2d(试听新增术语的音频)和2e(仅合成新增术语的行)。将结果追加到现有
manifest.jsonl
中。请勿重新生成现有术语的音频——周期隔离是有意设计的,以便排行榜可以清晰对比cycle N和cycle N+1的差异。

Artifacts produced

生成的产物

  • term_seed.csv
    — curated terms with
    entity_category
  • pronunciation_overrides.csv
    — verified IPA, appendable across cycles
  • manifest.jsonl
    — NeMo format with clinical extension fields (one JSON object per line)
  • audio/<slug>.wav
    — synthesized clips, one per manifest row
  • term_seed.csv
    — 带有
    entity_category
    的整理后术语
  • pronunciation_overrides.csv
    — 验证通过的IPA,可跨周期追加
  • manifest.jsonl
    — 带有临床扩展字段的NeMo格式清单(每行一个JSON对象)
  • audio/<slug>.wav
    — 合成音频片段,每个清单行对应一个

Troubleshooting

故障排查

  • TTS rate-limit drops (
    RESOURCE_EXHAUSTED
    )
    on >100-row generation → expected on Magpie NVCF. Confirm exponential backoff is active in
    /read-aloud
    ; expect ~5–10% drops on big runs and re-run for the gaps.
  • All
    ipa_source
    rows tagged
    magpie_g2p
    → MW lookup is failing across the board, or candidate IPAs are failing phoneme validation. Re-verify whichever MW path you configured (
    DICTIONARY_API_KEY
    for A; HTTPS reachability + parser for B), then check candidates against Magpie's en-US phoneme inventory.
  • Magpie mispronounces a term even with the IPA override → first verify the IPA is in the Magpie en-US phoneme inventory and the SSML wrapping is syntactically valid. If both check out, the underlying TTS bug is owned by
    /read-aloud
    (
    /riva-tts
    ) — route there for diagnosis. This skill provides the override mechanism but does not own the neural G2P or SSML parser.
  • Sentence variants from
    /data-designer
    are bland / template-like
    → check the brief; the schema-only prompt sometimes produces stereotyped output. Add 1–2 in-context examples to the brief and re-run.
  • Audio files exist but
    manifest.jsonl
    is short
    → manifest writer skipped rows whose synthesis returned a NVCF error. Re-run the build with only the missing rows.
For anything not in this list, identify which upstream skill is implicated and route there. The
digital-health-clinical-asr-build
skill owns the methodology, not the TTS or DataDesigner internals.
  • TTS限流错误(
    RESOURCE_EXHAUSTED
    ——当生成行数>100时出现,这在Magpie NVCF上是预期情况。确认
    /read-aloud
    已启用指数退避机制;大型任务约有5-10%的失败率,需重新运行失败的行。
  • 所有
    ipa_source
    行均标记为
    magpie_g2p
    ——MW查询全面失败,或候选IPA未通过音素验证。重新验证你配置的MW路径(路径A需检查
    DICTIONARY_API_KEY
    ;路径B需检查HTTPS可达性和解析器),然后检查候选IPA是否符合Magpie的美式英语音素库。
  • 即使使用IPA覆盖,Magpie仍发音错误——首先验证IPA是否在Magpie的美式英语音素库中,且SSML包装语法正确。若两者均无问题,则底层TTS bug由
    /read-aloud
    /riva-tts
    )负责——请路由至该Skill进行诊断。本Skill仅提供覆盖机制,不负责神经G2P或SSML解析器的问题。
  • /data-designer
    生成的句子变体平淡/模板化
    ——检查指令;仅提供schema的提示有时会产生刻板输出。在指令中添加1-2个上下文示例后重新运行。
  • 音频文件存在但
    manifest.jsonl
    行数不足
    ——清单生成器跳过了合成时返回NVCF错误的行。仅针对缺失的行重新运行构建流程。
对于未在此列表中的问题,确定涉及哪个上游Skill并路由至该Skill。
digital-health-clinical-asr-build
Skill负责方法论,不负责TTS或DataDesigner的内部实现。

Limitations

局限性

  • English-only by default. Magpie's en-US phoneme inventory is what the two-tier IPA pipeline validates against. Other locales need a different upstream phoneme set + override CSV format.
  • Six fixed entity categories. Extending
    entity_category
    is a deliberate methodology change, not a one-off tweak — KER breakdowns, leaderboard sections, and downstream finetune scripts all key off the vocabulary.
  • Tiny first cycles. Below ~20 terms, the by-
    ipa_source
    leaderboard split won't have enough rows in each bucket to be statistically meaningful. Build a meaningful cycle even if it costs a session.
  • Magpie NVCF rate-limits. ~5–10% drops on large jobs; budget a re-run pass.
  • 默认仅支持英语。双层IPA流程基于Magpie的美式英语音素库进行验证。其他地区语言需要不同的上游音素集+覆盖CSV格式。
  • 六个固定实体分类。扩展
    entity_category
    是对方法论的针对性修改,而非临时调整——KER指标细分、排行榜章节和下游微调脚本均依赖该词汇体系。
  • 首次周期规模较小。若术语数少于20个,按
    ipa_source
    拆分的排行榜每个分组的行数不足,无法具备统计意义。即使需要额外时间,也要构建一个有意义的周期。
  • Magpie NVCF限流。大型任务约有5-10%的失败率;需预留重新运行的时间。

Next steps

下一步

  • Forward:
    /digital-health-clinical-asr-eval
    — transcribe the manifest, score WER/CER/KER/SER, produce the five-section leaderboard.
  • Back to setup (if anything in the env is broken):
    /digital-health-clinical-asr-setup
    .
  • Lateral for TTS-specific debugging:
    /read-aloud
    or
    /riva-tts
    .
  • 前进
    /digital-health-clinical-asr-eval
    — 转录清单,评分WER/CER/KER/SER,生成五部分排行榜。
  • 返回设置(若环境存在问题):
    /digital-health-clinical-asr-setup
  • 横向排查(TTS特定问题):
    /read-aloud
    /riva-tts

References

参考文档

  • references/manifest-schema.md
    — NeMo canonical fields + clinical extension; pre-flight schema and audio-existence checks; cross-cycle stability rules
  • references/manifest-schema.md
    — NeMo标准字段+临床扩展字段;预校验schema和音频存在性检查;跨周期稳定性规则