genvarloader
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGenVarLoader Public API
GenVarLoader 公开API
GenVarLoader () reconstructs personalized haplotypes and re-aligns functional genomic tracks on the fly from a reference + variants + BigWig/Table tracks, without writing personalized genomes to disk. Variable-length output is the norm (indels make lengths region- and sample-dependent).
gvlThis skill is a pointer-dense overview. Symbol names link to where to find the authoritative docstring or source.
GenVarLoader()无需将个性化基因组写入磁盘,可从参考序列+变异数据+BigWig/Table轨道实时重构个性化单倍型并重新对齐功能基因组轨道。输出长度通常是可变的(插入缺失会导致长度因区域和样本而异)。
gvl本技能文档是一份内容密集的概览,符号名称链接到权威文档字符串或源代码的所在位置。
End-to-end shape
端到端流程
python
import genvarloader as gvlpython
import genvarloader as gvl1. Preprocess variants outside Python (see "Variant preprocessing")
1. 在Python外部预处理变异数据(参见“变异数据预处理”)
2. Write the dataset
2. 写入数据集
gvl.write(
path="ds.gvl",
bed="rois.bed",
variants="normed.bcf", # or .pgen, or .svar directory
tracks=[gvl.BigWigs.from_table("signal", "bw_table.tsv")],
max_jitter=128,
)
gvl.write(
path="ds.gvl",
bed="rois.bed",
variants="normed.bcf", # 或 .pgen,或 .svar 目录
tracks=[gvl.BigWigs.from_table("signal", "bw_table.tsv")],
max_jitter=128,
)
3. Open and configure (chainable fluent API)
3. 打开并配置(链式流式API)
ds = (
gvl.Dataset.open("ds.gvl", reference="ref.fa")
.with_seqs("haplotypes")
.with_tracks(["signal"])
.with_insertion_fill(gvl.Repeat5pNormalized())
.with_len(2048) # or "ragged" / "variable"
.with_settings(jitter=32, deterministic=False)
)
ds = (
gvl.Dataset.open("ds.gvl", reference="ref.fa")
.with_seqs("haplotypes")
.with_tracks(["signal"])
.with_insertion_fill(gvl.Repeat5pNormalized())
.with_len(2048) # 或 "ragged" / "variable"
.with_settings(jitter=32, deterministic=False)
)
4. Eager indexing: dataset[region_idx, sample_idx]
4. 立即索引:dataset[region_idx, sample_idx]
batch = ds[0:8, :] # shape depends on with_* state — see "Output shapes"
undefinedbatch = ds[0:8, :] # 形状取决于 with_* 状态——参见“输出形状”
undefinedVariant preprocessing requirements
变异数据预处理要求
Variants passed to must be left-aligned, bi-allelic, and atomized (no MNPs or compound MNP-indels). VCFs must be indexed.
gvl.writebash
undefined传入的变异数据必须是左对齐、双等位基因且原子化的(无多核苷酸变异(MNP)或复合MNP-插入缺失)。VCF文件必须已建立索引。
gvl.writebash
undefinedVCF/BCF
VCF/BCF
bcftools norm -f ref.fa
-a --atom-overlaps .
-m -any --multi-overlaps .
-O b -o normed.bcf in.vcf.gz bcftools index normed.bcf
-a --atom-overlaps .
-m -any --multi-overlaps .
-O b -o normed.bcf in.vcf.gz bcftools index normed.bcf
bcftools norm -f ref.fa
-a --atom-overlaps .
-m -any --multi-overlaps .
-O b -o normed.bcf in.vcf.gz bcftools index normed.bcf
-a --atom-overlaps .
-m -any --multi-overlaps .
-O b -o normed.bcf in.vcf.gz bcftools index normed.bcf
PGEN
PGEN
plink2 --make-bpgen --pfile in --out tmp
plink2 --make-pgen --normalize --ref-from-fa --fa ref.fa --bpfile tmp --out normed
See `docs/source/write.md` for the canonical recipe and BED/BigWig table layouts.plink2 --make-bpgen --pfile in --out tmp
plink2 --make-pgen --normalize --ref-from-fa --fa ref.fa --bpfile tmp --out normed
有关标准流程以及BED/BigWig表格布局,请参见`docs/source/write.md`。When to use SVAR vs BCF/PGEN
何时使用SVAR vs BCF/PGEN
.svargenoraygvl.write(variants="x.svar")Use SVAR when:
- You need allele-frequency filtering at read time (requires SVAR-backed genotypes — will raise otherwise).
Dataset.open(min_af=..., max_af=...) - Many datasets share the same variant source — SVAR avoids duplicating /
variant_idxs.npy/dosages.npyinto eachvariants.arrowdirectory..gvl - You're working at population scale and want compact on-disk variant storage.
Use BCF/PGEN directly when you have a one-off dataset and don't need AF filtering.
Create an SVAR from a normalized VCF/PGEN with :
genoraypython
from genoray._svar import dense2sparse
from genoray import VCF
dense2sparse(VCF("normed.bcf"), "normed.svar") # writes a .svar/ directorySVARs are resolved at time via → caller arg → recorded relative path → recorded absolute path → sibling . See ("SVAR resolution at open time") and . Legacy symlink-based SVAR layouts: run once to upgrade.
Dataset.openmetadata.jsonsvar=*.svardocs/source/format.md_dataset/_svar_link.pygvl.migrate_svar_link(path).svargenoraygvl.write(variants="x.svar")在以下场景使用SVAR:
- 需要在读取时进行等位基因频率过滤(要求数据集由SVAR支持,否则会报错)。
Dataset.open(min_af=..., max_af=...) - 多个数据集共享同一变异数据源——SVAR可避免将/
variant_idxs.npy/dosages.npy复制到每个variants.arrow目录中。.gvl - 处理群体规模的数据,需要紧凑的磁盘变异存储格式。
当你只需要一次性数据集且无需等位基因频率过滤时,直接使用BCF/PGEN即可。
使用从标准化的VCF/PGEN创建SVAR:
genoraypython
from genoray._svar import dense2sparse
from genoray import VCF
dense2sparse(VCF("normed.bcf"), "normed.svar") # 写入一个 .svar/ 目录SVAR会在时通过以下顺序解析: → 调用者传入的参数 → 记录的相对路径 → 记录的绝对路径 → 同级的。参见(“读取时的SVAR解析”)和。对于旧版基于符号链接的SVAR布局:运行一次即可升级。
Dataset.openmetadata.jsonsvar=*.svardocs/source/format.md_dataset/_svar_link.pygvl.migrate_svar_link(path)gvl.write
— key arguments
gvl.writegvl.write
— 关键参数
gvl.writepython
gvl.write(
path, bed, variants=None, tracks=None,
samples=None, max_jitter=None, overwrite=False,
max_mem="4g", extend_to_length=True,
)Notable:
- : path or polars DataFrame with
bed(0-based). Optionalchrom, chromStart, chromEnd(strand/+/-) controls reverse-complement on read. Extra columns are preserved on..Dataset.regions - : a
tracks,gvl.BigWigs, or a list of them. Each must have a uniquegvl.Table. BigWigs need a sample→path mapping (dict or table with.name,samplecolumns; seepath).BigWigs.from_table - : max read-time jitter; pads stored data on both sides of every region by this many bases so
max_jitterworks for anyDataset.with_settings(jitter=j).j <= max_jitter - keeps reading past the BED end until every haplotype is ≥ the region length (matters when deletions would shorten output); set
extend_to_length=Truefor faster writes if shorter haps are acceptable.False - Inner-joins samples across and all
variants.tracks
Source: .
python/genvarloader/_dataset/_write.pypython
gvl.write(
path, bed, variants=None, tracks=None,
samples=None, max_jitter=None, overwrite=False,
max_mem="4g", extend_to_length=True,
)值得注意的参数:
- : 路径或包含
bed(0起始)的polars DataFrame。可选的chrom, chromStart, chromEnd(strand/+/-)控制读取时的反向互补。额外列会保留在.中。Dataset.regions - :
tracks、gvl.BigWigs或它们的列表。每个轨道必须有唯一的gvl.Table。BigWigs需要样本→路径的映射(字典或包含.name、sample列的表格;参见path)。BigWigs.from_table - : 最大读取时抖动值;会在每个区域的两侧填充该数量的碱基,以便
max_jitter对任何Dataset.with_settings(jitter=j)都生效。j <= max_jitter - 会持续读取直到每个单倍型长度≥区域长度(当存在缺失会缩短输出时很重要);如果可以接受较短的单倍型,设置为
extend_to_length=True可加快写入速度。False - 会对和所有
variants中的样本进行内连接。tracks
源代码:。
python/genvarloader/_dataset/_write.pyDataset.open
— key arguments
Dataset.openDataset.open
— 关键参数
Dataset.openpython
gvl.Dataset.open(
path, reference=None, jitter=0, rng=None,
deterministic=True, rc_neg=True,
min_af=None, max_af=None, # SVAR only
region_names=None,
splice_info=None, # see "Spliced haplotypes"
var_filter=None, # None | "exonic"
*, svar=None,
)Without , only variants/haplotypes are available (you can't produce reference-overlaid sequences). overrides the recorded SVAR location.
reference=svar=python
gvl.Dataset.open(
path, reference=None, jitter=0, rng=None,
deterministic=True, rc_neg=True,
min_af=None, max_af=None, # 仅SVAR支持
region_names=None,
splice_info=None, # 参见“剪接单倍型”
var_filter=None, # None | "exonic"
*, svar=None,
)如果未指定,则仅能获取变异/单倍型数据(无法生成叠加参考序列的输出)。会覆盖记录的SVAR位置。
reference=svar=Output modes — with_seqs
× with_tracks
with_seqswith_tracks输出模式 — with_seqs
× with_tracks
with_seqswith_trackswith_seqs(kind) | Returns | Use when |
|---|---|---|
| Reference sequence ( | Baseline / no personalization |
| Personalized haplotypes with indels ( | Standard variant-aware modeling |
| | Need to map back to variants/ref coords |
| | Variant-centric tasks |
| No sequences | Tracks-only datasets |
with_tracks(tracks=..., kind=...)- :
tracks(default),None(disable), a single name, or a list of names.False - :
kind(re-aligned numeric values) or"tracks"(raw interval representation)."intervals"
with_len(L)- (default): returns
"ragged"(variable length per item).gvl.Ragged - : NumPy array right-padded to the batch's longest item (
"variable"for seqs,Nfor tracks).0 - integer : fixed length; jitter/random shift/truncate/pad-with-more-personalized-data combine to meet
L. Must satisfyL.L + 2·jitter ≤ min(region_length) + 2·max_jitter
Returns either a or (frozen dataclass views) based on . See for diagrams.
RaggedDatasetArrayDatasetwith_lendocs/source/dataset.mdwith_seqs(kind) | 返回内容 | 使用场景 |
|---|---|---|
| 参考序列( | 基准对比 / 无需个性化 |
| 带插入缺失的个性化单倍型( | 标准变异感知建模 |
| | 需要映射回变异/参考坐标的场景 |
| | 以变异为中心的任务 |
| 无序列 | 仅轨道数据集 |
with_tracks(tracks=..., kind=...)- :
tracks(默认)、None(禁用)、单个名称或名称列表。False - :
kind(重新对齐的数值)或"tracks"(原始区间表示)。"intervals"
with_len(L)- (默认):返回
"ragged"(每个条目长度可变)。gvl.Ragged - : NumPy数组,向右填充至批次中最长条目的长度(序列用
"variable"填充,轨道用N填充)。0 - 整数: 固定长度;抖动/随机偏移/截断/用更多个性化数据填充组合以达到长度
L。必须满足L。L + 2·jitter ≤ min(region_length) + 2·max_jitter
根据的设置,返回或(冻结数据类视图)。有关图示,请参见。
with_lenRaggedDatasetArrayDatasetdocs/source/dataset.mdTrack insertion fill (only when haps + tracks together)
轨道插入填充(仅当同时启用单倍型和轨道时生效)
Indels make track length differ from reference length. controls what gets written into inserted positions. Only valid when the dataset returns both haplotypes and tracks — pure-ref and pure-hap datasets ignore it (raises if attempted).
Dataset.with_insertion_fill(fill)| Strategy | Behavior |
|---|---|
| Repeat the value at variant POS across the insertion. |
| Repeat |
| Constant value (default NaN) across the insertion. |
| Resample with replacement from a 2·flank_width+1 window around POS. |
| Polynomial interp (order 1/2/3) between flanking reference values. |
Pass a single strategy (applies to every track) or a (missing tracks fall back to ). Source: .
dict[track_name, strategy]Repeat5ppython/genvarloader/_dataset/_insertion_fill.py插入缺失会导致轨道长度与参考序列长度不同。控制插入位置的填充内容。仅当数据集同时返回单倍型和轨道时有效——纯参考序列和纯单倍型数据集会忽略该设置(尝试设置会报错)。
Dataset.with_insertion_fill(fill)| 策略 | 行为 |
|---|---|
| 将变异POS位置的值重复填充整个插入区域。 |
| 重复填充 |
| 整个插入区域填充固定值(默认NaN)。 |
| 从POS周围2·flank_width+1的窗口中放回采样填充。 |
| 基于侧翼参考值进行多项式插值(阶数1/2/3)。 |
可传入单个策略(应用于所有轨道)或(未指定的轨道默认使用)。源代码:。
dict[track_name, strategy]Repeat5ppython/genvarloader/_dataset/_insertion_fill.pySpliced haplotypes
剪接单倍型
Splicing is opt-in at (or via ). It groups the BED rows for one transcript and concatenates exon-level sequences/tracks per sample.
Dataset.openwith_settingspython
splice_bed = gvl.get_splice_bed("annotation.gtf", transcript_support_level="1")
gvl.write(path="splice.gvl", bed=splice_bed, variants="normed.svar")
sds = gvl.Dataset.open(
"splice.gvl",
reference="ref.fa",
splice_info=("transcript_id", "exon_number"), # tuple = (group_col, order_col)
var_filter="exonic", # optional: drop intronic variants
)splice_info- a column name string (single grouping column, order inferred from BED row order), or
- a tuple (explicit ordering, e.g. exon number).
(group_col, order_col)
get_splice_bedtranscript_idexon_numberdocs/source/splicing.ipynb剪接功能在时可选启用(或通过启用)。它会将同一转录本的BED行分组,并按样本拼接外显子级别的序列/轨道。
Dataset.openwith_settingspython
splice_bed = gvl.get_splice_bed("annotation.gtf", transcript_support_level="1")
gvl.write(path="splice.gvl", bed=splice_bed, variants="normed.svar")
sds = gvl.Dataset.open(
"splice.gvl",
reference="ref.fa",
splice_info=("transcript_id", "exon_number"), # 元组 = (分组列, 排序列)
var_filter="exonic", # 可选:过滤掉内含子变异
)splice_info- 列名字符串(单个分组列,顺序由BED行顺序推断),或
- 元组(显式排序,例如外显子编号)。
(group_col, order_col)
get_splice_bedtranscript_idexon_numberdocs/source/splicing.ipynbRefDataset splicing
RefDataset 剪接
gvl.RefDatasetsplice_infoDataset.open(group_col, sort_col)with_settings(splice_info=False)RefDatasetoutput_length{"ragged", "variable"}jitter=0deterministic=Truesubset_to(transcript_ids)Datasetpython
ref = gvl.Reference.from_path("hg38.fa.bgz")
bed = gvl.get_splice_bed("annotations.gtf")
ref_ds = gvl.RefDataset(ref, bed, splice_info="transcript_id")
seqs = ref_ds[:] # Ragged[S1], one row per transcriptgvl.RefDatasetDataset.opensplice_info(group_col, sort_col)with_settings(splice_info=False)RefDatasetoutput_length{"ragged", "variable"}jitter=0deterministic=Truesubset_to(transcript_ids)Datasetpython
ref = gvl.Reference.from_path("hg38.fa.bgz")
bed = gvl.get_splice_bed("annotations.gtf")
ref_ds = gvl.RefDataset(ref, bed, splice_info="transcript_id")
seqs = ref_ds[:] # Ragged[S1],每行对应一个转录本Site-only variants (e.g. ClinVar)
仅位点变异(如ClinVar)
Use → (bi-allelic SNPs only), then wrap an with . Returns ; flags encode applied / deleted-overlap / already-existing. See .
gvl.sites_vcf_to_table(vcf)pl.DataFrameArrayDataset[AnnotatedHaps, ...]gvl.DatasetWithSites(ds, sites, max_variants_per_region=1)(wt_haps, mut_haps, flags[, tracks])_variants/_sitesonly.py使用 → (仅双等位基因SNP),然后用包装。返回;flags编码应用的变异/缺失重叠/已存在的变异。参见。
gvl.sites_vcf_to_table(vcf)pl.DataFramegvl.DatasetWithSites(ds, sites, max_variants_per_region=1)ArrayDataset[AnnotatedHaps, ...](wt_haps, mut_haps, flags[, tracks])_variants/_sitesonly.pyOther public surface (one-liners)
其他公开接口(单行命令)
- — wrap a FASTA. Cached.
gvl.Reference.from_path(fasta, contigs=None) - /
gvl.read_bedlike(path)— BED helpers (re-exported fromgvl.with_length(bed, L)).seqpro - ,
gvl.Ragged,gvl.RaggedAnnotatedHaps,gvl.RaggedVariants— ragged return containers.gvl.RaggedIntervals - — convert to a PyTorch nested tensor (requires
gvl.to_nested_tensor(ragged)).torch - — small in-memory dataset for examples/tests.
gvl.get_dummy_dataset() - — reference-only dataset (no genotypes).
gvl.RefDataset - — generic interval track from a DataFrame.
gvl.Table - — download public test/demo datasets.
gvl.data_registry.fetch(name)
Full list lives in .
python/genvarloader/__init__.py__all__- — 包装FASTA文件,支持缓存。
gvl.Reference.from_path(fasta, contigs=None) - /
gvl.read_bedlike(path)— BED工具函数(从gvl.with_length(bed, L)重新导出)。seqpro - ,
gvl.Ragged,gvl.RaggedAnnotatedHaps,gvl.RaggedVariants— 可变长度返回容器。gvl.RaggedIntervals - — 转换为PyTorch嵌套张量(需要
gvl.to_nested_tensor(ragged))。torch - — 用于示例/测试的小型内存数据集。
gvl.get_dummy_dataset() - — 仅参考序列的数据集(无基因型)。
gvl.RefDataset - — 基于DataFrame的通用区间轨道。
gvl.Table - — 下载公开测试/演示数据集。
gvl.data_registry.fetch(name)
完整列表请参见中的。
python/genvarloader/__init__.py__all__On-disk layout (quick reference)
磁盘布局(快速参考)
ds.gvl/
├── metadata.json # version, samples, contigs, ploidy, max_jitter, svar_link?
├── input_regions.arrow # BED + region index map
├── genotypes/ # variant_idxs.npy, dosages.npy, variants.arrow
│ # (absent when sourced from .svar; see svar_link)
└── intervals/<track>/ # per-track interval dataSee for the full schema, versioning, and SVAR-link details.
docs/source/format.mdds.gvl/
├── metadata.json # 版本、样本、contig、倍性、max_jitter、是否有svar_link?
├── input_regions.arrow # BED + 区域索引映射
├── genotypes/ # variant_idxs.npy, dosages.npy, variants.arrow
│ # (当数据源为.svar时不存在;参见svar_link)
└── intervals/<track>/ # 每个轨道的区间数据有关完整 schema、版本控制和SVAR链接详情,请参见。
docs/source/format.mdWhere to look next
下一步参考
| For… | Read… |
|---|---|
| End-to-end RNA-seq example | |
| Splicing tutorial | |
| Deep-learning eval pipeline | |
| BED / BigWig / bcftools recipes | |
| |
| On-disk format + SVAR resolution | |
FAQ ( | |
| Auto-generated reference | |
| Track re-alignment internals | |
| Insertion fill internals | |
| SVAR back-reference / migration | |
| 需求 | 参考文档 |
|---|---|
| 端到端RNA-seq示例 | |
| 剪接教程 | |
| 深度学习评估流程 | |
| BED / BigWig / bcftools 流程 | |
| |
| 磁盘格式 + SVAR解析 | |
FAQ( | |
| 自动生成的参考文档 | |
| 轨道重新对齐内部实现 | |
| 插入填充内部实现 | |
| SVAR反向引用 / 迁移 | |
Common gotchas
常见陷阱
- raises unless the dataset has both haplotypes AND tracks active.
with_insertion_fill - /
min_afraise unless the dataset is SVAR-backed.max_af - requires
with_len(L)— setL + 2·jitter ≤ min(region_length) + 2·max_jitteraccordingly atmax_jittertime.write - Tracks must have unique ; the on-disk layout is
.name.intervals/<name>/ - BED of
strandis treated as.. Reverse-complement happens automatically when+(default) andrc_neg=True.strand == "-" - Splicing is a read-time setting on a flat BED of exons — do not pre-concatenate exons before .
gvl.write - at write time will produce haplotypes shorter than the BED region when deletions are present; downstream code must tolerate
extend_to_length=Falseregion length.<
- 仅在数据集同时启用单倍型和轨道时生效,否则会报错。
with_insertion_fill - /
min_af仅在数据集由SVAR支持时可用,否则会报错。max_af - 要求
with_len(L)——需在L + 2·jitter ≤ min(region_length) + 2·max_jitter时相应设置write。max_jitter - 轨道必须有唯一的;磁盘布局为
.name。intervals/<name>/ - BED中的为
strand时会被视为.。当+(默认)且rc_neg=True时会自动进行反向互补。strand == "-" - 剪接是基于外显子的扁平BED在读取时的设置——请勿在前预先拼接外显子。
gvl.write - 时设置
write会导致存在缺失时生成的单倍型短于BED区域;下游代码必须能容忍长度小于区域长度的情况。extend_to_length=False
Maintaining this skill
维护本技能文档
Whenever a PR changes the public API (anything in , or the docstring/signature of , , or any method), the author must also update this . New public symbols, removed symbols, renamed args, changed defaults, and new output modes are all in scope. CLAUDE.md enforces this as part of the contribution checklist.
python/genvarloader/__init__.py__all__gvl.writeDataset.openDataset.with_*SKILL.md每当PR修改公开API(中包含的任何内容,或、、任何方法的文档字符串/签名)时,作者必须同时更新本。新增公开符号、移除符号、参数重命名、默认值修改、新增输出模式都属于更新范围。CLAUDE.md会将此作为贡献检查清单的一部分强制执行。
python/genvarloader/__init__.py__all__gvl.writeDataset.openDataset.with_*SKILL.md