genvarloader

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GenVarLoader Public API

GenVarLoader 公开API

GenVarLoader (
gvl
) reconstructs personalized haplotypes and re-aligns functional genomic tracks on the fly from a reference + variants + BigWig/Table tracks, without writing personalized genomes to disk. Variable-length output is the norm (indels make lengths region- and sample-dependent).
This skill is a pointer-dense overview. Symbol names link to where to find the authoritative docstring or source.
GenVarLoader(
gvl
)无需将个性化基因组写入磁盘,可从参考序列+变异数据+BigWig/Table轨道实时重构个性化单倍型并重新对齐功能基因组轨道。输出长度通常是可变的(插入缺失会导致长度因区域和样本而异)。
本技能文档是一份内容密集的概览,符号名称链接到权威文档字符串或源代码的所在位置。

End-to-end shape

端到端流程

python
import genvarloader as gvl
python
import genvarloader as gvl

1. Preprocess variants outside Python (see "Variant preprocessing")

1. 在Python外部预处理变异数据(参见“变异数据预处理”)

2. Write the dataset

2. 写入数据集

gvl.write( path="ds.gvl", bed="rois.bed", variants="normed.bcf", # or .pgen, or .svar directory tracks=[gvl.BigWigs.from_table("signal", "bw_table.tsv")], max_jitter=128, )
gvl.write( path="ds.gvl", bed="rois.bed", variants="normed.bcf", # 或 .pgen,或 .svar 目录 tracks=[gvl.BigWigs.from_table("signal", "bw_table.tsv")], max_jitter=128, )

3. Open and configure (chainable fluent API)

3. 打开并配置(链式流式API)

ds = ( gvl.Dataset.open("ds.gvl", reference="ref.fa") .with_seqs("haplotypes") .with_tracks(["signal"]) .with_insertion_fill(gvl.Repeat5pNormalized()) .with_len(2048) # or "ragged" / "variable" .with_settings(jitter=32, deterministic=False) )
ds = ( gvl.Dataset.open("ds.gvl", reference="ref.fa") .with_seqs("haplotypes") .with_tracks(["signal"]) .with_insertion_fill(gvl.Repeat5pNormalized()) .with_len(2048) # 或 "ragged" / "variable" .with_settings(jitter=32, deterministic=False) )

4. Eager indexing: dataset[region_idx, sample_idx]

4. 立即索引:dataset[region_idx, sample_idx]

batch = ds[0:8, :] # shape depends on with_* state — see "Output shapes"
undefined
batch = ds[0:8, :] # 形状取决于 with_* 状态——参见“输出形状”
undefined

Variant preprocessing requirements

变异数据预处理要求

Variants passed to
gvl.write
must be left-aligned, bi-allelic, and atomized (no MNPs or compound MNP-indels). VCFs must be indexed.
bash
undefined
传入
gvl.write
的变异数据必须是左对齐、双等位基因且原子化的(无多核苷酸变异(MNP)或复合MNP-插入缺失)。VCF文件必须已建立索引。
bash
undefined

VCF/BCF

VCF/BCF

bcftools norm -f ref.fa
-a --atom-overlaps .
-m -any --multi-overlaps .
-O b -o normed.bcf in.vcf.gz bcftools index normed.bcf
bcftools norm -f ref.fa
-a --atom-overlaps .
-m -any --multi-overlaps .
-O b -o normed.bcf in.vcf.gz bcftools index normed.bcf

PGEN

PGEN

plink2 --make-bpgen --pfile in --out tmp plink2 --make-pgen --normalize --ref-from-fa --fa ref.fa --bpfile tmp --out normed

See `docs/source/write.md` for the canonical recipe and BED/BigWig table layouts.
plink2 --make-bpgen --pfile in --out tmp plink2 --make-pgen --normalize --ref-from-fa --fa ref.fa --bpfile tmp --out normed

有关标准流程以及BED/BigWig表格布局,请参见`docs/source/write.md`。

When to use SVAR vs BCF/PGEN

何时使用SVAR vs BCF/PGEN

.svar
is a sparse columnar variant archive (from
genoray
). Pass it to
gvl.write(variants="x.svar")
exactly like a BCF or PGEN — the resulting dataset stores a back-reference instead of duplicating per-variant arrays.
Use SVAR when:
  • You need allele-frequency filtering at read time (
    Dataset.open(min_af=..., max_af=...)
    requires SVAR-backed genotypes — will raise otherwise).
  • Many datasets share the same variant source — SVAR avoids duplicating
    variant_idxs.npy
    /
    dosages.npy
    /
    variants.arrow
    into each
    .gvl
    directory.
  • You're working at population scale and want compact on-disk variant storage.
Use BCF/PGEN directly when you have a one-off dataset and don't need AF filtering.
Create an SVAR from a normalized VCF/PGEN with
genoray
:
python
from genoray._svar import dense2sparse
from genoray import VCF
dense2sparse(VCF("normed.bcf"), "normed.svar")  # writes a .svar/ directory
SVARs are resolved at
Dataset.open
time via
metadata.json
→ caller
svar=
arg → recorded relative path → recorded absolute path → sibling
*.svar
. See
docs/source/format.md
("SVAR resolution at open time") and
_dataset/_svar_link.py
. Legacy symlink-based SVAR layouts: run
gvl.migrate_svar_link(path)
once to upgrade.
.svar
是一种稀疏列存变异归档格式(来自
genoray
)。将其传入
gvl.write(variants="x.svar")
的方式与BCF或PGEN完全相同——生成的数据集会存储一个反向引用,而非重复每个变异的数组。
在以下场景使用SVAR:
  • 需要在读取时进行等位基因频率过滤
    Dataset.open(min_af=..., max_af=...)
    要求数据集由SVAR支持,否则会报错)。
  • 多个数据集共享同一变异数据源——SVAR可避免将
    variant_idxs.npy
    /
    dosages.npy
    /
    variants.arrow
    复制到每个
    .gvl
    目录中。
  • 处理群体规模的数据,需要紧凑的磁盘变异存储格式。
当你只需要一次性数据集且无需等位基因频率过滤时,直接使用BCF/PGEN即可。
使用
genoray
从标准化的VCF/PGEN创建SVAR:
python
from genoray._svar import dense2sparse
from genoray import VCF
dense2sparse(VCF("normed.bcf"), "normed.svar")  # 写入一个 .svar/ 目录
SVAR会在
Dataset.open
时通过以下顺序解析:
metadata.json
→ 调用者传入的
svar=
参数 → 记录的相对路径 → 记录的绝对路径 → 同级的
*.svar
。参见
docs/source/format.md
(“读取时的SVAR解析”)和
_dataset/_svar_link.py
。对于旧版基于符号链接的SVAR布局:运行一次
gvl.migrate_svar_link(path)
即可升级。

gvl.write
— key arguments

gvl.write
— 关键参数

python
gvl.write(
    path, bed, variants=None, tracks=None,
    samples=None, max_jitter=None, overwrite=False,
    max_mem="4g", extend_to_length=True,
)
Notable:
  • bed
    : path or polars DataFrame with
    chrom, chromStart, chromEnd
    (0-based). Optional
    strand
    (
    +
    /
    -
    /
    .
    ) controls reverse-complement on read. Extra columns are preserved on
    Dataset.regions
    .
  • tracks
    : a
    gvl.BigWigs
    ,
    gvl.Table
    , or a list of them. Each must have a unique
    .name
    . BigWigs need a sample→path mapping (dict or table with
    sample
    ,
    path
    columns; see
    BigWigs.from_table
    ).
  • max_jitter
    : max read-time jitter; pads stored data on both sides of every region by this many bases so
    Dataset.with_settings(jitter=j)
    works for any
    j <= max_jitter
    .
  • extend_to_length=True
    keeps reading past the BED end until every haplotype is ≥ the region length (matters when deletions would shorten output); set
    False
    for faster writes if shorter haps are acceptable.
  • Inner-joins samples across
    variants
    and all
    tracks
    .
Source:
python/genvarloader/_dataset/_write.py
.
python
gvl.write(
    path, bed, variants=None, tracks=None,
    samples=None, max_jitter=None, overwrite=False,
    max_mem="4g", extend_to_length=True,
)
值得注意的参数:
  • bed
    : 路径或包含
    chrom, chromStart, chromEnd
    (0起始)的polars DataFrame。可选的
    strand
    +
    /
    -
    /
    .
    )控制读取时的反向互补。额外列会保留在
    Dataset.regions
    中。
  • tracks
    :
    gvl.BigWigs
    gvl.Table
    或它们的列表。每个轨道必须有唯一的
    .name
    。BigWigs需要样本→路径的映射(字典或包含
    sample
    path
    列的表格;参见
    BigWigs.from_table
    )。
  • max_jitter
    : 最大读取时抖动值;会在每个区域的两侧填充该数量的碱基,以便
    Dataset.with_settings(jitter=j)
    对任何
    j <= max_jitter
    都生效。
  • extend_to_length=True
    会持续读取直到每个单倍型长度≥区域长度(当存在缺失会缩短输出时很重要);如果可以接受较短的单倍型,设置为
    False
    可加快写入速度。
  • 会对
    variants
    和所有
    tracks
    中的样本进行内连接。
源代码:
python/genvarloader/_dataset/_write.py

Dataset.open
— key arguments

Dataset.open
— 关键参数

python
gvl.Dataset.open(
    path, reference=None, jitter=0, rng=None,
    deterministic=True, rc_neg=True,
    min_af=None, max_af=None,           # SVAR only
    region_names=None,
    splice_info=None,                    # see "Spliced haplotypes"
    var_filter=None,                     # None | "exonic"
    *, svar=None,
)
Without
reference=
, only variants/haplotypes are available (you can't produce reference-overlaid sequences).
svar=
overrides the recorded SVAR location.
python
gvl.Dataset.open(
    path, reference=None, jitter=0, rng=None,
    deterministic=True, rc_neg=True,
    min_af=None, max_af=None,           # 仅SVAR支持
    region_names=None,
    splice_info=None,                    # 参见“剪接单倍型”
    var_filter=None,                     # None | "exonic"
    *, svar=None,
)
如果未指定
reference=
,则仅能获取变异/单倍型数据(无法生成叠加参考序列的输出)。
svar=
会覆盖记录的SVAR位置。

Output modes —
with_seqs
×
with_tracks

输出模式 —
with_seqs
×
with_tracks

with_seqs(kind)
selects the sequence output channel:
kind
ReturnsUse when
"reference"
Reference sequence (
S1
)
Baseline / no personalization
"haplotypes"
Personalized haplotypes with indels (
S1
)
Standard variant-aware modeling
"annotated"
AnnotatedHaps
(haps + var_idxs + ref_coords)
Need to map back to variants/ref coords
"variants"
RaggedVariants
(variants only, no seq)
Variant-centric tasks
None
No sequencesTracks-only datasets
with_tracks(tracks=..., kind=...)
selects tracks:
  • tracks
    :
    None
    (default),
    False
    (disable), a single name, or a list of names.
  • kind
    :
    "tracks"
    (re-aligned numeric values) or
    "intervals"
    (raw interval representation).
with_len(L)
controls output shape:
  • "ragged"
    (default): returns
    gvl.Ragged
    (variable length per item).
  • "variable"
    : NumPy array right-padded to the batch's longest item (
    N
    for seqs,
    0
    for tracks).
  • integer
    L
    : fixed length; jitter/random shift/truncate/pad-with-more-personalized-data combine to meet
    L
    . Must satisfy
    L + 2·jitter ≤ min(region_length) + 2·max_jitter
    .
Returns either a
RaggedDataset
or
ArrayDataset
(frozen dataclass views) based on
with_len
. See
docs/source/dataset.md
for diagrams.
with_seqs(kind)
用于选择序列输出通道:
kind
返回内容使用场景
"reference"
参考序列(
S1
基准对比 / 无需个性化
"haplotypes"
带插入缺失的个性化单倍型(
S1
标准变异感知建模
"annotated"
AnnotatedHaps
(单倍型 + 变异索引 + 参考坐标)
需要映射回变异/参考坐标的场景
"variants"
RaggedVariants
(仅变异数据,无序列)
以变异为中心的任务
None
无序列仅轨道数据集
with_tracks(tracks=..., kind=...)
用于选择轨道:
  • tracks
    :
    None
    (默认)、
    False
    (禁用)、单个名称或名称列表。
  • kind
    :
    "tracks"
    (重新对齐的数值)或
    "intervals"
    (原始区间表示)。
with_len(L)
控制输出形状:
  • "ragged"
    (默认):返回
    gvl.Ragged
    (每个条目长度可变)。
  • "variable"
    : NumPy数组,向右填充至批次中最长条目的长度(序列用
    N
    填充,轨道用
    0
    填充)。
  • 整数
    L
    : 固定长度;抖动/随机偏移/截断/用更多个性化数据填充组合以达到长度
    L
    。必须满足
    L + 2·jitter ≤ min(region_length) + 2·max_jitter
根据
with_len
的设置,返回
RaggedDataset
ArrayDataset
(冻结数据类视图)。有关图示,请参见
docs/source/dataset.md

Track insertion fill (only when haps + tracks together)

轨道插入填充(仅当同时启用单倍型和轨道时生效)

Indels make track length differ from reference length.
Dataset.with_insertion_fill(fill)
controls what gets written into inserted positions. Only valid when the dataset returns both haplotypes and tracks — pure-ref and pure-hap datasets ignore it (raises if attempted).
StrategyBehavior
gvl.Repeat5p()
(default)
Repeat the value at variant POS across the insertion.
gvl.Repeat5pNormalized()
Repeat
track[POS] / (insertion_len + 1)
. Preserves sum.
gvl.Constant(value=nan)
Constant value (default NaN) across the insertion.
gvl.FlankSample(flank_width=5)
Resample with replacement from a 2·flank_width+1 window around POS.
gvl.Interpolate(order=1)
Polynomial interp (order 1/2/3) between flanking reference values.
Pass a single strategy (applies to every track) or a
dict[track_name, strategy]
(missing tracks fall back to
Repeat5p
). Source:
python/genvarloader/_dataset/_insertion_fill.py
.
插入缺失会导致轨道长度与参考序列长度不同。
Dataset.with_insertion_fill(fill)
控制插入位置的填充内容。仅当数据集同时返回单倍型和轨道时有效——纯参考序列和纯单倍型数据集会忽略该设置(尝试设置会报错)。
策略行为
gvl.Repeat5p()
(默认)
将变异POS位置的值重复填充整个插入区域。
gvl.Repeat5pNormalized()
重复填充
track[POS] / (insertion_len + 1)
,保留总和。
gvl.Constant(value=nan)
整个插入区域填充固定值(默认NaN)。
gvl.FlankSample(flank_width=5)
从POS周围2·flank_width+1的窗口中放回采样填充。
gvl.Interpolate(order=1)
基于侧翼参考值进行多项式插值(阶数1/2/3)。
可传入单个策略(应用于所有轨道)或
dict[track_name, strategy]
(未指定的轨道默认使用
Repeat5p
)。源代码:
python/genvarloader/_dataset/_insertion_fill.py

Spliced haplotypes

剪接单倍型

Splicing is opt-in at
Dataset.open
(or via
with_settings
). It groups the BED rows for one transcript and concatenates exon-level sequences/tracks per sample.
python
splice_bed = gvl.get_splice_bed("annotation.gtf", transcript_support_level="1")
gvl.write(path="splice.gvl", bed=splice_bed, variants="normed.svar")

sds = gvl.Dataset.open(
    "splice.gvl",
    reference="ref.fa",
    splice_info=("transcript_id", "exon_number"),  # tuple = (group_col, order_col)
    var_filter="exonic",                            # optional: drop intronic variants
)
splice_info
accepts:
  • a column name string (single grouping column, order inferred from BED row order), or
  • a
    (group_col, order_col)
    tuple (explicit ordering, e.g. exon number).
get_splice_bed
does GTF→BED with TSL filtering and an optional "CDS length multiple of 3" filter. To roll your own splice BED, just include
transcript_id
(or any grouping column) and
exon_number
columns on the BED. See
docs/source/splicing.ipynb
.
剪接功能在
Dataset.open
时可选启用(或通过
with_settings
启用)。它会将同一转录本的BED行分组,并按样本拼接外显子级别的序列/轨道。
python
splice_bed = gvl.get_splice_bed("annotation.gtf", transcript_support_level="1")
gvl.write(path="splice.gvl", bed=splice_bed, variants="normed.svar")

sds = gvl.Dataset.open(
    "splice.gvl",
    reference="ref.fa",
    splice_info=("transcript_id", "exon_number"),  # 元组 = (分组列, 排序列)
    var_filter="exonic",                            # 可选:过滤掉内含子变异
)
splice_info
接受:
  • 列名字符串(单个分组列,顺序由BED行顺序推断),或
  • (group_col, order_col)
    元组(显式排序,例如外显子编号)。
get_splice_bed
会将GTF转换为BED,并支持TSL过滤以及可选的“CDS长度为3的倍数”过滤。如果要自定义剪接BED,只需在BED中包含
transcript_id
(或任何分组列)和
exon_number
列即可。参见
docs/source/splicing.ipynb

RefDataset splicing

RefDataset 剪接

gvl.RefDataset
accepts the same
splice_info
argument as
Dataset.open
. Pass either a transcript-ID column name (rows already in splice order) or a
(group_col, sort_col)
tuple to reorder exons.
with_settings(splice_info=False)
disables splicing on an existing
RefDataset
; pass a new value to re-enable. Splicing requires
output_length
in
{"ragged", "variable"}
,
jitter=0
, and
deterministic=True
.
subset_to(transcript_ids)
works the same as for
Dataset
.
python
ref = gvl.Reference.from_path("hg38.fa.bgz")
bed = gvl.get_splice_bed("annotations.gtf")
ref_ds = gvl.RefDataset(ref, bed, splice_info="transcript_id")
seqs = ref_ds[:]  # Ragged[S1], one row per transcript
gvl.RefDataset
接受与
Dataset.open
相同的
splice_info
参数。可传入转录本ID列名(行已按剪接顺序排列)或
(group_col, sort_col)
元组以重新排序外显子。
with_settings(splice_info=False)
可在现有
RefDataset
上禁用剪接;传入新值可重新启用。剪接要求
output_length
{"ragged", "variable"}
jitter=0
deterministic=True
subset_to(transcript_ids)
的工作方式与
Dataset
相同。
python
ref = gvl.Reference.from_path("hg38.fa.bgz")
bed = gvl.get_splice_bed("annotations.gtf")
ref_ds = gvl.RefDataset(ref, bed, splice_info="transcript_id")
seqs = ref_ds[:]  # Ragged[S1],每行对应一个转录本

Site-only variants (e.g. ClinVar)

仅位点变异(如ClinVar)

Use
gvl.sites_vcf_to_table(vcf)
pl.DataFrame
(bi-allelic SNPs only), then wrap an
ArrayDataset[AnnotatedHaps, ...]
with
gvl.DatasetWithSites(ds, sites, max_variants_per_region=1)
. Returns
(wt_haps, mut_haps, flags[, tracks])
; flags encode applied / deleted-overlap / already-existing. See
_variants/_sitesonly.py
.
使用
gvl.sites_vcf_to_table(vcf)
pl.DataFrame
(仅双等位基因SNP),然后用
gvl.DatasetWithSites(ds, sites, max_variants_per_region=1)
包装
ArrayDataset[AnnotatedHaps, ...]
。返回
(wt_haps, mut_haps, flags[, tracks])
;flags编码应用的变异/缺失重叠/已存在的变异。参见
_variants/_sitesonly.py

Other public surface (one-liners)

其他公开接口(单行命令)

  • gvl.Reference.from_path(fasta, contigs=None)
    — wrap a FASTA. Cached.
  • gvl.read_bedlike(path)
    /
    gvl.with_length(bed, L)
    — BED helpers (re-exported from
    seqpro
    ).
  • gvl.Ragged
    ,
    gvl.RaggedAnnotatedHaps
    ,
    gvl.RaggedVariants
    ,
    gvl.RaggedIntervals
    — ragged return containers.
  • gvl.to_nested_tensor(ragged)
    — convert to a PyTorch nested tensor (requires
    torch
    ).
  • gvl.get_dummy_dataset()
    — small in-memory dataset for examples/tests.
  • gvl.RefDataset
    — reference-only dataset (no genotypes).
  • gvl.Table
    — generic interval track from a DataFrame.
  • gvl.data_registry.fetch(name)
    — download public test/demo datasets.
Full list lives in
python/genvarloader/__init__.py
__all__
.
  • gvl.Reference.from_path(fasta, contigs=None)
    — 包装FASTA文件,支持缓存。
  • gvl.read_bedlike(path)
    /
    gvl.with_length(bed, L)
    — BED工具函数(从
    seqpro
    重新导出)。
  • gvl.Ragged
    ,
    gvl.RaggedAnnotatedHaps
    ,
    gvl.RaggedVariants
    ,
    gvl.RaggedIntervals
    — 可变长度返回容器。
  • gvl.to_nested_tensor(ragged)
    — 转换为PyTorch嵌套张量(需要
    torch
    )。
  • gvl.get_dummy_dataset()
    — 用于示例/测试的小型内存数据集。
  • gvl.RefDataset
    — 仅参考序列的数据集(无基因型)。
  • gvl.Table
    — 基于DataFrame的通用区间轨道。
  • gvl.data_registry.fetch(name)
    — 下载公开测试/演示数据集。
完整列表请参见
python/genvarloader/__init__.py
中的
__all__

On-disk layout (quick reference)

磁盘布局(快速参考)

ds.gvl/
├── metadata.json          # version, samples, contigs, ploidy, max_jitter, svar_link?
├── input_regions.arrow    # BED + region index map
├── genotypes/             # variant_idxs.npy, dosages.npy, variants.arrow
│                          # (absent when sourced from .svar; see svar_link)
└── intervals/<track>/     # per-track interval data
See
docs/source/format.md
for the full schema, versioning, and SVAR-link details.
ds.gvl/
├── metadata.json          # 版本、样本、contig、倍性、max_jitter、是否有svar_link?
├── input_regions.arrow    # BED + 区域索引映射
├── genotypes/             # variant_idxs.npy, dosages.npy, variants.arrow
│                          # (当数据源为.svar时不存在;参见svar_link)
└── intervals/<track>/     # 每个轨道的区间数据
有关完整 schema、版本控制和SVAR链接详情,请参见
docs/source/format.md

Where to look next

下一步参考

For…Read…
End-to-end RNA-seq example
docs/source/geuvadis.ipynb
Splicing tutorial
docs/source/splicing.ipynb
Deep-learning eval pipeline
docs/source/basenji2_eval.ipynb
BED / BigWig / bcftools recipes
docs/source/write.md
Dataset
shapes, ragged/var/fixed
docs/source/dataset.md
On-disk format + SVAR resolution
docs/source/format.md
FAQ (
with_*
design, typing)
docs/source/faq.md
Auto-generated reference
docs/source/api.md
https://genvarloader.readthedocs.io
Track re-alignment internals
python/genvarloader/_dataset/_tracks.py
,
_reconstruct.py
Insertion fill internals
python/genvarloader/_dataset/_insertion_fill.py
SVAR back-reference / migration
python/genvarloader/_dataset/_svar_link.py
需求参考文档
端到端RNA-seq示例
docs/source/geuvadis.ipynb
剪接教程
docs/source/splicing.ipynb
深度学习评估流程
docs/source/basenji2_eval.ipynb
BED / BigWig / bcftools 流程
docs/source/write.md
Dataset
形状、可变/固定长度
docs/source/dataset.md
磁盘格式 + SVAR解析
docs/source/format.md
FAQ(
with_*
设计、类型)
docs/source/faq.md
自动生成的参考文档
docs/source/api.md
https://genvarloader.readthedocs.io
轨道重新对齐内部实现
python/genvarloader/_dataset/_tracks.py
,
_reconstruct.py
插入填充内部实现
python/genvarloader/_dataset/_insertion_fill.py
SVAR反向引用 / 迁移
python/genvarloader/_dataset/_svar_link.py

Common gotchas

常见陷阱

  • with_insertion_fill
    raises unless the dataset has both haplotypes AND tracks active.
  • min_af
    /
    max_af
    raise unless the dataset is SVAR-backed.
  • with_len(L)
    requires
    L + 2·jitter ≤ min(region_length) + 2·max_jitter
    — set
    max_jitter
    accordingly at
    write
    time.
  • Tracks must have unique
    .name
    ; the on-disk layout is
    intervals/<name>/
    .
  • BED
    strand
    of
    .
    is treated as
    +
    . Reverse-complement happens automatically when
    rc_neg=True
    (default) and
    strand == "-"
    .
  • Splicing is a read-time setting on a flat BED of exons — do not pre-concatenate exons before
    gvl.write
    .
  • extend_to_length=False
    at write time will produce haplotypes shorter than the BED region when deletions are present; downstream code must tolerate
    <
    region length.
  • with_insertion_fill
    仅在数据集同时启用单倍型和轨道时生效,否则会报错。
  • min_af
    /
    max_af
    仅在数据集由SVAR支持时可用,否则会报错。
  • with_len(L)
    要求
    L + 2·jitter ≤ min(region_length) + 2·max_jitter
    ——需在
    write
    时相应设置
    max_jitter
  • 轨道必须有唯一的
    .name
    ;磁盘布局为
    intervals/<name>/
  • BED中的
    strand
    .
    时会被视为
    +
    。当
    rc_neg=True
    (默认)且
    strand == "-"
    时会自动进行反向互补。
  • 剪接是基于外显子的扁平BED在读取时的设置——请勿在
    gvl.write
    前预先拼接外显子。
  • write
    时设置
    extend_to_length=False
    会导致存在缺失时生成的单倍型短于BED区域;下游代码必须能容忍长度小于区域长度的情况。

Maintaining this skill

维护本技能文档

Whenever a PR changes the public API (anything in
python/genvarloader/__init__.py
__all__
, or the docstring/signature of
gvl.write
,
Dataset.open
, or any
Dataset.with_*
method), the author must also update this
SKILL.md
. New public symbols, removed symbols, renamed args, changed defaults, and new output modes are all in scope. CLAUDE.md enforces this as part of the contribution checklist.
每当PR修改公开API
python/genvarloader/__init__.py
__all__
包含的任何内容,或
gvl.write
Dataset.open
、任何
Dataset.with_*
方法的文档字符串/签名)时,作者必须同时更新本
SKILL.md
。新增公开符号、移除符号、参数重命名、默认值修改、新增输出模式都属于更新范围。CLAUDE.md会将此作为贡献检查清单的一部分强制执行。