Loading...
Loading...
Use when writing or reading GenVarLoader (gvl) datasets — preparing VCF/PGEN/SVAR variant sources with bcftools/plink2, calling gvl.write, configuring gvl.Dataset for haplotype/reference/annotated/variants output modes, attaching BigWig or Table tracks, setting up spliced haplotypes from a GTF, choosing track insertion-fill strategies for indels, or filtering variants by allele frequency.
npx skill4agent add mcvickerlab/genvarloader genvarloadergvlimport genvarloader as gvl
# 1. Preprocess variants outside Python (see "Variant preprocessing")
# 2. Write the dataset
gvl.write(
path="ds.gvl",
bed="rois.bed",
variants="normed.bcf", # or .pgen, or .svar directory
tracks=[gvl.BigWigs.from_table("signal", "bw_table.tsv")],
max_jitter=128,
)
# 3. Open and configure (chainable fluent API)
ds = (
gvl.Dataset.open("ds.gvl", reference="ref.fa")
.with_seqs("haplotypes")
.with_tracks(["signal"])
.with_insertion_fill(gvl.Repeat5pNormalized())
.with_len(2048) # or "ragged" / "variable"
.with_settings(jitter=32, deterministic=False)
)
# 4. Eager indexing: dataset[region_idx, sample_idx]
batch = ds[0:8, :] # shape depends on with_* state — see "Output shapes"gvl.write# VCF/BCF
bcftools norm -f ref.fa \
-a --atom-overlaps . \
-m -any --multi-overlaps . \
-O b -o normed.bcf in.vcf.gz
bcftools index normed.bcf
# PGEN
plink2 --make-bpgen --pfile in --out tmp
plink2 --make-pgen --normalize --ref-from-fa --fa ref.fa --bpfile tmp --out normeddocs/source/write.md.svargenoraygvl.write(variants="x.svar")Dataset.open(min_af=..., max_af=...)variant_idxs.npydosages.npyvariants.arrow.gvlgenorayfrom genoray._svar import dense2sparse
from genoray import VCF
dense2sparse(VCF("normed.bcf"), "normed.svar") # writes a .svar/ directoryDataset.openmetadata.jsonsvar=*.svardocs/source/format.md_dataset/_svar_link.pygvl.migrate_svar_link(path)gvl.writegvl.write(
path, bed, variants=None, tracks=None,
samples=None, max_jitter=None, overwrite=False,
max_mem="4g", extend_to_length=True,
)bedchrom, chromStart, chromEndstrand+-.Dataset.regionstracksgvl.BigWigsgvl.Table.namesamplepathBigWigs.from_tablemax_jitterDataset.with_settings(jitter=j)j <= max_jitterextend_to_length=TrueFalsevariantstrackspython/genvarloader/_dataset/_write.pyDataset.opengvl.Dataset.open(
path, reference=None, jitter=0, rng=None,
deterministic=True, rc_neg=True,
min_af=None, max_af=None, # SVAR only
region_names=None,
splice_info=None, # see "Spliced haplotypes"
var_filter=None, # None | "exonic"
*, svar=None,
)reference=svar=with_seqswith_trackswith_seqs(kind) | Returns | Use when |
|---|---|---|
| Reference sequence ( | Baseline / no personalization |
| Personalized haplotypes with indels ( | Standard variant-aware modeling |
| | Need to map back to variants/ref coords |
| | Variant-centric tasks |
| No sequences | Tracks-only datasets |
with_tracks(tracks=..., kind=...)tracksNoneFalsekind"tracks""intervals"with_len(L)"ragged"gvl.Ragged"variable"N0LLL + 2·jitter ≤ min(region_length) + 2·max_jitterRaggedDatasetArrayDatasetwith_lendocs/source/dataset.mdDataset.with_insertion_fill(fill)| Strategy | Behavior |
|---|---|
| Repeat the value at variant POS across the insertion. |
| Repeat |
| Constant value (default NaN) across the insertion. |
| Resample with replacement from a 2·flank_width+1 window around POS. |
| Polynomial interp (order 1/2/3) between flanking reference values. |
dict[track_name, strategy]Repeat5ppython/genvarloader/_dataset/_insertion_fill.pyDataset.openwith_settingssplice_bed = gvl.get_splice_bed("annotation.gtf", transcript_support_level="1")
gvl.write(path="splice.gvl", bed=splice_bed, variants="normed.svar")
sds = gvl.Dataset.open(
"splice.gvl",
reference="ref.fa",
splice_info=("transcript_id", "exon_number"), # tuple = (group_col, order_col)
var_filter="exonic", # optional: drop intronic variants
)splice_info(group_col, order_col)get_splice_bedtranscript_idexon_numberdocs/source/splicing.ipynbgvl.RefDatasetsplice_infoDataset.open(group_col, sort_col)with_settings(splice_info=False)RefDatasetoutput_length{"ragged", "variable"}jitter=0deterministic=Truesubset_to(transcript_ids)Datasetref = gvl.Reference.from_path("hg38.fa.bgz")
bed = gvl.get_splice_bed("annotations.gtf")
ref_ds = gvl.RefDataset(ref, bed, splice_info="transcript_id")
seqs = ref_ds[:] # Ragged[S1], one row per transcriptgvl.sites_vcf_to_table(vcf)pl.DataFrameArrayDataset[AnnotatedHaps, ...]gvl.DatasetWithSites(ds, sites, max_variants_per_region=1)(wt_haps, mut_haps, flags[, tracks])_variants/_sitesonly.pygvl.Reference.from_path(fasta, contigs=None)gvl.read_bedlike(path)gvl.with_length(bed, L)seqprogvl.Raggedgvl.RaggedAnnotatedHapsgvl.RaggedVariantsgvl.RaggedIntervalsgvl.to_nested_tensor(ragged)torchgvl.get_dummy_dataset()gvl.RefDatasetgvl.Tablegvl.data_registry.fetch(name)python/genvarloader/__init__.py__all__ds.gvl/
├── metadata.json # version, samples, contigs, ploidy, max_jitter, svar_link?
├── input_regions.arrow # BED + region index map
├── genotypes/ # variant_idxs.npy, dosages.npy, variants.arrow
│ # (absent when sourced from .svar; see svar_link)
└── intervals/<track>/ # per-track interval datadocs/source/format.md| For… | Read… |
|---|---|
| End-to-end RNA-seq example | |
| Splicing tutorial | |
| Deep-learning eval pipeline | |
| BED / BigWig / bcftools recipes | |
| |
| On-disk format + SVAR resolution | |
FAQ ( | |
| Auto-generated reference | |
| Track re-alignment internals | |
| Insertion fill internals | |
| SVAR back-reference / migration | |
with_insertion_fillmin_afmax_afwith_len(L)L + 2·jitter ≤ min(region_length) + 2·max_jittermax_jitterwrite.nameintervals/<name>/strand.+rc_neg=Truestrand == "-"gvl.writeextend_to_length=False<python/genvarloader/__init__.py__all__gvl.writeDataset.openDataset.with_*SKILL.md