genoray-api

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

genoray public API

genoray 公共API

genoray
is a NumPy-first range-query layer over VCF/BCF (cyvcf2), PGEN (pgenlib), and a sparse memmap format (
SparseVar
/
.svar
).
genoray
是基于VCF/BCF(cyvcf2)、PGEN(pgenlib)以及稀疏内存映射格式(
SparseVar
/
.svar
)构建的、优先使用NumPy的范围查询层。

Public surface

公共接口

import genoray
exposes exactly:
  • genoray.VCF
    — VCF/BCF reader
  • genoray.PGEN
    — PLINK 2 PGEN reader
  • genoray.SparseVar
    — sparse
    .svar
    reader/writer
  • genoray.Reader
    — type alias
    VCF | PGEN | SparseVar
  • genoray.exprs
    — polars filter expressions for
    .gvi
    indexes
Nothing else is public. Anything starting with
_
(e.g.
genoray._vcf
) is internal — do not import it from user code.
import genoray
会暴露以下内容:
  • genoray.VCF
    — VCF/BCF 读取器
  • genoray.PGEN
    — PLINK 2 PGEN 读取器
  • genoray.SparseVar
    — 稀疏
    .svar
    格式的读取器/写入器
  • genoray.Reader
    — 类型别名,代表
    VCF | PGEN | SparseVar
  • genoray.exprs
    — 用于
    .gvi
    索引的polars过滤表达式
除此之外无其他公共内容。所有以下划线
_
开头的内容(例如
genoray._vcf
)均为内部实现——用户代码请勿导入此类内容。

Where to look for details

详情查阅路径

Prefer reading these over guessing:
  • docs/source/index.md
    — narrative tour with full examples (VCF, PGEN, filtering, chunking)
  • docs/source/svar.md
    — SparseVar usage
  • genoray/__init__.py
    — confirms the public surface
  • genoray/_vcf.py
    VCF
    class: constructor,
    read
    ,
    chunk
    , mode constants near the top of the class
  • genoray/_pgen.py
    PGEN
    class: constructor,
    read
    ,
    chunk
    ,
    read_ranges
    ,
    chunk_ranges
    , mode constants near the top of the class
  • genoray/_svar.py
    SparseVar
    :
    __init__
    ,
    from_vcf
    ,
    from_pgen
    ,
    read_ranges
    ,
    with_fields
  • genoray/exprs.py
    — the complete set of pre-built filter expressions (currently 4:
    is_snp
    ,
    is_indel
    ,
    is_biallelic
    ,
    ILEN
    )
When a signature, kwarg, or shape is unclear, read the docstring in the source rather than reasoning from first principles.
建议优先查阅以下内容,而非自行猜测:
  • docs/source/index.md
    — 包含完整示例的说明性指南(涵盖VCF、PGEN、过滤、分块)
  • docs/source/svar.md
    — SparseVar使用说明
  • genoray/__init__.py
    — 确认公共接口范围
  • genoray/_vcf.py
    VCF
    类:构造函数、
    read
    chunk
    方法,以及类顶部附近的模式常量
  • genoray/_pgen.py
    PGEN
    类:构造函数、
    read
    chunk
    read_ranges
    chunk_ranges
    方法,以及类顶部附近的模式常量
  • genoray/_svar.py
    SparseVar
    类:
    __init__
    from_vcf
    from_pgen
    read_ranges
    with_fields
    方法
  • genoray/exprs.py
    — 完整的预构建过滤表达式集合(目前包含4个:
    is_snp
    is_indel
    is_biallelic
    ILEN
当函数签名、关键字参数或返回数组形状不明确时,请阅读源代码中的文档字符串,而非凭经验推断。

Cross-cutting conventions

通用约定

  • Ranges are 0-based, half-open
    [start, end)
    .
  • max_mem
    accepts strings like
    "4g"
    ,
    "512m"
    ,
    "2GB"
    .
  • Contig names auto-normalize:
    "chr1"
    and
    "1"
    both work regardless of file convention (
    ContigNormalizer
    ).
  • Missing genotype =
    -1
    (int). Missing dosage =
    np.nan
    (float32).
  • Ploidy is always 2.
  • All return arrays are NumPy;
    mode
    selects which arrays you get back.
  • 范围采用0起始、左闭右开的
    [start, end)
    格式。
  • max_mem
    参数接受类似
    "4g"
    "512m"
    "2GB"
    的字符串格式。
  • Contig名称会自动标准化:无论文件采用何种命名规范,
    "chr1"
    "1"
    都能正常工作(由
    ContigNormalizer
    实现)。
  • 缺失的基因型值为
    -1
    (整数类型)。缺失的剂量值为
    np.nan
    (float32类型)。
  • 倍性始终为2。
  • 所有返回的数组均为NumPy数组;
    mode
    参数用于选择返回的数组类型。

Mode constants — gotcha

模式常量——注意事项

Modes are class attributes, not top-level names:
python
genoray.VCF.Genos8           # not genoray.Genos8
genoray.PGEN.GenosPhasingDosages
To discover the available modes for a class, read the class body in
_vcf.py
/
_pgen.py
(search for
Genos
near the top).
When a mode bundles multiple arrays, the return tuple follows the order in the constant name.
PGEN.GenosPhasingDosages
returns
(genos, phasing, dosages)
;
VCF.Genos8Dosages
returns
(genos, dosages)
.
模式是类属性,而非顶级命名:
python
genoray.VCF.Genos8           # 不要写成 genoray.Genos8
genoray.PGEN.GenosPhasingDosages
如需查看某个类支持的所有模式,请阅读
_vcf.py
/
_pgen.py
中的类定义(在类顶部附近搜索
Genos
关键字)。
当某个模式包含多个数组时,返回元组的顺序与常量名称中的顺序一致。例如
PGEN.GenosPhasingDosages
返回
(genos, phasing, dosages)
VCF.Genos8Dosages
返回
(genos, dosages)

VCF — quick reference

VCF——快速参考

python
vcf = genoray.VCF(
    "file.vcf.gz",
    phasing=True,           # constructor-time, not per-read
    dosage_field="DS",      # required to read dosages; FORMAT field with Number=A
    filter=lambda v: ...,   # cyvcf2.Variant -> bool
)
python
vcf = genoray.VCF(
    "file.vcf.gz",
    phasing=True,           # 需在构造时设置,而非每次读取时设置
    dosage_field="DS",      # 读取剂量数据时必填;对应FORMAT字段中Number=A的字段
    filter=lambda v: ...,   # cyvcf2.Variant -> bool
)

Single range

单个范围读取

arr = vcf.read("chr1", start=0, end=1_000_000, mode=genoray.VCF.Genos8)
arr = vcf.read("chr1", start=0, end=1_000_000, mode=genoray.VCF.Genos8)

Chunked

分块读取

for chunk in vcf.chunk("chr1", start=0, end=1_000_000, max_mem="2g", mode=genoray.VCF.Genos8Dosages): ...

- Shape with `phasing=False`: `(samples, ploidy=2, variants)`.
- Shape with `phasing=True`: `(samples, ploidy+1=3, variants)` — the 3rd row along the ploidy axis is `0` (unphased) / `1` (phased), matching cyvcf2.
- Dosage arrays drop the ploidy axis: `(samples, variants)`, dtype `float32`.
- VCF intentionally has **no `read_ranges`** — benchmarking showed no throughput benefit.
for chunk in vcf.chunk("chr1", start=0, end=1_000_000, max_mem="2g", mode=genoray.VCF.Genos8Dosages): ...

- 当`phasing=False`时,返回数组形状为`(samples, ploidy=2, variants)`。
- 当`phasing=True`时,返回数组形状为`(samples, ploidy+1=3, variants)`——倍性轴的第三行值为`0`(未定相)/`1`(已定相),与cyvcf2的规则一致。
- 剂量数组会去掉倍性轴:形状为`(samples, variants)`,数据类型为`float32`。
- VCF刻意不提供`read_ranges`方法——基准测试显示该方法不会提升吞吐量。

PGEN — quick reference

PGEN——快速参考

python
pgen = genoray.PGEN(
    "hardcalls.pgen",                # hardcalls live in the main path
    dosage_path="dosages.pgen",      # optional; defaults to the main path
    filter=genoray.exprs.is_snp & genoray.exprs.is_biallelic,
)
Important: when you have a dosage-only PGEN and a separate hardcalls PGEN, hardcalls go in the main path and dosages go in
dosage_path
. If you only pass one path, both hardcalls and dosages come from it (with the hardcalls inferred from dosage threshold — see PLINK 2 docs).
A
.gvi
index file is created next to the PGEN on first construction. Don't delete it.
python
undefined
python
pgen = genoray.PGEN(
    "hardcalls.pgen",                # 硬调用文件路径
    dosage_path="dosages.pgen",      # 可选;默认与主路径一致
    filter=genoray.exprs.is_snp & genoray.exprs.is_biallelic,
)
重要提示:当你同时拥有仅含剂量数据的PGEN文件和单独的硬调用PGEN文件时,硬调用文件需放在主路径,剂量文件放在
dosage_path
参数中。如果仅传入一个路径,则硬调用和剂量数据均来自该文件(硬调用数据由剂量阈值推断而来——详见PLINK 2文档)。
首次构造PGEN读取器时,会在PGEN文件旁创建一个
.gvi
索引文件,请不要删除该文件。
python
undefined

Single range

单个范围读取

genos = pgen.read("chr2", start=0, end=1000)
genos = pgen.read("chr2", start=0, end=1000)

Multiple ranges in one call (PGEN-only optimization)

一次调用读取多个范围(PGEN专属优化)

data, offsets = pgen.read_ranges( "chr2", starts=[0, 1000, 2000], ends=[1000, 2000, 3000], mode=genoray.PGEN.GenosPhasingDosages, )
data, offsets = pgen.read_ranges( "chr2", starts=[0, 1000, 2000], ends=[1000, 2000, 3000], mode=genoray.PGEN.GenosPhasingDosages, )

data
matches the mode (tuple when mode bundles multiple arrays)

data
的格式与所选模式匹配(当模式包含多个数组时为元组)

offsets
shape: (n_ranges + 1,). Slice range i with: arr[..., offsets[i]:offsets[i+1]]

offsets
形状为:(n_ranges + 1,)。可通过切片获取第i个范围的数据:arr[..., offsets[i]:offsets[i+1]]

Chunked variants of both

分块读取的两种变体

for chunk in pgen.chunk("chr2", 0, 1000, max_mem="4g"): ... for range_iter in pgen.chunk_ranges("chr2", starts, ends, max_mem="4g"): for chunk in range_iter: ...

Genotype dtype: `int32`. Dosage dtype: `float32`. Phasing is a separate
`bool` array of shape `(samples, variants)` — *not* an extra row in the
genotype array (unlike VCF with `phasing=True`).
for chunk in pgen.chunk("chr2", 0, 1000, max_mem="4g"): ... for range_iter in pgen.chunk_ranges("chr2", starts, ends, max_mem="4g"): for chunk in range_iter: ...

基因型数据类型:`int32`。剂量数据类型:`float32`。定相数据是单独的`bool`数组,形状为`(samples, variants)`——与开启`phasing=True`的VCF不同,它不是基因型数组中的额外行。

SparseVar (
.svar
) — quick reference

SparseVar(
.svar
)——快速参考

Build:
python
undefined
构建SparseVar文件:
python
undefined

From a configured VCF reader

从已配置的VCF读取器构建

vcf = genoray.VCF("file.vcf.gz", dosage_field="DS") genoray.SparseVar.from_vcf("out.svar", vcf, max_mem="4g", with_dosages=True, overwrite=True)
vcf = genoray.VCF("file.vcf.gz", dosage_field="DS") genoray.SparseVar.from_vcf("out.svar", vcf, max_mem="4g", with_dosages=True, overwrite=True)

Or from a PGEN

或从PGEN文件构建

genoray.SparseVar.from_pgen("out.svar", "file.pgen", max_mem="4g")

Read:

```python
genoray.SparseVar.from_pgen("out.svar", "file.pgen", max_mem="4g")

读取SparseVar文件:

```python

Plain ragged: data is just variant indices

普通不规则数组:仅包含变异位点索引

svar = genoray.SparseVar("out.svar") ragged = svar.read_ranges("chr1", starts=[0, 50_000], ends=[10_000, 60_000], samples=["S1", "S2"])
svar = genoray.SparseVar("out.svar") ragged = svar.read_ranges("chr1", starts=[0, 50_000], ends=[10_000, 60_000], samples=["S1", "S2"])

shape: (ranges, samples, ploidy, ~variants) — last axis is ragged

形状:(ranges, samples, ploidy, ~variants) ——最后一轴为不规则轴

With extra fields attached

附加额外字段

svar = genoray.SparseVar("out.svar", fields={"dosages": np.float32})
svar = genoray.SparseVar("out.svar", fields={"dosages": np.float32})

or, on an existing instance:

或在已有的实例上添加:

svar_with = svar.with_fields({"dosages": np.float32}) result = svar_with.read_ranges("chr1", [0], [10_000]) result.genos # Ragged of variant indices (uint32) result.dosages # Ragged of dosages (float32)

`with_fields(False)` drops all extras and returns a plain
`Ragged[V_IDX_TYPE]` again from subsequent reads.

Each leaf value in the ragged result is a **variant index** — a row number
into `svar.index`, a polars `DataFrame` with at least `CHROM, POS, REF,
ALT (list[str]), ILEN`. To map indices back to chrom/pos/ref/alt, row-index
that DataFrame.

```python
v_idxs = ragged[0, 0, 0].to_numpy()
rows = svar.index[v_idxs.tolist()].select("CHROM", "POS", "REF", "ALT")
svar.index.POS
is 1-based (VCF convention), while query coordinates are 0-based half-open. Don't conflate them.
svar_with = svar.with_fields({"dosages": np.float32}) result = svar_with.read_ranges("chr1", [0], [10_000]) result.genos # 变异位点索引的不规则数组(uint32类型) result.dosages # 剂量数据的不规则数组(float32类型)

调用`with_fields(False)`会移除所有附加字段,后续读取将返回普通的`Ragged[V_IDX_TYPE]`类型。

不规则结果中的每个叶子值都是**变异位点索引**——对应`svar.index`中的行号,`svar.index`是一个polars DataFrame,至少包含`CHROM, POS, REF, ALT (list[str]), ILEN`列。如需将索引映射回染色体/位置/参考碱基/替代碱基,可通过行索引查询该DataFrame。

```python
v_idxs = ragged[0, 0, 0].to_numpy()
rows = svar.index[v_idxs.tolist()].select("CHROM", "POS", "REF", "ALT")
svar.index.POS
1起始(遵循VCF规范),而查询坐标是0起始左闭右开,请勿混淆两者。

Filtering

过滤

VCF: pass a
Callable[[cyvcf2.Variant], bool]
to
filter=
.
PGEN: pass a polars
pl.Expr
returning a boolean mask, operating on the
.gvi
index columns. Built-in expressions in
genoray.exprs
(the complete list):
  • is_snp
  • is_indel
  • is_biallelic
  • ILEN
    (an expression yielding indel length, not a boolean)
For anything else, write
pl.col(...)
against the
.gvi
schema — read
genoray/exprs.py
for the available columns. Combining two
exprs
expressions with
&
/
|
works without importing polars; you only need
import polars as pl
to build custom predicates.
VCF:向
filter=
参数传入一个
Callable[[cyvcf2.Variant], bool]
类型的函数。
PGEN:向
filter=
参数传入一个polars
pl.Expr
类型的布尔掩码表达式,基于
.gvi
索引列进行操作。
genoray.exprs
中提供了预构建的表达式(完整列表如下):
  • is_snp
  • is_indel
  • is_biallelic
  • ILEN
    (返回插入缺失长度的表达式,非布尔类型)
如需自定义过滤规则,可针对
.gvi
的 schema 编写
pl.col(...)
表达式——请阅读
genoray/exprs.py
查看可用列。无需导入polars即可使用
&
/
|
组合两个
exprs
表达式;仅当构建自定义断言时才需要
import polars as pl

Common mistakes

常见错误

MistakeFix
genoray.Genos8
genoray.VCF.Genos8
(class attribute)
vcf.read(..., phasing=True)
Set
phasing=True
on the
VCF()
constructor
Reading dosages from a VCF without
dosage_field=
Pass
dosage_field="DS"
(or appropriate
Number=A
field) on the constructor
Putting a dosage-only PGEN in the main path when you also have hardcallsHardcalls in main path, dosages in
dosage_path=
Importing
from genoray._vcf import VCF
Use
from genoray import VCF
Expecting VCF to have
read_ranges
VCF doesn't; loop over single-range
read
calls, or use PGEN/SparseVar
Treating
svar.index["POS"]
as 0-based
It's 1-based; subtract 1 to compare with query coords
Calling
read_ranges
and assuming a flat array
PGEN returns
(data, offsets)
; SparseVar returns a Ragged (or awkward record with
fields
)
错误用法修复方案
genoray.Genos8
使用
genoray.VCF.Genos8
(模式是类属性)
vcf.read(..., phasing=True)
VCF()
构造时设置
phasing=True
未指定
dosage_field=
就从VCF读取剂量数据
在构造时传入
dosage_field="DS"
(或对应的Number=A字段)
同时拥有硬调用和剂量PGEN文件时,将仅含剂量的PGEN放在主路径硬调用文件放在主路径,剂量文件放在
dosage_path=
参数中
from genoray._vcf import VCF
使用
from genoray import VCF
期望VCF提供
read_ranges
方法
VCF不支持该方法;可循环调用单范围
read
,或使用PGEN/SparseVar
svar.index["POS"]
视为0起始值
它是1起始值;与查询坐标比较时需减1
调用
read_ranges
后期望得到扁平数组
PGEN返回
(data, offsets)
;SparseVar返回不规则数组(或带字段的awkward记录)

When this skill needs updating

何时需要更新此文档

Any PR that adds, removes, renames, or changes the semantics of a public name (anything reachable from
import genoray
without underscores) must update this skill alongside the code change. See the project
CLAUDE.md
.
任何添加、移除、重命名公共名称(即
import genoray
可访问的非下划线开头内容)或改变其语义的PR,都必须同步更新此文档。详情请查看项目的
CLAUDE.md
文件。