genoray-api

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

genoray public API

genoray 公共API

genoray

is a NumPy-first range-query layer over VCF/BCF (cyvcf2), PGEN (pgenlib), and a sparse memmap format (

SparseVar

.svar

genoray

是基于VCF/BCF（cyvcf2）、PGEN（pgenlib）以及稀疏内存映射格式（

SparseVar

.svar

）构建的、优先使用NumPy的范围查询层。

Public surface

公共接口

import genoray

exposes exactly:

```
genoray.VCF
```
— VCF/BCF reader
```
genoray.PGEN
```
— PLINK 2 PGEN reader
```
genoray.SparseVar
```
— sparse
```
.svar
```
reader/writer
```
genoray.Reader
```
— type alias
```
VCF | PGEN | SparseVar
```
```
genoray.exprs
```
— polars filter expressions for
```
.gvi
```
indexes

Nothing else is public. Anything starting with

(e.g.

genoray._vcf

) is internal — do not import it from user code.

import genoray

会暴露以下内容：

```
genoray.VCF
```
— VCF/BCF 读取器
```
genoray.PGEN
```
— PLINK 2 PGEN 读取器
```
genoray.SparseVar
```
— 稀疏
```
.svar
```
格式的读取器/写入器
```
genoray.Reader
```
— 类型别名，代表
```
VCF | PGEN | SparseVar
```
```
genoray.exprs
```
— 用于
```
.gvi
```
索引的polars过滤表达式

除此之外无其他公共内容。所有以下划线

开头的内容（例如

genoray._vcf

）均为内部实现——用户代码请勿导入此类内容。

Where to look for details

详情查阅路径

Prefer reading these over guessing:

```
docs/source/index.md
```
— narrative tour with full examples (VCF, PGEN, filtering, chunking)
```
docs/source/svar.md
```
— SparseVar usage
```
genoray/__init__.py
```
— confirms the public surface
```
genoray/_vcf.py
```
—
```
VCF
```
class: constructor,
```
read
```
,
```
chunk
```
, mode constants near the top of the class
```
genoray/_pgen.py
```
—
```
PGEN
```
class: constructor,
```
read
```
,
```
chunk
```
,
```
read_ranges
```
,
```
chunk_ranges
```
, mode constants near the top of the class

genoray/_svar.py

—

SparseVar

__init__

from_vcf

from_pgen

read_ranges

with_fields

```
genoray/exprs.py
```
— the complete set of pre-built filter expressions (currently 4:
```
is_snp
```
,
```
is_indel
```
,
```
is_biallelic
```
,
```
ILEN
```
)

When a signature, kwarg, or shape is unclear, read the docstring in the source rather than reasoning from first principles.

建议优先查阅以下内容，而非自行猜测：

```
docs/source/index.md
```
— 包含完整示例的说明性指南（涵盖VCF、PGEN、过滤、分块）
```
docs/source/svar.md
```
— SparseVar使用说明
```
genoray/__init__.py
```
— 确认公共接口范围
```
genoray/_vcf.py
```
—
```
VCF
```
类：构造函数、
```
read
```
、
```
chunk
```
方法，以及类顶部附近的模式常量
```
genoray/_pgen.py
```
—
```
PGEN
```
类：构造函数、
```
read
```
、
```
chunk
```
、
```
read_ranges
```
、
```
chunk_ranges
```
方法，以及类顶部附近的模式常量

genoray/_svar.py

—

SparseVar

类：

__init__

、

from_vcf

、

from_pgen

、

read_ranges

、

with_fields

方法

```
genoray/exprs.py
```
— 完整的预构建过滤表达式集合（目前包含4个：
```
is_snp
```
、
```
is_indel
```
、
```
is_biallelic
```
、
```
ILEN
```
）

当函数签名、关键字参数或返回数组形状不明确时，请阅读源代码中的文档字符串，而非凭经验推断。

Cross-cutting conventions

通用约定

Ranges are 0-based, half-open
```
[start, end)
```
.
```
max_mem
```
accepts strings like
```
"4g"
```
,
```
"512m"
```
,
```
"2GB"
```
.
Contig names auto-normalize:
```
"chr1"
```
and
```
"1"
```
both work regardless of file convention (
```
ContigNormalizer
```
).
Missing genotype =
```
-1
```
(int). Missing dosage =
```
np.nan
```
(float32).
Ploidy is always 2.
All return arrays are NumPy;
```
mode
```
selects which arrays you get back.

范围采用0起始、左闭右开的
```
[start, end)
```
格式。
```
max_mem
```
参数接受类似
```
"4g"
```
、
```
"512m"
```
、
```
"2GB"
```
的字符串格式。
Contig名称会自动标准化：无论文件采用何种命名规范，
```
"chr1"
```
和
```
"1"
```
都能正常工作（由
```
ContigNormalizer
```
实现）。
缺失的基因型值为
```
-1
```
（整数类型）。缺失的剂量值为
```
np.nan
```
（float32类型）。
倍性始终为2。
所有返回的数组均为NumPy数组；
```
mode
```
参数用于选择返回的数组类型。

Mode constants — gotcha

模式常量——注意事项

Modes are class attributes, not top-level names:

python

genoray.VCF.Genos8           # not genoray.Genos8
genoray.PGEN.GenosPhasingDosages

To discover the available modes for a class, read the class body in

_vcf.py

_pgen.py

(search for

Genos

near the top).

When a mode bundles multiple arrays, the return tuple follows the order in the constant name.

PGEN.GenosPhasingDosages

returns

(genos, phasing, dosages)

;

VCF.Genos8Dosages

returns

(genos, dosages)

模式是类属性，而非顶级命名：

python

genoray.VCF.Genos8           # 不要写成 genoray.Genos8
genoray.PGEN.GenosPhasingDosages

如需查看某个类支持的所有模式，请阅读

_vcf.py

_pgen.py

中的类定义（在类顶部附近搜索

Genos

关键字）。

当某个模式包含多个数组时，返回元组的顺序与常量名称中的顺序一致。例如

PGEN.GenosPhasingDosages

(genos, phasing, dosages)

；

VCF.Genos8Dosages

(genos, dosages)

。

VCF — quick reference

VCF——快速参考

python

vcf = genoray.VCF(
    "file.vcf.gz",
    phasing=True,           # constructor-time, not per-read
    dosage_field="DS",      # required to read dosages; FORMAT field with Number=A
    filter=lambda v: ...,   # cyvcf2.Variant -> bool
)

python

vcf = genoray.VCF(
    "file.vcf.gz",
    phasing=True,           # 需在构造时设置，而非每次读取时设置
    dosage_field="DS",      # 读取剂量数据时必填；对应FORMAT字段中Number=A的字段
    filter=lambda v: ...,   # cyvcf2.Variant -> bool
)

Single range

单个范围读取

arr = vcf.read("chr1", start=0, end=1_000_000, mode=genoray.VCF.Genos8)

Chunked

分块读取

for chunk in vcf.chunk("chr1", start=0, end=1_000_000, max_mem="2g", mode=genoray.VCF.Genos8Dosages): ...


- Shape with `phasing=False`: `(samples, ploidy=2, variants)`.
- Shape with `phasing=True`: `(samples, ploidy+1=3, variants)` — the 3rd row along the ploidy axis is `0` (unphased) / `1` (phased), matching cyvcf2.
- Dosage arrays drop the ploidy axis: `(samples, variants)`, dtype `float32`.
- VCF intentionally has **no `read_ranges`** — benchmarking showed no throughput benefit.

for chunk in vcf.chunk("chr1", start=0, end=1_000_000, max_mem="2g", mode=genoray.VCF.Genos8Dosages): ...


- 当`phasing=False`时，返回数组形状为`(samples, ploidy=2, variants)`。
- 当`phasing=True`时，返回数组形状为`(samples, ploidy+1=3, variants)`——倍性轴的第三行值为`0`（未定相）/`1`（已定相），与cyvcf2的规则一致。
- 剂量数组会去掉倍性轴：形状为`(samples, variants)`，数据类型为`float32`。
- VCF刻意不提供`read_ranges`方法——基准测试显示该方法不会提升吞吐量。

PGEN — quick reference

PGEN——快速参考

python

pgen = genoray.PGEN(
    "hardcalls.pgen",                # hardcalls live in the main path
    dosage_path="dosages.pgen",      # optional; defaults to the main path
    filter=genoray.exprs.is_snp & genoray.exprs.is_biallelic,
)

Important: when you have a dosage-only PGEN and a separate hardcalls PGEN, hardcalls go in the main path and dosages go in

dosage_path

. If you only pass one path, both hardcalls and dosages come from it (with the hardcalls inferred from dosage threshold — see PLINK 2 docs).

.gvi

index file is created next to the PGEN on first construction. Don't delete it.

python

undefined

python

pgen = genoray.PGEN(
    "hardcalls.pgen",                # 硬调用文件路径
    dosage_path="dosages.pgen",      # 可选；默认与主路径一致
    filter=genoray.exprs.is_snp & genoray.exprs.is_biallelic,
)

重要提示：当你同时拥有仅含剂量数据的PGEN文件和单独的硬调用PGEN文件时，硬调用文件需放在主路径，剂量文件放在

dosage_path

参数中。如果仅传入一个路径，则硬调用和剂量数据均来自该文件（硬调用数据由剂量阈值推断而来——详见PLINK 2文档）。

首次构造PGEN读取器时，会在PGEN文件旁创建一个

.gvi

索引文件，请不要删除该文件。

python

undefined

Single range

单个范围读取

genos = pgen.read("chr2", start=0, end=1000)

Multiple ranges in one call (PGEN-only optimization)

一次调用读取多个范围（PGEN专属优化）

data, offsets = pgen.read_ranges( "chr2", starts=[0, 1000, 2000], ends=[1000, 2000, 3000], mode=genoray.PGEN.GenosPhasingDosages, )

data

matches the mode (tuple when mode bundles multiple arrays)

data

的格式与所选模式匹配（当模式包含多个数组时为元组）

offsets

shape: (n_ranges + 1,). Slice range i with: arr[..., offsets[i]:offsets[i+1]]

offsets

形状为：(n_ranges + 1,)。可通过切片获取第i个范围的数据：arr[..., offsets[i]:offsets[i+1]]

Chunked variants of both

分块读取的两种变体

for chunk in pgen.chunk("chr2", 0, 1000, max_mem="4g"): ... for range_iter in pgen.chunk_ranges("chr2", starts, ends, max_mem="4g"): for chunk in range_iter: ...


Genotype dtype: `int32`. Dosage dtype: `float32`. Phasing is a separate
`bool` array of shape `(samples, variants)` — *not* an extra row in the
genotype array (unlike VCF with `phasing=True`).

for chunk in pgen.chunk("chr2", 0, 1000, max_mem="4g"): ... for range_iter in pgen.chunk_ranges("chr2", starts, ends, max_mem="4g"): for chunk in range_iter: ...


基因型数据类型：`int32`。剂量数据类型：`float32`。定相数据是单独的`bool`数组，形状为`(samples, variants)`——与开启`phasing=True`的VCF不同，它不是基因型数组中的额外行。

SparseVar (

.svar

) — quick reference

SparseVar（

.svar

）——快速参考

Build:

python

undefined

构建SparseVar文件：

python

undefined

From a configured VCF reader

从已配置的VCF读取器构建

vcf = genoray.VCF("file.vcf.gz", dosage_field="DS") genoray.SparseVar.from_vcf("out.svar", vcf, max_mem="4g", with_dosages=True, overwrite=True)

Or from a PGEN

或从PGEN文件构建

genoray.SparseVar.from_pgen("out.svar", "file.pgen", max_mem="4g")


Read:

```python

genoray.SparseVar.from_pgen("out.svar", "file.pgen", max_mem="4g")


读取SparseVar文件：

```python

Plain ragged: data is just variant indices

普通不规则数组：仅包含变异位点索引

svar = genoray.SparseVar("out.svar") ragged = svar.read_ranges("chr1", starts=[0, 50_000], ends=[10_000, 60_000], samples=["S1", "S2"])

shape: (ranges, samples, ploidy, ~variants) — last axis is ragged

形状：(ranges, samples, ploidy, ~variants) ——最后一轴为不规则轴

With extra fields attached

附加额外字段

svar = genoray.SparseVar("out.svar", fields={"dosages": np.float32})

or, on an existing instance:

或在已有的实例上添加：

svar_with = svar.with_fields({"dosages": np.float32}) result = svar_with.read_ranges("chr1", [0], [10_000]) result.genos # Ragged of variant indices (uint32) result.dosages # Ragged of dosages (float32)


`with_fields(False)` drops all extras and returns a plain
`Ragged[V_IDX_TYPE]` again from subsequent reads.

Each leaf value in the ragged result is a **variant index** — a row number
into `svar.index`, a polars `DataFrame` with at least `CHROM, POS, REF,
ALT (list[str]), ILEN`. To map indices back to chrom/pos/ref/alt, row-index
that DataFrame.

```python
v_idxs = ragged[0, 0, 0].to_numpy()
rows = svar.index[v_idxs.tolist()].select("CHROM", "POS", "REF", "ALT")

svar.index.POS

is 1-based (VCF convention), while query coordinates are 0-based half-open. Don't conflate them.

svar_with = svar.with_fields({"dosages": np.float32}) result = svar_with.read_ranges("chr1", [0], [10_000]) result.genos # 变异位点索引的不规则数组（uint32类型） result.dosages # 剂量数据的不规则数组（float32类型）


调用`with_fields(False)`会移除所有附加字段，后续读取将返回普通的`Ragged[V_IDX_TYPE]`类型。

不规则结果中的每个叶子值都是**变异位点索引**——对应`svar.index`中的行号，`svar.index`是一个polars DataFrame，至少包含`CHROM, POS, REF, ALT (list[str]), ILEN`列。如需将索引映射回染色体/位置/参考碱基/替代碱基，可通过行索引查询该DataFrame。

```python
v_idxs = ragged[0, 0, 0].to_numpy()
rows = svar.index[v_idxs.tolist()].select("CHROM", "POS", "REF", "ALT")

svar.index.POS

是1起始（遵循VCF规范），而查询坐标是0起始左闭右开，请勿混淆两者。

Filtering

过滤

VCF: pass a

Callable[[cyvcf2.Variant], bool]

filter=

PGEN: pass a polars

pl.Expr

returning a boolean mask, operating on the

.gvi

index columns. Built-in expressions in

genoray.exprs

(the complete list):

```
is_snp
```
```
is_indel
```
```
is_biallelic
```
```
ILEN
```
(an expression yielding indel length, not a boolean)

For anything else, write

pl.col(...)

against the

.gvi

schema — read

genoray/exprs.py

for the available columns. Combining two

exprs

expressions with

works without importing polars; you only need

import polars as pl

to build custom predicates.

VCF：向

filter=

参数传入一个

Callable[[cyvcf2.Variant], bool]

类型的函数。

PGEN：向

filter=

参数传入一个polars

pl.Expr

类型的布尔掩码表达式，基于

.gvi

索引列进行操作。

genoray.exprs

中提供了预构建的表达式（完整列表如下）：

```
is_snp
```
```
is_indel
```
```
is_biallelic
```
```
ILEN
```
（返回插入缺失长度的表达式，非布尔类型）

如需自定义过滤规则，可针对

.gvi

的 schema 编写

pl.col(...)

表达式——请阅读

genoray/exprs.py

查看可用列。无需导入polars即可使用

组合两个

exprs

表达式；仅当构建自定义断言时才需要

import polars as pl

。

Common mistakes

常见错误

Mistake	Fix
`genoray.Genos8`	`genoray.VCF.Genos8` (class attribute)
`vcf.read(..., phasing=True)`	Set `phasing=True` on the `VCF()` constructor
Reading dosages from a VCF without `dosage_field=`	Pass `dosage_field="DS"` (or appropriate `Number=A` field) on the constructor
Putting a dosage-only PGEN in the main path when you also have hardcalls	Hardcalls in main path, dosages in `dosage_path=`
Importing `from genoray._vcf import VCF`	Use `from genoray import VCF`
Expecting VCF to have `read_ranges`	VCF doesn't; loop over single-range `read` calls, or use PGEN/SparseVar
Treating `svar.index["POS"]` as 0-based	It's 1-based; subtract 1 to compare with query coords
Calling `read_ranges` and assuming a flat array	PGEN returns `(data, offsets)` ; SparseVar returns a Ragged (or awkward record with `fields` )

错误用法	修复方案
`genoray.Genos8`	使用 `genoray.VCF.Genos8` （模式是类属性）
`vcf.read(..., phasing=True)`	在 `VCF()` 构造时设置 `phasing=True`
未指定 `dosage_field=` 就从VCF读取剂量数据	在构造时传入 `dosage_field="DS"` （或对应的Number=A字段）
同时拥有硬调用和剂量PGEN文件时，将仅含剂量的PGEN放在主路径	硬调用文件放在主路径，剂量文件放在 `dosage_path=` 参数中
`from genoray._vcf import VCF`	使用 `from genoray import VCF`
期望VCF提供 `read_ranges` 方法	VCF不支持该方法；可循环调用单范围 `read` ，或使用PGEN/SparseVar
将 `svar.index["POS"]` 视为0起始值	它是1起始值；与查询坐标比较时需减1
调用 `read_ranges` 后期望得到扁平数组	PGEN返回 `(data, offsets)` ；SparseVar返回不规则数组（或带字段的awkward记录）

When this skill needs updating

何时需要更新此文档

Any PR that adds, removes, renames, or changes the semantics of a public name (anything reachable from

import genoray

without underscores) must update this skill alongside the code change. See the project

CLAUDE.md

任何添加、移除、重命名公共名称（即

import genoray

可访问的非下划线开头内容）或改变其语义的PR，都必须同步更新此文档。详情请查看项目的

CLAUDE.md

文件。

genoray-api

Original

Translation

genoray public API

genoray 公共API

Public surface

公共接口

Where to look for details

详情查阅路径

Cross-cutting conventions

通用约定

Mode constants — gotcha

模式常量——注意事项

VCF — quick reference

VCF——快速参考

Single range

单个范围读取

Chunked

分块读取

PGEN — quick reference

PGEN——快速参考

Single range

单个范围读取

Multiple ranges in one call (PGEN-only optimization)

一次调用读取多个范围（PGEN专属优化）

data matches the mode (tuple when mode bundles multiple arrays)

data的格式与所选模式匹配（当模式包含多个数组时为元组）

offsets shape: (n_ranges + 1,). Slice range i with: arr[..., offsets[i]:offsets[i+1]]

offsets形状为：(n_ranges + 1,)。可通过切片获取第i个范围的数据：arr[..., offsets[i]:offsets[i+1]]

Chunked variants of both

分块读取的两种变体

SparseVar (.svar) — quick reference

SparseVar（.svar）——快速参考

From a configured VCF reader

从已配置的VCF读取器构建

Or from a PGEN

或从PGEN文件构建

Plain ragged: data is just variant indices

普通不规则数组：仅包含变异位点索引

shape: (ranges, samples, ploidy, ~variants) — last axis is ragged

形状：(ranges, samples, ploidy, ~variants) ——最后一轴为不规则轴

With extra fields attached

附加额外字段

or, on an existing instance:

或在已有的实例上添加：

Filtering

过滤

Common mistakes

常见错误

When this skill needs updating

何时需要更新此文档

`data`
matches the mode (tuple when mode bundles multiple arrays)

`data`
的格式与所选模式匹配（当模式包含多个数组时为元组）

`offsets`
shape: (n_ranges + 1,). Slice range i with: arr[..., offsets[i]:offsets[i+1]]

`offsets`
形状为：(n_ranges + 1,)。可通过切片获取第i个范围的数据：arr[..., offsets[i]:offsets[i+1]]

SparseVar (
`.svar`
) — quick reference

SparseVar（
`.svar`
）——快速参考