genoray-api
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesegenoray public API
genoray 公共API
genoraySparseVar.svargenoraySparseVar.svarPublic surface
公共接口
import genoray- — VCF/BCF reader
genoray.VCF - — PLINK 2 PGEN reader
genoray.PGEN - — sparse
genoray.SparseVarreader/writer.svar - — type alias
genoray.ReaderVCF | PGEN | SparseVar - — polars filter expressions for
genoray.exprsindexes.gvi
Nothing else is public. Anything starting with (e.g. ) is
internal — do not import it from user code.
_genoray._vcfimport genoray- — VCF/BCF 读取器
genoray.VCF - — PLINK 2 PGEN 读取器
genoray.PGEN - — 稀疏
genoray.SparseVar格式的读取器/写入器.svar - — 类型别名,代表
genoray.ReaderVCF | PGEN | SparseVar - — 用于
genoray.exprs索引的polars过滤表达式.gvi
除此之外无其他公共内容。所有以下划线开头的内容(例如)均为内部实现——用户代码请勿导入此类内容。
_genoray._vcfWhere to look for details
详情查阅路径
Prefer reading these over guessing:
- — narrative tour with full examples (VCF, PGEN, filtering, chunking)
docs/source/index.md - — SparseVar usage
docs/source/svar.md - — confirms the public surface
genoray/__init__.py - —
genoray/_vcf.pyclass: constructor,VCF,read, mode constants near the top of the classchunk - —
genoray/_pgen.pyclass: constructor,PGEN,read,chunk,read_ranges, mode constants near the top of the classchunk_ranges - —
genoray/_svar.py:SparseVar,__init__,from_vcf,from_pgen,read_rangeswith_fields - — the complete set of pre-built filter expressions (currently 4:
genoray/exprs.py,is_snp,is_indel,is_biallelic)ILEN
When a signature, kwarg, or shape is unclear, read the docstring in the
source rather than reasoning from first principles.
建议优先查阅以下内容,而非自行猜测:
- — 包含完整示例的说明性指南(涵盖VCF、PGEN、过滤、分块)
docs/source/index.md - — SparseVar使用说明
docs/source/svar.md - — 确认公共接口范围
genoray/__init__.py - —
genoray/_vcf.py类:构造函数、VCF、read方法,以及类顶部附近的模式常量chunk - —
genoray/_pgen.py类:构造函数、PGEN、read、chunk、read_ranges方法,以及类顶部附近的模式常量chunk_ranges - —
genoray/_svar.py类:SparseVar、__init__、from_vcf、from_pgen、read_ranges方法with_fields - — 完整的预构建过滤表达式集合(目前包含4个:
genoray/exprs.py、is_snp、is_indel、is_biallelic)ILEN
当函数签名、关键字参数或返回数组形状不明确时,请阅读源代码中的文档字符串,而非凭经验推断。
Cross-cutting conventions
通用约定
- Ranges are 0-based, half-open .
[start, end) - accepts strings like
max_mem,"4g","512m"."2GB" - Contig names auto-normalize: and
"chr1"both work regardless of file convention ("1").ContigNormalizer - Missing genotype = (int). Missing dosage =
-1(float32).np.nan - Ploidy is always 2.
- All return arrays are NumPy; selects which arrays you get back.
mode
- 范围采用0起始、左闭右开的格式。
[start, end) - 参数接受类似
max_mem、"4g"、"512m"的字符串格式。"2GB" - Contig名称会自动标准化:无论文件采用何种命名规范,和
"chr1"都能正常工作(由"1"实现)。ContigNormalizer - 缺失的基因型值为(整数类型)。缺失的剂量值为
-1(float32类型)。np.nan - 倍性始终为2。
- 所有返回的数组均为NumPy数组;参数用于选择返回的数组类型。
mode
Mode constants — gotcha
模式常量——注意事项
Modes are class attributes, not top-level names:
python
genoray.VCF.Genos8 # not genoray.Genos8
genoray.PGEN.GenosPhasingDosagesTo discover the available modes for a class, read the class body in
/ (search for near the top).
_vcf.py_pgen.pyGenosWhen a mode bundles multiple arrays, the return tuple follows the order in
the constant name. returns ; returns .
PGEN.GenosPhasingDosages(genos, phasing, dosages)VCF.Genos8Dosages(genos, dosages)模式是类属性,而非顶级命名:
python
genoray.VCF.Genos8 # 不要写成 genoray.Genos8
genoray.PGEN.GenosPhasingDosages如需查看某个类支持的所有模式,请阅读 / 中的类定义(在类顶部附近搜索关键字)。
_vcf.py_pgen.pyGenos当某个模式包含多个数组时,返回元组的顺序与常量名称中的顺序一致。例如返回;返回。
PGEN.GenosPhasingDosages(genos, phasing, dosages)VCF.Genos8Dosages(genos, dosages)VCF — quick reference
VCF——快速参考
python
vcf = genoray.VCF(
"file.vcf.gz",
phasing=True, # constructor-time, not per-read
dosage_field="DS", # required to read dosages; FORMAT field with Number=A
filter=lambda v: ..., # cyvcf2.Variant -> bool
)python
vcf = genoray.VCF(
"file.vcf.gz",
phasing=True, # 需在构造时设置,而非每次读取时设置
dosage_field="DS", # 读取剂量数据时必填;对应FORMAT字段中Number=A的字段
filter=lambda v: ..., # cyvcf2.Variant -> bool
)Single range
单个范围读取
arr = vcf.read("chr1", start=0, end=1_000_000, mode=genoray.VCF.Genos8)
arr = vcf.read("chr1", start=0, end=1_000_000, mode=genoray.VCF.Genos8)
Chunked
分块读取
for chunk in vcf.chunk("chr1", start=0, end=1_000_000,
max_mem="2g", mode=genoray.VCF.Genos8Dosages):
...
- Shape with `phasing=False`: `(samples, ploidy=2, variants)`.
- Shape with `phasing=True`: `(samples, ploidy+1=3, variants)` — the 3rd row along the ploidy axis is `0` (unphased) / `1` (phased), matching cyvcf2.
- Dosage arrays drop the ploidy axis: `(samples, variants)`, dtype `float32`.
- VCF intentionally has **no `read_ranges`** — benchmarking showed no throughput benefit.for chunk in vcf.chunk("chr1", start=0, end=1_000_000,
max_mem="2g", mode=genoray.VCF.Genos8Dosages):
...
- 当`phasing=False`时,返回数组形状为`(samples, ploidy=2, variants)`。
- 当`phasing=True`时,返回数组形状为`(samples, ploidy+1=3, variants)`——倍性轴的第三行值为`0`(未定相)/`1`(已定相),与cyvcf2的规则一致。
- 剂量数组会去掉倍性轴:形状为`(samples, variants)`,数据类型为`float32`。
- VCF刻意不提供`read_ranges`方法——基准测试显示该方法不会提升吞吐量。PGEN — quick reference
PGEN——快速参考
python
pgen = genoray.PGEN(
"hardcalls.pgen", # hardcalls live in the main path
dosage_path="dosages.pgen", # optional; defaults to the main path
filter=genoray.exprs.is_snp & genoray.exprs.is_biallelic,
)Important: when you have a dosage-only PGEN and a separate hardcalls PGEN,
hardcalls go in the main path and dosages go in . If you
only pass one path, both hardcalls and dosages come from it (with the
hardcalls inferred from dosage threshold — see PLINK 2 docs).
dosage_pathA index file is created next to the PGEN on first construction.
Don't delete it.
.gvipython
undefinedpython
pgen = genoray.PGEN(
"hardcalls.pgen", # 硬调用文件路径
dosage_path="dosages.pgen", # 可选;默认与主路径一致
filter=genoray.exprs.is_snp & genoray.exprs.is_biallelic,
)重要提示:当你同时拥有仅含剂量数据的PGEN文件和单独的硬调用PGEN文件时,硬调用文件需放在主路径,剂量文件放在参数中。如果仅传入一个路径,则硬调用和剂量数据均来自该文件(硬调用数据由剂量阈值推断而来——详见PLINK 2文档)。
dosage_path首次构造PGEN读取器时,会在PGEN文件旁创建一个索引文件,请不要删除该文件。
.gvipython
undefinedSingle range
单个范围读取
genos = pgen.read("chr2", start=0, end=1000)
genos = pgen.read("chr2", start=0, end=1000)
Multiple ranges in one call (PGEN-only optimization)
一次调用读取多个范围(PGEN专属优化)
data, offsets = pgen.read_ranges(
"chr2",
starts=[0, 1000, 2000],
ends=[1000, 2000, 3000],
mode=genoray.PGEN.GenosPhasingDosages,
)
data, offsets = pgen.read_ranges(
"chr2",
starts=[0, 1000, 2000],
ends=[1000, 2000, 3000],
mode=genoray.PGEN.GenosPhasingDosages,
)
data
matches the mode (tuple when mode bundles multiple arrays)
datadata
的格式与所选模式匹配(当模式包含多个数组时为元组)
dataoffsets
shape: (n_ranges + 1,). Slice range i with: arr[..., offsets[i]:offsets[i+1]]
offsetsoffsets
形状为:(n_ranges + 1,)。可通过切片获取第i个范围的数据:arr[..., offsets[i]:offsets[i+1]]
offsetsChunked variants of both
分块读取的两种变体
for chunk in pgen.chunk("chr2", 0, 1000, max_mem="4g"): ...
for range_iter in pgen.chunk_ranges("chr2", starts, ends, max_mem="4g"):
for chunk in range_iter: ...
Genotype dtype: `int32`. Dosage dtype: `float32`. Phasing is a separate
`bool` array of shape `(samples, variants)` — *not* an extra row in the
genotype array (unlike VCF with `phasing=True`).for chunk in pgen.chunk("chr2", 0, 1000, max_mem="4g"): ...
for range_iter in pgen.chunk_ranges("chr2", starts, ends, max_mem="4g"):
for chunk in range_iter: ...
基因型数据类型:`int32`。剂量数据类型:`float32`。定相数据是单独的`bool`数组,形状为`(samples, variants)`——与开启`phasing=True`的VCF不同,它不是基因型数组中的额外行。SparseVar (.svar
) — quick reference
.svarSparseVar(.svar
)——快速参考
.svarBuild:
python
undefined构建SparseVar文件:
python
undefinedFrom a configured VCF reader
从已配置的VCF读取器构建
vcf = genoray.VCF("file.vcf.gz", dosage_field="DS")
genoray.SparseVar.from_vcf("out.svar", vcf, max_mem="4g",
with_dosages=True, overwrite=True)
vcf = genoray.VCF("file.vcf.gz", dosage_field="DS")
genoray.SparseVar.from_vcf("out.svar", vcf, max_mem="4g",
with_dosages=True, overwrite=True)
Or from a PGEN
或从PGEN文件构建
genoray.SparseVar.from_pgen("out.svar", "file.pgen", max_mem="4g")
Read:
```pythongenoray.SparseVar.from_pgen("out.svar", "file.pgen", max_mem="4g")
读取SparseVar文件:
```pythonPlain ragged: data is just variant indices
普通不规则数组:仅包含变异位点索引
svar = genoray.SparseVar("out.svar")
ragged = svar.read_ranges("chr1", starts=[0, 50_000], ends=[10_000, 60_000],
samples=["S1", "S2"])
svar = genoray.SparseVar("out.svar")
ragged = svar.read_ranges("chr1", starts=[0, 50_000], ends=[10_000, 60_000],
samples=["S1", "S2"])
shape: (ranges, samples, ploidy, ~variants) — last axis is ragged
形状:(ranges, samples, ploidy, ~variants) ——最后一轴为不规则轴
With extra fields attached
附加额外字段
svar = genoray.SparseVar("out.svar", fields={"dosages": np.float32})
svar = genoray.SparseVar("out.svar", fields={"dosages": np.float32})
or, on an existing instance:
或在已有的实例上添加:
svar_with = svar.with_fields({"dosages": np.float32})
result = svar_with.read_ranges("chr1", [0], [10_000])
result.genos # Ragged of variant indices (uint32)
result.dosages # Ragged of dosages (float32)
`with_fields(False)` drops all extras and returns a plain
`Ragged[V_IDX_TYPE]` again from subsequent reads.
Each leaf value in the ragged result is a **variant index** — a row number
into `svar.index`, a polars `DataFrame` with at least `CHROM, POS, REF,
ALT (list[str]), ILEN`. To map indices back to chrom/pos/ref/alt, row-index
that DataFrame.
```python
v_idxs = ragged[0, 0, 0].to_numpy()
rows = svar.index[v_idxs.tolist()].select("CHROM", "POS", "REF", "ALT")svar.index.POSsvar_with = svar.with_fields({"dosages": np.float32})
result = svar_with.read_ranges("chr1", [0], [10_000])
result.genos # 变异位点索引的不规则数组(uint32类型)
result.dosages # 剂量数据的不规则数组(float32类型)
调用`with_fields(False)`会移除所有附加字段,后续读取将返回普通的`Ragged[V_IDX_TYPE]`类型。
不规则结果中的每个叶子值都是**变异位点索引**——对应`svar.index`中的行号,`svar.index`是一个polars DataFrame,至少包含`CHROM, POS, REF, ALT (list[str]), ILEN`列。如需将索引映射回染色体/位置/参考碱基/替代碱基,可通过行索引查询该DataFrame。
```python
v_idxs = ragged[0, 0, 0].to_numpy()
rows = svar.index[v_idxs.tolist()].select("CHROM", "POS", "REF", "ALT")svar.index.POSFiltering
过滤
VCF: pass a to .
Callable[[cyvcf2.Variant], bool]filter=PGEN: pass a polars returning a boolean mask, operating on the
index columns. Built-in expressions in (the
complete list):
pl.Expr.gvigenoray.exprsis_snpis_indelis_biallelic- (an expression yielding indel length, not a boolean)
ILEN
For anything else, write against the schema — read
for the available columns. Combining two
expressions with / works without importing polars; you only need
to build custom predicates.
pl.col(...).gvigenoray/exprs.pyexprs&|import polars as plVCF:向参数传入一个类型的函数。
filter=Callable[[cyvcf2.Variant], bool]PGEN:向参数传入一个polars 类型的布尔掩码表达式,基于索引列进行操作。中提供了预构建的表达式(完整列表如下):
filter=pl.Expr.gvigenoray.exprsis_snpis_indelis_biallelic- (返回插入缺失长度的表达式,非布尔类型)
ILEN
如需自定义过滤规则,可针对的 schema 编写表达式——请阅读查看可用列。无需导入polars即可使用 / 组合两个表达式;仅当构建自定义断言时才需要。
.gvipl.col(...)genoray/exprs.py&|exprsimport polars as plCommon mistakes
常见错误
| Mistake | Fix |
|---|---|
| |
| Set |
Reading dosages from a VCF without | Pass |
| Putting a dosage-only PGEN in the main path when you also have hardcalls | Hardcalls in main path, dosages in |
Importing | Use |
Expecting VCF to have | VCF doesn't; loop over single-range |
Treating | It's 1-based; subtract 1 to compare with query coords |
Calling | PGEN returns |
| 错误用法 | 修复方案 |
|---|---|
| 使用 |
| 在 |
未指定 | 在构造时传入 |
| 同时拥有硬调用和剂量PGEN文件时,将仅含剂量的PGEN放在主路径 | 硬调用文件放在主路径,剂量文件放在 |
| 使用 |
期望VCF提供 | VCF不支持该方法;可循环调用单范围 |
将 | 它是1起始值;与查询坐标比较时需减1 |
调用 | PGEN返回 |
When this skill needs updating
何时需要更新此文档
Any PR that adds, removes, renames, or changes the semantics of a public
name (anything reachable from without underscores) must
update this skill alongside the code change. See the project .
import genorayCLAUDE.md任何添加、移除、重命名公共名称(即可访问的非下划线开头内容)或改变其语义的PR,都必须同步更新此文档。详情请查看项目的文件。
import genorayCLAUDE.md