TileDB-VCF

Overview

概述

TileDB-VCF is a high-performance C++ library with Python and CLI interfaces for efficient storage and retrieval of genomic variant-call data. Built on TileDB's sparse array technology, it enables scalable ingestion of VCF/BCF files, incremental sample addition without expensive merging operations, and efficient parallel queries of variant data stored locally or in the cloud.

TileDB-VCF是一个高性能C++库，提供Python和CLI接口，用于高效存储和检索基因组变异调用数据。它基于TileDB的稀疏数组技术构建，支持可扩展地导入VCF/BCF文件、无需昂贵合并操作即可增量添加样本，以及对本地或云端存储的变异数据进行高效并行查询。

When to Use This Skill

何时使用该工具

This skill should be used when:

Learning TileDB-VCF concepts and workflows
Prototyping genomics analyses and pipelines
Working with small-to-medium datasets (< 1000 samples)
Need incremental addition of new samples to existing datasets
Require efficient querying of specific genomic regions across many samples
Working with cloud-stored variant data (S3, Azure, GCS)
Need to export subsets of large VCF datasets
Building variant databases for cohort studies
Educational projects and method development
Performance is critical for variant data operations

在以下场景中应使用该工具：

学习TileDB-VCF的概念与工作流程
基因组分析和流程的原型开发
处理中小型数据集（<1000个样本）
需要向现有数据集增量添加新样本
需要高效查询多个样本中特定基因组区域的变异数据
处理存储在云端的变异数据（S3、Azure、GCS）
需要导出大型VCF数据集的子集
为队列研究构建变异数据库
教育项目和方法开发
变异数据操作的性能至关重要时

Quick Start

快速开始

Installation

安装

Preferred Method: Conda/Mamba

bash

undefined

推荐方法：Conda/Mamba

bash

undefined

Enter the following two lines if you are on a M1 Mac

CONDA_SUBDIR=osx-64 conda config --env --set subdir osx-64

Create the conda environment

conda create -n tiledb-vcf "python<3.10" conda activate tiledb-vcf

Mamba is a faster and more reliable alternative to conda

conda install -c conda-forge mamba

Install TileDB-Py and TileDB-VCF, align with other useful libraries

mamba install -y -c conda-forge -c bioconda -c tiledb tiledb-py tiledbvcf-py pandas pyarrow numpy


**Alternative: Docker Images**
```bash
docker pull tiledb/tiledbvcf-py     # Python interface
docker pull tiledb/tiledbvcf-cli    # Command-line interface

mamba install -y -c conda-forge -c bioconda -c tiledb tiledb-py tiledbvcf-py pandas pyarrow numpy


**替代方法：Docker镜像**
```bash
docker pull tiledb/tiledbvcf-py     # Python interface
docker pull tiledb/tiledbvcf-cli    # Command-line interface

Basic Examples

基础示例

Create and populate a dataset:

python

import tiledbvcf

创建并填充数据集：

python

import tiledbvcf

Create a new dataset

ds = tiledbvcf.Dataset(uri="my_dataset", mode="w", cfg=tiledbvcf.ReadConfig(memory_budget=1024))

Ingest VCF files (must be single-sample with indexes)

Requirements:

- VCFs must be single-sample (not multi-sample)

- Must have indexes: .csi (bcftools) or .tbi (tabix)

ds.ingest_samples(["sample1.vcf.gz", "sample2.vcf.gz"])


**Query variant data:**
```python

ds.ingest_samples(["sample1.vcf.gz", "sample2.vcf.gz"])


**查询变异数据：**
```python

Open existing dataset for reading

ds = tiledbvcf.Dataset(uri="my_dataset", mode="r")

Query specific regions and samples

df = ds.read( attrs=["sample_name", "pos_start", "pos_end", "alleles", "fmt_GT"], regions=["chr1:1000000-2000000", "chr2:500000-1500000"], samples=["sample1", "sample2", "sample3"] ) print(df.head())


**Export to VCF:**
```python
import os

df = ds.read( attrs=["sample_name", "pos_start", "pos_end", "alleles", "fmt_GT"], regions=["chr1:1000000-2000000", "chr2:500000-1500000"], samples=["sample1", "sample2", "sample3"] ) print(df.head())


**导出为VCF格式：**
```python
import os

Export two VCF samples

ds.export( regions=["chr21:8220186-8405573"], samples=["HG00101", "HG00097"], output_format="v", output_dir=os.path.expanduser("~"), )

undefined

ds.export( regions=["chr21:8220186-8405573"], samples=["HG00101", "HG00097"], output_format="v", output_dir=os.path.expanduser("~"), )

undefined

Core Capabilities

核心功能

1. Dataset Creation and Ingestion

1. 数据集创建与导入

Create TileDB-VCF datasets and incrementally ingest variant data from multiple VCF/BCF files. This is appropriate for building population genomics databases and cohort studies.

Requirements:

Single-sample VCFs only: Multi-sample VCFs are not supported
Index files required: VCF/BCF files must have indexes (.csi or .tbi)

Common operations:

Create new datasets with optimized array schemas
Ingest single or multiple VCF/BCF files in parallel
Add new samples incrementally without re-processing existing data
Configure memory usage and compression settings
Handle various VCF formats and INFO/FORMAT fields
Resume interrupted ingestion processes
Validate data integrity during ingestion

创建TileDB-VCF数据集，并从多个VCF/BCF文件中增量导入变异数据。这适用于构建群体基因组学数据库和队列研究。

要求：

仅支持单样本VCF：不支持多样本VCF
需要索引文件：VCF/BCF文件必须带有索引（.csi或.tbi）

常见操作：

使用优化的数组模式创建新数据集
并行导入单个或多个VCF/BCF文件
增量添加新样本，无需重新处理现有数据
配置内存使用和压缩设置
处理各种VCF格式以及INFO/FORMAT字段
恢复中断的导入过程
在导入期间验证数据完整性

2. Efficient Querying and Filtering

2. 高效查询与过滤

Query variant data with high performance across genomic regions, samples, and variant attributes. This is appropriate for association studies, variant discovery, and population analysis.

Common operations:

Query specific genomic regions (single or multiple)
Filter by sample names or sample groups
Extract specific variant attributes (position, alleles, genotypes, quality)
Access INFO and FORMAT fields efficiently
Combine spatial and attribute-based filtering
Stream large query results
Perform aggregations across samples or regions

针对基因组区域、样本和变异属性进行高性能查询。这适用于关联研究、变异发现和群体分析。

常见操作：

查询特定基因组区域（单个或多个）
按样本名称或样本组过滤
提取特定变异属性（位置、等位基因、基因型、质量）
高效访问INFO和FORMAT字段
结合空间过滤和基于属性的过滤
流式处理大型查询结果
跨样本或区域执行聚合操作

3. Data Export and Interoperability

3. 数据导出与互操作性

Export data in various formats for downstream analysis or integration with other genomics tools. This is appropriate for sharing datasets, creating analysis subsets, or feeding other pipelines.

Common operations:

Export to standard VCF/BCF formats
Generate TSV files with selected fields
Create sample/region-specific subsets
Maintain data provenance and metadata
Lossless data export preserving all annotations
Compressed output formats
Streaming exports for large datasets

以多种格式导出数据，用于下游分析或与其他基因组学工具集成。这适用于共享数据集、创建分析子集或为其他流程提供数据。

常见操作：

导出为标准VCF/BCF格式
生成包含选定字段的TSV文件
创建特定样本/区域的子集
维护数据来源和元数据
无损导出数据，保留所有注释
压缩输出格式
流式导出大型数据集

4. Population Genomics Workflows

4. 群体基因组学工作流程

TileDB-VCF excels at large-scale population genomics analyses requiring efficient access to variant data across many samples and genomic regions.

Common workflows:

Genome-wide association studies (GWAS) data preparation
Rare variant burden testing
Population stratification analysis
Allele frequency calculations across populations
Quality control across large cohorts
Variant annotation and filtering
Cross-population comparative analysis

TileDB-VCF擅长处理需要高效访问跨多个样本和基因组区域的变异数据的大规模群体基因组学分析。

常见工作流程：

全基因组关联研究（GWAS）的数据准备
罕见变异负荷测试
群体分层分析
跨群体的等位基因频率计算
大型队列的质量控制
变异注释与过滤
跨群体比较分析

Key Concepts

核心概念

Array Schema and Data Model

数组模式与数据模型

TileDB-VCF Data Model:

Variants stored as sparse arrays with genomic coordinates as dimensions
Samples stored as attributes allowing efficient sample-specific queries
INFO and FORMAT fields preserved with original data types
Automatic compression and chunking for optimal storage

Schema Configuration:

python

undefined

TileDB-VCF数据模型：

变异以稀疏数组形式存储，基因组坐标作为维度
样本作为属性存储，支持高效的样本特异性查询
INFO和FORMAT字段保留原始数据类型
自动压缩和分块以优化存储

模式配置：

python

undefined

Custom schema with specific tile extents

config = tiledbvcf.ReadConfig( memory_budget=2048, # MB region_partition=(0, 3095677412), # Full genome sample_partition=(0, 10000) # Up to 10k samples )

undefined

config = tiledbvcf.ReadConfig( memory_budget=2048, # MB region_partition=(0, 3095677412), # Full genome sample_partition=(0, 10000) # Up to 10k samples )

undefined

Coordinate Systems and Regions

坐标系统与区域

Critical: TileDB-VCF uses 1-based genomic coordinates following VCF standard:

Positions are 1-based (first base is position 1)
Ranges are inclusive on both ends
Region "chr1:1000-2000" includes positions 1000-2000 (1001 bases total)

Region specification formats:

python

undefined

重要提示： TileDB-VCF遵循VCF标准，使用1-based基因组坐标：

位置为1-based（第一个碱基是位置1）
范围两端均包含在内
区域"chr1:1000-2000"包含位置1000-2000（共1001个碱基）

区域指定格式：

python

undefined

Single region

regions = ["chr1:1000000-2000000"]

Multiple regions

regions = ["chr1:1000000-2000000", "chr2:500000-1500000"]

Whole chromosome

regions = ["chr1"]

BED-style (0-based, half-open converted internally)

regions = ["chr1:999999-2000000"] # Equivalent to 1-based chr1:1000000-2000000

undefined

regions = ["chr1:999999-2000000"] # Equivalent to 1-based chr1:1000000-2000000

undefined

Memory Management

内存管理

Performance considerations:

Set appropriate memory budget based on available system memory
Use streaming queries for very large result sets
Partition large ingestions to avoid memory exhaustion
Configure tile cache for repeated region access
Use parallel ingestion for multiple files
Optimize region queries by combining nearby regions

性能注意事项：

根据可用系统内存设置合适的内存预算
对非常大的结果集使用流式查询
拆分大型导入任务以避免内存耗尽
为重复区域访问配置 tile 缓存
对多个文件使用并行导入
通过合并邻近区域优化区域查询

Cloud Storage Integration

云存储集成

TileDB-VCF seamlessly works with cloud storage:

python

undefined

TileDB-VCF可无缝对接云存储：

python

undefined

S3 dataset

ds = tiledbvcf.Dataset(uri="s3://bucket/dataset", mode="r")

Azure Blob Storage

ds = tiledbvcf.Dataset(uri="azure://container/dataset", mode="r")

Google Cloud Storage

ds = tiledbvcf.Dataset(uri="gcs://bucket/dataset", mode="r")

undefined

ds = tiledbvcf.Dataset(uri="gcs://bucket/dataset", mode="r")

undefined

Common Pitfalls

常见陷阱

Memory exhaustion during ingestion: Use appropriate memory budget and batch processing for large VCF files
Inefficient region queries: Combine nearby regions instead of many separate queries
Missing sample names: Ensure sample names in VCF headers match query sample specifications
Coordinate system confusion: Remember TileDB-VCF uses 1-based coordinates like VCF standard
Large result sets: Use streaming or pagination for queries returning millions of variants
Cloud permissions: Ensure proper authentication for cloud storage access
Concurrent access: Multiple writers to the same dataset can cause corruption—use appropriate locking

导入期间内存耗尽：为大型VCF文件设置合适的内存预算并使用批处理
低效的区域查询：合并邻近区域，而非执行大量单独查询
样本名称缺失：确保VCF头中的样本名称与查询中的样本规格匹配
坐标系统混淆：记住TileDB-VCF像VCF标准一样使用1-based坐标
大型结果集：对返回数百万变异的查询使用流式处理或分页
云权限：确保拥有云存储访问的正确认证
并发访问：多个写入者操作同一数据集可能导致损坏——使用适当的锁机制

CLI Usage

CLI使用

TileDB-VCF provides a command-line interface with the following subcommands:

Available Subcommands:

```
create
```
- Creates an empty TileDB-VCF dataset
```
store
```
- Ingests samples into a TileDB-VCF dataset
```
export
```
- Exports data from a TileDB-VCF dataset
```
list
```
- Lists all sample names present in a TileDB-VCF dataset
```
stat
```
- Prints high-level statistics about a TileDB-VCF dataset
```
utils
```
- Utils for working with a TileDB-VCF dataset
```
version
```
- Print the version information and exit

bash

undefined

TileDB-VCF提供命令行接口，包含以下子命令：

可用子命令：

```
create
```
- 创建空的TileDB-VCF数据集
```
store
```
- 将样本导入TileDB-VCF数据集
```
export
```
- 从TileDB-VCF数据集导出数据
```
list
```
- 列出TileDB-VCF数据集中的所有样本名称
```
stat
```
- 打印TileDB-VCF数据集的高级统计信息
```
utils
```
- 用于处理TileDB-VCF数据集的工具
```
version
```
- 打印版本信息并退出

bash

undefined

Create empty dataset

tiledbvcf create --uri my_dataset

Ingest samples (requires single-sample VCFs with indexes)

tiledbvcf store --uri my_dataset --samples sample1.vcf.gz,sample2.vcf.gz

Export data

tiledbvcf export --uri my_dataset
--regions "chr1:1000000-2000000"
--sample-names "sample1,sample2"

List all samples

tiledbvcf list --uri my_dataset

Show dataset statistics

tiledbvcf stat --uri my_dataset

undefined

tiledbvcf stat --uri my_dataset

undefined

Advanced Features

高级功能

Allele Frequency Analysis

等位基因频率分析

python

undefined

python

undefined

Calculate allele frequencies

af_df = tiledbvcf.read_allele_frequency( uri="my_dataset", regions=["chr1:1000000-2000000"], samples=["sample1", "sample2", "sample3"] )

undefined

af_df = tiledbvcf.read_allele_frequency( uri="my_dataset", regions=["chr1:1000000-2000000"], samples=["sample1", "sample2", "sample3"] )

undefined

Sample Quality Control

样本质量控制

python

undefined

python

undefined

Perform sample QC

qc_results = tiledbvcf.sample_qc( uri="my_dataset", samples=["sample1", "sample2"] )

undefined

qc_results = tiledbvcf.sample_qc( uri="my_dataset", samples=["sample1", "sample2"] )

undefined

Custom Configurations

自定义配置

python

undefined

python

undefined

Advanced configuration

config = tiledbvcf.ReadConfig( memory_budget=4096, tiledb_config={ "sm.tile_cache_size": "1000000000", "vfs.s3.region": "us-east-1" } )

undefined

config = tiledbvcf.ReadConfig( memory_budget=4096, tiledb_config={ "sm.tile_cache_size": "1000000000", "vfs.s3.region": "us-east-1" } )

undefined

Resources

资源

Getting Help

获取帮助

Open Source TileDB-VCF Resources

开源TileDB-VCF资源

Open Source Documentation:

TileDB Academy: https://cloud.tiledb.com/academy/
Population Genomics Guide: https://cloud.tiledb.com/academy/structure/life-sciences/population-genomics/
TileDB-VCF GitHub: https://github.com/TileDB-Inc/TileDB-VCF

开源文档：

TileDB学院：https://cloud.tiledb.com/academy/
群体基因组学指南：https://cloud.tiledb.com/academy/structure/life-sciences/population-genomics/
TileDB-VCF GitHub：https://github.com/TileDB-Inc/TileDB-VCF

TileDB-Cloud Resources

TileDB-Cloud资源

For Large-Scale/Production Genomics:

TileDB-Cloud Platform: https://cloud.tiledb.com
TileDB Academy (All Documentation): https://cloud.tiledb.com/academy/

Getting Started:

Free account signup: https://cloud.tiledb.com
Contact: sales@tiledb.com for enterprise needs

适用于大规模/生产级基因组学：

TileDB-Cloud平台：https://cloud.tiledb.com
TileDB学院（所有文档）：https://cloud.tiledb.com/academy/

入门指南：

免费账户注册：https://cloud.tiledb.com
企业需求联系：sales@tiledb.com

Scaling to TileDB-Cloud

扩展到TileDB-Cloud

When your genomics workloads outgrow single-node processing, TileDB-Cloud provides enterprise-scale capabilities for production genomics pipelines.

Note: This section covers TileDB-Cloud capabilities based on available documentation. For complete API details and current functionality, consult the official TileDB-Cloud documentation and API reference.

当你的基因组学工作负载超出单节点处理能力时，TileDB-Cloud为生产级基因组学流程提供企业级扩展能力。

注意：本节基于现有文档介绍TileDB-Cloud的功能。如需完整的API细节和当前功能，请查阅官方TileDB-Cloud文档和API参考。

Setting Up TileDB-Cloud

设置TileDB-Cloud

1. Create Account and Get API Token

bash

undefined

1. 创建账户并获取API令牌

bash

undefined

Sign up at https://cloud.tiledb.com

Generate API token in your account settings


**2. Install TileDB-Cloud Python Client**
```bash


**2. 安装TileDB-Cloud Python客户端**
```bash

Base installation

pip install tiledb-cloud

With genomics-specific functionality

pip install tiledb-cloud[life-sciences]


**3. Configure Authentication**
```bash

pip install tiledb-cloud[life-sciences]


**3. 配置认证**
```bash

Set environment variable with your API token

export TILEDB_REST_TOKEN="your_api_token"


```python
import tiledb.cloud

export TILEDB_REST_TOKEN="your_api_token"


```python
import tiledb.cloud

Authentication is automatic via TILEDB_REST_TOKEN

No explicit login required in code

undefined

undefined

Migrating from Open Source to TileDB-Cloud

从开源版迁移到TileDB-Cloud

Large-Scale Ingestion

python

undefined

大规模导入

python

undefined

TileDB-Cloud: Distributed VCF ingestion

import tiledb.cloud.vcf

Use specialized VCF ingestion module

Note: Exact API requires TileDB-Cloud documentation

This represents the available functionality structure

tiledb.cloud.vcf.ingestion.ingest_vcf_dataset( source="s3://my-bucket/vcf-files/", output="tiledb://my-namespace/large-dataset", namespace="my-namespace", acn="my-s3-credentials", ingest_resources={"cpu": "16", "memory": "64Gi"} )


**Distributed Query Processing**
```python

tiledb.cloud.vcf.ingestion.ingest_vcf_dataset( source="s3://my-bucket/vcf-files/", output="tiledb://my-namespace/large-dataset", namespace="my-namespace", acn="my-s3-credentials", ingest_resources={"cpu": "16", "memory": "64Gi"} )


**分布式查询处理**
```python

TileDB-Cloud: VCF querying across distributed storage

import tiledb.cloud.vcf import tiledbvcf

Define the dataset URI

dataset_uri = "tiledb://TileDB-Inc/gvcf-1kg-dragen-v376"

Get all samples from the dataset

ds = tiledbvcf.Dataset(dataset_uri, tiledb_config=cfg) samples = ds.samples()

Define attributes and ranges to query on

attrs = ["sample_name", "fmt_GT", "fmt_AD", "fmt_DP"] regions = ["chr13:32396898-32397044", "chr13:32398162-32400268"]

Perform the read, which is executed in a distributed fashion

df = tiledb.cloud.vcf.read( dataset_uri=dataset_uri, regions=regions, samples=samples, attrs=attrs, namespace="my-namespace", # specifies which account to charge ) df.to_pandas()

undefined

df = tiledb.cloud.vcf.read( dataset_uri=dataset_uri, regions=regions, samples=samples, attrs=attrs, namespace="my-namespace", # specifies which account to charge ) df.to_pandas()

undefined

Enterprise Features

企业功能

Data Sharing and Collaboration

python

undefined

数据共享与协作

python

undefined

TileDB-Cloud provides enterprise data sharing capabilities

through namespace-based permissions and group management

Access shared datasets via TileDB-Cloud URIs

dataset_uri = "tiledb://shared-namespace/population-study"

Collaborate through shared notebooks and compute resources

(Specific API requires TileDB-Cloud documentation)


**Cost Optimization**
- **Serverless Compute**: Pay only for actual compute time
- **Auto-scaling**: Automatically scale up/down based on workload
- **Spot Instances**: Use cost-optimized compute for batch jobs
- **Data Tiering**: Automatic hot/cold storage management

**Security and Compliance**
- **End-to-end Encryption**: Data encrypted in transit and at rest
- **Access Controls**: Fine-grained permissions and audit logs
- **HIPAA/SOC2 Compliance**: Enterprise security standards
- **VPC Support**: Deploy in private cloud environments


**成本优化**
- **无服务器计算**：仅为实际计算时间付费
- **自动扩缩容**：根据工作负载自动向上/向下扩展
- **Spot实例**：为批处理作业使用成本优化的计算资源
- **数据分层**：自动热/冷存储管理

**安全性与合规性**
- **端到端加密**：数据在传输和存储时均加密
- **访问控制**：细粒度权限和审计日志
- **HIPAA/SOC2合规**：企业级安全标准
- **VPC支持**：部署在私有云环境中

When to Migrate Checklist

迁移时机检查清单

✅ Migrate to TileDB-Cloud if you have:

Datasets > 1000 samples
Need to process > 100GB of VCF data
Require distributed computing
Multiple team members need access
Need enterprise security/compliance
Want cost-optimized serverless compute
Require 24/7 production uptime

✅ 如果满足以下条件，迁移到TileDB-Cloud：

数据集样本数>1000
需要处理>100GB的VCF数据
需要分布式计算
多个团队成员需要访问
需要企业级安全/合规性
想要成本优化的无服务器计算
需要7×24小时生产级可用性

Getting Started with TileDB-Cloud

TileDB-Cloud入门

Start Free: TileDB-Cloud offers free tier for evaluation
Migration Support: TileDB team provides migration assistance
Training: Access to genomics-specific tutorials and examples
Professional Services: Custom deployment and optimization

Next Steps:

Visit https://cloud.tiledb.com to create account
Review documentation at https://cloud.tiledb.com/academy/
Contact sales@tiledb.com for enterprise needs

免费开始：TileDB-Cloud提供免费试用版用于评估
迁移支持：TileDB团队提供迁移协助
培训：获取基因组学特定教程和示例
专业服务：定制部署和优化

下一步：

tiledbvcf

Original

Translation

TileDB-VCF

TileDB-VCF

Overview

概述

When to Use This Skill

何时使用该工具

Quick Start

快速开始

Installation

安装

Enter the following two lines if you are on a M1 Mac

Enter the following two lines if you are on a M1 Mac

Create the conda environment

Create the conda environment

Mamba is a faster and more reliable alternative to conda

Mamba is a faster and more reliable alternative to conda

Install TileDB-Py and TileDB-VCF, align with other useful libraries

Install TileDB-Py and TileDB-VCF, align with other useful libraries

Basic Examples

基础示例

Create a new dataset

Create a new dataset

Ingest VCF files (must be single-sample with indexes)

Ingest VCF files (must be single-sample with indexes)

Requirements:

Requirements:

- VCFs must be single-sample (not multi-sample)

- VCFs must be single-sample (not multi-sample)

- Must have indexes: .csi (bcftools) or .tbi (tabix)

- Must have indexes: .csi (bcftools) or .tbi (tabix)

Open existing dataset for reading

Open existing dataset for reading

Query specific regions and samples

Query specific regions and samples

Export two VCF samples

Export two VCF samples

Core Capabilities

核心功能

1. Dataset Creation and Ingestion

1. 数据集创建与导入

2. Efficient Querying and Filtering

2. 高效查询与过滤

3. Data Export and Interoperability

3. 数据导出与互操作性

4. Population Genomics Workflows

4. 群体基因组学工作流程

Key Concepts

核心概念

Array Schema and Data Model

数组模式与数据模型

Custom schema with specific tile extents

Custom schema with specific tile extents

Coordinate Systems and Regions

坐标系统与区域

Single region

Single region

Multiple regions

Multiple regions

Whole chromosome

Whole chromosome

BED-style (0-based, half-open converted internally)

BED-style (0-based, half-open converted internally)

Memory Management

内存管理

Cloud Storage Integration

云存储集成

S3 dataset

S3 dataset

Azure Blob Storage

Azure Blob Storage

Google Cloud Storage

Google Cloud Storage

Common Pitfalls

常见陷阱

CLI Usage

CLI使用

Create empty dataset