anndata
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAnnData
AnnData
Overview
概述
AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.
AnnData是一个用于处理带注释数据矩阵的Python包,可同时存储实验测量数据(X)、观测元数据(obs)、变量元数据(var)以及多维注释数据(obsm、varm、obsp、varp、uns)。它最初是为Scanpy的单细胞基因组分析设计的,现在已成为一个通用框架,适用于所有需要高效存储、操作和分析的带注释数据。
When to Use This Skill
适用场景
Use this skill when:
- Creating, reading, or writing AnnData objects
- Working with h5ad, zarr, or other genomics data formats
- Performing single-cell RNA-seq analysis
- Managing large datasets with sparse matrices or backed mode
- Concatenating multiple datasets or experimental batches
- Subsetting, filtering, or transforming annotated data
- Integrating with scanpy, scvi-tools, or other scverse ecosystem tools
在以下场景中使用此技能:
- 创建、读取或写入AnnData对象
- 处理h5ad、zarr或其他基因组数据格式
- 执行单细胞RNA-seq分析
- 使用稀疏矩阵或后端模式管理大型数据集
- 拼接多个数据集或实验批次
- 对带注释的数据进行子集划分、过滤或转换
- 与scanpy、scvi-tools或其他scverse生态系统工具集成
Installation
安装
bash
uv pip install anndatabash
uv pip install anndataWith optional dependencies
安装可选依赖
uv pip install anndata[dev,test,doc]
undefineduv pip install anndata[dev,test,doc]
undefinedQuick Start
快速入门
Creating an AnnData object
创建AnnData对象
python
import anndata as ad
import numpy as np
import pandas as pdpython
import anndata as ad
import numpy as np
import pandas as pdMinimal creation
最简创建方式
X = np.random.rand(100, 2000) # 100 cells × 2000 genes
adata = ad.AnnData(X)
X = np.random.rand(100, 2000) # 100个细胞 × 2000个基因
adata = ad.AnnData(X)
With metadata
带元数据的创建方式
obs = pd.DataFrame({
'cell_type': ['T cell', 'B cell'] * 50,
'sample': ['A', 'B'] * 50
}, index=[f'cell_{i}' for i in range(100)])
var = pd.DataFrame({
'gene_name': [f'Gene_{i}' for i in range(2000)]
}, index=[f'ENSG{i:05d}' for i in range(2000)])
adata = ad.AnnData(X=X, obs=obs, var=var)
undefinedobs = pd.DataFrame({
'cell_type': ['T cell', 'B cell'] * 50,
'sample': ['A', 'B'] * 50
}, index=[f'cell_{i}' for i in range(100)])
var = pd.DataFrame({
'gene_name': [f'Gene_{i}' for i in range(2000)]
}, index=[f'ENSG{i:05d}' for i in range(2000)])
adata = ad.AnnData(X=X, obs=obs, var=var)
undefinedReading data
读取数据
python
undefinedpython
undefinedRead h5ad file
读取h5ad文件
adata = ad.read_h5ad('data.h5ad')
adata = ad.read_h5ad('data.h5ad')
Read with backed mode (for large files)
以后端模式读取(适用于大文件)
adata = ad.read_h5ad('large_data.h5ad', backed='r')
adata = ad.read_h5ad('large_data.h5ad', backed='r')
Read other formats
读取其他格式
adata = ad.read_csv('data.csv')
adata = ad.read_loom('data.loom')
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
undefinedadata = ad.read_csv('data.csv')
adata = ad.read_loom('data.loom')
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
undefinedWriting data
写入数据
python
undefinedpython
undefinedWrite h5ad file
写入h5ad文件
adata.write_h5ad('output.h5ad')
adata.write_h5ad('output.h5ad')
Write with compression
压缩写入
adata.write_h5ad('output.h5ad', compression='gzip')
adata.write_h5ad('output.h5ad', compression='gzip')
Write other formats
写入其他格式
adata.write_zarr('output.zarr')
adata.write_csvs('output_dir/')
undefinedadata.write_zarr('output.zarr')
adata.write_csvs('output_dir/')
undefinedBasic operations
基础操作
python
undefinedpython
undefinedSubset by conditions
按条件筛选子集
t_cells = adata[adata.obs['cell_type'] == 'T cell']
t_cells = adata[adata.obs['cell_type'] == 'T cell']
Subset by indices
按索引筛选子集
subset = adata[0:50, 0:100]
subset = adata[0:50, 0:100]
Add metadata
添加元数据
adata.obs['quality_score'] = np.random.rand(adata.n_obs)
adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8
adata.obs['quality_score'] = np.random.rand(adata.n_obs)
adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8
Access dimensions
查看维度
print(f"{adata.n_obs} observations × {adata.n_vars} variables")
undefinedprint(f"{adata.n_obs} 个观测 × {adata.n_vars} 个变量")
undefinedCore Capabilities
核心功能
1. Data Structure
1. 数据结构
Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.
See: for comprehensive information on:
references/data_structure.md- Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
- Creating AnnData objects from various sources
- Accessing and manipulating data components
- Memory-efficient practices
了解AnnData对象的结构,包括X、obs、var、layers、obsm、varm、obsp、varp、uns和raw组件。
参考: 获取以下内容的全面信息:
references/data_structure.md- 核心组件(X、obs、var、layers、obsm、varm、obsp、varp、uns、raw)
- 从多种来源创建AnnData对象
- 访问和操作数据组件
- 内存高效实践
2. Input/Output Operations
2. 输入/输出操作
Read and write data in various formats with support for compression, backed mode, and cloud storage.
See: for details on:
references/io_operations.md- Native formats (h5ad, zarr)
- Alternative formats (CSV, MTX, Loom, 10X, Excel)
- Backed mode for large datasets
- Remote data access
- Format conversion
- Performance optimization
Common commands:
python
undefined读取和写入多种格式的数据,支持压缩、后端模式和云存储。
参考: 获取以下详情:
references/io_operations.md- 原生格式(h5ad、zarr)
- 替代格式(CSV、MTX、Loom、10X、Excel)
- 适用于大型数据集的后端模式
- 远程数据访问
- 格式转换
- 性能优化
常用命令:
python
undefinedRead/write h5ad
读取/写入h5ad
adata = ad.read_h5ad('data.h5ad', backed='r')
adata.write_h5ad('output.h5ad', compression='gzip')
adata = ad.read_h5ad('data.h5ad', backed='r')
adata.write_h5ad('output.h5ad', compression='gzip')
Read 10X data
读取10X数据
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
Read MTX format
读取MTX格式
adata = ad.read_mtx('matrix.mtx').T
undefinedadata = ad.read_mtx('matrix.mtx').T
undefined3. Concatenation
3. 数据拼接
Combine multiple AnnData objects along observations or variables with flexible join strategies.
See: for comprehensive coverage of:
references/concatenation.md- Basic concatenation (axis=0 for observations, axis=1 for variables)
- Join types (inner, outer)
- Merge strategies (same, unique, first, only)
- Tracking data sources with labels
- Lazy concatenation (AnnCollection)
- On-disk concatenation for large datasets
Common commands:
python
undefined沿观测或变量维度合并多个AnnData对象,支持灵活的连接策略。
参考: 获取以下内容的全面介绍:
references/concatenation.md- 基础拼接(axis=0为观测维度,axis=1为变量维度)
- 连接类型(内连接、外连接)
- 合并策略(same、unique、first、only)
- 用标签跟踪数据源
- 延迟拼接(AnnCollection)
- 适用于大型数据集的磁盘拼接
常用命令:
python
undefinedConcatenate observations (combine samples)
沿观测维度拼接(合并样本)
adata = ad.concat(
[adata1, adata2, adata3],
axis=0,
join='inner',
label='batch',
keys=['batch1', 'batch2', 'batch3']
)
adata = ad.concat(
[adata1, adata2, adata3],
axis=0,
join='inner',
label='batch',
keys=['batch1', 'batch2', 'batch3']
)
Concatenate variables (combine modalities)
沿变量维度拼接(合并模态)
adata = ad.concat([adata_rna, adata_protein], axis=1)
adata = ad.concat([adata_rna, adata_protein], axis=1)
Lazy concatenation
延迟拼接
from anndata.experimental import AnnCollection
collection = AnnCollection(
['data1.h5ad', 'data2.h5ad'],
join_obs='outer',
label='dataset'
)
undefinedfrom anndata.experimental import AnnCollection
collection = AnnCollection(
['data1.h5ad', 'data2.h5ad'],
join_obs='outer',
label='dataset'
)
undefined4. Data Manipulation
4. 数据操作
Transform, subset, filter, and reorganize data efficiently.
See: for detailed guidance on:
references/manipulation.md- Subsetting (by indices, names, boolean masks, metadata conditions)
- Transposition
- Copying (full copies vs views)
- Renaming (observations, variables, categories)
- Type conversions (strings to categoricals, sparse/dense)
- Adding/removing data components
- Reordering
- Quality control filtering
Common commands:
python
undefined高效地转换、子集划分、过滤和重组数据。
参考: 获取以下内容的详细指南:
references/manipulation.md- 子集划分(按索引、名称、布尔掩码、元数据条件)
- 转置
- 复制(完整复制与视图)
- 重命名(观测、变量、类别)
- 类型转换(字符串转分类变量、稀疏/稠密转换)
- 添加/删除数据组件
- 重新排序
- 质量控制过滤
常用命令:
python
undefinedSubset by metadata
按元数据筛选
filtered = adata[adata.obs['quality_score'] > 0.8]
hv_genes = adata[:, adata.var['highly_variable']]
filtered = adata[adata.obs['quality_score'] > 0.8]
hv_genes = adata[:, adata.var['highly_variable']]
Transpose
转置
adata_T = adata.T
adata_T = adata.T
Copy vs view
视图与复制
view = adata[0:100, :] # View (lightweight reference)
copy = adata[0:100, :].copy() # Independent copy
view = adata[0:100, :] # 视图(轻量级引用)
copy = adata[0:100, :].copy() # 独立复制
Convert strings to categoricals
将字符串转换为分类变量
adata.strings_to_categoricals()
undefinedadata.strings_to_categoricals()
undefined5. Best Practices
5. 最佳实践
Follow recommended patterns for memory efficiency, performance, and reproducibility.
See: for guidelines on:
references/best_practices.md- Memory management (sparse matrices, categoricals, backed mode)
- Views vs copies
- Data storage optimization
- Performance optimization
- Working with raw data
- Metadata management
- Reproducibility
- Error handling
- Integration with other tools
- Common pitfalls and solutions
Key recommendations:
python
undefined遵循内存效率、性能和可复现性的推荐模式。
参考: 获取以下指南:
references/best_practices.md- 内存管理(稀疏矩阵、分类变量、后端模式)
- 视图与复制
- 数据存储优化
- 性能优化
- 原始数据处理
- 元数据管理
- 可复现性
- 错误处理
- 与其他工具集成
- 常见陷阱与解决方案
关键建议:
python
undefinedUse sparse matrices for sparse data
对稀疏数据使用稀疏矩阵
from scipy.sparse import csr_matrix
adata.X = csr_matrix(adata.X)
from scipy.sparse import csr_matrix
adata.X = csr_matrix(adata.X)
Convert strings to categoricals
将字符串转换为分类变量
adata.strings_to_categoricals()
adata.strings_to_categoricals()
Use backed mode for large files
对大文件使用后端模式
adata = ad.read_h5ad('large.h5ad', backed='r')
adata = ad.read_h5ad('large.h5ad', backed='r')
Store raw before filtering
过滤前存储原始数据
adata.raw = adata.copy()
adata = adata[:, adata.var['highly_variable']]
undefinedadata.raw = adata.copy()
adata = adata[:, adata.var['highly_variable']]
undefinedIntegration with Scverse Ecosystem
与Scverse生态系统集成
AnnData serves as the foundational data structure for the scverse ecosystem:
AnnData是scverse生态系统的基础数据结构:
Scanpy (Single-cell analysis)
Scanpy(单细胞分析)
python
import scanpy as scpython
import scanpy as scPreprocessing
预处理
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
Dimensionality reduction
降维
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata, n_neighbors=15)
sc.tl.umap(adata)
sc.tl.leiden(adata)
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata, n_neighbors=15)
sc.tl.umap(adata)
sc.tl.leiden(adata)
Visualization
可视化
sc.pl.umap(adata, color=['cell_type', 'leiden'])
undefinedsc.pl.umap(adata, color=['cell_type', 'leiden'])
undefinedMuon (Multimodal data)
Muon(多模态数据)
python
import muon as mupython
import muon as muCombine RNA and protein data
合并RNA和蛋白质数据
mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})
undefinedmdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})
undefinedPyTorch integration
PyTorch集成
python
from anndata.experimental import AnnLoaderpython
from anndata.experimental import AnnLoaderCreate DataLoader for deep learning
创建深度学习用DataLoader
dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
for batch in dataloader:
X = batch.X
# Train model
undefineddataloader = AnnLoader(adata, batch_size=128, shuffle=True)
for batch in dataloader:
X = batch.X
# 训练模型
undefinedCommon Workflows
常见工作流
Single-cell RNA-seq analysis
单细胞RNA-seq分析
python
import anndata as ad
import scanpy as scpython
import anndata as ad
import scanpy as sc1. Load data
1. 加载数据
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
2. Quality control
2. 质量控制
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
adata.obs['n_counts'] = adata.X.sum(axis=1)
adata = adata[adata.obs['n_genes'] > 200]
adata = adata[adata.obs['n_counts'] < 50000]
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
adata.obs['n_counts'] = adata.X.sum(axis=1)
adata = adata[adata.obs['n_genes'] > 200]
adata = adata[adata.obs['n_counts'] < 50000]
3. Store raw
3. 存储原始数据
adata.raw = adata.copy()
adata.raw = adata.copy()
4. Normalize and filter
4. 归一化与过滤
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var['highly_variable']]
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var['highly_variable']]
5. Save processed data
5. 保存处理后的数据
adata.write_h5ad('processed.h5ad')
undefinedadata.write_h5ad('processed.h5ad')
undefinedBatch integration
批次集成
python
undefinedpython
undefinedLoad multiple batches
加载多个批次数据
adata1 = ad.read_h5ad('batch1.h5ad')
adata2 = ad.read_h5ad('batch2.h5ad')
adata3 = ad.read_h5ad('batch3.h5ad')
adata1 = ad.read_h5ad('batch1.h5ad')
adata2 = ad.read_h5ad('batch2.h5ad')
adata3 = ad.read_h5ad('batch3.h5ad')
Concatenate with batch labels
带批次标签拼接
adata = ad.concat(
[adata1, adata2, adata3],
label='batch',
keys=['batch1', 'batch2', 'batch3'],
join='inner'
)
adata = ad.concat(
[adata1, adata2, adata3],
label='batch',
keys=['batch1', 'batch2', 'batch3'],
join='inner'
)
Apply batch correction
应用批次校正
import scanpy as sc
sc.pp.combat(adata, key='batch')
import scanpy as sc
sc.pp.combat(adata, key='batch')
Continue analysis
继续分析
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
undefinedsc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
undefinedWorking with large datasets
处理大型数据集
python
undefinedpython
undefinedOpen in backed mode
以后端模式打开
adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')
adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')
Filter based on metadata (no data loading)
基于元数据过滤(不加载数据)
high_quality = adata[adata.obs['quality_score'] > 0.8]
high_quality = adata[adata.obs['quality_score'] > 0.8]
Load filtered subset
加载过滤后的子集
adata_subset = high_quality.to_memory()
adata_subset = high_quality.to_memory()
Process subset
处理子集
process(adata_subset)
process(adata_subset)
Or process in chunks
或分块处理
chunk_size = 1000
for i in range(0, adata.n_obs, chunk_size):
chunk = adata[i:i+chunk_size, :].to_memory()
process(chunk)
undefinedchunk_size = 1000
for i in range(0, adata.n_obs, chunk_size):
chunk = adata[i:i+chunk_size, :].to_memory()
process(chunk)
undefinedTroubleshooting
故障排除
Out of memory errors
内存不足错误
Use backed mode or convert to sparse matrices:
python
undefined使用后端模式或转换为稀疏矩阵:
python
undefinedBacked mode
后端模式
adata = ad.read_h5ad('file.h5ad', backed='r')
adata = ad.read_h5ad('file.h5ad', backed='r')
Sparse matrices
稀疏矩阵
from scipy.sparse import csr_matrix
adata.X = csr_matrix(adata.X)
undefinedfrom scipy.sparse import csr_matrix
adata.X = csr_matrix(adata.X)
undefinedSlow file reading
文件读取缓慢
Use compression and appropriate formats:
python
undefined使用压缩和合适的格式:
python
undefinedOptimize for storage
优化存储
adata.strings_to_categoricals()
adata.write_h5ad('file.h5ad', compression='gzip')
adata.strings_to_categoricals()
adata.write_h5ad('file.h5ad', compression='gzip')
Use Zarr for cloud storage
对云存储使用Zarr格式
adata.write_zarr('file.zarr', chunks=(1000, 1000))
undefinedadata.write_zarr('file.zarr', chunks=(1000, 1000))
undefinedIndex alignment issues
索引对齐问题
Always align external data on index:
python
undefined始终确保外部数据与索引对齐:
python
undefinedWrong
错误方式
adata.obs['new_col'] = external_data['values']
adata.obs['new_col'] = external_data['values']
Correct
正确方式
adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
undefinedadata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
undefinedAdditional Resources
额外资源
- Official documentation: https://anndata.readthedocs.io/
- Scanpy tutorials: https://scanpy.readthedocs.io/
- Scverse ecosystem: https://scverse.org/
- GitHub repository: https://github.com/scverse/anndata
- 官方文档:https://anndata.readthedocs.io/
- Scanpy教程:https://scanpy.readthedocs.io/
- Scverse生态系统:https://scverse.org/
- GitHub仓库:https://github.com/scverse/anndata