cellxgene-census

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CZ CELLxGENE Census

CZ CELLxGENE Census

Overview

概述

The CZ CELLxGENE Census provides programmatic access to a comprehensive, versioned collection of standardized single-cell genomics data from CZ CELLxGENE Discover. This skill enables efficient querying and analysis of millions of cells across thousands of datasets.
The Census includes:
  • 61+ million cells from human and mouse
  • Standardized metadata (cell types, tissues, diseases, donors)
  • Raw gene expression matrices
  • Pre-calculated embeddings and statistics
  • Integration with PyTorch, scanpy, and other analysis tools
CZ CELLxGENE Census 提供对CZ CELLxGENE Discover中标准化单细胞基因组数据的全面、版本化集合的程序化访问。本技能支持对跨数千个数据集的数百万细胞进行高效查询和分析。
Census包含:
  • 6100多万个人类和小鼠细胞
  • 标准化元数据(细胞类型、组织、疾病、供体)
  • 原始基因表达矩阵
  • 预计算嵌入和统计数据
  • 与PyTorch、scanpy及其他分析工具集成

When to Use This Skill

何时使用本技能

This skill should be used when:
  • Querying single-cell expression data by cell type, tissue, or disease
  • Exploring available single-cell datasets and metadata
  • Training machine learning models on single-cell data
  • Performing large-scale cross-dataset analyses
  • Integrating Census data with scanpy or other analysis frameworks
  • Computing statistics across millions of cells
  • Accessing pre-calculated embeddings or model predictions
在以下场景中应使用本技能:
  • 按细胞类型、组织或疾病查询单细胞表达数据
  • 探索可用的单细胞数据集和元数据
  • 在单细胞数据上训练机器学习模型
  • 执行大规模跨数据集分析
  • 将Census数据与scanpy或其他分析框架集成
  • 计算数百万细胞的统计数据
  • 访问预计算嵌入或模型预测结果

Installation and Setup

安装与设置

Install the Census API:
bash
uv pip install cellxgene-census
For machine learning workflows, install additional dependencies:
bash
uv pip install cellxgene-census[experimental]
安装Census API:
bash
uv pip install cellxgene-census
对于机器学习工作流,安装额外依赖:
bash
uv pip install cellxgene-census[experimental]

Core Workflow Patterns

核心工作流模式

1. Opening the Census

1. 打开Census

Always use the context manager to ensure proper resource cleanup:
python
import cellxgene_census
始终使用上下文管理器以确保资源正确清理:
python
import cellxgene_census

Open latest stable version

打开最新稳定版本

with cellxgene_census.open_soma() as census: # Work with census data
with cellxgene_census.open_soma() as census: # 处理census数据

Open specific version for reproducibility

打开特定版本以保证可复现性

with cellxgene_census.open_soma(census_version="2023-07-25") as census: # Work with census data

**Key points:**
- Use context manager (`with` statement) for automatic cleanup
- Specify `census_version` for reproducible analyses
- Default opens latest "stable" release
with cellxgene_census.open_soma(census_version="2023-07-25") as census: # 处理census数据

**关键点:**
- 使用上下文管理器(`with`语句)自动清理资源
- 指定`census_version`以实现可复现分析
- 默认打开最新的"stable"版本

2. Exploring Census Information

2. 探索Census信息

Before querying expression data, explore available datasets and metadata.
Access summary information:
python
undefined
在查询表达数据之前,先探索可用的数据集和元数据。
访问摘要信息:
python
undefined

Get summary statistics

获取统计摘要

summary = census["census_info"]["summary"].read().concat().to_pandas() print(f"Total cells: {summary['total_cell_count'][0]}")
summary = census["census_info"]["summary"].read().concat().to_pandas() print(f"总细胞数: {summary['total_cell_count'][0]}")

Get all datasets

获取所有数据集

datasets = census["census_info"]["datasets"].read().concat().to_pandas()
datasets = census["census_info"]["datasets"].read().concat().to_pandas()

Filter datasets by criteria

按条件筛选数据集

covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]

**Query cell metadata to understand available data:**
```python
covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]

**查询细胞元数据以了解可用数据:**
```python

Get unique cell types in a tissue

获取某一组织中的独特细胞类型

cell_metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'brain' and is_primary_data == True", column_names=["cell_type"] ) unique_cell_types = cell_metadata["cell_type"].unique() print(f"Found {len(unique_cell_types)} cell types in brain")
cell_metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'brain' and is_primary_data == True", column_names=["cell_type"] ) unique_cell_types = cell_metadata["cell_type"].unique() print(f"在大脑中发现{len(unique_cell_types)}种细胞类型")

Count cells by tissue

按组织统计细胞数量

tissue_counts = cell_metadata.groupby("tissue_general").size()

**Important:** Always filter for `is_primary_data == True` to avoid counting duplicate cells unless specifically analyzing duplicates.
tissue_counts = cell_metadata.groupby("tissue_general").size()

**重要提示:** 除非专门分析重复细胞,否则始终筛选`is_primary_data == True`以避免统计重复细胞。

3. Querying Expression Data (Small to Medium Scale)

3. 查询表达数据(中小规模)

For queries returning < 100k cells that fit in memory, use
get_anndata()
:
python
undefined
对于返回细胞数<10万且可放入内存的查询,使用
get_anndata()
python
undefined

Basic query with cell type and tissue filters

带细胞类型和组织筛选的基础查询

adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", # or "Mus musculus" obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True", obs_column_names=["assay", "disease", "sex", "donor_id"], )
adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", # 或 "Mus musculus" obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True", obs_column_names=["assay", "disease", "sex", "donor_id"], )

Query specific genes with multiple filters

带多条件筛选的特定基因查询

adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']", obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True", obs_column_names=["cell_type", "tissue_general", "donor_id"], )

**Filter syntax:**
- Use `obs_value_filter` for cell filtering
- Use `var_value_filter` for gene filtering
- Combine conditions with `and`, `or`
- Use `in` for multiple values: `tissue in ['lung', 'liver']`
- Select only needed columns with `obs_column_names`

**Getting metadata separately:**
```python
adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']", obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True", obs_column_names=["cell_type", "tissue_general", "donor_id"], )

**筛选语法:**
- 使用`obs_value_filter`进行细胞筛选
- 使用`var_value_filter`进行基因筛选
- 使用`and`、`or`组合条件
- 使用`in`匹配多个值:`tissue in ['lung', 'liver']`
- 使用`obs_column_names`仅选择所需列

**单独获取元数据:**
```python

Query cell metadata

查询细胞元数据

cell_metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="disease == 'COVID-19' and is_primary_data == True", column_names=["cell_type", "tissue_general", "donor_id"] )
cell_metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="disease == 'COVID-19' and is_primary_data == True", column_names=["cell_type", "tissue_general", "donor_id"] )

Query gene metadata

查询基因元数据

gene_metadata = cellxgene_census.get_var( census, "homo_sapiens", value_filter="feature_name in ['CD4', 'CD8A']", column_names=["feature_id", "feature_name", "feature_length"] )
undefined
gene_metadata = cellxgene_census.get_var( census, "homo_sapiens", value_filter="feature_name in ['CD4', 'CD8A']", column_names=["feature_id", "feature_name", "feature_length"] )
undefined

4. Large-Scale Queries (Out-of-Core Processing)

4. 大规模查询(核外处理)

For queries exceeding available RAM, use
axis_query()
with iterative processing:
python
import tiledbsoma as soma
对于超出可用内存的查询,使用
axis_query()
进行迭代处理:
python
import tiledbsoma as soma

Create axis query

创建轴查询

query = census["census_data"]["homo_sapiens"].axis_query( measurement_name="RNA", obs_query=soma.AxisQuery( value_filter="tissue_general == 'brain' and is_primary_data == True" ), var_query=soma.AxisQuery( value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']" ) )
query = census["census_data"]["homo_sapiens"].axis_query( measurement_name="RNA", obs_query=soma.AxisQuery( value_filter="tissue_general == 'brain' and is_primary_data == True" ), var_query=soma.AxisQuery( value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']" ) )

Iterate through expression matrix in chunks

分块迭代处理表达矩阵

iterator = query.X("raw").tables() for batch in iterator: # batch is a pyarrow.Table with columns: # - soma_data: expression value # - soma_dim_0: cell (obs) coordinate # - soma_dim_1: gene (var) coordinate process_batch(batch)

**Computing incremental statistics:**
```python
iterator = query.X("raw").tables() for batch in iterator: # batch是一个pyarrow.Table,包含以下列: # - soma_data: 表达值 # - soma_dim_0: 细胞(obs)坐标 # - soma_dim_1: 基因(var)坐标 process_batch(batch)

**计算增量统计数据:**
```python

Example: Calculate mean expression

示例:计算平均表达量

n_observations = 0 sum_values = 0.0
iterator = query.X("raw").tables() for batch in iterator: values = batch["soma_data"].to_numpy() n_observations += len(values) sum_values += values.sum()
mean_expression = sum_values / n_observations
undefined
n_observations = 0 sum_values = 0.0
iterator = query.X("raw").tables() for batch in iterator: values = batch["soma_data"].to_numpy() n_observations += len(values) sum_values += values.sum()
mean_expression = sum_values / n_observations
undefined

5. Machine Learning with PyTorch

5. 与PyTorch结合的机器学习

For training models, use the experimental PyTorch integration:
python
from cellxgene_census.experimental.ml import experiment_dataloader

with cellxgene_census.open_soma() as census:
    # Create dataloader
    dataloader = experiment_dataloader(
        census["census_data"]["homo_sapiens"],
        measurement_name="RNA",
        X_name="raw",
        obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
        obs_column_names=["cell_type"],
        batch_size=128,
        shuffle=True,
    )

    # Training loop
    for epoch in range(num_epochs):
        for batch in dataloader:
            X = batch["X"]  # Gene expression tensor
            labels = batch["obs"]["cell_type"]  # Cell type labels

            # Forward pass
            outputs = model(X)
            loss = criterion(outputs, labels)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
Train/test splitting:
python
from cellxgene_census.experimental.ml import ExperimentDataset
对于模型训练,使用实验性的PyTorch集成:
python
from cellxgene_census.experimental.ml import experiment_dataloader

with cellxgene_census.open_soma() as census:
    # 创建数据加载器
    dataloader = experiment_dataloader(
        census["census_data"]["homo_sapiens"],
        measurement_name="RNA",
        X_name="raw",
        obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
        obs_column_names=["cell_type"],
        batch_size=128,
        shuffle=True,
    )

    # 训练循环
    for epoch in range(num_epochs):
        for batch in dataloader:
            X = batch["X"]  # 基因表达张量
            labels = batch["obs"]["cell_type"]  # 细胞类型标签

            # 前向传播
            outputs = model(X)
            loss = criterion(outputs, labels)

            # 反向传播
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
训练/测试拆分:
python
from cellxgene_census.experimental.ml import ExperimentDataset

Create dataset from experiment

从实验创建数据集

dataset = ExperimentDataset( experiment_axis_query, layer_name="raw", obs_column_names=["cell_type"], batch_size=128, )
dataset = ExperimentDataset( experiment_axis_query, layer_name="raw", obs_column_names=["cell_type"], batch_size=128, )

Split into train and test

拆分为训练集和测试集

train_dataset, test_dataset = dataset.random_split( split=[0.8, 0.2], seed=42 )
undefined
train_dataset, test_dataset = dataset.random_split( split=[0.8, 0.2], seed=42 )
undefined

6. Integration with Scanpy

6. 与Scanpy集成

Seamlessly integrate Census data with scanpy workflows:
python
import scanpy as sc
将Census数据与scanpy工作流无缝集成:
python
import scanpy as sc

Load data from Census

从Census加载数据

adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True", )
adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True", )

Standard scanpy workflow

标准scanpy工作流

sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000)

Dimensionality reduction

降维

sc.pp.pca(adata, n_comps=50) sc.pp.neighbors(adata) sc.tl.umap(adata)
sc.pp.pca(adata, n_comps=50) sc.pp.neighbors(adata) sc.tl.umap(adata)

Visualization

可视化

sc.pl.umap(adata, color=["cell_type", "tissue", "disease"])
undefined
sc.pl.umap(adata, color=["cell_type", "tissue", "disease"])
undefined

7. Multi-Dataset Integration

7. 多数据集集成

Query and integrate multiple datasets:
python
undefined
查询并集成多个数据集:
python
undefined

Strategy 1: Query multiple tissues separately

策略1:分别查询多个组织

tissues = ["lung", "liver", "kidney"] adatas = []
for tissue in tissues: adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True", ) adata.obs["tissue"] = tissue adatas.append(adata)
tissues = ["lung", "liver", "kidney"] adatas = []
for tissue in tissues: adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True", ) adata.obs["tissue"] = tissue adatas.append(adata)

Concatenate

合并数据

combined = adatas[0].concatenate(adatas[1:])
combined = adatas[0].concatenate(adatas[1:])

Strategy 2: Query multiple datasets directly

策略2:直接查询多个数据集

adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True", )
undefined
adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True", )
undefined

Key Concepts and Best Practices

核心概念与最佳实践

Always Filter for Primary Data

始终筛选原始数据

Unless analyzing duplicates, always include
is_primary_data == True
in queries to avoid counting cells multiple times:
python
obs_value_filter="cell_type == 'B cell' and is_primary_data == True"
除非分析重复细胞,否则始终在查询中包含
is_primary_data == True
以避免重复统计细胞:
python
obs_value_filter="cell_type == 'B cell' and is_primary_data == True"

Specify Census Version for Reproducibility

指定Census版本以保证可复现性

Always specify the Census version in production analyses:
python
census = cellxgene_census.open_soma(census_version="2023-07-25")
在生产分析中始终指定Census版本:
python
census = cellxgene_census.open_soma(census_version="2023-07-25")

Estimate Query Size Before Loading

加载前预估查询规模

For large queries, first check the number of cells to avoid memory issues:
python
undefined
对于大型查询,先检查细胞数量以避免内存问题:
python
undefined

Get cell count

获取细胞数量

metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'brain' and is_primary_data == True", column_names=["soma_joinid"] ) n_cells = len(metadata) print(f"Query will return {n_cells:,} cells")
metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'brain' and is_primary_data == True", column_names=["soma_joinid"] ) n_cells = len(metadata) print(f"查询将返回{n_cells:,}个细胞")

If too large (>100k), use out-of-core processing

如果数量过大(>10万),使用核外处理

undefined
undefined

Use tissue_general for Broader Groupings

使用tissue_general进行更广泛的分组

The
tissue_general
field provides coarser categories than
tissue
, useful for cross-tissue analyses:
python
undefined
tissue_general
字段提供比
tissue
更粗略的分类,适用于跨组织分析:
python
undefined

Broader grouping

更广泛的分组

obs_value_filter="tissue_general == 'immune system'"
obs_value_filter="tissue_general == 'immune system'"

Specific tissue

特定组织

obs_value_filter="tissue == 'peripheral blood mononuclear cell'"
undefined
obs_value_filter="tissue == 'peripheral blood mononuclear cell'"
undefined

Select Only Needed Columns

仅选择所需列

Minimize data transfer by specifying only required metadata columns:
python
obs_column_names=["cell_type", "tissue_general", "disease"]  # Not all columns
通过仅指定所需的元数据列来减少数据传输:
python
obs_column_names=["cell_type", "tissue_general", "disease"]  # 不选择所有列

Check Dataset Presence for Gene-Specific Queries

针对特定基因查询时检查数据集覆盖情况

When analyzing specific genes, verify which datasets measured them:
python
presence = cellxgene_census.get_presence_matrix(
    census,
    "homo_sapiens",
    var_value_filter="feature_name in ['CD4', 'CD8A']"
)
分析特定基因时,验证哪些数据集测量了这些基因:
python
presence = cellxgene_census.get_presence_matrix(
    census,
    "homo_sapiens",
    var_value_filter="feature_name in ['CD4', 'CD8A']"
)

Two-Step Workflow: Explore Then Query

两步工作流:先探索再查询

First explore metadata to understand available data, then query expression:
python
undefined
先探索元数据以了解可用数据,再查询表达数据:
python
undefined

Step 1: Explore what's available

步骤1:探索可用数据

metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="disease == 'COVID-19' and is_primary_data == True", column_names=["cell_type", "tissue_general"] ) print(metadata.value_counts())
metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="disease == 'COVID-19' and is_primary_data == True", column_names=["cell_type", "tissue_general"] ) print(metadata.value_counts())

Step 2: Query based on findings

步骤2:基于发现结果进行查询

adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True", )
undefined
adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True", )
undefined

Available Metadata Fields

可用元数据字段

Cell Metadata (obs)

细胞元数据(obs)

Key fields for filtering:
  • cell_type
    ,
    cell_type_ontology_term_id
  • tissue
    ,
    tissue_general
    ,
    tissue_ontology_term_id
  • disease
    ,
    disease_ontology_term_id
  • assay
    ,
    assay_ontology_term_id
  • donor_id
    ,
    sex
    ,
    self_reported_ethnicity
  • development_stage
    ,
    development_stage_ontology_term_id
  • dataset_id
  • is_primary_data
    (Boolean: True = unique cell)
用于筛选的关键字段:
  • cell_type
    ,
    cell_type_ontology_term_id
  • tissue
    ,
    tissue_general
    ,
    tissue_ontology_term_id
  • disease
    ,
    disease_ontology_term_id
  • assay
    ,
    assay_ontology_term_id
  • donor_id
    ,
    sex
    ,
    self_reported_ethnicity
  • development_stage
    ,
    development_stage_ontology_term_id
  • dataset_id
  • is_primary_data
    (布尔值:True = 唯一细胞)

Gene Metadata (var)

基因元数据(var)

  • feature_id
    (Ensembl gene ID, e.g., "ENSG00000161798")
  • feature_name
    (Gene symbol, e.g., "FOXP2")
  • feature_length
    (Gene length in base pairs)
  • feature_id
    (Ensembl基因ID,例如"ENSG00000161798")
  • feature_name
    (基因符号,例如"FOXP2")
  • feature_length
    (基因长度,单位为碱基对)

Reference Documentation

参考文档

This skill includes detailed reference documentation:
本技能包含详细的参考文档:

references/census_schema.md

references/census_schema.md

Comprehensive documentation of:
  • Census data structure and organization
  • All available metadata fields
  • Value filter syntax and operators
  • SOMA object types
  • Data inclusion criteria
When to read: When you need detailed schema information, full list of metadata fields, or complex filter syntax.
全面文档涵盖:
  • Census数据结构与组织
  • 所有可用元数据字段
  • 值筛选语法与运算符
  • SOMA对象类型
  • 数据纳入标准
阅读时机: 当你需要详细的 schema 信息、完整的元数据字段列表或复杂筛选语法时。

references/common_patterns.md

references/common_patterns.md

Examples and patterns for:
  • Exploratory queries (metadata only)
  • Small-to-medium queries (AnnData)
  • Large queries (out-of-core processing)
  • PyTorch integration
  • Scanpy integration workflows
  • Multi-dataset integration
  • Best practices and common pitfalls
When to read: When implementing specific query patterns, looking for code examples, or troubleshooting common issues.
包含以下场景的示例与模式:
  • 探索性查询(仅元数据)
  • 中小规模查询(AnnData)
  • 大规模查询(核外处理)
  • PyTorch集成
  • Scanpy集成工作流
  • 多数据集集成
  • 最佳实践与常见陷阱
阅读时机: 当你实现特定查询模式、寻找代码示例或排查常见问题时。

Common Use Cases

常见用例

Use Case 1: Explore Cell Types in a Tissue

用例1:探索某一组织中的细胞类型

python
with cellxgene_census.open_soma() as census:
    cells = cellxgene_census.get_obs(
        census, "homo_sapiens",
        value_filter="tissue_general == 'lung' and is_primary_data == True",
        column_names=["cell_type"]
    )
    print(cells["cell_type"].value_counts())
python
with cellxgene_census.open_soma() as census:
    cells = cellxgene_census.get_obs(
        census, "homo_sapiens",
        value_filter="tissue_general == 'lung' and is_primary_data == True",
        column_names=["cell_type"]
    )
    print(cells["cell_type"].value_counts())

Use Case 2: Query Marker Gene Expression

用例2:查询标记基因表达

python
with cellxgene_census.open_soma() as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']",
        obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True",
    )
python
with cellxgene_census.open_soma() as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']",
        obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True",
    )

Use Case 3: Train Cell Type Classifier

用例3:训练细胞类型分类器

python
from cellxgene_census.experimental.ml import experiment_dataloader

with cellxgene_census.open_soma() as census:
    dataloader = experiment_dataloader(
        census["census_data"]["homo_sapiens"],
        measurement_name="RNA",
        X_name="raw",
        obs_value_filter="is_primary_data == True",
        obs_column_names=["cell_type"],
        batch_size=128,
        shuffle=True,
    )

    # Train model
    for epoch in range(epochs):
        for batch in dataloader:
            # Training logic
            pass
python
from cellxgene_census.experimental.ml import experiment_dataloader

with cellxgene_census.open_soma() as census:
    dataloader = experiment_dataloader(
        census["census_data"]["homo_sapiens"],
        measurement_name="RNA",
        X_name="raw",
        obs_value_filter="is_primary_data == True",
        obs_column_names=["cell_type"],
        batch_size=128,
        shuffle=True,
    )

    # 训练模型
    for epoch in range(epochs):
        for batch in dataloader:
            # 训练逻辑
            pass

Use Case 4: Cross-Tissue Analysis

用例4:跨组织分析

python
with cellxgene_census.open_soma() as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True",
    )

    # Analyze macrophage differences across tissues
    sc.tl.rank_genes_groups(adata, groupby="tissue_general")
python
with cellxgene_census.open_soma() as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True",
    )

    # 分析跨组织巨噬细胞的差异
    sc.tl.rank_genes_groups(adata, groupby="tissue_general")

Troubleshooting

故障排除

Query Returns Too Many Cells

查询返回过多细胞

  • Add more specific filters to reduce scope
  • Use
    tissue
    instead of
    tissue_general
    for finer granularity
  • Filter by specific
    dataset_id
    if known
  • Switch to out-of-core processing for large queries
  • 添加更具体的筛选条件以缩小范围
  • 使用
    tissue
    而非
    tissue_general
    以获得更精细的粒度
  • 如果已知,按特定
    dataset_id
    筛选
  • 对于大型查询切换到核外处理

Memory Errors

内存错误

  • Reduce query scope with more restrictive filters
  • Select fewer genes with
    var_value_filter
  • Use out-of-core processing with
    axis_query()
  • Process data in batches
  • 使用更严格的筛选条件缩小查询范围
  • 使用
    var_value_filter
    选择更少的基因
  • 使用
    axis_query()
    进行核外处理
  • 分块处理数据

Duplicate Cells in Results

结果中出现重复细胞

  • Always include
    is_primary_data == True
    in filters
  • Check if intentionally querying across multiple datasets
  • 始终在筛选条件中包含
    is_primary_data == True
  • 检查是否有意跨多个数据集查询

Gene Not Found

未找到基因

  • Verify gene name spelling (case-sensitive)
  • Try Ensembl ID with
    feature_id
    instead of
    feature_name
  • Check dataset presence matrix to see if gene was measured
  • Some genes may have been filtered during Census construction
  • 验证基因名称拼写(区分大小写)
  • 尝试使用Ensembl ID(
    feature_id
    )而非
    feature_name
  • 检查数据集存在矩阵以确认基因是否被测量
  • 部分基因可能在Census构建过程中被过滤

Version Inconsistencies

版本不一致

  • Always specify
    census_version
    explicitly
  • Use same version across all analyses
  • Check release notes for version-specific changes
  • 始终显式指定
    census_version
  • 在所有分析中使用相同版本
  • 查看发行说明了解版本特定变更