lamindb

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LaminDB

Overview

概述

LaminDB is an open-source data framework for biology designed to make data queryable, traceable, reproducible, and FAIR (Findable, Accessible, Interoperable, Reusable). It provides a unified platform that combines lakehouse architecture, lineage tracking, feature stores, biological ontologies, LIMS (Laboratory Information Management System), and ELN (Electronic Lab Notebook) capabilities through a single Python API.

Core Value Proposition:

Queryability: Search and filter datasets by metadata, features, and ontology terms
Traceability: Automatic lineage tracking from raw data through analysis to results
Reproducibility: Version control for data, code, and environment
FAIR Compliance: Standardized annotations using biological ontologies

LaminDB是一款面向生物学领域的开源数据框架，旨在实现数据的可查询、可追溯、可复现以及FAIR（可查找、可访问、可互操作、可复用）特性。它通过统一的Python API，将湖仓架构、谱系追踪、特征存储、生物本体论、LIMS（实验室信息管理系统）和ELN（电子实验记录本）功能整合到单一平台中。

核心价值主张：

可查询性：通过元数据、特征和本体术语搜索和过滤数据集
可追溯性：自动追踪从原始数据到分析结果的完整数据谱系
可复现性：对数据、代码和环境进行版本控制
FAIR合规性：利用生物本体论实现标准化注释

When to Use This Skill

何时使用此技能

Use this skill when:

Managing biological datasets: scRNA-seq, bulk RNA-seq, spatial transcriptomics, flow cytometry, multi-modal data, EHR data
Tracking computational workflows: Notebooks, scripts, pipeline execution (Nextflow, Snakemake, Redun)
Curating and validating data: Schema validation, standardization, ontology-based annotation
Working with biological ontologies: Genes, proteins, cell types, tissues, diseases, pathways (via Bionty)
Building data lakehouses: Unified query interface across multiple datasets
Ensuring reproducibility: Automatic versioning, lineage tracking, environment capture
Integrating ML pipelines: Connecting with Weights & Biases, MLflow, HuggingFace, scVI-tools
Deploying data infrastructure: Setting up local or cloud-based data management systems
Collaborating on datasets: Sharing curated, annotated data with standardized metadata

在以下场景中使用此技能：

管理生物数据集：scRNA-seq、批量RNA-seq、空间转录组、流式细胞术、多模态数据、EHR数据
追踪计算工作流：笔记本、脚本、流水线执行（Nextflow、Snakemake、Redun）
整理和验证数据：模式验证、标准化、基于本体的注释
处理生物本体论：基因、蛋白质、细胞类型、组织、疾病、通路（通过Bionty）
构建数据湖仓：跨多个数据集的统一查询接口
确保可复现性：自动版本控制、谱系追踪、环境捕获
集成ML流水线：与Weights & Biases、MLflow、HuggingFace、scVI-tools对接
部署数据基础设施：搭建本地或基于云的数据管理系统
数据集协作：共享经过整理、带有标准化元数据的注释数据集

Core Capabilities

核心功能

LaminDB provides six interconnected capability areas, each documented in detail in the references folder.

LaminDB提供六个相互关联的功能模块，每个模块的详细文档都在参考文件夹中。

1. Core Concepts and Data Lineage

1. 核心概念与数据谱系

Core entities:

Artifacts: Versioned datasets (DataFrame, AnnData, Parquet, Zarr, etc.)
Records: Experimental entities (samples, perturbations, instruments)
Runs & Transforms: Computational lineage tracking (what code produced what data)
Features: Typed metadata fields for annotation and querying

Key workflows:

Create and version artifacts from files or Python objects
Track notebook/script execution with
```
ln.track()
```
and
```
ln.finish()
```
Annotate artifacts with typed features
Visualize data lineage graphs with
```
artifact.view_lineage()
```
Query by provenance (find all outputs from specific code/inputs)

Reference:

references/core-concepts.md

- Read this for detailed information on artifacts, records, runs, transforms, features, versioning, and lineage tracking.

核心实体：

Artifacts：带版本的数据集（DataFrame、AnnData、Parquet、Zarr等）
Records：实验实体（样本、扰动因素、仪器）
Runs & Transforms：计算谱系追踪（哪些代码生成了哪些数据）
Features：用于注释和查询的类型化元数据字段

关键工作流：

从文件或Python对象创建并版本化Artifacts
使用
```
ln.track()
```
和
```
ln.finish()
```
追踪笔记本/脚本的执行
为Artifacts添加类型化特征注释
使用
```
artifact.view_lineage()
```
可视化数据谱系图
按来源查询（查找特定代码/输入生成的所有输出）

参考文档：

references/core-concepts.md

- 阅读本文档了解Artifacts、Records、Runs、Transforms、Features、版本控制和谱系追踪的详细信息。

2. Data Management and Querying

2. 数据管理与查询

Query capabilities:

Registry exploration and lookup with auto-complete
Single record retrieval with
```
get()
```
,
```
one()
```
,
```
one_or_none()
```
Filtering with comparison operators (
```
__gt
```
,
```
__lte
```
,
```
__contains
```
,
```
__startswith
```
)
Feature-based queries (query by annotated metadata)
Cross-registry traversal with double-underscore syntax
Full-text search across registries
Advanced logical queries with Q objects (AND, OR, NOT)
Streaming large datasets without loading into memory

Key workflows:

Browse artifacts with filters and ordering
Query by features, creation date, creator, size, etc.
Stream large files in chunks or with array slicing
Organize data with hierarchical keys
Group artifacts into collections

Reference:

references/data-management.md

- Read this for comprehensive query patterns, filtering examples, streaming strategies, and data organization best practices.

查询功能：

注册表探索与自动补全查找
使用
```
get()
```
、
```
one()
```
、
```
one_or_none()
```
检索单条记录
使用比较运算符过滤（
```
__gt
```
、
```
__lte
```
、
```
__contains
```
、
```
__startswith
```
）
基于特征的查询（通过注释的元数据进行查询）
使用双下划线语法跨注册表遍历
跨注册表全文搜索
使用Q对象进行高级逻辑查询（AND、OR、NOT）
无需加载到内存即可流式处理大型数据集

关键工作流：

通过筛选和排序浏览Artifacts
按特征、创建日期、创建者、大小等条件查询
以分块或数组切片的方式流式处理大文件
使用分层键组织数据
将Artifacts分组为集合

参考文档：

references/data-management.md

- 阅读本文档了解全面的查询模式、过滤示例、流式处理策略和数据组织最佳实践。

3. Annotation and Validation

3. 注释与验证

Curation process:

Validation: Confirm datasets match desired schemas
Standardization: Fix typos, map synonyms to canonical terms
Annotation: Link datasets to metadata entities for queryability

Schema types:

Flexible schemas: Validate only known columns, allow additional metadata
Minimal required schemas: Specify essential columns, permit extras
Strict schemas: Complete control over structure and values

Supported data types:

DataFrames (Parquet, CSV)
AnnData (single-cell genomics)
MuData (multi-modal)
SpatialData (spatial transcriptomics)
TileDB-SOMA (scalable arrays)

Key workflows:

Define features and schemas for data validation
Use
```
DataFrameCurator
```
or
```
AnnDataCurator
```
for validation
Standardize values with
```
.cat.standardize()
```
Map to ontologies with
```
.cat.add_ontology()
```
Save curated artifacts with schema linkage
Query validated datasets by features

Reference:

references/annotation-validation.md

- Read this for detailed curation workflows, schema design patterns, handling validation errors, and best practices.

整理流程：

验证：确认数据集符合预期模式
标准化：修正拼写错误，将同义词映射为规范术语
注释：将数据集与元数据实体关联以实现可查询性

模式类型：

灵活模式：仅验证已知列，允许添加额外元数据
最小必填模式：指定必要列，允许额外内容
严格模式：完全控制结构和值

支持的数据类型：

DataFrames（Parquet、CSV）
AnnData（单细胞基因组学）
MuData（多模态）
SpatialData（空间转录组）
TileDB-SOMA（可扩展数组）

关键工作流：

定义用于数据验证的特征和模式
使用
```
DataFrameCurator
```
或
```
AnnDataCurator
```
进行验证
使用
```
.cat.standardize()
```
标准化值
使用
```
.cat.add_ontology()
```
映射到本体
保存关联了模式的整理后Artifacts
按特征查询经过验证的数据集

参考文档：

references/annotation-validation.md

- 阅读本文档了解详细的整理工作流、模式设计模式、验证错误处理和最佳实践。

4. Biological Ontologies

4. 生物本体论

Available ontologies (via Bionty):

Genes (Ensembl), Proteins (UniProt)
Cell types (CL), Cell lines (CLO)
Tissues (Uberon), Diseases (Mondo, DOID)
Phenotypes (HPO), Pathways (GO)
Experimental factors (EFO), Developmental stages
Organisms (NCBItaxon), Drugs (DrugBank)

Key workflows:

Import public ontologies with
```
bt.CellType.import_source()
```
Search ontologies with keyword or exact matching
Standardize terms using synonym mapping
Explore hierarchical relationships (parents, children, ancestors)
Validate data against ontology terms
Annotate datasets with ontology records
Create custom terms and hierarchies
Handle multi-organism contexts (human, mouse, etc.)

Reference:

references/ontologies.md

- Read this for comprehensive ontology operations, standardization strategies, hierarchy navigation, and annotation workflows.

可用本体（通过Bionty）：

基因（Ensembl）、蛋白质（UniProt）
细胞类型（CL）、细胞系（CLO）
组织（Uberon）、疾病（Mondo、DOID）
表型（HPO）、通路（GO）
实验因素（EFO）、发育阶段
生物（NCBItaxon）、药物（DrugBank）

关键工作流：

使用
```
bt.CellType.import_source()
```
导入公共本体
通过关键词或精确匹配搜索本体
使用同义词映射标准化术语
探索层级关系（父类、子类、祖先）
对照本体术语验证数据
用本体记录注释数据集
创建自定义术语和层级
处理多生物场景（人类、小鼠等）

参考文档：

references/ontologies.md

- 阅读本文档了解全面的本体操作、标准化策略、层级导航和注释工作流。

5. Integrations

5. 集成能力

Workflow managers:

Nextflow: Track pipeline processes and outputs
Snakemake: Integrate into Snakemake rules
Redun: Combine with Redun task tracking

MLOps platforms:

Weights & Biases: Link experiments with data artifacts
MLflow: Track models and experiments
HuggingFace: Track model fine-tuning
scVI-tools: Single-cell analysis workflows

Storage systems:

Local filesystem, AWS S3, Google Cloud Storage
S3-compatible (MinIO, Cloudflare R2)
HTTP/HTTPS endpoints (read-only)
HuggingFace datasets

Array stores:

TileDB-SOMA (with cellxgene support)
DuckDB for SQL queries on Parquet files

Visualization:

Vitessce for interactive spatial/single-cell visualization

Version control:

Git integration for source code tracking

Reference:

references/integrations.md

- Read this for integration patterns, code examples, and troubleshooting for third-party systems.

工作流管理器：

Nextflow：追踪流水线进程和输出
Snakemake：集成到Snakemake规则中
Redun：与Redun任务追踪结合使用

MLOps平台：

Weights & Biases：将实验与数据Artifacts关联
MLflow：追踪模型和实验
HuggingFace：追踪模型微调
scVI-tools：单细胞分析工作流

存储系统：

本地文件系统、AWS S3、Google Cloud Storage
兼容S3的存储（MinIO、Cloudflare R2）
HTTP/HTTPS端点（只读）
HuggingFace数据集

数组存储：

TileDB-SOMA（支持cellxgene）
DuckDB：对Parquet文件执行SQL查询

可视化：

Vitessce：交互式空间/单细胞可视化

版本控制：

Git集成用于源代码追踪

参考文档：

references/integrations.md

- 阅读本文档了解集成模式、代码示例和第三方系统的故障排除方法。

6. Setup and Deployment

6. 安装与部署

Installation:

Basic:
```
uv pip install lamindb
```
With extras:
```
uv pip install 'lamindb[gcp,zarr,fcs]'
```
Modules: bionty, wetlab, clinical

Instance types:

Local SQLite (development)
Cloud storage + SQLite (small teams)
Cloud storage + PostgreSQL (production)

Storage options:

Local filesystem
AWS S3 with configurable regions and permissions
Google Cloud Storage
S3-compatible endpoints (MinIO, Cloudflare R2)

Configuration:

Cache management for cloud files
Multi-user system configurations
Git repository sync
Environment variables

Deployment patterns:

Local dev → Cloud production migration
Multi-region deployments
Shared storage with personal instances

Reference:

references/setup-deployment.md

- Read this for detailed installation, configuration, storage setup, database management, security best practices, and troubleshooting.

安装方式：

基础安装：
```
uv pip install lamindb
```
带扩展功能安装：
```
uv pip install 'lamindb[gcp,zarr,fcs]'
```
可选模块：bionty、wetlab、clinical

实例类型：

本地SQLite（开发环境）
云存储 + SQLite（小型团队）
云存储 + PostgreSQL（生产环境）

存储选项：

本地文件系统
可配置区域和权限的AWS S3
Google Cloud Storage
兼容S3的端点（MinIO、Cloudflare R2）

配置：

云文件的缓存管理
多用户系统配置
Git仓库同步
环境变量

部署模式：

本地开发 → 云生产环境迁移
多区域部署
共享存储搭配个人实例

参考文档：

references/setup-deployment.md

- 阅读本文档了解详细的安装、配置、存储设置、数据库管理、安全最佳实践和故障排除方法。

Common Use Case Workflows

常见用例工作流

Use Case 1: Single-Cell RNA-seq Analysis with Ontology Validation

用例1：带本体验证的单细胞RNA-seq分析

python

import lamindb as ln
import bionty as bt
import anndata as ad

python

import lamindb as ln
import bionty as bt
import anndata as ad

Start tracking

ln.track(params={"analysis": "scRNA-seq QC and annotation"})

Import cell type ontology

bt.CellType.import_source()

Load data

adata = ad.read_h5ad("raw_counts.h5ad")

Validate and standardize cell types

adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])

Curate with schema

curator = ln.curators.AnnDataCurator(adata, schema) curator.validate() artifact = curator.save_artifact(key="scrna/validated.h5ad")

Link ontology annotations

cell_types = bt.CellType.from_values(adata.obs.cell_type) artifact.feature_sets.add_ontology(cell_types)

ln.finish()

undefined

cell_types = bt.CellType.from_values(adata.obs.cell_type) artifact.feature_sets.add_ontology(cell_types)

ln.finish()

undefined

Use Case 2: Building a Queryable Data Lakehouse

用例2：构建可查询的数据湖仓

python

import lamindb as ln

python

import lamindb as ln

Register multiple experiments

for i, file in enumerate(data_files): artifact = ln.Artifact.from_anndata( ad.read_h5ad(file), key=f"scrna/batch_{i}.h5ad", description=f"scRNA-seq batch {i}" ).save()

# Annotate with features
artifact.features.add_values({
    "batch": i,
    "tissue": tissues[i],
    "condition": conditions[i]
})

for i, file in enumerate(data_files): artifact = ln.Artifact.from_anndata( ad.read_h5ad(file), key=f"scrna/batch_{i}.h5ad", description=f"scRNA-seq batch {i}" ).save()

# Annotate with features
artifact.features.add_values({
    "batch": i,
    "tissue": tissues[i],
    "condition": conditions[i]
})

Query across all experiments

immune_datasets = ln.Artifact.filter( key__startswith="scrna/", tissue="PBMC", condition="treated" ).to_dataframe()

Load specific datasets

for artifact in immune_datasets: adata = artifact.load() # Analyze

undefined

for artifact in immune_datasets: adata = artifact.load() # Analyze

undefined

Use Case 3: ML Pipeline with W&B Integration

用例3：与W&B集成的ML流水线

python

import lamindb as ln
import wandb

python

import lamindb as ln
import wandb

Initialize both systems

wandb.init(project="drug-response", name="exp-42") ln.track(params={"model": "random_forest", "n_estimators": 100})

Load training data from LaminDB

train_artifact = ln.Artifact.get(key="datasets/train.parquet") train_data = train_artifact.load()

Train model

model = train_model(train_data)

Log to W&B

wandb.log({"accuracy": 0.95})

Save model in LaminDB with W&B linkage

import joblib joblib.dump(model, "model.pkl") model_artifact = ln.Artifact("model.pkl", key="models/exp-42.pkl").save() model_artifact.features.add_values({"wandb_run_id": wandb.run.id})

ln.finish() wandb.finish()

undefined

import joblib joblib.dump(model, "model.pkl") model_artifact = ln.Artifact("model.pkl", key="models/exp-42.pkl").save() model_artifact.features.add_values({"wandb_run_id": wandb.run.id})

ln.finish() wandb.finish()

undefined

Use Case 4: Nextflow Pipeline Integration

用例4：Nextflow流水线集成

python

undefined

python

undefined

In Nextflow process script

import lamindb as ln

ln.track()

import lamindb as ln

ln.track()

Load input artifact

input_artifact = ln.Artifact.get(key="raw/batch_${batch_id}.fastq.gz") input_path = input_artifact.cache()

Process (alignment, quantification, etc.)

... Nextflow process logic ...

Save output

output_artifact = ln.Artifact( "counts.csv", key="processed/batch_${batch_id}_counts.csv" ).save()

ln.finish()

undefined

output_artifact = ln.Artifact( "counts.csv", key="processed/batch_${batch_id}_counts.csv" ).save()

ln.finish()

undefined

Getting Started Checklist

快速入门清单

To start using LaminDB effectively:

Installation & Setup (
```
references/setup-deployment.md
```
)
- Install LaminDB and required extras
- Authenticate with
```
lamin login
```
- Initialize instance with
```
lamin init --storage ...
```
Learn Core Concepts (
```
references/core-concepts.md
```
)
- Understand Artifacts, Records, Runs, Transforms
- Practice creating and retrieving artifacts
- Implement
```
ln.track()
```
  and
```
ln.finish()
```
  in workflows
Master Querying (
```
references/data-management.md
```
)
- Practice filtering and searching registries
- Learn feature-based queries
- Experiment with streaming large files
Set Up Validation (
```
references/annotation-validation.md
```
)
- Define features relevant to research domain
- Create schemas for data types
- Practice curation workflows
Integrate Ontologies (
```
references/ontologies.md
```
)
- Import relevant biological ontologies (genes, cell types, etc.)
- Validate existing annotations
- Standardize metadata with ontology terms
Connect Tools (
```
references/integrations.md
```
)
- Integrate with existing workflow managers
- Link ML platforms for experiment tracking
- Configure cloud storage and compute

要高效使用LaminDB，请遵循以下步骤：

安装与设置（
```
references/setup-deployment.md
```
）
- 安装LaminDB及所需扩展
- 使用
```
lamin login
```
  进行身份验证
- 使用
```
lamin init --storage ...
```
  初始化实例
学习核心概念（
```
references/core-concepts.md
```
）
- 理解Artifacts、Records、Runs、Transforms
- 练习创建和检索Artifacts
- 在工作流中实现
```
ln.track()
```
  和
```
ln.finish()
```
掌握查询技巧（
```
references/data-management.md
```
）
- 练习筛选和搜索注册表
- 学习基于特征的查询
- 尝试流式处理大文件
设置验证机制（
```
references/annotation-validation.md
```
）
- 定义与研究领域相关的特征
- 为数据类型创建模式
- 练习整理工作流
集成本体论（
```
references/ontologies.md
```
）
- 导入相关的生物本体（基因、细胞类型等）
- 验证现有注释
- 使用本体术语标准化元数据
对接工具链（
```
references/integrations.md
```
）
- 与现有工作流管理器集成
- 关联ML平台以追踪实验
- 配置云存储和计算资源

Key Principles

核心原则

Follow these principles when working with LaminDB:

Track everything: Use
```
ln.track()
```
at the start of every analysis for automatic lineage capture
Validate early: Define schemas and validate data before extensive analysis
Use ontologies: Leverage public biological ontologies for standardized annotations
Organize with keys: Structure artifact keys hierarchically (e.g.,
```
project/experiment/batch/file.h5ad
```
)
Query metadata first: Filter and search before loading large files
Version, don't duplicate: Use built-in versioning instead of creating new keys for modifications
Annotate with features: Define typed features for queryable metadata
Document thoroughly: Add descriptions to artifacts, schemas, and transforms
Leverage lineage: Use
```
view_lineage()
```
to understand data provenance
Start local, scale cloud: Develop locally with SQLite, deploy to cloud with PostgreSQL

使用LaminDB时请遵循以下原则：

追踪所有内容：在每次分析开始时使用
```
ln.track()
```
自动捕获数据谱系
尽早验证：在进行大量分析前定义模式并验证数据
使用本体：利用公共生物本体实现标准化注释
用键组织数据：分层结构设计Artifact的键（例如：
```
project/experiment/batch/file.h5ad
```
）
先查询元数据：在加载大文件前先进行筛选和搜索
版本控制而非重复：使用内置的版本控制，避免为修改内容创建新键
用特征注释：定义类型化特征以实现可查询的元数据
详细文档：为Artifacts、模式和Transforms添加描述
利用谱系：使用
```
view_lineage()
```
了解数据来源
从本地开始，向云扩展：使用SQLite进行本地开发，使用PostgreSQL部署到云生产环境

Reference Files

参考文件

This skill includes comprehensive reference documentation organized by capability:

references/core-concepts.md
- Artifacts, records, runs, transforms, features, versioning, lineage
references/data-management.md
- Querying, filtering, searching, streaming, organizing data
references/annotation-validation.md
- Schema design, curation workflows, validation strategies
references/ontologies.md
- Biological ontology management, standardization, hierarchies
references/integrations.md
- Workflow managers, MLOps platforms, storage systems, tools
references/setup-deployment.md
- Installation, configuration, deployment, troubleshooting

Read the relevant reference file(s) based on the specific LaminDB capability needed for the task at hand.

本技能包含按功能模块组织的全面参考文档：

references/core-concepts.md
- Artifacts、Records、Runs、Transforms、Features、版本控制、谱系
references/data-management.md
- 查询、筛选、搜索、流式处理、数据组织
references/annotation-validation.md
- 模式设计、整理工作流、验证策略
references/ontologies.md
- 生物本体管理、标准化、层级结构
references/integrations.md
- 工作流管理器、MLOps平台、存储系统、工具
references/setup-deployment.md
- 安装、配置、部署、故障排除

根据任务所需的LaminDB功能，阅读对应的参考文档即可。

Additional Resources

额外资源

Official Documentation: https://docs.lamin.ai
API Reference: https://docs.lamin.ai/api
GitHub Repository: https://github.com/laminlabs/lamindb
Tutorial: https://docs.lamin.ai/tutorial
FAQ: https://docs.lamin.ai/faq

官方文档：https://docs.lamin.ai
API参考：https://docs.lamin.ai/api
GitHub仓库：https://github.com/laminlabs/lamindb
教程：https://docs.lamin.ai/tutorial
常见问题：https://docs.lamin.ai/faq