lamindb
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLaminDB
LaminDB
Overview
概述
LaminDB is an open-source data framework for biology designed to make data queryable, traceable, reproducible, and FAIR (Findable, Accessible, Interoperable, Reusable). It provides a unified platform that combines lakehouse architecture, lineage tracking, feature stores, biological ontologies, LIMS (Laboratory Information Management System), and ELN (Electronic Lab Notebook) capabilities through a single Python API.
Core Value Proposition:
- Queryability: Search and filter datasets by metadata, features, and ontology terms
- Traceability: Automatic lineage tracking from raw data through analysis to results
- Reproducibility: Version control for data, code, and environment
- FAIR Compliance: Standardized annotations using biological ontologies
LaminDB是一款面向生物学领域的开源数据框架,旨在实现数据的可查询、可追溯、可复现以及FAIR(可查找、可访问、可互操作、可复用)特性。它通过统一的Python API,将湖仓架构、谱系追踪、特征存储、生物本体论、LIMS(实验室信息管理系统)和ELN(电子实验记录本)功能整合到单一平台中。
核心价值主张:
- 可查询性:通过元数据、特征和本体术语搜索和过滤数据集
- 可追溯性:自动追踪从原始数据到分析结果的完整数据谱系
- 可复现性:对数据、代码和环境进行版本控制
- FAIR合规性:利用生物本体论实现标准化注释
When to Use This Skill
何时使用此技能
Use this skill when:
- Managing biological datasets: scRNA-seq, bulk RNA-seq, spatial transcriptomics, flow cytometry, multi-modal data, EHR data
- Tracking computational workflows: Notebooks, scripts, pipeline execution (Nextflow, Snakemake, Redun)
- Curating and validating data: Schema validation, standardization, ontology-based annotation
- Working with biological ontologies: Genes, proteins, cell types, tissues, diseases, pathways (via Bionty)
- Building data lakehouses: Unified query interface across multiple datasets
- Ensuring reproducibility: Automatic versioning, lineage tracking, environment capture
- Integrating ML pipelines: Connecting with Weights & Biases, MLflow, HuggingFace, scVI-tools
- Deploying data infrastructure: Setting up local or cloud-based data management systems
- Collaborating on datasets: Sharing curated, annotated data with standardized metadata
在以下场景中使用此技能:
- 管理生物数据集:scRNA-seq、批量RNA-seq、空间转录组、流式细胞术、多模态数据、EHR数据
- 追踪计算工作流:笔记本、脚本、流水线执行(Nextflow、Snakemake、Redun)
- 整理和验证数据:模式验证、标准化、基于本体的注释
- 处理生物本体论:基因、蛋白质、细胞类型、组织、疾病、通路(通过Bionty)
- 构建数据湖仓:跨多个数据集的统一查询接口
- 确保可复现性:自动版本控制、谱系追踪、环境捕获
- 集成ML流水线:与Weights & Biases、MLflow、HuggingFace、scVI-tools对接
- 部署数据基础设施:搭建本地或基于云的数据管理系统
- 数据集协作:共享经过整理、带有标准化元数据的注释数据集
Core Capabilities
核心功能
LaminDB provides six interconnected capability areas, each documented in detail in the references folder.
LaminDB提供六个相互关联的功能模块,每个模块的详细文档都在参考文件夹中。
1. Core Concepts and Data Lineage
1. 核心概念与数据谱系
Core entities:
- Artifacts: Versioned datasets (DataFrame, AnnData, Parquet, Zarr, etc.)
- Records: Experimental entities (samples, perturbations, instruments)
- Runs & Transforms: Computational lineage tracking (what code produced what data)
- Features: Typed metadata fields for annotation and querying
Key workflows:
- Create and version artifacts from files or Python objects
- Track notebook/script execution with and
ln.track()ln.finish() - Annotate artifacts with typed features
- Visualize data lineage graphs with
artifact.view_lineage() - Query by provenance (find all outputs from specific code/inputs)
Reference: - Read this for detailed information on artifacts, records, runs, transforms, features, versioning, and lineage tracking.
references/core-concepts.md核心实体:
- Artifacts:带版本的数据集(DataFrame、AnnData、Parquet、Zarr等)
- Records:实验实体(样本、扰动因素、仪器)
- Runs & Transforms:计算谱系追踪(哪些代码生成了哪些数据)
- Features:用于注释和查询的类型化元数据字段
关键工作流:
- 从文件或Python对象创建并版本化Artifacts
- 使用和
ln.track()追踪笔记本/脚本的执行ln.finish() - 为Artifacts添加类型化特征注释
- 使用可视化数据谱系图
artifact.view_lineage() - 按来源查询(查找特定代码/输入生成的所有输出)
参考文档: - 阅读本文档了解Artifacts、Records、Runs、Transforms、Features、版本控制和谱系追踪的详细信息。
references/core-concepts.md2. Data Management and Querying
2. 数据管理与查询
Query capabilities:
- Registry exploration and lookup with auto-complete
- Single record retrieval with ,
get(),one()one_or_none() - Filtering with comparison operators (,
__gt,__lte,__contains)__startswith - Feature-based queries (query by annotated metadata)
- Cross-registry traversal with double-underscore syntax
- Full-text search across registries
- Advanced logical queries with Q objects (AND, OR, NOT)
- Streaming large datasets without loading into memory
Key workflows:
- Browse artifacts with filters and ordering
- Query by features, creation date, creator, size, etc.
- Stream large files in chunks or with array slicing
- Organize data with hierarchical keys
- Group artifacts into collections
Reference: - Read this for comprehensive query patterns, filtering examples, streaming strategies, and data organization best practices.
references/data-management.md查询功能:
- 注册表探索与自动补全查找
- 使用、
get()、one()检索单条记录one_or_none() - 使用比较运算符过滤(、
__gt、__lte、__contains)__startswith - 基于特征的查询(通过注释的元数据进行查询)
- 使用双下划线语法跨注册表遍历
- 跨注册表全文搜索
- 使用Q对象进行高级逻辑查询(AND、OR、NOT)
- 无需加载到内存即可流式处理大型数据集
关键工作流:
- 通过筛选和排序浏览Artifacts
- 按特征、创建日期、创建者、大小等条件查询
- 以分块或数组切片的方式流式处理大文件
- 使用分层键组织数据
- 将Artifacts分组为集合
参考文档: - 阅读本文档了解全面的查询模式、过滤示例、流式处理策略和数据组织最佳实践。
references/data-management.md3. Annotation and Validation
3. 注释与验证
Curation process:
- Validation: Confirm datasets match desired schemas
- Standardization: Fix typos, map synonyms to canonical terms
- Annotation: Link datasets to metadata entities for queryability
Schema types:
- Flexible schemas: Validate only known columns, allow additional metadata
- Minimal required schemas: Specify essential columns, permit extras
- Strict schemas: Complete control over structure and values
Supported data types:
- DataFrames (Parquet, CSV)
- AnnData (single-cell genomics)
- MuData (multi-modal)
- SpatialData (spatial transcriptomics)
- TileDB-SOMA (scalable arrays)
Key workflows:
- Define features and schemas for data validation
- Use or
DataFrameCuratorfor validationAnnDataCurator - Standardize values with
.cat.standardize() - Map to ontologies with
.cat.add_ontology() - Save curated artifacts with schema linkage
- Query validated datasets by features
Reference: - Read this for detailed curation workflows, schema design patterns, handling validation errors, and best practices.
references/annotation-validation.md整理流程:
- 验证:确认数据集符合预期模式
- 标准化:修正拼写错误,将同义词映射为规范术语
- 注释:将数据集与元数据实体关联以实现可查询性
模式类型:
- 灵活模式:仅验证已知列,允许添加额外元数据
- 最小必填模式:指定必要列,允许额外内容
- 严格模式:完全控制结构和值
支持的数据类型:
- DataFrames(Parquet、CSV)
- AnnData(单细胞基因组学)
- MuData(多模态)
- SpatialData(空间转录组)
- TileDB-SOMA(可扩展数组)
关键工作流:
- 定义用于数据验证的特征和模式
- 使用或
DataFrameCurator进行验证AnnDataCurator - 使用标准化值
.cat.standardize() - 使用映射到本体
.cat.add_ontology() - 保存关联了模式的整理后Artifacts
- 按特征查询经过验证的数据集
参考文档: - 阅读本文档了解详细的整理工作流、模式设计模式、验证错误处理和最佳实践。
references/annotation-validation.md4. Biological Ontologies
4. 生物本体论
Available ontologies (via Bionty):
- Genes (Ensembl), Proteins (UniProt)
- Cell types (CL), Cell lines (CLO)
- Tissues (Uberon), Diseases (Mondo, DOID)
- Phenotypes (HPO), Pathways (GO)
- Experimental factors (EFO), Developmental stages
- Organisms (NCBItaxon), Drugs (DrugBank)
Key workflows:
- Import public ontologies with
bt.CellType.import_source() - Search ontologies with keyword or exact matching
- Standardize terms using synonym mapping
- Explore hierarchical relationships (parents, children, ancestors)
- Validate data against ontology terms
- Annotate datasets with ontology records
- Create custom terms and hierarchies
- Handle multi-organism contexts (human, mouse, etc.)
Reference: - Read this for comprehensive ontology operations, standardization strategies, hierarchy navigation, and annotation workflows.
references/ontologies.md可用本体(通过Bionty):
- 基因(Ensembl)、蛋白质(UniProt)
- 细胞类型(CL)、细胞系(CLO)
- 组织(Uberon)、疾病(Mondo、DOID)
- 表型(HPO)、通路(GO)
- 实验因素(EFO)、发育阶段
- 生物(NCBItaxon)、药物(DrugBank)
关键工作流:
- 使用导入公共本体
bt.CellType.import_source() - 通过关键词或精确匹配搜索本体
- 使用同义词映射标准化术语
- 探索层级关系(父类、子类、祖先)
- 对照本体术语验证数据
- 用本体记录注释数据集
- 创建自定义术语和层级
- 处理多生物场景(人类、小鼠等)
参考文档: - 阅读本文档了解全面的本体操作、标准化策略、层级导航和注释工作流。
references/ontologies.md5. Integrations
5. 集成能力
Workflow managers:
- Nextflow: Track pipeline processes and outputs
- Snakemake: Integrate into Snakemake rules
- Redun: Combine with Redun task tracking
MLOps platforms:
- Weights & Biases: Link experiments with data artifacts
- MLflow: Track models and experiments
- HuggingFace: Track model fine-tuning
- scVI-tools: Single-cell analysis workflows
Storage systems:
- Local filesystem, AWS S3, Google Cloud Storage
- S3-compatible (MinIO, Cloudflare R2)
- HTTP/HTTPS endpoints (read-only)
- HuggingFace datasets
Array stores:
- TileDB-SOMA (with cellxgene support)
- DuckDB for SQL queries on Parquet files
Visualization:
- Vitessce for interactive spatial/single-cell visualization
Version control:
- Git integration for source code tracking
Reference: - Read this for integration patterns, code examples, and troubleshooting for third-party systems.
references/integrations.md工作流管理器:
- Nextflow:追踪流水线进程和输出
- Snakemake:集成到Snakemake规则中
- Redun:与Redun任务追踪结合使用
MLOps平台:
- Weights & Biases:将实验与数据Artifacts关联
- MLflow:追踪模型和实验
- HuggingFace:追踪模型微调
- scVI-tools:单细胞分析工作流
存储系统:
- 本地文件系统、AWS S3、Google Cloud Storage
- 兼容S3的存储(MinIO、Cloudflare R2)
- HTTP/HTTPS端点(只读)
- HuggingFace数据集
数组存储:
- TileDB-SOMA(支持cellxgene)
- DuckDB:对Parquet文件执行SQL查询
可视化:
- Vitessce:交互式空间/单细胞可视化
版本控制:
- Git集成用于源代码追踪
参考文档: - 阅读本文档了解集成模式、代码示例和第三方系统的故障排除方法。
references/integrations.md6. Setup and Deployment
6. 安装与部署
Installation:
- Basic:
uv pip install lamindb - With extras:
uv pip install 'lamindb[gcp,zarr,fcs]' - Modules: bionty, wetlab, clinical
Instance types:
- Local SQLite (development)
- Cloud storage + SQLite (small teams)
- Cloud storage + PostgreSQL (production)
Storage options:
- Local filesystem
- AWS S3 with configurable regions and permissions
- Google Cloud Storage
- S3-compatible endpoints (MinIO, Cloudflare R2)
Configuration:
- Cache management for cloud files
- Multi-user system configurations
- Git repository sync
- Environment variables
Deployment patterns:
- Local dev → Cloud production migration
- Multi-region deployments
- Shared storage with personal instances
Reference: - Read this for detailed installation, configuration, storage setup, database management, security best practices, and troubleshooting.
references/setup-deployment.md安装方式:
- 基础安装:
uv pip install lamindb - 带扩展功能安装:
uv pip install 'lamindb[gcp,zarr,fcs]' - 可选模块:bionty、wetlab、clinical
实例类型:
- 本地SQLite(开发环境)
- 云存储 + SQLite(小型团队)
- 云存储 + PostgreSQL(生产环境)
存储选项:
- 本地文件系统
- 可配置区域和权限的AWS S3
- Google Cloud Storage
- 兼容S3的端点(MinIO、Cloudflare R2)
配置:
- 云文件的缓存管理
- 多用户系统配置
- Git仓库同步
- 环境变量
部署模式:
- 本地开发 → 云生产环境迁移
- 多区域部署
- 共享存储搭配个人实例
参考文档: - 阅读本文档了解详细的安装、配置、存储设置、数据库管理、安全最佳实践和故障排除方法。
references/setup-deployment.mdCommon Use Case Workflows
常见用例工作流
Use Case 1: Single-Cell RNA-seq Analysis with Ontology Validation
用例1:带本体验证的单细胞RNA-seq分析
python
import lamindb as ln
import bionty as bt
import anndata as adpython
import lamindb as ln
import bionty as bt
import anndata as adStart tracking
Start tracking
ln.track(params={"analysis": "scRNA-seq QC and annotation"})
ln.track(params={"analysis": "scRNA-seq QC and annotation"})
Import cell type ontology
Import cell type ontology
bt.CellType.import_source()
bt.CellType.import_source()
Load data
Load data
adata = ad.read_h5ad("raw_counts.h5ad")
adata = ad.read_h5ad("raw_counts.h5ad")
Validate and standardize cell types
Validate and standardize cell types
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])
Curate with schema
Curate with schema
curator = ln.curators.AnnDataCurator(adata, schema)
curator.validate()
artifact = curator.save_artifact(key="scrna/validated.h5ad")
curator = ln.curators.AnnDataCurator(adata, schema)
curator.validate()
artifact = curator.save_artifact(key="scrna/validated.h5ad")
Link ontology annotations
Link ontology annotations
cell_types = bt.CellType.from_values(adata.obs.cell_type)
artifact.feature_sets.add_ontology(cell_types)
ln.finish()
undefinedcell_types = bt.CellType.from_values(adata.obs.cell_type)
artifact.feature_sets.add_ontology(cell_types)
ln.finish()
undefinedUse Case 2: Building a Queryable Data Lakehouse
用例2:构建可查询的数据湖仓
python
import lamindb as lnpython
import lamindb as lnRegister multiple experiments
Register multiple experiments
for i, file in enumerate(data_files):
artifact = ln.Artifact.from_anndata(
ad.read_h5ad(file),
key=f"scrna/batch_{i}.h5ad",
description=f"scRNA-seq batch {i}"
).save()
# Annotate with features
artifact.features.add_values({
"batch": i,
"tissue": tissues[i],
"condition": conditions[i]
})for i, file in enumerate(data_files):
artifact = ln.Artifact.from_anndata(
ad.read_h5ad(file),
key=f"scrna/batch_{i}.h5ad",
description=f"scRNA-seq batch {i}"
).save()
# Annotate with features
artifact.features.add_values({
"batch": i,
"tissue": tissues[i],
"condition": conditions[i]
})Query across all experiments
Query across all experiments
immune_datasets = ln.Artifact.filter(
key__startswith="scrna/",
tissue="PBMC",
condition="treated"
).to_dataframe()
immune_datasets = ln.Artifact.filter(
key__startswith="scrna/",
tissue="PBMC",
condition="treated"
).to_dataframe()
Load specific datasets
Load specific datasets
for artifact in immune_datasets:
adata = artifact.load()
# Analyze
undefinedfor artifact in immune_datasets:
adata = artifact.load()
# Analyze
undefinedUse Case 3: ML Pipeline with W&B Integration
用例3:与W&B集成的ML流水线
python
import lamindb as ln
import wandbpython
import lamindb as ln
import wandbInitialize both systems
Initialize both systems
wandb.init(project="drug-response", name="exp-42")
ln.track(params={"model": "random_forest", "n_estimators": 100})
wandb.init(project="drug-response", name="exp-42")
ln.track(params={"model": "random_forest", "n_estimators": 100})
Load training data from LaminDB
Load training data from LaminDB
train_artifact = ln.Artifact.get(key="datasets/train.parquet")
train_data = train_artifact.load()
train_artifact = ln.Artifact.get(key="datasets/train.parquet")
train_data = train_artifact.load()
Train model
Train model
model = train_model(train_data)
model = train_model(train_data)
Log to W&B
Log to W&B
wandb.log({"accuracy": 0.95})
wandb.log({"accuracy": 0.95})
Save model in LaminDB with W&B linkage
Save model in LaminDB with W&B linkage
import joblib
joblib.dump(model, "model.pkl")
model_artifact = ln.Artifact("model.pkl", key="models/exp-42.pkl").save()
model_artifact.features.add_values({"wandb_run_id": wandb.run.id})
ln.finish()
wandb.finish()
undefinedimport joblib
joblib.dump(model, "model.pkl")
model_artifact = ln.Artifact("model.pkl", key="models/exp-42.pkl").save()
model_artifact.features.add_values({"wandb_run_id": wandb.run.id})
ln.finish()
wandb.finish()
undefinedUse Case 4: Nextflow Pipeline Integration
用例4:Nextflow流水线集成
python
undefinedpython
undefinedIn Nextflow process script
In Nextflow process script
import lamindb as ln
ln.track()
import lamindb as ln
ln.track()
Load input artifact
Load input artifact
input_artifact = ln.Artifact.get(key="raw/batch_${batch_id}.fastq.gz")
input_path = input_artifact.cache()
input_artifact = ln.Artifact.get(key="raw/batch_${batch_id}.fastq.gz")
input_path = input_artifact.cache()
Process (alignment, quantification, etc.)
Process (alignment, quantification, etc.)
... Nextflow process logic ...
... Nextflow process logic ...
Save output
Save output
output_artifact = ln.Artifact(
"counts.csv",
key="processed/batch_${batch_id}_counts.csv"
).save()
ln.finish()
undefinedoutput_artifact = ln.Artifact(
"counts.csv",
key="processed/batch_${batch_id}_counts.csv"
).save()
ln.finish()
undefinedGetting Started Checklist
快速入门清单
To start using LaminDB effectively:
-
Installation & Setup ()
references/setup-deployment.md- Install LaminDB and required extras
- Authenticate with
lamin login - Initialize instance with
lamin init --storage ...
-
Learn Core Concepts ()
references/core-concepts.md- Understand Artifacts, Records, Runs, Transforms
- Practice creating and retrieving artifacts
- Implement and
ln.track()in workflowsln.finish()
-
Master Querying ()
references/data-management.md- Practice filtering and searching registries
- Learn feature-based queries
- Experiment with streaming large files
-
Set Up Validation ()
references/annotation-validation.md- Define features relevant to research domain
- Create schemas for data types
- Practice curation workflows
-
Integrate Ontologies ()
references/ontologies.md- Import relevant biological ontologies (genes, cell types, etc.)
- Validate existing annotations
- Standardize metadata with ontology terms
-
Connect Tools ()
references/integrations.md- Integrate with existing workflow managers
- Link ML platforms for experiment tracking
- Configure cloud storage and compute
要高效使用LaminDB,请遵循以下步骤:
-
安装与设置()
references/setup-deployment.md- 安装LaminDB及所需扩展
- 使用进行身份验证
lamin login - 使用初始化实例
lamin init --storage ...
-
学习核心概念()
references/core-concepts.md- 理解Artifacts、Records、Runs、Transforms
- 练习创建和检索Artifacts
- 在工作流中实现和
ln.track()ln.finish()
-
掌握查询技巧()
references/data-management.md- 练习筛选和搜索注册表
- 学习基于特征的查询
- 尝试流式处理大文件
-
设置验证机制()
references/annotation-validation.md- 定义与研究领域相关的特征
- 为数据类型创建模式
- 练习整理工作流
-
集成本体论()
references/ontologies.md- 导入相关的生物本体(基因、细胞类型等)
- 验证现有注释
- 使用本体术语标准化元数据
-
对接工具链()
references/integrations.md- 与现有工作流管理器集成
- 关联ML平台以追踪实验
- 配置云存储和计算资源
Key Principles
核心原则
Follow these principles when working with LaminDB:
-
Track everything: Useat the start of every analysis for automatic lineage capture
ln.track() -
Validate early: Define schemas and validate data before extensive analysis
-
Use ontologies: Leverage public biological ontologies for standardized annotations
-
Organize with keys: Structure artifact keys hierarchically (e.g.,)
project/experiment/batch/file.h5ad -
Query metadata first: Filter and search before loading large files
-
Version, don't duplicate: Use built-in versioning instead of creating new keys for modifications
-
Annotate with features: Define typed features for queryable metadata
-
Document thoroughly: Add descriptions to artifacts, schemas, and transforms
-
Leverage lineage: Useto understand data provenance
view_lineage() -
Start local, scale cloud: Develop locally with SQLite, deploy to cloud with PostgreSQL
使用LaminDB时请遵循以下原则:
- 追踪所有内容:在每次分析开始时使用自动捕获数据谱系
ln.track() - 尽早验证:在进行大量分析前定义模式并验证数据
- 使用本体:利用公共生物本体实现标准化注释
- 用键组织数据:分层结构设计Artifact的键(例如:)
project/experiment/batch/file.h5ad - 先查询元数据:在加载大文件前先进行筛选和搜索
- 版本控制而非重复:使用内置的版本控制,避免为修改内容创建新键
- 用特征注释:定义类型化特征以实现可查询的元数据
- 详细文档:为Artifacts、模式和Transforms添加描述
- 利用谱系:使用了解数据来源
view_lineage() - 从本地开始,向云扩展:使用SQLite进行本地开发,使用PostgreSQL部署到云生产环境
Reference Files
参考文件
This skill includes comprehensive reference documentation organized by capability:
- - Artifacts, records, runs, transforms, features, versioning, lineage
references/core-concepts.md - - Querying, filtering, searching, streaming, organizing data
references/data-management.md - - Schema design, curation workflows, validation strategies
references/annotation-validation.md - - Biological ontology management, standardization, hierarchies
references/ontologies.md - - Workflow managers, MLOps platforms, storage systems, tools
references/integrations.md - - Installation, configuration, deployment, troubleshooting
references/setup-deployment.md
Read the relevant reference file(s) based on the specific LaminDB capability needed for the task at hand.
本技能包含按功能模块组织的全面参考文档:
- - Artifacts、Records、Runs、Transforms、Features、版本控制、谱系
references/core-concepts.md - - 查询、筛选、搜索、流式处理、数据组织
references/data-management.md - - 模式设计、整理工作流、验证策略
references/annotation-validation.md - - 生物本体管理、标准化、层级结构
references/ontologies.md - - 工作流管理器、MLOps平台、存储系统、工具
references/integrations.md - - 安装、配置、部署、故障排除
references/setup-deployment.md
根据任务所需的LaminDB功能,阅读对应的参考文档即可。
Additional Resources
额外资源
- Official Documentation: https://docs.lamin.ai
- API Reference: https://docs.lamin.ai/api
- GitHub Repository: https://github.com/laminlabs/lamindb
- Tutorial: https://docs.lamin.ai/tutorial
- FAQ: https://docs.lamin.ai/faq