pytdc

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PyTDC (Therapeutics Data Commons)

PyTDC(治疗学数据共享平台)

Overview

概述

PyTDC is an open-science platform providing AI-ready datasets and benchmarks for drug discovery and development. Access curated datasets spanning the entire therapeutics pipeline with standardized evaluation metrics and meaningful data splits, organized into three categories: single-instance prediction (molecular/protein properties), multi-instance prediction (drug-target interactions, DDI), and generation (molecule generation, retrosynthesis).
PyTDC是一个开放科学平台,为药物发现与研发提供可直接用于AI的数据集和基准测试。你可以获取覆盖整个治疗学研发流程的精选数据集,这些数据集配有标准化评估指标和合理的数据拆分方式,分为三类任务:单实例预测(分子/蛋白质属性)、多实例预测(药物-靶点相互作用DTI、药物-药物相互作用DDI)和生成任务(分子生成、逆合成)。

When to Use This Skill

何时使用该工具

This skill should be used when:
  • Working with drug discovery or therapeutic ML datasets
  • Benchmarking machine learning models on standardized pharmaceutical tasks
  • Predicting molecular properties (ADME, toxicity, bioactivity)
  • Predicting drug-target or drug-drug interactions
  • Generating novel molecules with desired properties
  • Accessing curated datasets with proper train/test splits (scaffold, cold-split)
  • Using molecular oracles for property optimization
在以下场景中可以使用该工具:
  • 处理药物发现或治疗领域机器学习数据集时
  • 在标准化药学任务上对机器学习模型进行基准测试时
  • 预测分子属性(ADME、毒性、生物活性)时
  • 预测药物-靶点或药物-药物相互作用时
  • 生成具有所需属性的新型分子时
  • 获取带有合理训练/测试拆分(骨架拆分、冷拆分)的精选数据集时
  • 使用分子预言机进行属性优化时

Installation & Setup

安装与设置

Install PyTDC using pip:
bash
uv pip install PyTDC
To upgrade to the latest version:
bash
uv pip install PyTDC --upgrade
Core dependencies (automatically installed):
  • numpy, pandas, tqdm, seaborn, scikit_learn, fuzzywuzzy
Additional packages are installed automatically as needed for specific features.
使用pip安装PyTDC:
bash
uv pip install PyTDC
升级至最新版本:
bash
uv pip install PyTDC --upgrade
核心依赖(将自动安装):
  • numpy, pandas, tqdm, seaborn, scikit_learn, fuzzywuzzy
特定功能所需的额外包会根据需要自动安装。

Quick Start

快速开始

The basic pattern for accessing any TDC dataset follows this structure:
python
from tdc.<problem> import <Task>
data = <Task>(name='<Dataset>')
split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])
df = data.get_data(format='df')
Where:
  • <problem>
    : One of
    single_pred
    ,
    multi_pred
    , or
    generation
  • <Task>
    : Specific task category (e.g., ADME, DTI, MolGen)
  • <Dataset>
    : Dataset name within that task
Example - Loading ADME data:
python
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold')
访问任何TDC数据集的基本模式如下:
python
from tdc.<problem> import <Task>
data = <Task>(name='<Dataset>')
split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])
df = data.get_data(format='df')
其中:
  • <problem>
    single_pred
    multi_pred
    generation
    中的一个
  • <Task>
    :具体任务类别(例如ADME、DTI、MolGen)
  • <Dataset>
    :该任务下的数据集名称
示例 - 加载ADME数据:
python
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold')

Returns dict with 'train', 'valid', 'test' DataFrames

返回包含'train'、'valid'、'test' DataFrame的字典

undefined
undefined

Single-Instance Prediction Tasks

单实例预测任务

Single-instance prediction involves forecasting properties of individual biomedical entities (molecules, proteins, etc.).
单实例预测任务指预测单个生物医学实体(分子、蛋白质等)的属性。

Available Task Categories

可用任务类别

1. ADME (Absorption, Distribution, Metabolism, Excretion)

1. ADME(吸收、分布、代谢、排泄)

Predict pharmacokinetic properties of drug molecules.
python
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang')  # Intestinal permeability
预测药物分子的药代动力学属性。
python
from tdc.single_pred import ADME
data = ADME(name='Caco2_Wang')  # 肠道通透性

Other datasets: HIA_Hou, Bioavailability_Ma, Lipophilicity_AstraZeneca, etc.

其他数据集:HIA_Hou、Bioavailability_Ma、Lipophilicity_AstraZeneca等


**Common ADME datasets:**
- Caco2 - Intestinal permeability
- HIA - Human intestinal absorption
- Bioavailability - Oral bioavailability
- Lipophilicity - Octanol-water partition coefficient
- Solubility - Aqueous solubility
- BBB - Blood-brain barrier penetration
- CYP - Cytochrome P450 metabolism

**常见ADME数据集:**
- Caco2 - 肠道通透性
- HIA - 人体肠道吸收
- Bioavailability - 口服生物利用度
- Lipophilicity - 辛醇-水分配系数
- Solubility - 水溶性
- BBB - 血脑屏障穿透性
- CYP - 细胞色素P450代谢

2. Toxicity (Tox)

2. 毒性(Tox)

Predict toxicity and adverse effects of compounds.
python
from tdc.single_pred import Tox
data = Tox(name='hERG')  # Cardiotoxicity
预测化合物的毒性与不良反应。
python
from tdc.single_pred import Tox
data = Tox(name='hERG')  # 心脏毒性

Other datasets: AMES, DILI, Carcinogens_Lagunin, etc.

其他数据集:AMES、DILI、Carcinogens_Lagunin等


**Common toxicity datasets:**
- hERG - Cardiac toxicity
- AMES - Mutagenicity
- DILI - Drug-induced liver injury
- Carcinogens - Carcinogenicity
- ClinTox - Clinical trial toxicity

**常见毒性数据集:**
- hERG - 心脏毒性
- AMES - 致突变性
- DILI - 药物诱导肝损伤
- Carcinogens - 致癌性
- ClinTox - 临床试验毒性

3. HTS (High-Throughput Screening)

3. HTS(高通量筛选)

Bioactivity predictions from screening data.
python
from tdc.single_pred import HTS
data = HTS(name='SARSCoV2_Vitro_Touret')
基于筛选数据的生物活性预测。
python
from tdc.single_pred import HTS
data = HTS(name='SARSCoV2_Vitro_Touret')

4. QM (Quantum Mechanics)

4. QM(量子力学)

Quantum mechanical properties of molecules.
python
from tdc.single_pred import QM
data = QM(name='QM7')
分子的量子力学属性。
python
from tdc.single_pred import QM
data = QM(name='QM7')

5. Other Single Prediction Tasks

5. 其他单实例预测任务

  • Yields: Chemical reaction yield prediction
  • Epitope: Epitope prediction for biologics
  • Develop: Development-stage predictions
  • CRISPROutcome: Gene editing outcome prediction
  • Yields:化学反应产率预测
  • Epitope:生物制品表位预测
  • Develop:研发阶段预测
  • CRISPROutcome:基因编辑结果预测

Data Format

数据格式

Single prediction datasets typically return DataFrames with columns:
  • Drug_ID
    or
    Compound_ID
    : Unique identifier
  • Drug
    or
    X
    : SMILES string or molecular representation
  • Y
    : Target label (continuous or binary)
单实例预测数据集通常返回包含以下列的DataFrame:
  • Drug_ID
    Compound_ID
    :唯一标识符
  • Drug
    X
    :SMILES字符串或分子表示
  • Y
    :目标标签(连续型或二分类)

Multi-Instance Prediction Tasks

多实例预测任务

Multi-instance prediction involves forecasting properties of interactions between multiple biomedical entities.
多实例预测任务指预测多个生物医学实体之间的相互作用属性。

Available Task Categories

可用任务类别

1. DTI (Drug-Target Interaction)

1. DTI(药物-靶点相互作用)

Predict binding affinity between drugs and protein targets.
python
from tdc.multi_pred import DTI
data = DTI(name='BindingDB_Kd')
split = data.get_split()
Available datasets:
  • BindingDB_Kd - Dissociation constant (52,284 pairs)
  • BindingDB_IC50 - Half-maximal inhibitory concentration (991,486 pairs)
  • BindingDB_Ki - Inhibition constant (375,032 pairs)
  • DAVIS, KIBA - Kinase binding datasets
Data format: Drug_ID, Target_ID, Drug (SMILES), Target (sequence), Y (binding affinity)
预测药物与蛋白质靶点之间的结合亲和力。
python
from tdc.multi_pred import DTI
data = DTI(name='BindingDB_Kd')
split = data.get_split()
可用数据集:
  • BindingDB_Kd - 解离常数(52,284对数据)
  • BindingDB_IC50 - 半数抑制浓度(991,486对数据)
  • BindingDB_Ki - 抑制常数(375,032对数据)
  • DAVIS、KIBA - 激酶结合数据集
数据格式: Drug_ID、Target_ID、Drug(SMILES)、Target(序列)、Y(结合亲和力)

2. DDI (Drug-Drug Interaction)

2. DDI(药物-药物相互作用)

Predict interactions between drug pairs.
python
from tdc.multi_pred import DDI
data = DDI(name='DrugBank')
split = data.get_split()
Multi-class classification task predicting interaction types. Dataset contains 191,808 DDI pairs with 1,706 drugs.
预测药物对之间的相互作用。
python
from tdc.multi_pred import DDI
data = DDI(name='DrugBank')
split = data.get_split()
该任务为多分类任务,预测相互作用类型。数据集包含191,808对DDI数据,涉及1,706种药物。

3. PPI (Protein-Protein Interaction)

3. PPI(蛋白质-蛋白质相互作用)

Predict protein-protein interactions.
python
from tdc.multi_pred import PPI
data = PPI(name='HuRI')
预测蛋白质之间的相互作用。
python
from tdc.multi_pred import PPI
data = PPI(name='HuRI')

4. Other Multi-Prediction Tasks

4. 其他多实例预测任务

  • GDA: Gene-disease associations
  • DrugRes: Drug resistance prediction
  • DrugSyn: Drug synergy prediction
  • PeptideMHC: Peptide-MHC binding
  • AntibodyAff: Antibody affinity prediction
  • MTI: miRNA-target interactions
  • Catalyst: Catalyst prediction
  • TrialOutcome: Clinical trial outcome prediction
  • GDA:基因-疾病关联
  • DrugRes:药物抗性预测
  • DrugSyn:药物协同作用预测
  • PeptideMHC:肽-MHC结合预测
  • AntibodyAff:抗体亲和力预测
  • MTI:miRNA-靶点相互作用
  • Catalyst:催化剂预测
  • TrialOutcome:临床试验结果预测

Generation Tasks

生成任务

Generation tasks involve creating novel biomedical entities with desired properties.
生成任务指创建具有所需属性的新型生物医学实体。

1. Molecular Generation (MolGen)

1. 分子生成(MolGen)

Generate diverse, novel molecules with desirable chemical properties.
python
from tdc.generation import MolGen
data = MolGen(name='ChEMBL_V29')
split = data.get_split()
Use with oracles to optimize for specific properties:
python
from tdc import Oracle
oracle = Oracle(name='GSK3B')
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')  # Evaluate SMILES
See
references/oracles.md
for all available oracle functions.
生成多样且具有理想化学属性的新型分子。
python
from tdc.generation import MolGen
data = MolGen(name='ChEMBL_V29')
split = data.get_split()
可结合预言机优化特定属性:
python
from tdc import Oracle
oracle = Oracle(name='GSK3B')
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')  # 评估SMILES字符串
所有可用预言机函数请参考
references/oracles.md

2. Retrosynthesis (RetroSyn)

2. 逆合成(RetroSyn)

Predict reactants needed to synthesize a target molecule.
python
from tdc.generation import RetroSyn
data = RetroSyn(name='USPTO')
split = data.get_split()
Dataset contains 1,939,253 reactions from USPTO database.
预测合成目标分子所需的反应物。
python
from tdc.generation import RetroSyn
data = RetroSyn(name='USPTO')
split = data.get_split()
该数据集包含来自USPTO数据库的1,939,253个反应。

3. Paired Molecule Generation

3. 配对分子生成

Generate molecule pairs (e.g., prodrug-drug pairs).
python
from tdc.generation import PairMolGen
data = PairMolGen(name='Prodrug')
For detailed oracle documentation and molecular generation workflows, refer to
references/oracles.md
and
scripts/molecular_generation.py
.
生成分子对(例如前药-药物对)。
python
from tdc.generation import PairMolGen
data = PairMolGen(name='Prodrug')
关于预言机的详细文档和分子生成工作流,请参考
references/oracles.md
scripts/molecular_generation.py

Benchmark Groups

基准测试组

Benchmark groups provide curated collections of related datasets for systematic model evaluation.
基准测试组提供相关数据集的精选集合,用于系统地评估模型。

ADMET Benchmark Group

ADMET基准测试组

python
from tdc.benchmark_group import admet_group
group = admet_group(path='data/')
python
from tdc.benchmark_group import admet_group
group = admet_group(path='data/')

Get benchmark datasets

获取基准测试数据集

benchmark = group.get('Caco2_Wang') predictions = {}
for seed in [1, 2, 3, 4, 5]: train, valid = benchmark['train'], benchmark['valid'] # Train model here predictions[seed] = model.predict(benchmark['test'])
benchmark = group.get('Caco2_Wang') predictions = {}
for seed in [1, 2, 3, 4, 5]: train, valid = benchmark['train'], benchmark['valid'] # 在此处训练模型 predictions[seed] = model.predict(benchmark['test'])

Evaluate with required 5 seeds

使用要求的5个随机种子进行评估

results = group.evaluate(predictions)

**ADMET Group includes 22 datasets** covering absorption, distribution, metabolism, excretion, and toxicity.
results = group.evaluate(predictions)

**ADMET基准测试组包含22个数据集**,覆盖吸收、分布、代谢、排泄和毒性领域。

Other Benchmark Groups

其他基准测试组

Available benchmark groups include collections for:
  • ADMET properties
  • Drug-target interactions
  • Drug combination prediction
  • And more specialized therapeutic tasks
For benchmark evaluation workflows, see
scripts/benchmark_evaluation.py
.
可用的基准测试组包括:
  • ADMET属性组
  • 药物-靶点相互作用组
  • 药物组合预测组
  • 其他专业治疗学任务组
基准测试评估工作流请参考
scripts/benchmark_evaluation.py

Data Functions

数据处理工具

TDC provides comprehensive data processing utilities organized into four categories.
TDC提供全面的数据处理工具,分为四类。

1. Dataset Splits

1. 数据集拆分

Retrieve train/validation/test partitions with various strategies:
python
undefined
通过多种策略获取训练/验证/测试分区:
python
undefined

Scaffold split (default for most tasks)

骨架拆分(多数任务的默认方式)

split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])
split = data.get_split(method='scaffold', seed=1, frac=[0.7, 0.1, 0.2])

Random split

随机拆分

split = data.get_split(method='random', seed=42, frac=[0.8, 0.1, 0.1])
split = data.get_split(method='random', seed=42, frac=[0.8, 0.1, 0.1])

Cold split (for DTI/DDI tasks)

冷拆分(适用于DTI/DDI任务)

split = data.get_split(method='cold_drug', seed=1) # Unseen drugs in test split = data.get_split(method='cold_target', seed=1) # Unseen targets in test

**Available split strategies:**
- `random`: Random shuffling
- `scaffold`: Scaffold-based (for chemical diversity)
- `cold_drug`, `cold_target`, `cold_drug_target`: For DTI tasks
- `temporal`: Time-based splits for temporal datasets
split = data.get_split(method='cold_drug', seed=1) # 测试集包含未见过的药物 split = data.get_split(method='cold_target', seed=1) # 测试集包含未见过的靶点

**可用拆分策略:**
- `random`:随机打乱拆分
- `scaffold`:基于分子骨架拆分(保证化学多样性)
- `cold_drug`、`cold_target`、`cold_drug_target`:适用于DTI任务
- `temporal`:基于时间的拆分(适用于时序数据集)

2. Model Evaluation

2. 模型评估

Use standardized metrics for evaluation:
python
from tdc import Evaluator
使用标准化指标进行评估:
python
from tdc import Evaluator

For binary classification

二分类任务评估

evaluator = Evaluator(name='ROC-AUC') score = evaluator(y_true, y_pred)
evaluator = Evaluator(name='ROC-AUC') score = evaluator(y_true, y_pred)

For regression

回归任务评估

evaluator = Evaluator(name='RMSE') score = evaluator(y_true, y_pred)

**Available metrics:** ROC-AUC, PR-AUC, F1, Accuracy, RMSE, MAE, R2, Spearman, Pearson, and more.
evaluator = Evaluator(name='RMSE') score = evaluator(y_true, y_pred)

**可用指标:** ROC-AUC、PR-AUC、F1、准确率、RMSE、MAE、R2、斯皮尔曼相关系数、皮尔逊相关系数等。

3. Data Processing

3. 数据处理

TDC provides 11 key processing utilities:
python
from tdc.chem_utils import MolConvert
TDC提供11种核心处理工具:
python
from tdc.chem_utils import MolConvert

Molecule format conversion

分子格式转换

converter = MolConvert(src='SMILES', dst='PyG') pyg_graph = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')

**Processing utilities include:**
- Molecule format conversion (SMILES, SELFIES, PyG, DGL, ECFP, etc.)
- Molecule filters (PAINS, drug-likeness)
- Label binarization and unit conversion
- Data balancing (over/under-sampling)
- Negative sampling for pair data
- Graph transformation
- Entity retrieval (CID to SMILES, UniProt to sequence)

For comprehensive utilities documentation, see `references/utilities.md`.
converter = MolConvert(src='SMILES', dst='PyG') pyg_graph = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')

**处理工具包括:**
- 分子格式转换(SMILES、SELFIES、PyG、DGL、ECFP等)
- 分子过滤(PAINS、类药性过滤)
- 标签二值化与单位转换
- 数据平衡(过采样/欠采样)
- 配对数据负采样
- 图转换
- 实体检索(CID转SMILES、UniProt转序列)

完整工具文档请参考`references/utilities.md`。

4. Molecule Generation Oracles

4. 分子生成预言机

TDC provides 17+ oracle functions for molecular optimization:
python
from tdc import Oracle
TDC提供17种以上的预言机函数用于分子优化:
python
from tdc import Oracle

Single oracle

单个预言机

oracle = Oracle(name='DRD2') score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
oracle = Oracle(name='DRD2') score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')

Multiple oracles

批量预言机

oracle = Oracle(name='JNK3') scores = oracle(['SMILES1', 'SMILES2', 'SMILES3'])

For complete oracle documentation, see `references/oracles.md`.
oracle = Oracle(name='JNK3') scores = oracle(['SMILES1', 'SMILES2', 'SMILES3'])

完整预言机文档请参考`references/oracles.md`。

Advanced Features

高级功能

Retrieve Available Datasets

获取可用数据集

python
from tdc.utils import retrieve_dataset_names
python
from tdc.utils import retrieve_dataset_names

Get all ADME datasets

获取所有ADME数据集

adme_datasets = retrieve_dataset_names('ADME')
adme_datasets = retrieve_dataset_names('ADME')

Get all DTI datasets

获取所有DTI数据集

dti_datasets = retrieve_dataset_names('DTI')
undefined
dti_datasets = retrieve_dataset_names('DTI')
undefined

Label Transformations

标签转换

python
undefined
python
undefined

Get label mapping

获取标签映射

label_map = data.get_label_map(name='DrugBank')
label_map = data.get_label_map(name='DrugBank')

Convert labels

转换标签

from tdc.chem_utils import label_transform transformed = label_transform(y, from_unit='nM', to_unit='p')
undefined
from tdc.chem_utils import label_transform transformed = label_transform(y, from_unit='nM', to_unit='p')
undefined

Database Queries

数据库查询

python
from tdc.utils import cid2smiles, uniprot2seq
python
from tdc.utils import cid2smiles, uniprot2seq

Convert PubChem CID to SMILES

将PubChem CID转换为SMILES

smiles = cid2smiles(2244)
smiles = cid2smiles(2244)

Convert UniProt ID to amino acid sequence

将UniProt ID转换为氨基酸序列

sequence = uniprot2seq('P12345')
undefined
sequence = uniprot2seq('P12345')
undefined

Common Workflows

常见工作流

Workflow 1: Train a Single Prediction Model

工作流1:训练单实例预测模型

See
scripts/load_and_split_data.py
for a complete example:
python
from tdc.single_pred import ADME
from tdc import Evaluator
完整示例请参考
scripts/load_and_split_data.py
python
from tdc.single_pred import ADME
from tdc import Evaluator

Load data

加载数据

data = ADME(name='Caco2_Wang') split = data.get_split(method='scaffold', seed=42)
train, valid, test = split['train'], split['valid'], split['test']
data = ADME(name='Caco2_Wang') split = data.get_split(method='scaffold', seed=42)
train, valid, test = split['train'], split['valid'], split['test']

Train model (user implements)

训练模型(用户自行实现)

model.fit(train['Drug'], train['Y'])

model.fit(train['Drug'], train['Y'])

Evaluate

评估

evaluator = Evaluator(name='MAE')
evaluator = Evaluator(name='MAE')

score = evaluator(test['Y'], predictions)

score = evaluator(test['Y'], predictions)

undefined
undefined

Workflow 2: Benchmark Evaluation

工作流2:基准测试评估

See
scripts/benchmark_evaluation.py
for a complete example with multiple seeds and proper evaluation protocol.
完整示例(包含多随机种子与标准评估流程)请参考
scripts/benchmark_evaluation.py

Workflow 3: Molecular Generation with Oracles

工作流3:结合预言机的分子生成

See
scripts/molecular_generation.py
for an example of goal-directed generation using oracle functions.
使用预言机进行目标导向生成的示例请参考
scripts/molecular_generation.py

Resources

资源

This skill includes bundled resources for common TDC workflows:
该工具包含以下常见TDC工作流的配套资源:

scripts/

scripts/

  • load_and_split_data.py
    : Template for loading and splitting TDC datasets with various strategies
  • benchmark_evaluation.py
    : Template for running benchmark group evaluations with proper 5-seed protocol
  • molecular_generation.py
    : Template for molecular generation using oracle functions
  • load_and_split_data.py
    :加载与拆分TDC数据集的模板(支持多种策略)
  • benchmark_evaluation.py
    :运行基准测试组评估的模板(符合5随机种子要求)
  • molecular_generation.py
    :使用预言机进行分子生成的模板

references/

references/

  • datasets.md
    : Comprehensive catalog of all available datasets organized by task type
  • oracles.md
    : Complete documentation of all 17+ molecule generation oracles
  • utilities.md
    : Detailed guide to data processing, splitting, and evaluation utilities
  • datasets.md
    :按任务类型分类的所有可用数据集的完整目录
  • oracles.md
    :所有17+种分子生成预言机的完整文档
  • utilities.md
    :数据处理、拆分与评估工具的详细指南

Additional Resources

额外资源