tooluniverse-proteomics-data-retrieval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Proteomics Data Retrieval

蛋白质组学数据检索

Find and retrieve metadata for publicly available proteomics datasets from MassIVE and ProteomeXchange repositories. Supports searching by species, keyword, or accession, and returns detailed dataset metadata including instruments, publications, species, and post-translational modifications.

从MassIVE和ProteomeXchange数据库中查找并检索公开可用的蛋白质组学数据集元数据。支持按物种、关键词或登录号搜索，并返回详细的数据集元数据，包括仪器信息、出版物、物种和翻译后修饰类型。

When to Use This Skill

何时使用此技能

Triggers:

"Find proteomics datasets for [organism/disease/protein]"
"Search MassIVE for [keyword]"
"Get details for PXD000001" or "Look up MSV000079514"
"What public mass spectrometry datasets exist for [topic]?"
"Find MS datasets with [PTM type] data"
"List recent human proteomics datasets"

Use Cases:

Dataset Discovery: Search repositories for proteomics experiments related to a research topic
Accession Lookup: Get full metadata for a known dataset accession (PXD or MSV)
Species-Filtered Search: Find all datasets for a specific organism
Cross-Repository Search: Query both MassIVE and ProteomeXchange for comprehensive coverage
Experimental Context: Find published datasets to validate or complement in-house results

触发场景:

"查找[生物/疾病/蛋白质]相关的蛋白质组学数据集"
"在MassIVE中搜索[关键词]"
"获取PXD000001的详细信息"或"查询MSV000079514"
"[主题]相关的公开质谱数据集有哪些？"
"查找包含[PTM类型]数据的质谱数据集"
"列出近期人类蛋白质组学数据集"

使用案例:

数据集发现: 搜索数据库中与研究主题相关的蛋白质组学实验
登录号查询: 获取已知数据集登录号（PXD或MSV）的完整元数据
物种筛选搜索: 查找特定生物的所有数据集
跨库搜索: 同时查询MassIVE和ProteomeXchange以获取全面覆盖
实验背景补充: 查找已发表的数据集，用于验证或补充内部研究结果

COMPUTE, DON'T DESCRIBE

执行计算，而非描述

When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.

当分析需要计算（统计、数据处理、评分、富集分析）时，通过Bash编写并运行Python代码。不要描述操作步骤——直接执行并报告实际结果。使用ToolUniverse工具检索数据，然后用Python（pandas、scipy、statsmodels、matplotlib）进行分析。

KEY PRINCIPLES

核心原则

ProteomeXchange is the aggregator -- it indexes datasets from PRIDE, MassIVE, PeptideAtlas, jPOST, and iProX
MassIVE has richer metadata -- includes summaries, keywords, modifications, and contacts
Search both repositories -- ProteomeXchange for breadth, MassIVE for detail
Species uses NCBI taxonomy IDs -- human = 9606, mouse = 10090, rat = 10116
Accession formats: PXD (ProteomeXchange), MSV (MassIVE) -- both accepted by MassIVE_get_dataset
LOOK UP DON'T GUESS -- Never assume which datasets exist, their accessions, or their instrument types. Always search and retrieve metadata to confirm.

ProteomeXchange是聚合平台——它索引来自PRIDE、MassIVE、PeptideAtlas、jPOST和iProX的数据集
MassIVE元数据更丰富——包含摘要、关键词、修饰类型和联系人信息
同时搜索两个数据库——ProteomeXchange覆盖范围广，MassIVE提供更详细的信息
物种使用NCBI分类ID——人类=9606，小鼠=10090，大鼠=10116
登录号格式: PXD（ProteomeXchange）、MSV（MassIVE）——两者均被MassIVE_get_dataset支持
查询而非猜测——永远不要假设数据集的存在、登录号或仪器类型。始终通过搜索和检索元数据来确认。

Domain Reasoning: Dataset Quality Assessment

领域推理：数据集质量评估

Dataset quality depends on instrument, sample preparation, and quantification method. TMT/iTRAQ (isobaric labeling) datasets have ratio compression and co-isolation interference biases that differ from label-free quantification (LFQ). DIA datasets require different analysis pipelines than DDA. Check the original publication for methods before reusing data in a meta-analysis or cross-study comparison. Instrument resolution (Orbitrap > ion trap) and acquisition mode (DIA > DDA for completeness) directly affect how many proteins are quantified and at what confidence.

数据集质量取决于仪器、样品制备和定量方法。TMT/iTRAQ（等压标记）数据集存在比率压缩和共分离干扰偏差，与无标记定量（LFQ）不同。DIA数据集需要与DDA不同的分析流程。在元分析或跨研究比较中重用数据前，请查看原始出版物中的方法。仪器分辨率（Orbitrap > 离子阱）和采集模式（DIA > DDA的完整性）直接影响可定量的蛋白质数量及其置信度。

Core Repositories Integrated

集成的核心数据库

Repository	Coverage	Strengths
MassIVE	10,000+ datasets	Rich metadata (summaries, keywords, modifications, contacts), species filtering by taxonomy ID
ProteomeXchange	Aggregates PRIDE, MassIVE, PeptideAtlas, jPOST, iProX	Broadest coverage, standardized PXD accessions

数据库	覆盖范围	优势
MassIVE	10,000+数据集	丰富的元数据（摘要、关键词、修饰类型、联系人），支持按分类ID筛选物种
ProteomeXchange	聚合PRIDE、MassIVE、PeptideAtlas、jPOST、iProX	覆盖范围最广，标准化PXD登录号

Workflow Overview

工作流程概述

Query (keyword / species / accession)
|
+-- PHASE 0: Input Resolution
|   Determine search type: keyword, species, or accession lookup
|
+-- PHASE 1: Repository Search
|   Search MassIVE and/or ProteomeXchange based on query type
|
+-- PHASE 2: Dataset Detail Retrieval
|   Get full metadata for promising hits
|
+-- PHASE 3: Result Synthesis
    Compile datasets with metadata, publications, and relevance assessment

查询（关键词 / 物种 / 登录号）
|
+-- 阶段0：输入解析
|   确定搜索类型：关键词、物种或登录号查询
|
+-- 阶段1：数据库搜索
|   根据查询类型搜索MassIVE和/或ProteomeXchange
|
+-- 阶段2：数据集详情检索
|   获取候选数据集的完整元数据
|
+-- 阶段3：结果整合
    整理包含元数据、出版物和相关性评估的数据集

Phase 0: Input Resolution

阶段0：输入解析

Objective: Determine the query type and prepare appropriate search parameters.

目标: 确定查询类型并准备合适的搜索参数。

Decision Logic

决策逻辑

Accession provided (e.g.,
```
PXD000001
```
,
```
MSV000079514
```
):
- PXD accession: call
```
ProteomeXchange_get_dataset
```
  and optionally
```
MassIVE_get_dataset
```
- MSV accession: call
```
MassIVE_get_dataset
```
- Skip Phase 1, go directly to Phase 2
Species name provided (e.g., "human", "mouse"):
- Map to NCBI taxonomy ID: human=9606, mouse=10090, rat=10116, yeast=559292, zebrafish=7955, fly=7227, worm=6239, arabidopsis=3702
- Use
```
MassIVE_search_datasets
```
  with
```
species
```
  filter
Keyword provided (e.g., "phosphoproteomics", "breast cancer"):
- Use
```
ProteomeXchange_search_datasets
```
  with
```
query
```
  parameter
- MassIVE does not support keyword search -- use ProteomeXchange for keyword queries

提供登录号（例如：
```
PXD000001
```
、
```
MSV000079514
```
）:
- PXD登录号：调用
```
ProteomeXchange_get_dataset
```
  ，可选调用
```
MassIVE_get_dataset
```
- MSV登录号：调用
```
MassIVE_get_dataset
```
- 跳过阶段1，直接进入阶段2
提供物种名称（例如："人类"、"小鼠"）:
- 映射为NCBI分类ID：人类=9606，小鼠=10090，大鼠=10116，酵母=559292，斑马鱼=7955，果蝇=7227，线虫=6239，拟南芥=3702
- 使用
```
MassIVE_search_datasets
```
  并设置
```
species
```
  筛选器
提供关键词（例如："磷酸蛋白质组学"、"乳腺癌"）:
- 使用
```
ProteomeXchange_search_datasets
```
  并设置
```
query
```
  参数
- MassIVE不支持关键词搜索——使用ProteomeXchange进行关键词查询

Phase 1: Repository Search

阶段1：数据库搜索

Objective: Find relevant datasets across repositories.

目标: 在各数据库中查找相关数据集。

Tools

工具

MassIVE_search_datasets:

```
page_size
```
: Number of results to return (integer, max 100, default 10)
```
species
```
: NCBI taxonomy ID string to filter by species (e.g.,
```
"9606"
```
for human)
Returns: Array of dataset objects with
```
accessions
```
(array),
```
title
```
,
```
summary
```
,
```
species
```
,
```
instruments
```
,
```
keywords
```
Note: No keyword/text search parameter -- filtering is by species only

ProteomeXchange_search_datasets:

```
query
```
: Optional search filter -- keyword or dataset accession (e.g.,
```
"phosphoproteomics"
```
,
```
"PXD"
```
)
```
limit
```
: Max results (1-50, default 10)

Returns:

{data: [{accession, title, species}], metadata: {source, total_returned, query}}

MassIVE_search_datasets:

```
page_size
```
: 返回结果数量（整数，最大100，默认10）
```
species
```
: 用于筛选物种的NCBI分类ID字符串（例如：
```
"9606"
```
代表人类）
返回值：数据集对象数组，包含
```
accessions
```
（数组）、
```
title
```
、
```
summary
```
、
```
species
```
、
```
instruments
```
、
```
keywords
```
注意: 无关键词/文本搜索参数——仅支持按物种筛选

ProteomeXchange_search_datasets:

```
query
```
: 可选搜索筛选器——关键词或数据集登录号（例如：
```
"phosphoproteomics"
```
、
```
"PXD"
```
）
```
limit
```
: 最大结果数（1-50，默认10）

返回值：

{data: [{accession, title, species}], metadata: {source, total_returned, query}}

Workflow

工作流程

For species-specific search:

Call

MassIVE_search_datasets(page_size=20, species="9606")

for species-filtered results

Call

ProteomeXchange_search_datasets(limit=20)

for broader listing

For keyword search:

Call

ProteomeXchange_search_datasets(query="keyword", limit=20)

Review titles for relevance

For comprehensive discovery:
- Call both tools in parallel
- Merge results, deduplicate by accession (PXD accessions may appear in both)

针对物种特异性搜索:

调用

MassIVE_search_datasets(page_size=20, species="9606")

获取物种筛选结果

调用

ProteomeXchange_search_datasets(limit=20)

获取更广泛的列表

针对关键词搜索:

调用

ProteomeXchange_search_datasets(query="keyword", limit=20)

查看标题判断相关性

针对全面发现:
- 并行调用两个工具
- 合并结果，按登录号去重（PXD登录号可能同时出现在两个数据库中）

Response Format Notes

返回格式说明

MassIVE_search_datasets: Returns a direct array (no
```
{data: ...}
```
wrapper)
ProteomeXchange_search_datasets: Returns
```
{data: [...], metadata: {...}}
```

MassIVE_search_datasets: 返回直接数组（无
```
{data: ...}
```
包装）
ProteomeXchange_search_datasets: 返回
```
{data: [...], metadata: {...}}
```

Phase 2: Dataset Detail Retrieval

阶段2：数据集详情检索

Objective: Get full metadata for datasets of interest.

目标: 获取感兴趣数据集的完整元数据。

Tools

工具

MassIVE_get_dataset:

```
accession
```
: Dataset accession -- accepts both MSV and PXD formats (e.g.,
```
"MSV000079514"
```
,
```
"PXD003971"
```
)

Returns: Object with

accessions

title

summary

species

instruments

keywords

contacts

publications

modifications

ProteomeXchange_get_dataset:

```
px_id
```
: ProteomeXchange identifier in PXD format (e.g.,
```
"PXD000001"
```
)

Returns:

{data: {px_id, title, species, identifiers, instruments, publications, file_count}, metadata: {...}}

MassIVE_get_dataset:

```
accession
```
: 数据集登录号——支持MSV和PXD格式（例如：
```
"MSV000079514"
```
、
```
"PXD003971"
```
）

返回值：包含

accessions

、

title

、

summary

、

species

、

instruments

、

keywords

、

contacts

、

publications

、

modifications

的对象

ProteomeXchange_get_dataset:

```
px_id
```
: PXD格式的ProteomeXchange标识符（例如：
```
"PXD000001"
```
）

返回值：

{data: {px_id, title, species, identifiers, instruments, publications, file_count}, metadata: {...}}

Workflow

工作流程

For each promising dataset from Phase 1, call the appropriate detail tool
Extract key metadata: title, species, instruments, publications (PubMed/DOI), modifications
For PXD accessions: prefer
```
ProteomeXchange_get_dataset
```
for file count; use
```
MassIVE_get_dataset
```
for richer summary/keywords

对阶段1中的每个候选数据集，调用相应的详情工具
提取关键元数据：标题、物种、仪器、出版物（PubMed/DOI）、修饰类型
对于PXD登录号：优先使用
```
ProteomeXchange_get_dataset
```
获取文件数量；使用
```
MassIVE_get_dataset
```
获取更丰富的摘要/关键词

Key Fields to Extract

需提取的关键字段

title: Dataset name/description
species: Organism(s) studied
instruments: Mass spectrometer(s) used (e.g., Orbitrap, Q Exactive, TripleTOF)
publications: PubMed IDs and DOIs for associated papers
modifications: PTMs studied (from MassIVE only)
file_count: Number of raw files (from ProteomeXchange only)
keywords: Topic tags (from MassIVE only)

title: 数据集名称/描述
species: 研究的生物
instruments: 使用的质谱仪（例如：Orbitrap、Q Exactive、TripleTOF）
publications: 相关论文的PubMed ID和DOI
modifications: 研究的翻译后修饰（仅来自MassIVE）
file_count: 原始文件数量（仅来自ProteomeXchange）
keywords: 主题标签（仅来自MassIVE）

Phase 3: Result Synthesis

阶段3：结果整合

Objective: Compile and present dataset results in a structured format.

目标: 以结构化格式整理并展示数据集结果。

Report Format

报告格式

undefined

undefined

Proteomics Dataset Search Results

蛋白质组学数据集搜索结果

Query: [original query] Date: YYYY-MM-DD Repositories searched: MassIVE, ProteomeXchange

查询: [原始查询内容] 日期: YYYY-MM-DD 搜索的数据库: MassIVE, ProteomeXchange

Summary

摘要

Found N datasets matching [criteria].

找到N个符合[筛选条件]的数据集。

Datasets

数据集

1. [Title]

1. [标题]

Accession: PXD/MSV number
Species: [organism]
Instruments: [MS platforms]
Publications: [PubMed IDs / DOIs]
Modifications: [PTMs if available]
Files: [count if available]
Summary: [brief description]

登录号: PXD/MSV编号
物种: [生物]
仪器: [质谱平台]
出版物: [PubMed ID / DOI]
修饰类型: [若有则列出PTM]
文件数: [若有则列出数量]
摘要: [简要描述]

2. [Title]

2. [标题]

...

Data Gaps

数据缺口

[Note any limitations in search coverage]

---

[说明搜索覆盖范围的任何局限性]

---

Tool Parameter Reference

工具参数参考

Tool	Parameter	Notes
`MassIVE_search_datasets`	`page_size`	Integer, max 100. Default 10
`MassIVE_search_datasets`	`species`	NCBI taxonomy ID as string (e.g., `"9606"` not `9606` )
`MassIVE_get_dataset`	`accession`	Accepts both MSV and PXD formats
`ProteomeXchange_search_datasets`	`query`	Optional keyword or accession filter
`ProteomeXchange_search_datasets`	`limit`	Integer, 1-50
`ProteomeXchange_get_dataset`	`px_id`	PXD format only (e.g., `"PXD000001"` )

Response Format Notes:

MassIVE_search_datasets: Returns direct array of dataset objects (no wrapper)
MassIVE_get_dataset: Returns direct object (no wrapper)
ProteomeXchange_search_datasets: Returns
```
{data: [...], metadata: {...}}
```
ProteomeXchange_get_dataset: Returns
```
{data: {...}, metadata: {...}}
```

工具	参数	说明
`MassIVE_search_datasets`	`page_size`	整数，最大100。默认10
`MassIVE_search_datasets`	`species`	NCBI分类ID，需为字符串（例如： `"9606"` 而非 `9606` ）
`MassIVE_get_dataset`	`accession`	支持MSV和PXD格式
`ProteomeXchange_search_datasets`	`query`	可选关键词或登录号筛选器
`ProteomeXchange_search_datasets`	`limit`	整数，1-50
`ProteomeXchange_get_dataset`	`px_id`	仅支持PXD格式（例如： `"PXD000001"` ）

返回格式说明:

MassIVE_search_datasets: 返回数据集对象的直接数组（无包装）
MassIVE_get_dataset: 返回直接对象（无包装）
ProteomeXchange_search_datasets: 返回
```
{data: [...], metadata: {...}}
```
ProteomeXchange_get_dataset: 返回
```
{data: {...}, metadata: {...}}
```

Fallback Strategies

fallback策略

Situation	Fallback
MassIVE search returns empty	Use ProteomeXchange search (broader coverage)
ProteomeXchange search returns empty	Try broader/simpler query terms
MassIVE_get_dataset fails for PXD accession	Use ProteomeXchange_get_dataset instead
Species taxonomy ID unknown	Search ProteomeXchange by keyword (organism name)
No keyword search results	Try individual terms instead of multi-word queries

场景	替代方案
MassIVE搜索无结果	使用ProteomeXchange搜索（覆盖范围更广）
ProteomeXchange搜索无结果	尝试更宽泛/简单的查询词
MassIVE_get_dataset无法处理PXD登录号	改用ProteomeXchange_get_dataset
未知物种分类ID	通过关键词（生物名称）搜索ProteomeXchange
关键词搜索无结果	尝试单个词而非多词查询

Common Species Taxonomy IDs

常见物种分类ID

Species	Taxonomy ID
Human	9606
Mouse	10090
Rat	10116
Zebrafish	7955
Fruit fly	7227
C. elegans	6239
S. cerevisiae	559292
A. thaliana	3702
E. coli	562

物种	分类ID
人类	9606
小鼠	10090
大鼠	10116
斑马鱼	7955
果蝇	7227
秀丽隐杆线虫	6239
酿酒酵母	559292
拟南芥	3702
大肠杆菌	562

Interpretation Framework

解读框架

Quality Indicator	Good	Acceptable	Caution
Instrument	Orbitrap Exploris/Eclipse, timsTOF	Q Exactive, TripleTOF 6600	Older LTQ, ion trap only
Publication	Peer-reviewed with PubMed ID	Preprint or DOI only	No associated publication
Metadata completeness	Species + instrument + PTMs + summary	Species + instrument only	Title only, no annotations

Interpreting dataset search results:

Datasets with both MassIVE and ProteomeXchange accessions generally have richer metadata; MassIVE provides summaries and keywords while ProteomeXchange provides file counts -- cross-reference both for a complete picture.
Instrument type determines data quality ceiling: high-resolution instruments (Orbitrap, timsTOF) produce higher mass accuracy and more reliable quantification than older ion trap platforms.
A dataset lacking a peer-reviewed publication may still be valuable, but its experimental design and processing pipeline cannot be independently verified -- weight such datasets lower in meta-analyses.

Synthesis questions to address in the report:

Do multiple independent datasets for the same organism/condition show consistent protein identifications, or do discrepancies suggest batch effects?
Is the instrument platform appropriate for the analysis type (e.g., DIA requires high-resolution; TMT requires MS3 or calibrated MS2)?
Are the reported PTM types and species consistent with the user's research question, or is additional filtering needed?

质量指标	优质	可接受	需注意
仪器	Orbitrap Exploris/Eclipse、timsTOF	Q Exactive、TripleTOF 6600	较旧的LTQ、仅离子阱
出版物	经同行评审且有PubMed ID	预印本或仅DOI	无相关出版物
元数据完整性	物种+仪器+PTM+摘要	仅物种+仪器	仅标题，无注释

数据集搜索结果解读:

同时拥有MassIVE和ProteomeXchange登录号的数据集通常元数据更丰富；MassIVE提供摘要和关键词，ProteomeXchange提供文件数量——交叉参考两者以获取完整信息。
仪器类型决定数据质量上限：高分辨率仪器（Orbitrap、timsTOF）比旧离子阱平台产生更高的质量准确度和更可靠的定量结果。
缺乏同行评审出版物的数据集仍可能有价值，但其实验设计和处理流程无法独立验证——在元分析中应降低此类数据集的权重。

报告中需解决的整合问题:

同一生物/条件的多个独立数据集是否显示一致的蛋白质鉴定结果，还是差异表明存在批次效应？
仪器平台是否适合分析类型（例如：DIA需要高分辨率；TMT需要MS3或校准MS2）？
报告的PTM类型和物种是否符合用户的研究问题，还是需要额外筛选？

Limitations

局限性

MassIVE: No keyword/text search -- only species-based filtering via
```
species
```
parameter
ProteomeXchange: Limited metadata in search results (no summaries or keywords); get details via
```
Dataverse_get_dataset
```
No full-text search: Cannot search within dataset descriptions or abstracts across repositories
No download: These tools retrieve metadata only, not raw data files
Rate limits: Both APIs may throttle under heavy load; keep
```
page_size
```
/
```
limit
```
reasonable
Coverage: ProteomeXchange is the most comprehensive but may lag behind individual repositories for very recent submissions

MassIVE: 无关键词/文本搜索——仅支持通过
```
species
```
参数按物种筛选
ProteomeXchange: 搜索结果中元数据有限（无摘要或关键词）；需通过
```
Dataverse_get_dataset
```
获取详情
无全文搜索: 无法跨数据库搜索数据集描述或摘要内容
无下载功能: 这些工具仅检索元数据，不获取原始数据文件
速率限制: 两个API在高负载下可能限流；保持
```
page_size
```
/
```
limit
```
合理
覆盖范围: ProteomeXchange是最全面的平台，但对于非常新的提交可能滞后于单个数据库

Integration with Other Skills

与其他技能的集成

Skill	Relationship
`tooluniverse-proteomics-analysis`	Use retrieved datasets as input for MS data analysis
`tooluniverse-protein-modification-analysis`	Find PTM-specific datasets to complement iPTMnet annotations
`tooluniverse-multi-omics-integration`	Discover proteomics datasets for cross-omics integration

技能	关系
`tooluniverse-proteomics-analysis`	将检索到的数据集作为质谱数据分析的输入
`tooluniverse-protein-modification-analysis`	查找PTM特异性数据集以补充iPTMnet注释
`tooluniverse-multi-omics-integration`	发现蛋白质组学数据集用于跨组学整合

References

参考资料

MassIVE: https://massive.ucsd.edu
ProteomeXchange: http://www.proteomexchange.org
PRIDE: https://www.ebi.ac.uk/pride
ProXI API: https://github.com/PRIDE-Archive/proxi-schemas

MassIVE: https://massive.ucsd.edu
ProteomeXchange: http://www.proteomexchange.org
PRIDE: https://www.ebi.ac.uk/pride
ProXI API: https://github.com/PRIDE-Archive/proxi-schemas