tooluniverse-proteomics-data-retrieval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Proteomics Data Retrieval

蛋白质组学数据检索

Find and retrieve metadata for publicly available proteomics datasets from MassIVE and ProteomeXchange repositories. Supports searching by species, keyword, or accession, and returns detailed dataset metadata including instruments, publications, species, and post-translational modifications.
从MassIVE和ProteomeXchange数据库中查找并检索公开可用的蛋白质组学数据集元数据。支持按物种、关键词或登录号搜索,并返回详细的数据集元数据,包括仪器信息、出版物、物种和翻译后修饰类型。

When to Use This Skill

何时使用此技能

Triggers:
  • "Find proteomics datasets for [organism/disease/protein]"
  • "Search MassIVE for [keyword]"
  • "Get details for PXD000001" or "Look up MSV000079514"
  • "What public mass spectrometry datasets exist for [topic]?"
  • "Find MS datasets with [PTM type] data"
  • "List recent human proteomics datasets"
Use Cases:
  1. Dataset Discovery: Search repositories for proteomics experiments related to a research topic
  2. Accession Lookup: Get full metadata for a known dataset accession (PXD or MSV)
  3. Species-Filtered Search: Find all datasets for a specific organism
  4. Cross-Repository Search: Query both MassIVE and ProteomeXchange for comprehensive coverage
  5. Experimental Context: Find published datasets to validate or complement in-house results

触发场景:
  • "查找[生物/疾病/蛋白质]相关的蛋白质组学数据集"
  • "在MassIVE中搜索[关键词]"
  • "获取PXD000001的详细信息"或"查询MSV000079514"
  • "[主题]相关的公开质谱数据集有哪些?"
  • "查找包含[PTM类型]数据的质谱数据集"
  • "列出近期人类蛋白质组学数据集"
使用案例:
  1. 数据集发现: 搜索数据库中与研究主题相关的蛋白质组学实验
  2. 登录号查询: 获取已知数据集登录号(PXD或MSV)的完整元数据
  3. 物种筛选搜索: 查找特定生物的所有数据集
  4. 跨库搜索: 同时查询MassIVE和ProteomeXchange以获取全面覆盖
  5. 实验背景补充: 查找已发表的数据集,用于验证或补充内部研究结果

COMPUTE, DON'T DESCRIBE

执行计算,而非描述

When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.
当分析需要计算(统计、数据处理、评分、富集分析)时,通过Bash编写并运行Python代码。不要描述操作步骤——直接执行并报告实际结果。使用ToolUniverse工具检索数据,然后用Python(pandas、scipy、statsmodels、matplotlib)进行分析。

KEY PRINCIPLES

核心原则

  1. ProteomeXchange is the aggregator -- it indexes datasets from PRIDE, MassIVE, PeptideAtlas, jPOST, and iProX
  2. MassIVE has richer metadata -- includes summaries, keywords, modifications, and contacts
  3. Search both repositories -- ProteomeXchange for breadth, MassIVE for detail
  4. Species uses NCBI taxonomy IDs -- human = 9606, mouse = 10090, rat = 10116
  5. Accession formats: PXD (ProteomeXchange), MSV (MassIVE) -- both accepted by MassIVE_get_dataset
  6. LOOK UP DON'T GUESS -- Never assume which datasets exist, their accessions, or their instrument types. Always search and retrieve metadata to confirm.
  1. ProteomeXchange是聚合平台——它索引来自PRIDE、MassIVE、PeptideAtlas、jPOST和iProX的数据集
  2. MassIVE元数据更丰富——包含摘要、关键词、修饰类型和联系人信息
  3. 同时搜索两个数据库——ProteomeXchange覆盖范围广,MassIVE提供更详细的信息
  4. 物种使用NCBI分类ID——人类=9606,小鼠=10090,大鼠=10116
  5. 登录号格式: PXD(ProteomeXchange)、MSV(MassIVE)——两者均被MassIVE_get_dataset支持
  6. 查询而非猜测——永远不要假设数据集的存在、登录号或仪器类型。始终通过搜索和检索元数据来确认。

Domain Reasoning: Dataset Quality Assessment

领域推理:数据集质量评估

Dataset quality depends on instrument, sample preparation, and quantification method. TMT/iTRAQ (isobaric labeling) datasets have ratio compression and co-isolation interference biases that differ from label-free quantification (LFQ). DIA datasets require different analysis pipelines than DDA. Check the original publication for methods before reusing data in a meta-analysis or cross-study comparison. Instrument resolution (Orbitrap > ion trap) and acquisition mode (DIA > DDA for completeness) directly affect how many proteins are quantified and at what confidence.

数据集质量取决于仪器、样品制备和定量方法。TMT/iTRAQ(等压标记)数据集存在比率压缩和共分离干扰偏差,与无标记定量(LFQ)不同。DIA数据集需要与DDA不同的分析流程。在元分析或跨研究比较中重用数据前,请查看原始出版物中的方法。仪器分辨率(Orbitrap > 离子阱)和采集模式(DIA > DDA的完整性)直接影响可定量的蛋白质数量及其置信度。

Core Repositories Integrated

集成的核心数据库

RepositoryCoverageStrengths
MassIVE10,000+ datasetsRich metadata (summaries, keywords, modifications, contacts), species filtering by taxonomy ID
ProteomeXchangeAggregates PRIDE, MassIVE, PeptideAtlas, jPOST, iProXBroadest coverage, standardized PXD accessions

数据库覆盖范围优势
MassIVE10,000+数据集丰富的元数据(摘要、关键词、修饰类型、联系人),支持按分类ID筛选物种
ProteomeXchange聚合PRIDE、MassIVE、PeptideAtlas、jPOST、iProX覆盖范围最广,标准化PXD登录号

Workflow Overview

工作流程概述

Query (keyword / species / accession)
|
+-- PHASE 0: Input Resolution
|   Determine search type: keyword, species, or accession lookup
|
+-- PHASE 1: Repository Search
|   Search MassIVE and/or ProteomeXchange based on query type
|
+-- PHASE 2: Dataset Detail Retrieval
|   Get full metadata for promising hits
|
+-- PHASE 3: Result Synthesis
    Compile datasets with metadata, publications, and relevance assessment

查询(关键词 / 物种 / 登录号)
|
+-- 阶段0:输入解析
|   确定搜索类型:关键词、物种或登录号查询
|
+-- 阶段1:数据库搜索
|   根据查询类型搜索MassIVE和/或ProteomeXchange
|
+-- 阶段2:数据集详情检索
|   获取候选数据集的完整元数据
|
+-- 阶段3:结果整合
    整理包含元数据、出版物和相关性评估的数据集

Phase 0: Input Resolution

阶段0:输入解析

Objective: Determine the query type and prepare appropriate search parameters.
目标: 确定查询类型并准备合适的搜索参数。

Decision Logic

决策逻辑

  • Accession provided (e.g.,
    PXD000001
    ,
    MSV000079514
    ):
    • PXD accession: call
      ProteomeXchange_get_dataset
      and optionally
      MassIVE_get_dataset
    • MSV accession: call
      MassIVE_get_dataset
    • Skip Phase 1, go directly to Phase 2
  • Species name provided (e.g., "human", "mouse"):
    • Map to NCBI taxonomy ID: human=9606, mouse=10090, rat=10116, yeast=559292, zebrafish=7955, fly=7227, worm=6239, arabidopsis=3702
    • Use
      MassIVE_search_datasets
      with
      species
      filter
  • Keyword provided (e.g., "phosphoproteomics", "breast cancer"):
    • Use
      ProteomeXchange_search_datasets
      with
      query
      parameter
    • MassIVE does not support keyword search -- use ProteomeXchange for keyword queries

  • 提供登录号(例如:
    PXD000001
    MSV000079514
    ):
    • PXD登录号:调用
      ProteomeXchange_get_dataset
      ,可选调用
      MassIVE_get_dataset
    • MSV登录号:调用
      MassIVE_get_dataset
    • 跳过阶段1,直接进入阶段2
  • 提供物种名称(例如:"人类"、"小鼠"):
    • 映射为NCBI分类ID:人类=9606,小鼠=10090,大鼠=10116,酵母=559292,斑马鱼=7955,果蝇=7227,线虫=6239,拟南芥=3702
    • 使用
      MassIVE_search_datasets
      并设置
      species
      筛选器
  • 提供关键词(例如:"磷酸蛋白质组学"、"乳腺癌"):
    • 使用
      ProteomeXchange_search_datasets
      并设置
      query
      参数
    • MassIVE不支持关键词搜索——使用ProteomeXchange进行关键词查询

Phase 1: Repository Search

阶段1:数据库搜索

Objective: Find relevant datasets across repositories.
目标: 在各数据库中查找相关数据集。

Tools

工具

MassIVE_search_datasets:
  • page_size
    : Number of results to return (integer, max 100, default 10)
  • species
    : NCBI taxonomy ID string to filter by species (e.g.,
    "9606"
    for human)
  • Returns: Array of dataset objects with
    accessions
    (array),
    title
    ,
    summary
    ,
    species
    ,
    instruments
    ,
    keywords
  • Note: No keyword/text search parameter -- filtering is by species only
ProteomeXchange_search_datasets:
  • query
    : Optional search filter -- keyword or dataset accession (e.g.,
    "phosphoproteomics"
    ,
    "PXD"
    )
  • limit
    : Max results (1-50, default 10)
  • Returns:
    {data: [{accession, title, species}], metadata: {source, total_returned, query}}
MassIVE_search_datasets:
  • page_size
    : 返回结果数量(整数,最大100,默认10)
  • species
    : 用于筛选物种的NCBI分类ID字符串(例如:
    "9606"
    代表人类)
  • 返回值:数据集对象数组,包含
    accessions
    (数组)、
    title
    summary
    species
    instruments
    keywords
  • 注意: 无关键词/文本搜索参数——仅支持按物种筛选
ProteomeXchange_search_datasets:
  • query
    : 可选搜索筛选器——关键词或数据集登录号(例如:
    "phosphoproteomics"
    "PXD"
  • limit
    : 最大结果数(1-50,默认10)
  • 返回值:
    {data: [{accession, title, species}], metadata: {source, total_returned, query}}

Workflow

工作流程

  1. For species-specific search:
    • Call
      MassIVE_search_datasets(page_size=20, species="9606")
      for species-filtered results
    • Call
      ProteomeXchange_search_datasets(limit=20)
      for broader listing
  2. For keyword search:
    • Call
      ProteomeXchange_search_datasets(query="keyword", limit=20)
    • Review titles for relevance
  3. For comprehensive discovery:
    • Call both tools in parallel
    • Merge results, deduplicate by accession (PXD accessions may appear in both)
  1. 针对物种特异性搜索:
    • 调用
      MassIVE_search_datasets(page_size=20, species="9606")
      获取物种筛选结果
    • 调用
      ProteomeXchange_search_datasets(limit=20)
      获取更广泛的列表
  2. 针对关键词搜索:
    • 调用
      ProteomeXchange_search_datasets(query="keyword", limit=20)
    • 查看标题判断相关性
  3. 针对全面发现:
    • 并行调用两个工具
    • 合并结果,按登录号去重(PXD登录号可能同时出现在两个数据库中)

Response Format Notes

返回格式说明

  • MassIVE_search_datasets: Returns a direct array (no
    {data: ...}
    wrapper)
  • ProteomeXchange_search_datasets: Returns
    {data: [...], metadata: {...}}

  • MassIVE_search_datasets: 返回直接数组(无
    {data: ...}
    包装)
  • ProteomeXchange_search_datasets: 返回
    {data: [...], metadata: {...}}

Phase 2: Dataset Detail Retrieval

阶段2:数据集详情检索

Objective: Get full metadata for datasets of interest.
目标: 获取感兴趣数据集的完整元数据。

Tools

工具

MassIVE_get_dataset:
  • accession
    : Dataset accession -- accepts both MSV and PXD formats (e.g.,
    "MSV000079514"
    ,
    "PXD003971"
    )
  • Returns: Object with
    accessions
    ,
    title
    ,
    summary
    ,
    species
    ,
    instruments
    ,
    keywords
    ,
    contacts
    ,
    publications
    ,
    modifications
ProteomeXchange_get_dataset:
  • px_id
    : ProteomeXchange identifier in PXD format (e.g.,
    "PXD000001"
    )
  • Returns:
    {data: {px_id, title, species, identifiers, instruments, publications, file_count}, metadata: {...}}
MassIVE_get_dataset:
  • accession
    : 数据集登录号——支持MSV和PXD格式(例如:
    "MSV000079514"
    "PXD003971"
  • 返回值:包含
    accessions
    title
    summary
    species
    instruments
    keywords
    contacts
    publications
    modifications
    的对象
ProteomeXchange_get_dataset:
  • px_id
    : PXD格式的ProteomeXchange标识符(例如:
    "PXD000001"
  • 返回值:
    {data: {px_id, title, species, identifiers, instruments, publications, file_count}, metadata: {...}}

Workflow

工作流程

  1. For each promising dataset from Phase 1, call the appropriate detail tool
  2. Extract key metadata: title, species, instruments, publications (PubMed/DOI), modifications
  3. For PXD accessions: prefer
    ProteomeXchange_get_dataset
    for file count; use
    MassIVE_get_dataset
    for richer summary/keywords
  1. 对阶段1中的每个候选数据集,调用相应的详情工具
  2. 提取关键元数据:标题、物种、仪器、出版物(PubMed/DOI)、修饰类型
  3. 对于PXD登录号:优先使用
    ProteomeXchange_get_dataset
    获取文件数量;使用
    MassIVE_get_dataset
    获取更丰富的摘要/关键词

Key Fields to Extract

需提取的关键字段

  • title: Dataset name/description
  • species: Organism(s) studied
  • instruments: Mass spectrometer(s) used (e.g., Orbitrap, Q Exactive, TripleTOF)
  • publications: PubMed IDs and DOIs for associated papers
  • modifications: PTMs studied (from MassIVE only)
  • file_count: Number of raw files (from ProteomeXchange only)
  • keywords: Topic tags (from MassIVE only)

  • title: 数据集名称/描述
  • species: 研究的生物
  • instruments: 使用的质谱仪(例如:Orbitrap、Q Exactive、TripleTOF)
  • publications: 相关论文的PubMed ID和DOI
  • modifications: 研究的翻译后修饰(仅来自MassIVE)
  • file_count: 原始文件数量(仅来自ProteomeXchange)
  • keywords: 主题标签(仅来自MassIVE)

Phase 3: Result Synthesis

阶段3:结果整合

Objective: Compile and present dataset results in a structured format.
目标: 以结构化格式整理并展示数据集结果。

Report Format

报告格式

undefined
undefined

Proteomics Dataset Search Results

蛋白质组学数据集搜索结果

Query: [original query] Date: YYYY-MM-DD Repositories searched: MassIVE, ProteomeXchange
查询: [原始查询内容] 日期: YYYY-MM-DD 搜索的数据库: MassIVE, ProteomeXchange

Summary

摘要

Found N datasets matching [criteria].
找到N个符合[筛选条件]的数据集。

Datasets

数据集

1. [Title]

1. [标题]

  • Accession: PXD/MSV number
  • Species: [organism]
  • Instruments: [MS platforms]
  • Publications: [PubMed IDs / DOIs]
  • Modifications: [PTMs if available]
  • Files: [count if available]
  • Summary: [brief description]
  • 登录号: PXD/MSV编号
  • 物种: [生物]
  • 仪器: [质谱平台]
  • 出版物: [PubMed ID / DOI]
  • 修饰类型: [若有则列出PTM]
  • 文件数: [若有则列出数量]
  • 摘要: [简要描述]

2. [Title]

2. [标题]

...
...

Data Gaps

数据缺口

[Note any limitations in search coverage]

---
[说明搜索覆盖范围的任何局限性]

---

Tool Parameter Reference

工具参数参考

ToolParameterNotes
MassIVE_search_datasets
page_size
Integer, max 100. Default 10
MassIVE_search_datasets
species
NCBI taxonomy ID as string (e.g.,
"9606"
not
9606
)
MassIVE_get_dataset
accession
Accepts both MSV and PXD formats
ProteomeXchange_search_datasets
query
Optional keyword or accession filter
ProteomeXchange_search_datasets
limit
Integer, 1-50
ProteomeXchange_get_dataset
px_id
PXD format only (e.g.,
"PXD000001"
)
Response Format Notes:
  • MassIVE_search_datasets: Returns direct array of dataset objects (no wrapper)
  • MassIVE_get_dataset: Returns direct object (no wrapper)
  • ProteomeXchange_search_datasets: Returns
    {data: [...], metadata: {...}}
  • ProteomeXchange_get_dataset: Returns
    {data: {...}, metadata: {...}}

工具参数说明
MassIVE_search_datasets
page_size
整数,最大100。默认10
MassIVE_search_datasets
species
NCBI分类ID,需为字符串(例如:
"9606"
而非
9606
MassIVE_get_dataset
accession
支持MSV和PXD格式
ProteomeXchange_search_datasets
query
可选关键词或登录号筛选器
ProteomeXchange_search_datasets
limit
整数,1-50
ProteomeXchange_get_dataset
px_id
仅支持PXD格式(例如:
"PXD000001"
返回格式说明:
  • MassIVE_search_datasets: 返回数据集对象的直接数组(无包装)
  • MassIVE_get_dataset: 返回直接对象(无包装)
  • ProteomeXchange_search_datasets: 返回
    {data: [...], metadata: {...}}
  • ProteomeXchange_get_dataset: 返回
    {data: {...}, metadata: {...}}

Fallback Strategies

fallback策略

SituationFallback
MassIVE search returns emptyUse ProteomeXchange search (broader coverage)
ProteomeXchange search returns emptyTry broader/simpler query terms
MassIVE_get_dataset fails for PXD accessionUse ProteomeXchange_get_dataset instead
Species taxonomy ID unknownSearch ProteomeXchange by keyword (organism name)
No keyword search resultsTry individual terms instead of multi-word queries

场景替代方案
MassIVE搜索无结果使用ProteomeXchange搜索(覆盖范围更广)
ProteomeXchange搜索无结果尝试更宽泛/简单的查询词
MassIVE_get_dataset无法处理PXD登录号改用ProteomeXchange_get_dataset
未知物种分类ID通过关键词(生物名称)搜索ProteomeXchange
关键词搜索无结果尝试单个词而非多词查询

Common Species Taxonomy IDs

常见物种分类ID

SpeciesTaxonomy ID
Human9606
Mouse10090
Rat10116
Zebrafish7955
Fruit fly7227
C. elegans6239
S. cerevisiae559292
A. thaliana3702
E. coli562

物种分类ID
人类9606
小鼠10090
大鼠10116
斑马鱼7955
果蝇7227
秀丽隐杆线虫6239
酿酒酵母559292
拟南芥3702
大肠杆菌562

Interpretation Framework

解读框架

Quality IndicatorGoodAcceptableCaution
InstrumentOrbitrap Exploris/Eclipse, timsTOFQ Exactive, TripleTOF 6600Older LTQ, ion trap only
PublicationPeer-reviewed with PubMed IDPreprint or DOI onlyNo associated publication
Metadata completenessSpecies + instrument + PTMs + summarySpecies + instrument onlyTitle only, no annotations
Interpreting dataset search results:
  • Datasets with both MassIVE and ProteomeXchange accessions generally have richer metadata; MassIVE provides summaries and keywords while ProteomeXchange provides file counts -- cross-reference both for a complete picture.
  • Instrument type determines data quality ceiling: high-resolution instruments (Orbitrap, timsTOF) produce higher mass accuracy and more reliable quantification than older ion trap platforms.
  • A dataset lacking a peer-reviewed publication may still be valuable, but its experimental design and processing pipeline cannot be independently verified -- weight such datasets lower in meta-analyses.
Synthesis questions to address in the report:
  1. Do multiple independent datasets for the same organism/condition show consistent protein identifications, or do discrepancies suggest batch effects?
  2. Is the instrument platform appropriate for the analysis type (e.g., DIA requires high-resolution; TMT requires MS3 or calibrated MS2)?
  3. Are the reported PTM types and species consistent with the user's research question, or is additional filtering needed?

质量指标优质可接受需注意
仪器Orbitrap Exploris/Eclipse、timsTOFQ Exactive、TripleTOF 6600较旧的LTQ、仅离子阱
出版物经同行评审且有PubMed ID预印本或仅DOI无相关出版物
元数据完整性物种+仪器+PTM+摘要仅物种+仪器仅标题,无注释
数据集搜索结果解读:
  • 同时拥有MassIVE和ProteomeXchange登录号的数据集通常元数据更丰富;MassIVE提供摘要和关键词,ProteomeXchange提供文件数量——交叉参考两者以获取完整信息。
  • 仪器类型决定数据质量上限:高分辨率仪器(Orbitrap、timsTOF)比旧离子阱平台产生更高的质量准确度和更可靠的定量结果。
  • 缺乏同行评审出版物的数据集仍可能有价值,但其实验设计和处理流程无法独立验证——在元分析中应降低此类数据集的权重。
报告中需解决的整合问题:
  1. 同一生物/条件的多个独立数据集是否显示一致的蛋白质鉴定结果,还是差异表明存在批次效应?
  2. 仪器平台是否适合分析类型(例如:DIA需要高分辨率;TMT需要MS3或校准MS2)?
  3. 报告的PTM类型和物种是否符合用户的研究问题,还是需要额外筛选?

Limitations

局限性

  • MassIVE: No keyword/text search -- only species-based filtering via
    species
    parameter
  • ProteomeXchange: Limited metadata in search results (no summaries or keywords); get details via
    Dataverse_get_dataset
  • No full-text search: Cannot search within dataset descriptions or abstracts across repositories
  • No download: These tools retrieve metadata only, not raw data files
  • Rate limits: Both APIs may throttle under heavy load; keep
    page_size
    /
    limit
    reasonable
  • Coverage: ProteomeXchange is the most comprehensive but may lag behind individual repositories for very recent submissions

  • MassIVE: 无关键词/文本搜索——仅支持通过
    species
    参数按物种筛选
  • ProteomeXchange: 搜索结果中元数据有限(无摘要或关键词);需通过
    Dataverse_get_dataset
    获取详情
  • 无全文搜索: 无法跨数据库搜索数据集描述或摘要内容
  • 无下载功能: 这些工具仅检索元数据,不获取原始数据文件
  • 速率限制: 两个API在高负载下可能限流;保持
    page_size
    /
    limit
    合理
  • 覆盖范围: ProteomeXchange是最全面的平台,但对于非常新的提交可能滞后于单个数据库

Integration with Other Skills

与其他技能的集成

SkillRelationship
tooluniverse-proteomics-analysis
Use retrieved datasets as input for MS data analysis
tooluniverse-protein-modification-analysis
Find PTM-specific datasets to complement iPTMnet annotations
tooluniverse-multi-omics-integration
Discover proteomics datasets for cross-omics integration

技能关系
tooluniverse-proteomics-analysis
将检索到的数据集作为质谱数据分析的输入
tooluniverse-protein-modification-analysis
查找PTM特异性数据集以补充iPTMnet注释
tooluniverse-multi-omics-integration
发现蛋白质组学数据集用于跨组学整合

References

参考资料