tooluniverse-dataset-discovery

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Dataset Discovery

数据集发现

When to Use

适用场景

  • User asks "find me data about X" or "where can I get data on Y"
  • User wants to analyze a relationship between variables
  • User needs specific study designs (longitudinal, cross-sectional, experimental)
  • User asks about specific surveys or cohorts
  • 用户询问"帮我查找关于X的数据"或"我在哪里可以获取关于Y的数据"
  • 用户想要分析变量之间的关系
  • 用户需要特定研究设计类型(longitudinal、cross-sectional、experimental)
  • 用户询问特定调查或队列

Step 1: Understand What the Research Question Requires

步骤1:明确研究问题的需求

Before searching, determine the minimum data requirements:
Study design needed:
  • "Does X predict CHANGES in Y over time?" → longitudinal (same people measured repeatedly). Cross-sectional data CANNOT answer this — don't settle for it.
  • "Is X associated with Y?" → cross-sectional is sufficient (one-time measurement)
  • "Does intervention X cause outcome Y?" → experimental (clinical trial with controls)
  • "What genes/proteins are involved in X?" → omics (sequencing, expression, proteomics)
Variables needed:
  • List the specific exposure, outcome, and confounder variables
  • For each variable, note the measurement type (continuous, categorical, biomarker vs self-report)
  • Identify minimum confounders needed (age, sex are almost always required; domain-specific confounders depend on the question)
Population needed:
  • Age range, geography, clinical status, sample size requirements
  • Power analysis: to detect a small effect (r=0.1), you need ~800 subjects at 80% power
在搜索前,确定最低数据要求
所需研究设计类型:
  • "X能否预测Y随时间的变化?" → longitudinal(对同一人群进行重复测量)。cross-sectional数据无法回答此类问题——不要勉强使用。
  • "X与Y是否存在关联?" → cross-sectional已足够(单次测量)
  • "干预措施X是否会导致结果Y?" → experimental(含对照组的临床试验)
  • "哪些基因/蛋白质参与X过程?" → omics(测序、表达、蛋白质组学)
所需变量:
  • 列出具体的暴露、结局和混杂变量
  • 针对每个变量,记录测量类型(连续型、分类型、生物标志物vs自我报告)
  • 确定所需的核心混杂变量(年龄、性别几乎总是必需的;领域特定混杂变量取决于研究问题)
所需人群:
  • 年龄范围、地域、临床状态、样本量要求
  • 功效分析:要检测小效应(r=0.1),在80%功效下需要约800名受试者

Step 2: Search Strategy

步骤2:搜索策略

Search from broadest to most specific. Use
find_tools
to discover available dataset search tools — don't rely on memorized tool names.
Layer 1 — Cross-repository search (cast wide net): Search tools that index datasets across thousands of repositories. These find datasets you didn't know existed.
  • Search by: research topic keywords, variable names, population descriptors
  • Look for: DOI-registered datasets, repository listings, government data portals
Layer 2 — Domain-specific repositories: Search repositories specialized for your data type.
  • Health surveys: CDC, NHANES (search by variable name, not topic keywords)
  • Genomics: SRA, ENA, ArrayExpress, GEO
  • Proteomics: PRIDE, MassIVE
  • Metabolomics: MetaboLights, Metabolomics Workbench
  • Clinical: ClinicalTrials.gov (for trial data with results)
Layer 3 — Literature-based discovery: Many datasets aren't in any repository — they're described in paper methods sections.
  • Search PubMed/EuropePMC for papers that analyzed the relationship you're interested in
  • Read their methods: "We used data from [DATASET NAME]" tells you exactly what exists
  • Check supplementary materials for deposited data (GEO/SRA accession numbers)
  • This is often the MOST effective strategy for finding niche datasets
从宽泛到具体进行搜索。使用
find_tools
发现可用的数据集搜索工具——不要依赖记忆中的工具名称。
第一层——跨仓库搜索(广泛撒网): 搜索可索引数千个仓库中数据集的工具。这些工具能帮你找到未知的数据集。
  • 搜索方式:研究主题关键词、变量名称、人群描述词
  • 查找对象:DOI注册的数据集、仓库列表、政府数据门户
第二层——领域特定仓库: 搜索针对特定数据类型的仓库。
  • 健康调查:CDC、NHANES(按变量名称搜索,而非主题关键词)
  • 基因组学:SRA、ENA、ArrayExpress、GEO
  • 蛋白质组学:PRIDE、MassIVE
  • 代谢组学:MetaboLights、Metabolomics Workbench
  • 临床研究:ClinicalTrials.gov(用于含结果的试验数据)
第三层——基于文献的发现: 许多数据集未存入任何仓库——它们仅在论文方法部分被描述。
  • 在PubMed/EuropePMC中搜索分析了你感兴趣的关系的论文
  • 阅读它们的方法部分:"我们使用[数据集名称]的数据"会明确告诉你存在哪些数据
  • 查看补充材料中的已存档数据(GEO/SRA accession编号)
  • 这通常是查找小众数据集最有效的策略

Step 3: Evaluate Dataset Fitness

步骤3:评估数据集适配性

For each candidate dataset, assess these dimensions:
Variables:
  • Does it contain your SPECIFIC exposure and outcome variables?
  • Are they measured the way you need? (biomarker vs self-report, continuous vs categorical)
  • Are key confounders available? (missing confounders = biased analysis)
Design match:
  • If you need longitudinal: does it follow the SAME individuals over time? How many waves? What's the follow-up interval?
  • Beware: "repeated cross-sections" (different people each wave) are NOT longitudinal
  • If you need experimental: is there a proper control group? Randomization?
Sample:
  • Is the sample large enough for your analysis? (logistic regression needs ~10 events per predictor)
  • Does the population match? (age range, geography, clinical characteristics)
  • Are there subgroups you need? (stratified by sex, race, disease status)
Access:
  • Publicly downloadable (best) vs registration required (days) vs collaboration agreement (months) vs restricted (may be impossible)
  • Data format: CSV/TSV (easy), XPT/SAS (need conversion), proprietary database (may need special software)
Quality:
  • Is it from a well-known study with published methods? (NHANES, HRS, UK Biobank = high quality)
  • Has it been used in peer-reviewed publications? (indicates data is usable)
  • What's the response rate / missingness pattern?
针对每个候选数据集,从以下维度进行评估:
变量维度:
  • 它是否包含你所需的特定暴露和结局变量?
  • 变量的测量方式是否符合你的需求?(生物标志物vs自我报告、连续型vs分类型)
  • 是否有可用的核心混杂变量?(缺失混杂变量会导致分析偏倚)
设计匹配度:
  • 若你需要longitudinal数据:它是否对同一人群进行长期追踪?有多少轮测量?随访间隔是多久?
  • 注意:"重复cross-sections"(每轮测量不同人群)不属于longitudinal数据
  • 若你需要experimental数据:是否有合适的对照组?是否进行了随机分组?
样本维度:
  • 样本量是否足够支持你的分析?(逻辑回归每个预测因子需要约10个事件)
  • 人群是否匹配?(年龄范围、地域、临床特征)
  • 是否包含你需要的亚组?(按性别、种族、疾病状态分层)
访问权限:
  • 可公开下载(最佳)vs 需要注册(需数天)vs 需要合作协议(需数月)vs 受限访问(可能无法获取)
  • 数据格式:CSV/TSV(易用)、XPT/SAS(需转换)、专有数据库(可能需要特殊软件)
质量维度:
  • 是否来自方法已发表的知名研究?(NHANES、HRS、UK Biobank均为高质量数据集)
  • 是否已在同行评审出版物中使用?(表明数据可用)
  • 应答率/缺失值模式如何?

Step 4: Download and Analyze

步骤4:下载与分析

Don't stop at finding datasets — download and analyze them. Write and run Python code via Bash. Never describe what you "would do" — execute it.
不要停留在查找数据集阶段——下载并分析它们。通过Bash编写并运行Python代码。永远不要描述你"会做什么"——直接执行。

Data Loading Cookbook

数据加载实用指南

Choose the loader that matches your data source. When unsure of the format, download a small sample first and inspect.
python
import requests, io, pandas as pd
选择与数据源匹配的加载器。若不确定格式,先下载小样本并检查。
python
import requests, io, pandas as pd

--- Tabular files (most common) ---

--- Tabular files (most common) ---

df = pd.read_csv("data.csv") # CSV / TSV (use sep="\t" for TSV) df = pd.read_excel("data.xlsx") # Excel df = pd.read_stata("data.dta") # Stata df = pd.read_sas("data.xpt", format="xport") # SAS transport (XPT) df = pd.read_sas("data.sas7bdat", format="sas7bdat") # SAS native df = pd.read_parquet("data.parquet") # Parquet df = pd.read_json("data.json") # JSON (records or columnar) df = pd.read_fwf("data.dat") # Fixed-width (some legacy surveys)
df = pd.read_csv("data.csv") # CSV / TSV (use sep="\t" for TSV) df = pd.read_excel("data.xlsx") # Excel df = pd.read_stata("data.dta") # Stata df = pd.read_sas("data.xpt", format="xport") # SAS transport (XPT) df = pd.read_sas("data.sas7bdat", format="sas7bdat") # SAS native df = pd.read_parquet("data.parquet") # Parquet df = pd.read_json("data.json") # JSON (records or columnar) df = pd.read_fwf("data.dat") # Fixed-width (some legacy surveys)

--- Download from URL first, then parse ---

--- Download from URL first, then parse ---

resp = requests.get(url, timeout=120) content = resp.content
resp = requests.get(url, timeout=120) content = resp.content

Detect format from URL or content header

Detect format from URL or content header

if url.endswith(".XPT") or url.endswith(".xpt"): df = pd.read_sas(io.BytesIO(content), format="xport") elif url.endswith(".csv") or url.endswith(".csv.gz"): df = pd.read_csv(io.BytesIO(content)) elif url.endswith(".tsv") or url.endswith(".tsv.gz"): df = pd.read_csv(io.BytesIO(content), sep="\t") elif url.endswith(".json"): df = pd.read_json(io.BytesIO(content)) else: # Try CSV first, then inspect df = pd.read_csv(io.BytesIO(content))
if url.endswith(".XPT") or url.endswith(".xpt"): df = pd.read_sas(io.BytesIO(content), format="xport") elif url.endswith(".csv") or url.endswith(".csv.gz"): df = pd.read_csv(io.BytesIO(content)) elif url.endswith(".tsv") or url.endswith(".tsv.gz"): df = pd.read_csv(io.BytesIO(content), sep="\t") elif url.endswith(".json"): df = pd.read_json(io.BytesIO(content)) else: # Try CSV first, then inspect df = pd.read_csv(io.BytesIO(content))

--- REST API pagination (common for GDC, ClinicalTrials.gov, etc.) ---

--- REST API pagination (common for GDC, ClinicalTrials.gov, etc.) ---

import json all_records = [] offset = 0 while True: resp = requests.get(f"{api_url}?offset={offset}&limit=100", timeout=30) batch = resp.json().get("data", []) if not batch: break all_records.extend(batch) offset += len(batch) df = pd.DataFrame(all_records)
undefined
import json all_records = [] offset = 0 while True: resp = requests.get(f"{api_url}?offset={offset}&limit=100", timeout=30) batch = resp.json().get("data", []) if not batch: break all_records.extend(batch) offset += len(batch) df = pd.DataFrame(all_records)
undefined

Merge, Clean, Analyze

合并、清洗与分析

python
undefined
python
undefined

Merge multiple files on participant/sample ID

Merge multiple files on participant/sample ID

merged = df1.merge(df2, on="id_col", how="inner")
merged = df1.merge(df2, on="id_col", how="inner")

Filter population

Filter population

subset = merged[(merged["age"] >= 60) & (merged["age"] <= 80)].copy()
subset = merged[(merged["age"] >= 60) & (merged["age"] <= 80)].copy()

Handle missing values

Handle missing values

missing_pct = subset.isnull().mean() * 100 print("Missing % per variable:\n", missing_pct[missing_pct > 0].sort_values(ascending=False)) subset = subset.dropna(subset=["exposure_var", "outcome_var"])
missing_pct = subset.isnull().mean() * 100 print("Missing % per variable:\n", missing_pct[missing_pct > 0].sort_values(ascending=False)) subset = subset.dropna(subset=["exposure_var", "outcome_var"])

Quick regression

Quick regression

import statsmodels.formula.api as smf model = smf.ols("outcome ~ exposure + age + sex", data=subset).fit() print(model.summary())
import statsmodels.formula.api as smf model = smf.ols("outcome ~ exposure + age + sex", data=subset).fit() print(model.summary())

Visualization

Visualization

import matplotlib.pyplot as plt plt.scatter(subset["exposure"], subset["outcome"], alpha=0.3) plt.xlabel("Exposure"); plt.ylabel("Outcome") plt.savefig("/tmp/scatter.png", dpi=150, bbox_inches="tight")

Always run the code and report actual numbers (β, p-value, CI, N).
import matplotlib.pyplot as plt plt.scatter(subset["exposure"], subset["outcome"], alpha=0.3) plt.xlabel("Exposure"); plt.ylabel("Outcome") plt.savefig("/tmp/scatter.png", dpi=150, bbox_inches="tight")

务必运行代码并报告实际数值(β、p值、置信区间、样本量N)。

Step 5: Report Honestly

步骤5:如实报告

Structure the report as:
  1. Best available dataset — name, what it contains, access method, key limitation
  2. Analysis results — actual statistics (β, p-value, CI, N) from running the code
  3. Alternative datasets — ranked by fitness, with tradeoffs
  4. What CANNOT be answered — if no dataset matches the study design needed, say so clearly
  5. Recommended next steps — apply for access to longitudinal data, replicate in other cohorts
Critical honesty rules:
  • Never claim a dataset answers a temporal question if it's cross-sectional
  • Distinguish "data exists but needs registration" from "data doesn't exist"
  • Report actual computed statistics, not hypothetical analyses
  • State the strongest analysis possible with available data, even if it's weaker than what was asked
报告结构如下:
  1. 最佳可用数据集——名称、包含内容、访问方式、核心局限性
  2. 分析结果——运行代码得到的实际统计数据(β、p值、置信区间、样本量N)
  3. 备选数据集——按适配性排序,说明各选项的权衡
  4. 无法回答的问题——若没有匹配所需研究设计的数据集,需明确说明
  5. 建议下一步行动——申请longitudinal数据访问权限、在其他队列中重复验证
关键诚信规则:
  • 若数据集为cross-sectional,绝不要声称它能回答时间相关问题
  • 区分"数据存在但需注册"与"数据不存在"
  • 报告实际计算的统计数据,而非假设分析
  • 说明利用现有数据可进行的最强分析,即使它弱于用户最初的需求

LOOK UP, DON'T GUESS

查资料,别猜测

Never assume a dataset exists — search for it. Never assume access is public — check. Never assume variables are measured the way you need — verify the codebook.
永远不要假设数据集存在——要搜索确认。永远不要假设访问权限是公开的——要核实。永远不要假设变量的测量方式符合你的需求——要查阅代码手册。