Dataset Discovery

数据集发现

When to Use

适用场景

User asks "find me data about X" or "where can I get data on Y"
User wants to analyze a relationship between variables
User needs specific study designs (longitudinal, cross-sectional, experimental)
User asks about specific surveys or cohorts

用户询问"帮我查找关于X的数据"或"我在哪里可以获取关于Y的数据"
用户想要分析变量之间的关系
用户需要特定研究设计类型（longitudinal、cross-sectional、experimental）
用户询问特定调查或队列

Step 1: Understand What the Research Question Requires

步骤1：明确研究问题的需求

Before searching, determine the minimum data requirements:

Study design needed:

"Does X predict CHANGES in Y over time?" → longitudinal (same people measured repeatedly). Cross-sectional data CANNOT answer this — don't settle for it.
"Is X associated with Y?" → cross-sectional is sufficient (one-time measurement)
"Does intervention X cause outcome Y?" → experimental (clinical trial with controls)
"What genes/proteins are involved in X?" → omics (sequencing, expression, proteomics)

Variables needed:

List the specific exposure, outcome, and confounder variables
For each variable, note the measurement type (continuous, categorical, biomarker vs self-report)
Identify minimum confounders needed (age, sex are almost always required; domain-specific confounders depend on the question)

Population needed:

Age range, geography, clinical status, sample size requirements
Power analysis: to detect a small effect (r=0.1), you need ~800 subjects at 80% power

在搜索前，确定最低数据要求：

所需研究设计类型：

"X能否预测Y随时间的变化？" → longitudinal（对同一人群进行重复测量）。cross-sectional数据无法回答此类问题——不要勉强使用。
"X与Y是否存在关联？" → cross-sectional已足够（单次测量）
"干预措施X是否会导致结果Y？" → experimental（含对照组的临床试验）
"哪些基因/蛋白质参与X过程？" → omics（测序、表达、蛋白质组学）

所需变量：

列出具体的暴露、结局和混杂变量
针对每个变量，记录测量类型（连续型、分类型、生物标志物vs自我报告）
确定所需的核心混杂变量（年龄、性别几乎总是必需的；领域特定混杂变量取决于研究问题）

所需人群：

年龄范围、地域、临床状态、样本量要求
功效分析：要检测小效应（r=0.1），在80%功效下需要约800名受试者

Step 2: Search Strategy

步骤2：搜索策略

Search from broadest to most specific. Use

find_tools

to discover available dataset search tools — don't rely on memorized tool names.

Layer 1 — Cross-repository search (cast wide net): Search tools that index datasets across thousands of repositories. These find datasets you didn't know existed.

Search by: research topic keywords, variable names, population descriptors
Look for: DOI-registered datasets, repository listings, government data portals

Layer 2 — Domain-specific repositories: Search repositories specialized for your data type.

Health surveys: CDC, NHANES (search by variable name, not topic keywords)
Genomics: SRA, ENA, ArrayExpress, GEO
Proteomics: PRIDE, MassIVE
Metabolomics: MetaboLights, Metabolomics Workbench
Clinical: ClinicalTrials.gov (for trial data with results)

Layer 3 — Literature-based discovery: Many datasets aren't in any repository — they're described in paper methods sections.

Search PubMed/EuropePMC for papers that analyzed the relationship you're interested in
Read their methods: "We used data from [DATASET NAME]" tells you exactly what exists
Check supplementary materials for deposited data (GEO/SRA accession numbers)
This is often the MOST effective strategy for finding niche datasets

从宽泛到具体进行搜索。使用

find_tools

发现可用的数据集搜索工具——不要依赖记忆中的工具名称。

第一层——跨仓库搜索（广泛撒网）： 搜索可索引数千个仓库中数据集的工具。这些工具能帮你找到未知的数据集。

搜索方式：研究主题关键词、变量名称、人群描述词
查找对象：DOI注册的数据集、仓库列表、政府数据门户

第二层——领域特定仓库： 搜索针对特定数据类型的仓库。

健康调查：CDC、NHANES（按变量名称搜索，而非主题关键词）
基因组学：SRA、ENA、ArrayExpress、GEO
蛋白质组学：PRIDE、MassIVE
代谢组学：MetaboLights、Metabolomics Workbench
临床研究：ClinicalTrials.gov（用于含结果的试验数据）

第三层——基于文献的发现： 许多数据集未存入任何仓库——它们仅在论文方法部分被描述。

在PubMed/EuropePMC中搜索分析了你感兴趣的关系的论文
阅读它们的方法部分："我们使用[数据集名称]的数据"会明确告诉你存在哪些数据
查看补充材料中的已存档数据（GEO/SRA accession编号）
这通常是查找小众数据集最有效的策略

Step 3: Evaluate Dataset Fitness

步骤3：评估数据集适配性

For each candidate dataset, assess these dimensions:

Variables:

Does it contain your SPECIFIC exposure and outcome variables?
Are they measured the way you need? (biomarker vs self-report, continuous vs categorical)
Are key confounders available? (missing confounders = biased analysis)

Design match:

If you need longitudinal: does it follow the SAME individuals over time? How many waves? What's the follow-up interval?
Beware: "repeated cross-sections" (different people each wave) are NOT longitudinal
If you need experimental: is there a proper control group? Randomization?

Sample:

Is the sample large enough for your analysis? (logistic regression needs ~10 events per predictor)
Does the population match? (age range, geography, clinical characteristics)
Are there subgroups you need? (stratified by sex, race, disease status)

Access:

Publicly downloadable (best) vs registration required (days) vs collaboration agreement (months) vs restricted (may be impossible)
Data format: CSV/TSV (easy), XPT/SAS (need conversion), proprietary database (may need special software)

Quality:

Is it from a well-known study with published methods? (NHANES, HRS, UK Biobank = high quality)
Has it been used in peer-reviewed publications? (indicates data is usable)
What's the response rate / missingness pattern?

针对每个候选数据集，从以下维度进行评估：

变量维度：

它是否包含你所需的特定暴露和结局变量？
变量的测量方式是否符合你的需求？（生物标志物vs自我报告、连续型vs分类型）
是否有可用的核心混杂变量？（缺失混杂变量会导致分析偏倚）

设计匹配度：

若你需要longitudinal数据：它是否对同一人群进行长期追踪？有多少轮测量？随访间隔是多久？
注意："重复cross-sections"（每轮测量不同人群）不属于longitudinal数据
若你需要experimental数据：是否有合适的对照组？是否进行了随机分组？

样本维度：

样本量是否足够支持你的分析？（逻辑回归每个预测因子需要约10个事件）
人群是否匹配？（年龄范围、地域、临床特征）
是否包含你需要的亚组？（按性别、种族、疾病状态分层）

访问权限：

可公开下载（最佳）vs 需要注册（需数天）vs 需要合作协议（需数月）vs 受限访问（可能无法获取）
数据格式：CSV/TSV（易用）、XPT/SAS（需转换）、专有数据库（可能需要特殊软件）

质量维度：

是否来自方法已发表的知名研究？（NHANES、HRS、UK Biobank均为高质量数据集）
是否已在同行评审出版物中使用？（表明数据可用）
应答率/缺失值模式如何？

Step 4: Download and Analyze

步骤4：下载与分析

Don't stop at finding datasets — download and analyze them. Write and run Python code via Bash. Never describe what you "would do" — execute it.

不要停留在查找数据集阶段——下载并分析它们。通过Bash编写并运行Python代码。永远不要描述你"会做什么"——直接执行。

Data Loading Cookbook

数据加载实用指南

Choose the loader that matches your data source. When unsure of the format, download a small sample first and inspect.

python

import requests, io, pandas as pd

选择与数据源匹配的加载器。若不确定格式，先下载小样本并检查。

python

import requests, io, pandas as pd

--- Tabular files (most common) ---

df = pd.read_csv("data.csv") # CSV / TSV (use sep="\t" for TSV) df = pd.read_excel("data.xlsx") # Excel df = pd.read_stata("data.dta") # Stata df = pd.read_sas("data.xpt", format="xport") # SAS transport (XPT) df = pd.read_sas("data.sas7bdat", format="sas7bdat") # SAS native df = pd.read_parquet("data.parquet") # Parquet df = pd.read_json("data.json") # JSON (records or columnar) df = pd.read_fwf("data.dat") # Fixed-width (some legacy surveys)

--- Download from URL first, then parse ---

resp = requests.get(url, timeout=120) content = resp.content

Detect format from URL or content header

if url.endswith(".XPT") or url.endswith(".xpt"): df = pd.read_sas(io.BytesIO(content), format="xport") elif url.endswith(".csv") or url.endswith(".csv.gz"): df = pd.read_csv(io.BytesIO(content)) elif url.endswith(".tsv") or url.endswith(".tsv.gz"): df = pd.read_csv(io.BytesIO(content), sep="\t") elif url.endswith(".json"): df = pd.read_json(io.BytesIO(content)) else: # Try CSV first, then inspect df = pd.read_csv(io.BytesIO(content))

--- REST API pagination (common for GDC, ClinicalTrials.gov, etc.) ---

import json all_records = [] offset = 0 while True: resp = requests.get(f"{api_url}?offset={offset}&limit=100", timeout=30) batch = resp.json().get("data", []) if not batch: break all_records.extend(batch) offset += len(batch) df = pd.DataFrame(all_records)

undefined

import json all_records = [] offset = 0 while True: resp = requests.get(f"{api_url}?offset={offset}&limit=100", timeout=30) batch = resp.json().get("data", []) if not batch: break all_records.extend(batch) offset += len(batch) df = pd.DataFrame(all_records)

undefined

Merge, Clean, Analyze

合并、清洗与分析

python

undefined

python

undefined

Merge multiple files on participant/sample ID

merged = df1.merge(df2, on="id_col", how="inner")

Filter population

subset = merged[(merged["age"] >= 60) & (merged["age"] <= 80)].copy()

Handle missing values

missing_pct = subset.isnull().mean() * 100 print("Missing % per variable:\n", missing_pct[missing_pct > 0].sort_values(ascending=False)) subset = subset.dropna(subset=["exposure_var", "outcome_var"])

Quick regression

import statsmodels.formula.api as smf model = smf.ols("outcome ~ exposure + age + sex", data=subset).fit() print(model.summary())

Visualization

import matplotlib.pyplot as plt plt.scatter(subset["exposure"], subset["outcome"], alpha=0.3) plt.xlabel("Exposure"); plt.ylabel("Outcome") plt.savefig("/tmp/scatter.png", dpi=150, bbox_inches="tight")


Always run the code and report actual numbers (β, p-value, CI, N).

import matplotlib.pyplot as plt plt.scatter(subset["exposure"], subset["outcome"], alpha=0.3) plt.xlabel("Exposure"); plt.ylabel("Outcome") plt.savefig("/tmp/scatter.png", dpi=150, bbox_inches="tight")


务必运行代码并报告实际数值（β、p值、置信区间、样本量N）。

Step 5: Report Honestly

步骤5：如实报告

Structure the report as:

Best available dataset — name, what it contains, access method, key limitation
Analysis results — actual statistics (β, p-value, CI, N) from running the code
Alternative datasets — ranked by fitness, with tradeoffs
What CANNOT be answered — if no dataset matches the study design needed, say so clearly
Recommended next steps — apply for access to longitudinal data, replicate in other cohorts

Critical honesty rules:

Never claim a dataset answers a temporal question if it's cross-sectional
Distinguish "data exists but needs registration" from "data doesn't exist"
Report actual computed statistics, not hypothetical analyses
State the strongest analysis possible with available data, even if it's weaker than what was asked

报告结构如下：

最佳可用数据集——名称、包含内容、访问方式、核心局限性
分析结果——运行代码得到的实际统计数据（β、p值、置信区间、样本量N）
备选数据集——按适配性排序，说明各选项的权衡
无法回答的问题——若没有匹配所需研究设计的数据集，需明确说明
建议下一步行动——申请longitudinal数据访问权限、在其他队列中重复验证

关键诚信规则：

若数据集为cross-sectional，绝不要声称它能回答时间相关问题
区分"数据存在但需注册"与"数据不存在"
报告实际计算的统计数据，而非假设分析
说明利用现有数据可进行的最强分析，即使它弱于用户最初的需求

LOOK UP, DON'T GUESS

查资料，别猜测

Never assume a dataset exists — search for it. Never assume access is public — check. Never assume variables are measured the way you need — verify the codebook.

永远不要假设数据集存在——要搜索确认。永远不要假设访问权限是公开的——要核实。永远不要假设变量的测量方式符合你的需求——要查阅代码手册。

tooluniverse-dataset-discovery

Original

Translation

Dataset Discovery

数据集发现

When to Use

适用场景

Step 1: Understand What the Research Question Requires

步骤1：明确研究问题的需求

Step 2: Search Strategy

步骤2：搜索策略

Step 3: Evaluate Dataset Fitness

步骤3：评估数据集适配性

Step 4: Download and Analyze

步骤4：下载与分析

Data Loading Cookbook

数据加载实用指南

--- Tabular files (most common) ---

--- Tabular files (most common) ---

--- Download from URL first, then parse ---

--- Download from URL first, then parse ---

Detect format from URL or content header

Detect format from URL or content header

--- REST API pagination (common for GDC, ClinicalTrials.gov, etc.) ---

--- REST API pagination (common for GDC, ClinicalTrials.gov, etc.) ---

Merge, Clean, Analyze

合并、清洗与分析

Merge multiple files on participant/sample ID

Merge multiple files on participant/sample ID

Filter population

Filter population

Handle missing values

Handle missing values

Quick regression

Quick regression

Visualization

Visualization

Step 5: Report Honestly

步骤5：如实报告

LOOK UP, DON'T GUESS

查资料，别猜测