tooluniverse-dataset-discovery
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDataset Discovery
数据集发现
When to Use
适用场景
- User asks "find me data about X" or "where can I get data on Y"
- User wants to analyze a relationship between variables
- User needs specific study designs (longitudinal, cross-sectional, experimental)
- User asks about specific surveys or cohorts
- 用户询问"帮我查找关于X的数据"或"我在哪里可以获取关于Y的数据"
- 用户想要分析变量之间的关系
- 用户需要特定研究设计类型(longitudinal、cross-sectional、experimental)
- 用户询问特定调查或队列
Step 1: Understand What the Research Question Requires
步骤1:明确研究问题的需求
Before searching, determine the minimum data requirements:
Study design needed:
- "Does X predict CHANGES in Y over time?" → longitudinal (same people measured repeatedly). Cross-sectional data CANNOT answer this — don't settle for it.
- "Is X associated with Y?" → cross-sectional is sufficient (one-time measurement)
- "Does intervention X cause outcome Y?" → experimental (clinical trial with controls)
- "What genes/proteins are involved in X?" → omics (sequencing, expression, proteomics)
Variables needed:
- List the specific exposure, outcome, and confounder variables
- For each variable, note the measurement type (continuous, categorical, biomarker vs self-report)
- Identify minimum confounders needed (age, sex are almost always required; domain-specific confounders depend on the question)
Population needed:
- Age range, geography, clinical status, sample size requirements
- Power analysis: to detect a small effect (r=0.1), you need ~800 subjects at 80% power
在搜索前,确定最低数据要求:
所需研究设计类型:
- "X能否预测Y随时间的变化?" → longitudinal(对同一人群进行重复测量)。cross-sectional数据无法回答此类问题——不要勉强使用。
- "X与Y是否存在关联?" → cross-sectional已足够(单次测量)
- "干预措施X是否会导致结果Y?" → experimental(含对照组的临床试验)
- "哪些基因/蛋白质参与X过程?" → omics(测序、表达、蛋白质组学)
所需变量:
- 列出具体的暴露、结局和混杂变量
- 针对每个变量,记录测量类型(连续型、分类型、生物标志物vs自我报告)
- 确定所需的核心混杂变量(年龄、性别几乎总是必需的;领域特定混杂变量取决于研究问题)
所需人群:
- 年龄范围、地域、临床状态、样本量要求
- 功效分析:要检测小效应(r=0.1),在80%功效下需要约800名受试者
Step 2: Search Strategy
步骤2:搜索策略
Search from broadest to most specific. Use to discover available dataset search tools — don't rely on memorized tool names.
find_toolsLayer 1 — Cross-repository search (cast wide net):
Search tools that index datasets across thousands of repositories. These find datasets you didn't know existed.
- Search by: research topic keywords, variable names, population descriptors
- Look for: DOI-registered datasets, repository listings, government data portals
Layer 2 — Domain-specific repositories:
Search repositories specialized for your data type.
- Health surveys: CDC, NHANES (search by variable name, not topic keywords)
- Genomics: SRA, ENA, ArrayExpress, GEO
- Proteomics: PRIDE, MassIVE
- Metabolomics: MetaboLights, Metabolomics Workbench
- Clinical: ClinicalTrials.gov (for trial data with results)
Layer 3 — Literature-based discovery:
Many datasets aren't in any repository — they're described in paper methods sections.
- Search PubMed/EuropePMC for papers that analyzed the relationship you're interested in
- Read their methods: "We used data from [DATASET NAME]" tells you exactly what exists
- Check supplementary materials for deposited data (GEO/SRA accession numbers)
- This is often the MOST effective strategy for finding niche datasets
从宽泛到具体进行搜索。使用发现可用的数据集搜索工具——不要依赖记忆中的工具名称。
find_tools第一层——跨仓库搜索(广泛撒网):
搜索可索引数千个仓库中数据集的工具。这些工具能帮你找到未知的数据集。
- 搜索方式:研究主题关键词、变量名称、人群描述词
- 查找对象:DOI注册的数据集、仓库列表、政府数据门户
第二层——领域特定仓库:
搜索针对特定数据类型的仓库。
- 健康调查:CDC、NHANES(按变量名称搜索,而非主题关键词)
- 基因组学:SRA、ENA、ArrayExpress、GEO
- 蛋白质组学:PRIDE、MassIVE
- 代谢组学:MetaboLights、Metabolomics Workbench
- 临床研究:ClinicalTrials.gov(用于含结果的试验数据)
第三层——基于文献的发现:
许多数据集未存入任何仓库——它们仅在论文方法部分被描述。
- 在PubMed/EuropePMC中搜索分析了你感兴趣的关系的论文
- 阅读它们的方法部分:"我们使用[数据集名称]的数据"会明确告诉你存在哪些数据
- 查看补充材料中的已存档数据(GEO/SRA accession编号)
- 这通常是查找小众数据集最有效的策略
Step 3: Evaluate Dataset Fitness
步骤3:评估数据集适配性
For each candidate dataset, assess these dimensions:
Variables:
- Does it contain your SPECIFIC exposure and outcome variables?
- Are they measured the way you need? (biomarker vs self-report, continuous vs categorical)
- Are key confounders available? (missing confounders = biased analysis)
Design match:
- If you need longitudinal: does it follow the SAME individuals over time? How many waves? What's the follow-up interval?
- Beware: "repeated cross-sections" (different people each wave) are NOT longitudinal
- If you need experimental: is there a proper control group? Randomization?
Sample:
- Is the sample large enough for your analysis? (logistic regression needs ~10 events per predictor)
- Does the population match? (age range, geography, clinical characteristics)
- Are there subgroups you need? (stratified by sex, race, disease status)
Access:
- Publicly downloadable (best) vs registration required (days) vs collaboration agreement (months) vs restricted (may be impossible)
- Data format: CSV/TSV (easy), XPT/SAS (need conversion), proprietary database (may need special software)
Quality:
- Is it from a well-known study with published methods? (NHANES, HRS, UK Biobank = high quality)
- Has it been used in peer-reviewed publications? (indicates data is usable)
- What's the response rate / missingness pattern?
针对每个候选数据集,从以下维度进行评估:
变量维度:
- 它是否包含你所需的特定暴露和结局变量?
- 变量的测量方式是否符合你的需求?(生物标志物vs自我报告、连续型vs分类型)
- 是否有可用的核心混杂变量?(缺失混杂变量会导致分析偏倚)
设计匹配度:
- 若你需要longitudinal数据:它是否对同一人群进行长期追踪?有多少轮测量?随访间隔是多久?
- 注意:"重复cross-sections"(每轮测量不同人群)不属于longitudinal数据
- 若你需要experimental数据:是否有合适的对照组?是否进行了随机分组?
样本维度:
- 样本量是否足够支持你的分析?(逻辑回归每个预测因子需要约10个事件)
- 人群是否匹配?(年龄范围、地域、临床特征)
- 是否包含你需要的亚组?(按性别、种族、疾病状态分层)
访问权限:
- 可公开下载(最佳)vs 需要注册(需数天)vs 需要合作协议(需数月)vs 受限访问(可能无法获取)
- 数据格式:CSV/TSV(易用)、XPT/SAS(需转换)、专有数据库(可能需要特殊软件)
质量维度:
- 是否来自方法已发表的知名研究?(NHANES、HRS、UK Biobank均为高质量数据集)
- 是否已在同行评审出版物中使用?(表明数据可用)
- 应答率/缺失值模式如何?
Step 4: Download and Analyze
步骤4:下载与分析
Don't stop at finding datasets — download and analyze them. Write and run Python code via Bash. Never describe what you "would do" — execute it.
不要停留在查找数据集阶段——下载并分析它们。通过Bash编写并运行Python代码。永远不要描述你"会做什么"——直接执行。
Data Loading Cookbook
数据加载实用指南
Choose the loader that matches your data source. When unsure of the format, download a small sample first and inspect.
python
import requests, io, pandas as pd选择与数据源匹配的加载器。若不确定格式,先下载小样本并检查。
python
import requests, io, pandas as pd--- Tabular files (most common) ---
--- Tabular files (most common) ---
df = pd.read_csv("data.csv") # CSV / TSV (use sep="\t" for TSV)
df = pd.read_excel("data.xlsx") # Excel
df = pd.read_stata("data.dta") # Stata
df = pd.read_sas("data.xpt", format="xport") # SAS transport (XPT)
df = pd.read_sas("data.sas7bdat", format="sas7bdat") # SAS native
df = pd.read_parquet("data.parquet") # Parquet
df = pd.read_json("data.json") # JSON (records or columnar)
df = pd.read_fwf("data.dat") # Fixed-width (some legacy surveys)
df = pd.read_csv("data.csv") # CSV / TSV (use sep="\t" for TSV)
df = pd.read_excel("data.xlsx") # Excel
df = pd.read_stata("data.dta") # Stata
df = pd.read_sas("data.xpt", format="xport") # SAS transport (XPT)
df = pd.read_sas("data.sas7bdat", format="sas7bdat") # SAS native
df = pd.read_parquet("data.parquet") # Parquet
df = pd.read_json("data.json") # JSON (records or columnar)
df = pd.read_fwf("data.dat") # Fixed-width (some legacy surveys)
--- Download from URL first, then parse ---
--- Download from URL first, then parse ---
resp = requests.get(url, timeout=120)
content = resp.content
resp = requests.get(url, timeout=120)
content = resp.content
Detect format from URL or content header
Detect format from URL or content header
if url.endswith(".XPT") or url.endswith(".xpt"):
df = pd.read_sas(io.BytesIO(content), format="xport")
elif url.endswith(".csv") or url.endswith(".csv.gz"):
df = pd.read_csv(io.BytesIO(content))
elif url.endswith(".tsv") or url.endswith(".tsv.gz"):
df = pd.read_csv(io.BytesIO(content), sep="\t")
elif url.endswith(".json"):
df = pd.read_json(io.BytesIO(content))
else:
# Try CSV first, then inspect
df = pd.read_csv(io.BytesIO(content))
if url.endswith(".XPT") or url.endswith(".xpt"):
df = pd.read_sas(io.BytesIO(content), format="xport")
elif url.endswith(".csv") or url.endswith(".csv.gz"):
df = pd.read_csv(io.BytesIO(content))
elif url.endswith(".tsv") or url.endswith(".tsv.gz"):
df = pd.read_csv(io.BytesIO(content), sep="\t")
elif url.endswith(".json"):
df = pd.read_json(io.BytesIO(content))
else:
# Try CSV first, then inspect
df = pd.read_csv(io.BytesIO(content))
--- REST API pagination (common for GDC, ClinicalTrials.gov, etc.) ---
--- REST API pagination (common for GDC, ClinicalTrials.gov, etc.) ---
import json
all_records = []
offset = 0
while True:
resp = requests.get(f"{api_url}?offset={offset}&limit=100", timeout=30)
batch = resp.json().get("data", [])
if not batch:
break
all_records.extend(batch)
offset += len(batch)
df = pd.DataFrame(all_records)
undefinedimport json
all_records = []
offset = 0
while True:
resp = requests.get(f"{api_url}?offset={offset}&limit=100", timeout=30)
batch = resp.json().get("data", [])
if not batch:
break
all_records.extend(batch)
offset += len(batch)
df = pd.DataFrame(all_records)
undefinedMerge, Clean, Analyze
合并、清洗与分析
python
undefinedpython
undefinedMerge multiple files on participant/sample ID
Merge multiple files on participant/sample ID
merged = df1.merge(df2, on="id_col", how="inner")
merged = df1.merge(df2, on="id_col", how="inner")
Filter population
Filter population
subset = merged[(merged["age"] >= 60) & (merged["age"] <= 80)].copy()
subset = merged[(merged["age"] >= 60) & (merged["age"] <= 80)].copy()
Handle missing values
Handle missing values
missing_pct = subset.isnull().mean() * 100
print("Missing % per variable:\n", missing_pct[missing_pct > 0].sort_values(ascending=False))
subset = subset.dropna(subset=["exposure_var", "outcome_var"])
missing_pct = subset.isnull().mean() * 100
print("Missing % per variable:\n", missing_pct[missing_pct > 0].sort_values(ascending=False))
subset = subset.dropna(subset=["exposure_var", "outcome_var"])
Quick regression
Quick regression
import statsmodels.formula.api as smf
model = smf.ols("outcome ~ exposure + age + sex", data=subset).fit()
print(model.summary())
import statsmodels.formula.api as smf
model = smf.ols("outcome ~ exposure + age + sex", data=subset).fit()
print(model.summary())
Visualization
Visualization
import matplotlib.pyplot as plt
plt.scatter(subset["exposure"], subset["outcome"], alpha=0.3)
plt.xlabel("Exposure"); plt.ylabel("Outcome")
plt.savefig("/tmp/scatter.png", dpi=150, bbox_inches="tight")
Always run the code and report actual numbers (β, p-value, CI, N).import matplotlib.pyplot as plt
plt.scatter(subset["exposure"], subset["outcome"], alpha=0.3)
plt.xlabel("Exposure"); plt.ylabel("Outcome")
plt.savefig("/tmp/scatter.png", dpi=150, bbox_inches="tight")
务必运行代码并报告实际数值(β、p值、置信区间、样本量N)。Step 5: Report Honestly
步骤5:如实报告
Structure the report as:
- Best available dataset — name, what it contains, access method, key limitation
- Analysis results — actual statistics (β, p-value, CI, N) from running the code
- Alternative datasets — ranked by fitness, with tradeoffs
- What CANNOT be answered — if no dataset matches the study design needed, say so clearly
- Recommended next steps — apply for access to longitudinal data, replicate in other cohorts
Critical honesty rules:
- Never claim a dataset answers a temporal question if it's cross-sectional
- Distinguish "data exists but needs registration" from "data doesn't exist"
- Report actual computed statistics, not hypothetical analyses
- State the strongest analysis possible with available data, even if it's weaker than what was asked
报告结构如下:
- 最佳可用数据集——名称、包含内容、访问方式、核心局限性
- 分析结果——运行代码得到的实际统计数据(β、p值、置信区间、样本量N)
- 备选数据集——按适配性排序,说明各选项的权衡
- 无法回答的问题——若没有匹配所需研究设计的数据集,需明确说明
- 建议下一步行动——申请longitudinal数据访问权限、在其他队列中重复验证
关键诚信规则:
- 若数据集为cross-sectional,绝不要声称它能回答时间相关问题
- 区分"数据存在但需注册"与"数据不存在"
- 报告实际计算的统计数据,而非假设分析
- 说明利用现有数据可进行的最强分析,即使它弱于用户最初的需求
LOOK UP, DON'T GUESS
查资料,别猜测
Never assume a dataset exists — search for it. Never assume access is public — check. Never assume variables are measured the way you need — verify the codebook.
永远不要假设数据集存在——要搜索确认。永远不要假设访问权限是公开的——要核实。永远不要假设变量的测量方式符合你的需求——要查阅代码手册。