tooluniverse-population-genetics-1000genomes
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCOMPUTE, DON'T DESCRIBE
计算,而非描述
When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.
当分析需要计算(统计、数据处理、评分、富集分析)时,通过Bash编写并运行Python代码。不要描述你会做什么——直接执行并报告实际结果。使用ToolUniverse工具检索数据,然后用Python(pandas、scipy、statsmodels、matplotlib)进行分析。
Population Genetics with 1000 Genomes (IGSR)
基于千人基因组计划(IGSR)的群体遗传学
Use IGSR tools to search 1000 Genomes populations and samples, explore data collections, and
combine with GWAS tools for population-stratified analysis.
使用IGSR工具搜索千人基因组的群体和样本,探索数据集合,并与GWAS工具结合进行群体分层分析。
When to Use
适用场景
- "List all African (AFR) populations in the 1000 Genomes Project"
- "Find samples from the YRI (Yoruba) population"
- "What 1000 Genomes data collections are available?"
- "Which GWAS SNPs for type 2 diabetes have population-specific effects?"
- "Find all SNPs mapped to TCF7L2 in GWAS studies"
- "列出千人基因组计划中所有非洲(AFR)群体"
- "寻找约鲁巴(YRI)群体的样本"
- "千人基因组有哪些可用的数据集合?"
- "哪些2型糖尿病的GWAS SNP具有群体特异性效应?"
- "在GWAS研究中找到所有映射到TCF7L2的SNP"
NOT for (use other skills instead)
不适用场景(请使用其他技能)
- Allele frequencies from gnomAD -> Use
tooluniverse-population-genetics - ClinVar / OMIM variant interpretation -> Use
tooluniverse-variant-interpretation - GWAS fine-mapping -> Use
tooluniverse-gwas-finemapping
- 来自gnomAD的等位基因频率 -> 使用
tooluniverse-population-genetics - ClinVar / OMIM变异解读 -> 使用
tooluniverse-variant-interpretation - GWAS精细定位 -> 使用
tooluniverse-gwas-finemapping
Phase 1: Search 1000 Genomes Populations
阶段1:搜索千人基因组群体
IGSR_search_populations: (string/null, one of AFR/AMR/EAS/EUR/SAS), (string/null, free-text search by name), (int).
Returns .
superpopulationquerylimit{status, data: {total, populations: [{code, name, description, sample_count, superpopulation_code, superpopulation_name, latitude, longitude}]}, metadata: {source, filter_superpopulation, filter_query}}Superpopulation codes:
| Code | Ancestry |
|---|---|
| AFR | African |
| AMR | Admixed American |
| EAS | East Asian |
| EUR | European |
| SAS | South Asian |
json
// List all AFR populations
{"superpopulation": "AFR", "limit": 10}
// Search by name (free-text)
{"query": "Yoruba", "limit": 5}
// List all populations
{"limit": 26}Response example:
json
{
"status": "success",
"data": {
"total": 3,
"populations": [
{"code": "YRI", "name": "Yoruba", "description": "Yoruba in Ibadan, Nigeria",
"sample_count": 188, "superpopulation_code": "AFR", "superpopulation_name": "African Ancestry"}
]
}
}IGSR_search_populations: (字符串/空值,可选值为AFR/AMR/EAS/EUR/SAS),(字符串/空值,按名称进行自由文本搜索),(整数)。
返回结果格式:。
superpopulationquerylimit{status, data: {total, populations: [{code, name, description, sample_count, superpopulation_code, superpopulation_name, latitude, longitude}]}, metadata: {source, filter_superpopulation, filter_query}}超级群体编码:
| 编码 | 祖先群体 |
|---|---|
| AFR | 非洲 |
| AMR | 混血美洲人 |
| EAS | 东亚 |
| EUR | 欧洲 |
| SAS | 南亚 |
json
// 列出所有AFR群体
{"superpopulation": "AFR", "limit": 10}
// 按名称搜索(自由文本)
{"query": "Yoruba", "limit": 5}
// 列出所有群体
{"limit": 26}响应示例:
json
{
"status": "success",
"data": {
"total": 3,
"populations": [
{"code": "YRI", "name": "Yoruba", "description": "Yoruba in Ibadan, Nigeria",
"sample_count": 188, "superpopulation_code": "AFR", "superpopulation_name": "African Ancestry"}
]
}
}Phase 2: Search Samples by Population
阶段2:按群体搜索样本
IGSR_search_samples: (string/null, population code e.g. "YRI"), (string/null, collection title), (string/null, specific sample e.g. "NA12878"), (int).
Returns .
populationdata_collectionsample_namelimit{status, data: {total, samples: [{name, sex, biosample_id, populations: [{code, name, superpopulation}], data_collections: [...]}]}}json
// Find all YRI samples
{"population": "YRI", "limit": 10}
// Look up the reference sample NA12878
{"sample_name": "NA12878", "limit": 1}
// Find samples in the 30x high-coverage collection
{"data_collection": "1000 Genomes 30x on GRCh38", "limit": 5}NOTE: takes a population code (e.g. "YRI", "GBR", "CHB"), not a superpopulation code. Use IGSR_search_populations first to get population codes if starting from a superpopulation.
populationIGSR_search_samples: (字符串/空值,群体编码例如"YRI"),(字符串/空值,集合标题),(字符串/空值,特定样本例如"NA12878"),(整数)。
返回结果格式:。
populationdata_collectionsample_namelimit{status, data: {total, samples: [{name, sex, biosample_id, populations: [{code, name, superpopulation}], data_collections: [...]}]}}json
// 查找所有YRI样本
{"population": "YRI", "limit": 10}
// 查询参考样本NA12878
{"sample_name": "NA12878", "limit": 1}
// 查找30x高覆盖集合中的样本
{"data_collection": "1000 Genomes 30x on GRCh38", "limit": 5}注意:参数接受群体编码(例如"YRI"、"GBR"、"CHB"),而非超级群体编码。如果从超级群体开始搜索,请先使用IGSR_search_populations获取群体编码。
populationPhase 3: List Data Collections
阶段3:列出数据集合
IGSR_list_data_collections: (int).
Returns .
limit{status, data: {total, collections: [{code, title, short_title, sample_count, population_count, data_types, website}]}}json
{"limit": 20}Key collections available (18 total):
| Collection | Description | Data Types |
|---|---|---|
| 1000 Genomes on GRCh38 | 2709 samples, 26 populations | sequence, alignment, variants |
| 1000 Genomes 30x on GRCh38 | High-coverage resequencing | sequence, alignment, variants |
| 1000 Genomes phase 3 release | Original phase 3 | sequence, alignment, variants |
| Human Genome Structural Variation Consortium | HGSVC SV discovery | sequence, alignment |
| MAGE RNA-seq | RNA-seq data | - |
| Geuvadis | Expression + genotype | - |
IGSR_list_data_collections: (整数)。
返回结果格式:。
limit{status, data: {total, collections: [{code, title, short_title, sample_count, population_count, data_types, website}]}}json
{"limit": 20}可用的主要集合(共18个):
| 集合 | 描述 | 数据类型 |
|---|---|---|
| 1000 Genomes on GRCh38 | 2709个样本,26个群体 | 序列、比对、变异 |
| 1000 Genomes 30x on GRCh38 | 高覆盖重测序 | 序列、比对、变异 |
| 1000 Genomes phase 3 release | 原始第三阶段 | 序列、比对、变异 |
| Human Genome Structural Variation Consortium | HGSVC结构变异发现 | 序列、比对 |
| MAGE RNA-seq | RNA-seq数据 | - |
| Geuvadis | 表达+基因型 | - |
Phase 4: GWAS Context for Population Stratification
阶段4:群体分层的GWAS背景
Search GWAS associations for a trait
搜索某一性状的GWAS关联
gwas_search_associations: (string, free text), (int).
Returns GWAS associations with rsID, p-value, mapped genes, EFO trait IDs.
traitlimitjson
{"trait": "type 2 diabetes", "limit": 10}gwas_search_associations: (字符串,自由文本),(整数)。
返回包含rsID、p值、映射基因、EFO性状ID的GWAS关联结果。
traitlimitjson
{"trait": "type 2 diabetes", "limit": 10}Get variants for a specific trait (by EFO ID)
获取特定性状的变异(通过EFO ID)
gwas_get_variants_for_trait: (string, EFO ID e.g. "EFO_0001645"), (int).
traitlimitjson
{"trait": "EFO_0001645", "limit": 10}gwas_get_variants_for_trait: (字符串,EFO ID例如"EFO_0001645"),(整数)。
traitlimitjson
{"trait": "EFO_0001645", "limit": 10}Find SNPs in a gene from GWAS catalog
在GWAS目录中查找某一基因的SNP
gwas_get_snps_for_gene: (string), (int).
Returns SNPs mapped to the gene with rsIDs, genomic positions, functional classes.
gene_symbollimitjson
{"gene_symbol": "TCF7L2", "limit": 10}gwas_get_snps_for_gene: (字符串),(整数)。
返回映射到该基因的SNP,包含rsID、基因组位置、功能类别。
gene_symbollimitjson
{"gene_symbol": "TCF7L2", "limit": 10}Workflow: Population Stratification in GWAS
工作流:GWAS中的群体分层
Step 1 -- Find populations of interest:
json
// Get all EUR populations
{"superpopulation": "EUR", "limit": 10}
// -> Returns codes like GBR, FIN, CEU, TSI, IBSStep 2 -- Get samples from target population:
json
// Get YRI samples (AFR)
{"population": "YRI", "limit": 100}Step 3 -- Get GWAS SNPs for the gene or trait:
json
// GWAS hits for TCF7L2 (T2D gene)
{"gene_symbol": "TCF7L2", "limit": 20}Step 4 -- Cross-reference with population data for stratification analysis.
步骤1 -- 找到感兴趣的群体:
json
// 获取所有EUR群体
{"superpopulation": "EUR", "limit": 10}
// -> 返回GBR、FIN、CEU、TSI、IBS等编码步骤2 -- 获取目标群体的样本:
json
// 获取YRI样本(AFR)
{"population": "YRI", "limit": 100}步骤3 -- 获取基因或性状的GWAS SNP:
json
// TCF7L2(2型糖尿病基因)的GWAS信号
{"gene_symbol": "TCF7L2", "limit": 20}步骤4 -- 与群体数据交叉引用进行分层分析。
Common Population Codes
常见群体编码
| Code | Population | Superpopulation |
|---|---|---|
| YRI | Yoruba in Ibadan, Nigeria | AFR |
| LWK | Luhya in Webuye, Kenya | AFR |
| GWD | Gambian Mandinka | AFR |
| CEU | Utah residents (CEPH) | EUR |
| GBR | British in England/Scotland | EUR |
| FIN | Finnish in Finland | EUR |
| TSI | Toscani in Italia | EUR |
| CHB | Han Chinese in Beijing | EAS |
| JPT | Japanese in Tokyo | EAS |
| CHS | Southern Han Chinese | EAS |
| MXL | Mexican Ancestry in LA | AMR |
| PUR | Puerto Rican in Puerto Rico | AMR |
| GIH | Gujarati Indian in Houston | SAS |
| PJL | Punjabi from Lahore | SAS |
| 编码 | 群体 | 超级群体 |
|---|---|---|
| YRI | 尼日利亚伊巴丹的约鲁巴人 | AFR |
| LWK | 肯尼亚韦布耶的卢希亚人 | AFR |
| GWD | 冈比亚曼丁卡人 | AFR |
| CEU | 犹他州居民(CEPH) | EUR |
| GBR | 英格兰/苏格兰的英国人 | EUR |
| FIN | 芬兰的芬兰人 | EUR |
| TSI | 意大利的托斯卡纳人 | EUR |
| CHB | 中国北京的汉族人 | EAS |
| JPT | 日本东京的日本人 | EAS |
| CHS | 中国南方汉族人 | EAS |
| MXL | 洛杉矶的墨西哥裔 | AMR |
| PUR | 波多黎各的波多黎各人 | AMR |
| GIH | 休斯顿的古吉拉特印度人 | SAS |
| PJL | 拉合尔的旁遮普人 | SAS |
Reasoning Framework for Result Interpretation
结果解读的推理框架
Evidence Grading
证据分级
| Grade | Criteria | Example |
|---|---|---|
| Strong | AF difference > 0.2 across superpopulations, GWAS p < 5e-8, replicated in multiple cohorts | rs7903146 (TCF7L2) with AF = 0.30 EUR vs 0.05 EAS, GWAS p = 1e-40 |
| Moderate | AF difference 0.05-0.2, GWAS p < 5e-8 in one ancestry, nominal in others | Variant with AF = 0.15 AFR vs 0.08 EUR, GWAS p < 5e-8 in EUR only |
| Weak | AF difference < 0.05, GWAS p < 5e-8 but single study, no cross-ancestry replication | Common variant with similar AF across populations, significant in one cohort |
| Population-specific | Variant common (AF > 0.01) in one superpopulation, rare (AF < 0.01) in others | Sickle cell variant (rs334) AF ~0.10 in AFR, < 0.001 elsewhere |
| 等级 | 标准 | 示例 |
|---|---|---|
| 强 | 超级群体间等位基因频率差异>0.2,GWAS p值<5e-8,在多个队列中重复验证 | rs7903146(TCF7L2),EUR中等位基因频率=0.30,EAS中=0.05,GWAS p值=1e-40 |
| 中等 | 等位基因频率差异0.05-0.2,GWAS p值在一个祖先群体中<5e-8,在其他群体中为名义显著 | 某变异AFR中等位基因频率=0.15,EUR中=0.08,仅在EUR中GWAS p值<5e-8 |
| 弱 | 等位基因频率差异<0.05,GWAS p值<5e-8但仅单个研究,无跨祖先重复验证 | 在各群体中等位基因频率相似的常见变异,仅在一个队列中显著 |
| 群体特异性 | 变异在一个超级群体中常见(等位基因频率>0.01),在其他群体中罕见(等位基因频率<0.01) | 镰状细胞变异(rs334)在AFR中等位基因频率~0.10,其他地区<0.001 |
Interpretation Guidance
解读指南
- Allele frequency interpretation by ancestry: Allele frequencies vary across superpopulations (AFR, AMR, EAS, EUR, SAS) due to genetic drift, selection, and demographic history. AFR populations have the highest genetic diversity and longest haplotypes broken by recombination. Disease-risk alleles may be common in one ancestry and rare in another, leading to differential genetic risk across populations.
- Fst significance thresholds: Fst measures population differentiation (0 = no differentiation, 1 = complete fixation of different alleles). Global Fst for human populations averages ~0.12. Locus-specific Fst > 0.3 suggests strong differentiation (possible selection). Fst > 0.5 is extreme and rare in humans outside known selection targets (e.g., SLC24A5 for skin pigmentation). Compare locus Fst against genome-wide distribution to identify outliers.
- LD interpretation: Linkage disequilibrium (LD) patterns differ by ancestry. AFR populations have shorter LD blocks due to older demographic history, requiring denser genotyping for fine-mapping. EUR and EAS populations have longer LD blocks. When a GWAS hit is in LD with multiple variants, the causal variant is more likely to be resolved in AFR-ancestry data. Report r-squared values: r2 > 0.8 = strong LD, 0.2-0.8 = moderate, < 0.2 = weak.
- Population stratification: Uncontrolled population structure in GWAS inflates false positives. The 1000 Genomes superpopulation labels provide a framework for stratified analysis. Mixed-ancestry samples (e.g., AMR) require local ancestry deconvolution for accurate interpretation.
- Sample size context: 1000 Genomes has ~2500 samples across 26 populations. Population-specific allele frequencies have limited precision for smaller populations (N < 100). For rare variants (AF < 0.01), larger resources like gnomAD provide more reliable estimates.
- 按祖先群体解读等位基因频率:等位基因频率在超级群体(AFR、AMR、EAS、EUR、SAS)间存在差异,这是遗传漂变、选择和人口历史导致的。AFR群体具有最高的遗传多样性和最长的单倍型(因重组被打断)。疾病风险等位基因可能在一个祖先群体中常见,在另一个群体中罕见,导致不同群体间遗传风险存在差异。
- Fst显著性阈值:Fst衡量群体分化程度(0=无分化,1=不同等位基因完全固定)。人类群体的全球平均Fst约为0.12。位点特异性Fst>0.3表明存在强分化(可能是选择作用)。Fst>0.5在人类中极为罕见,仅见于已知的选择靶点(如影响肤色的SLC24A5)。将位点Fst与全基因组分布比较,以识别异常值。
- 连锁不平衡(LD)解读:连锁不平衡模式因祖先群体而异。AFR群体由于更古老的人口历史,LD区块更短,精细定位需要更密集的基因分型。EUR和EAS群体的LD区块更长。当GWAS信号与多个变异存在LD时,AFR祖先数据更有可能解析出因果变异。报告r平方值:r2>0.8=强LD,0.2-0.8=中等LD,<0.2=弱LD。
- 群体分层:GWAS中未控制的群体结构会增加假阳性结果。千人基因组的超级群体标签为分层分析提供了框架。混血样本(如AMR)需要进行本地祖先解析才能准确解读。
- 样本量背景:千人基因组在26个群体中约有2500个样本。对于较小的群体(N<100),群体特异性等位基因频率的精度有限。对于罕见变异(等位基因频率<0.01),像gnomAD这样的更大资源能提供更可靠的估计。
Synthesis Questions
综合问题
- Does the allele frequency of the variant of interest differ meaningfully (> 5%) across superpopulations, and could this explain differential disease prevalence or GWAS effect sizes?
- Is the GWAS association replicated across ancestries, or is it population-specific, potentially due to LD structure differences or population-specific selection?
- For fine-mapping, does the LD pattern in AFR populations narrow the association signal compared to EUR, helping identify the likely causal variant?
- Are the population labels and sample sizes in the 1000 Genomes dataset adequate for the analysis, or is the target population underrepresented?
- Could population stratification (uncontrolled ancestry differences between cases and controls) explain the observed association, rather than a true genetic effect?
- 感兴趣变异的等位基因频率在超级群体间是否存在显著差异(>5%),这能否解释疾病患病率或GWAS效应量的差异?
- GWAS关联是否在不同祖先群体中重复验证,还是具有群体特异性,可能是由于LD结构差异或群体特异性选择作用?
- 对于精细定位,AFR群体的LD模式是否比EUR群体更能缩小关联信号范围,帮助识别可能的因果变异?
- 千人基因组数据集中的群体标签和样本量是否足以支持分析,还是目标群体代表性不足?
- 观察到的关联是否可能是群体分层(病例和对照间未控制的祖先差异)导致的,而非真实的遗传效应?
Tool Parameter Quick Reference
工具参数速查
| Tool | Key Parameters | Notes |
|---|---|---|
| IGSR_search_populations | superpopulation, query, limit | superpopulation: AFR/AMR/EAS/EUR/SAS |
| IGSR_search_samples | population, data_collection, sample_name, limit | population = population code (e.g. YRI) |
| IGSR_list_data_collections | limit | 18 collections total |
| gwas_search_associations | trait, limit | free-text trait search |
| gwas_get_variants_for_trait | trait, limit | trait = EFO ID |
| gwas_get_snps_for_gene | gene_symbol, limit | returns mapped SNPs |
| 工具 | 关键参数 | 说明 |
|---|---|---|
| IGSR_search_populations | superpopulation, query, limit | superpopulation可选值:AFR/AMR/EAS/EUR/SAS |
| IGSR_search_samples | population, data_collection, sample_name, limit | population为群体编码(如YRI) |
| IGSR_list_data_collections | limit | 共18个集合 |
| gwas_search_associations | trait, limit | 自由文本性状搜索 |
| gwas_get_variants_for_trait | trait, limit | trait为EFO ID |
| gwas_get_snps_for_gene | gene_symbol, limit | 返回映射的SNP |