tooluniverse-population-genetics-1000genomes

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

COMPUTE, DON'T DESCRIBE

计算,而非描述

When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.
当分析需要计算(统计、数据处理、评分、富集分析)时,通过Bash编写并运行Python代码。不要描述你会做什么——直接执行并报告实际结果。使用ToolUniverse工具检索数据,然后用Python(pandas、scipy、statsmodels、matplotlib)进行分析。

Population Genetics with 1000 Genomes (IGSR)

基于千人基因组计划(IGSR)的群体遗传学

Use IGSR tools to search 1000 Genomes populations and samples, explore data collections, and combine with GWAS tools for population-stratified analysis.
使用IGSR工具搜索千人基因组的群体和样本,探索数据集合,并与GWAS工具结合进行群体分层分析。

When to Use

适用场景

  • "List all African (AFR) populations in the 1000 Genomes Project"
  • "Find samples from the YRI (Yoruba) population"
  • "What 1000 Genomes data collections are available?"
  • "Which GWAS SNPs for type 2 diabetes have population-specific effects?"
  • "Find all SNPs mapped to TCF7L2 in GWAS studies"
  • "列出千人基因组计划中所有非洲(AFR)群体"
  • "寻找约鲁巴(YRI)群体的样本"
  • "千人基因组有哪些可用的数据集合?"
  • "哪些2型糖尿病的GWAS SNP具有群体特异性效应?"
  • "在GWAS研究中找到所有映射到TCF7L2的SNP"

NOT for (use other skills instead)

不适用场景(请使用其他技能)

  • Allele frequencies from gnomAD -> Use
    tooluniverse-population-genetics
  • ClinVar / OMIM variant interpretation -> Use
    tooluniverse-variant-interpretation
  • GWAS fine-mapping -> Use
    tooluniverse-gwas-finemapping

  • 来自gnomAD的等位基因频率 -> 使用
    tooluniverse-population-genetics
  • ClinVar / OMIM变异解读 -> 使用
    tooluniverse-variant-interpretation
  • GWAS精细定位 -> 使用
    tooluniverse-gwas-finemapping

Phase 1: Search 1000 Genomes Populations

阶段1:搜索千人基因组群体

IGSR_search_populations:
superpopulation
(string/null, one of AFR/AMR/EAS/EUR/SAS),
query
(string/null, free-text search by name),
limit
(int). Returns
{status, data: {total, populations: [{code, name, description, sample_count, superpopulation_code, superpopulation_name, latitude, longitude}]}, metadata: {source, filter_superpopulation, filter_query}}
.
Superpopulation codes:
CodeAncestry
AFRAfrican
AMRAdmixed American
EASEast Asian
EUREuropean
SASSouth Asian
json
// List all AFR populations
{"superpopulation": "AFR", "limit": 10}

// Search by name (free-text)
{"query": "Yoruba", "limit": 5}

// List all populations
{"limit": 26}
Response example:
json
{
  "status": "success",
  "data": {
    "total": 3,
    "populations": [
      {"code": "YRI", "name": "Yoruba", "description": "Yoruba in Ibadan, Nigeria",
       "sample_count": 188, "superpopulation_code": "AFR", "superpopulation_name": "African Ancestry"}
    ]
  }
}

IGSR_search_populations:
superpopulation
(字符串/空值,可选值为AFR/AMR/EAS/EUR/SAS),
query
(字符串/空值,按名称进行自由文本搜索),
limit
(整数)。 返回结果格式:
{status, data: {total, populations: [{code, name, description, sample_count, superpopulation_code, superpopulation_name, latitude, longitude}]}, metadata: {source, filter_superpopulation, filter_query}}
超级群体编码:
编码祖先群体
AFR非洲
AMR混血美洲人
EAS东亚
EUR欧洲
SAS南亚
json
// 列出所有AFR群体
{"superpopulation": "AFR", "limit": 10}

// 按名称搜索(自由文本)
{"query": "Yoruba", "limit": 5}

// 列出所有群体
{"limit": 26}
响应示例:
json
{
  "status": "success",
  "data": {
    "total": 3,
    "populations": [
      {"code": "YRI", "name": "Yoruba", "description": "Yoruba in Ibadan, Nigeria",
       "sample_count": 188, "superpopulation_code": "AFR", "superpopulation_name": "African Ancestry"}
    ]
  }
}

Phase 2: Search Samples by Population

阶段2:按群体搜索样本

IGSR_search_samples:
population
(string/null, population code e.g. "YRI"),
data_collection
(string/null, collection title),
sample_name
(string/null, specific sample e.g. "NA12878"),
limit
(int). Returns
{status, data: {total, samples: [{name, sex, biosample_id, populations: [{code, name, superpopulation}], data_collections: [...]}]}}
.
json
// Find all YRI samples
{"population": "YRI", "limit": 10}

// Look up the reference sample NA12878
{"sample_name": "NA12878", "limit": 1}

// Find samples in the 30x high-coverage collection
{"data_collection": "1000 Genomes 30x on GRCh38", "limit": 5}
NOTE:
population
takes a population code (e.g. "YRI", "GBR", "CHB"), not a superpopulation code. Use IGSR_search_populations first to get population codes if starting from a superpopulation.

IGSR_search_samples:
population
(字符串/空值,群体编码例如"YRI"),
data_collection
(字符串/空值,集合标题),
sample_name
(字符串/空值,特定样本例如"NA12878"),
limit
(整数)。 返回结果格式:
{status, data: {total, samples: [{name, sex, biosample_id, populations: [{code, name, superpopulation}], data_collections: [...]}]}}
json
// 查找所有YRI样本
{"population": "YRI", "limit": 10}

// 查询参考样本NA12878
{"sample_name": "NA12878", "limit": 1}

// 查找30x高覆盖集合中的样本
{"data_collection": "1000 Genomes 30x on GRCh38", "limit": 5}
注意:
population
参数接受群体编码(例如"YRI"、"GBR"、"CHB"),而非超级群体编码。如果从超级群体开始搜索,请先使用IGSR_search_populations获取群体编码。

Phase 3: List Data Collections

阶段3:列出数据集合

IGSR_list_data_collections:
limit
(int). Returns
{status, data: {total, collections: [{code, title, short_title, sample_count, population_count, data_types, website}]}}
.
json
{"limit": 20}
Key collections available (18 total):
CollectionDescriptionData Types
1000 Genomes on GRCh382709 samples, 26 populationssequence, alignment, variants
1000 Genomes 30x on GRCh38High-coverage resequencingsequence, alignment, variants
1000 Genomes phase 3 releaseOriginal phase 3sequence, alignment, variants
Human Genome Structural Variation ConsortiumHGSVC SV discoverysequence, alignment
MAGE RNA-seqRNA-seq data-
GeuvadisExpression + genotype-

IGSR_list_data_collections:
limit
(整数)。 返回结果格式:
{status, data: {total, collections: [{code, title, short_title, sample_count, population_count, data_types, website}]}}
json
{"limit": 20}
可用的主要集合(共18个):
集合描述数据类型
1000 Genomes on GRCh382709个样本,26个群体序列、比对、变异
1000 Genomes 30x on GRCh38高覆盖重测序序列、比对、变异
1000 Genomes phase 3 release原始第三阶段序列、比对、变异
Human Genome Structural Variation ConsortiumHGSVC结构变异发现序列、比对
MAGE RNA-seqRNA-seq数据-
Geuvadis表达+基因型-

Phase 4: GWAS Context for Population Stratification

阶段4:群体分层的GWAS背景

Search GWAS associations for a trait

搜索某一性状的GWAS关联

gwas_search_associations:
trait
(string, free text),
limit
(int). Returns GWAS associations with rsID, p-value, mapped genes, EFO trait IDs.
json
{"trait": "type 2 diabetes", "limit": 10}
gwas_search_associations:
trait
(字符串,自由文本),
limit
(整数)。 返回包含rsID、p值、映射基因、EFO性状ID的GWAS关联结果。
json
{"trait": "type 2 diabetes", "limit": 10}

Get variants for a specific trait (by EFO ID)

获取特定性状的变异(通过EFO ID)

gwas_get_variants_for_trait:
trait
(string, EFO ID e.g. "EFO_0001645"),
limit
(int).
json
{"trait": "EFO_0001645", "limit": 10}
gwas_get_variants_for_trait:
trait
(字符串,EFO ID例如"EFO_0001645"),
limit
(整数)。
json
{"trait": "EFO_0001645", "limit": 10}

Find SNPs in a gene from GWAS catalog

在GWAS目录中查找某一基因的SNP

gwas_get_snps_for_gene:
gene_symbol
(string),
limit
(int). Returns SNPs mapped to the gene with rsIDs, genomic positions, functional classes.
json
{"gene_symbol": "TCF7L2", "limit": 10}

gwas_get_snps_for_gene:
gene_symbol
(字符串),
limit
(整数)。 返回映射到该基因的SNP,包含rsID、基因组位置、功能类别。
json
{"gene_symbol": "TCF7L2", "limit": 10}

Workflow: Population Stratification in GWAS

工作流:GWAS中的群体分层

Step 1 -- Find populations of interest:
json
// Get all EUR populations
{"superpopulation": "EUR", "limit": 10}
// -> Returns codes like GBR, FIN, CEU, TSI, IBS
Step 2 -- Get samples from target population:
json
// Get YRI samples (AFR)
{"population": "YRI", "limit": 100}
Step 3 -- Get GWAS SNPs for the gene or trait:
json
// GWAS hits for TCF7L2 (T2D gene)
{"gene_symbol": "TCF7L2", "limit": 20}
Step 4 -- Cross-reference with population data for stratification analysis.

步骤1 -- 找到感兴趣的群体:
json
// 获取所有EUR群体
{"superpopulation": "EUR", "limit": 10}
// -> 返回GBR、FIN、CEU、TSI、IBS等编码
步骤2 -- 获取目标群体的样本:
json
// 获取YRI样本(AFR)
{"population": "YRI", "limit": 100}
步骤3 -- 获取基因或性状的GWAS SNP:
json
// TCF7L2(2型糖尿病基因)的GWAS信号
{"gene_symbol": "TCF7L2", "limit": 20}
步骤4 -- 与群体数据交叉引用进行分层分析。

Common Population Codes

常见群体编码

CodePopulationSuperpopulation
YRIYoruba in Ibadan, NigeriaAFR
LWKLuhya in Webuye, KenyaAFR
GWDGambian MandinkaAFR
CEUUtah residents (CEPH)EUR
GBRBritish in England/ScotlandEUR
FINFinnish in FinlandEUR
TSIToscani in ItaliaEUR
CHBHan Chinese in BeijingEAS
JPTJapanese in TokyoEAS
CHSSouthern Han ChineseEAS
MXLMexican Ancestry in LAAMR
PURPuerto Rican in Puerto RicoAMR
GIHGujarati Indian in HoustonSAS
PJLPunjabi from LahoreSAS

编码群体超级群体
YRI尼日利亚伊巴丹的约鲁巴人AFR
LWK肯尼亚韦布耶的卢希亚人AFR
GWD冈比亚曼丁卡人AFR
CEU犹他州居民(CEPH)EUR
GBR英格兰/苏格兰的英国人EUR
FIN芬兰的芬兰人EUR
TSI意大利的托斯卡纳人EUR
CHB中国北京的汉族人EAS
JPT日本东京的日本人EAS
CHS中国南方汉族人EAS
MXL洛杉矶的墨西哥裔AMR
PUR波多黎各的波多黎各人AMR
GIH休斯顿的古吉拉特印度人SAS
PJL拉合尔的旁遮普人SAS

Reasoning Framework for Result Interpretation

结果解读的推理框架

Evidence Grading

证据分级

GradeCriteriaExample
StrongAF difference > 0.2 across superpopulations, GWAS p < 5e-8, replicated in multiple cohortsrs7903146 (TCF7L2) with AF = 0.30 EUR vs 0.05 EAS, GWAS p = 1e-40
ModerateAF difference 0.05-0.2, GWAS p < 5e-8 in one ancestry, nominal in othersVariant with AF = 0.15 AFR vs 0.08 EUR, GWAS p < 5e-8 in EUR only
WeakAF difference < 0.05, GWAS p < 5e-8 but single study, no cross-ancestry replicationCommon variant with similar AF across populations, significant in one cohort
Population-specificVariant common (AF > 0.01) in one superpopulation, rare (AF < 0.01) in othersSickle cell variant (rs334) AF ~0.10 in AFR, < 0.001 elsewhere
等级标准示例
超级群体间等位基因频率差异>0.2,GWAS p值<5e-8,在多个队列中重复验证rs7903146(TCF7L2),EUR中等位基因频率=0.30,EAS中=0.05,GWAS p值=1e-40
中等等位基因频率差异0.05-0.2,GWAS p值在一个祖先群体中<5e-8,在其他群体中为名义显著某变异AFR中等位基因频率=0.15,EUR中=0.08,仅在EUR中GWAS p值<5e-8
等位基因频率差异<0.05,GWAS p值<5e-8但仅单个研究,无跨祖先重复验证在各群体中等位基因频率相似的常见变异,仅在一个队列中显著
群体特异性变异在一个超级群体中常见(等位基因频率>0.01),在其他群体中罕见(等位基因频率<0.01)镰状细胞变异(rs334)在AFR中等位基因频率~0.10,其他地区<0.001

Interpretation Guidance

解读指南

  • Allele frequency interpretation by ancestry: Allele frequencies vary across superpopulations (AFR, AMR, EAS, EUR, SAS) due to genetic drift, selection, and demographic history. AFR populations have the highest genetic diversity and longest haplotypes broken by recombination. Disease-risk alleles may be common in one ancestry and rare in another, leading to differential genetic risk across populations.
  • Fst significance thresholds: Fst measures population differentiation (0 = no differentiation, 1 = complete fixation of different alleles). Global Fst for human populations averages ~0.12. Locus-specific Fst > 0.3 suggests strong differentiation (possible selection). Fst > 0.5 is extreme and rare in humans outside known selection targets (e.g., SLC24A5 for skin pigmentation). Compare locus Fst against genome-wide distribution to identify outliers.
  • LD interpretation: Linkage disequilibrium (LD) patterns differ by ancestry. AFR populations have shorter LD blocks due to older demographic history, requiring denser genotyping for fine-mapping. EUR and EAS populations have longer LD blocks. When a GWAS hit is in LD with multiple variants, the causal variant is more likely to be resolved in AFR-ancestry data. Report r-squared values: r2 > 0.8 = strong LD, 0.2-0.8 = moderate, < 0.2 = weak.
  • Population stratification: Uncontrolled population structure in GWAS inflates false positives. The 1000 Genomes superpopulation labels provide a framework for stratified analysis. Mixed-ancestry samples (e.g., AMR) require local ancestry deconvolution for accurate interpretation.
  • Sample size context: 1000 Genomes has ~2500 samples across 26 populations. Population-specific allele frequencies have limited precision for smaller populations (N < 100). For rare variants (AF < 0.01), larger resources like gnomAD provide more reliable estimates.
  • 按祖先群体解读等位基因频率:等位基因频率在超级群体(AFR、AMR、EAS、EUR、SAS)间存在差异,这是遗传漂变、选择和人口历史导致的。AFR群体具有最高的遗传多样性和最长的单倍型(因重组被打断)。疾病风险等位基因可能在一个祖先群体中常见,在另一个群体中罕见,导致不同群体间遗传风险存在差异。
  • Fst显著性阈值:Fst衡量群体分化程度(0=无分化,1=不同等位基因完全固定)。人类群体的全球平均Fst约为0.12。位点特异性Fst>0.3表明存在强分化(可能是选择作用)。Fst>0.5在人类中极为罕见,仅见于已知的选择靶点(如影响肤色的SLC24A5)。将位点Fst与全基因组分布比较,以识别异常值。
  • 连锁不平衡(LD)解读:连锁不平衡模式因祖先群体而异。AFR群体由于更古老的人口历史,LD区块更短,精细定位需要更密集的基因分型。EUR和EAS群体的LD区块更长。当GWAS信号与多个变异存在LD时,AFR祖先数据更有可能解析出因果变异。报告r平方值:r2>0.8=强LD,0.2-0.8=中等LD,<0.2=弱LD。
  • 群体分层:GWAS中未控制的群体结构会增加假阳性结果。千人基因组的超级群体标签为分层分析提供了框架。混血样本(如AMR)需要进行本地祖先解析才能准确解读。
  • 样本量背景:千人基因组在26个群体中约有2500个样本。对于较小的群体(N<100),群体特异性等位基因频率的精度有限。对于罕见变异(等位基因频率<0.01),像gnomAD这样的更大资源能提供更可靠的估计。

Synthesis Questions

综合问题

  1. Does the allele frequency of the variant of interest differ meaningfully (> 5%) across superpopulations, and could this explain differential disease prevalence or GWAS effect sizes?
  2. Is the GWAS association replicated across ancestries, or is it population-specific, potentially due to LD structure differences or population-specific selection?
  3. For fine-mapping, does the LD pattern in AFR populations narrow the association signal compared to EUR, helping identify the likely causal variant?
  4. Are the population labels and sample sizes in the 1000 Genomes dataset adequate for the analysis, or is the target population underrepresented?
  5. Could population stratification (uncontrolled ancestry differences between cases and controls) explain the observed association, rather than a true genetic effect?

  1. 感兴趣变异的等位基因频率在超级群体间是否存在显著差异(>5%),这能否解释疾病患病率或GWAS效应量的差异?
  2. GWAS关联是否在不同祖先群体中重复验证,还是具有群体特异性,可能是由于LD结构差异或群体特异性选择作用?
  3. 对于精细定位,AFR群体的LD模式是否比EUR群体更能缩小关联信号范围,帮助识别可能的因果变异?
  4. 千人基因组数据集中的群体标签和样本量是否足以支持分析,还是目标群体代表性不足?
  5. 观察到的关联是否可能是群体分层(病例和对照间未控制的祖先差异)导致的,而非真实的遗传效应?

Tool Parameter Quick Reference

工具参数速查

ToolKey ParametersNotes
IGSR_search_populationssuperpopulation, query, limitsuperpopulation: AFR/AMR/EAS/EUR/SAS
IGSR_search_samplespopulation, data_collection, sample_name, limitpopulation = population code (e.g. YRI)
IGSR_list_data_collectionslimit18 collections total
gwas_search_associationstrait, limitfree-text trait search
gwas_get_variants_for_traittrait, limittrait = EFO ID
gwas_get_snps_for_genegene_symbol, limitreturns mapped SNPs
工具关键参数说明
IGSR_search_populationssuperpopulation, query, limitsuperpopulation可选值:AFR/AMR/EAS/EUR/SAS
IGSR_search_samplespopulation, data_collection, sample_name, limitpopulation为群体编码(如YRI)
IGSR_list_data_collectionslimit共18个集合
gwas_search_associationstrait, limit自由文本性状搜索
gwas_get_variants_for_traittrait, limittrait为EFO ID
gwas_get_snps_for_genegene_symbol, limit返回映射的SNP