tooluniverse-population-genetics-1000genomes

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

COMPUTE, DON'T DESCRIBE

计算，而非描述

When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.

当分析需要计算（统计、数据处理、评分、富集分析）时，通过Bash编写并运行Python代码。不要描述你会做什么——直接执行并报告实际结果。使用ToolUniverse工具检索数据，然后用Python（pandas、scipy、statsmodels、matplotlib）进行分析。

Population Genetics with 1000 Genomes (IGSR)

基于千人基因组计划（IGSR）的群体遗传学

Use IGSR tools to search 1000 Genomes populations and samples, explore data collections, and combine with GWAS tools for population-stratified analysis.

使用IGSR工具搜索千人基因组的群体和样本，探索数据集合，并与GWAS工具结合进行群体分层分析。

When to Use

适用场景

"List all African (AFR) populations in the 1000 Genomes Project"
"Find samples from the YRI (Yoruba) population"
"What 1000 Genomes data collections are available?"
"Which GWAS SNPs for type 2 diabetes have population-specific effects?"
"Find all SNPs mapped to TCF7L2 in GWAS studies"

"列出千人基因组计划中所有非洲（AFR）群体"
"寻找约鲁巴（YRI）群体的样本"
"千人基因组有哪些可用的数据集合？"
"哪些2型糖尿病的GWAS SNP具有群体特异性效应？"
"在GWAS研究中找到所有映射到TCF7L2的SNP"

NOT for (use other skills instead)

不适用场景（请使用其他技能）

Allele frequencies from gnomAD -> Use
```
tooluniverse-population-genetics
```
ClinVar / OMIM variant interpretation -> Use
```
tooluniverse-variant-interpretation
```
GWAS fine-mapping -> Use
```
tooluniverse-gwas-finemapping
```

来自gnomAD的等位基因频率 -> 使用
```
tooluniverse-population-genetics
```
ClinVar / OMIM变异解读 -> 使用
```
tooluniverse-variant-interpretation
```
GWAS精细定位 -> 使用
```
tooluniverse-gwas-finemapping
```

Phase 1: Search 1000 Genomes Populations

阶段1：搜索千人基因组群体

IGSR_search_populations:

superpopulation

(string/null, one of AFR/AMR/EAS/EUR/SAS),

query

(string/null, free-text search by name),

limit

(int). Returns

{status, data: {total, populations: [{code, name, description, sample_count, superpopulation_code, superpopulation_name, latitude, longitude}]}, metadata: {source, filter_superpopulation, filter_query}}

Superpopulation codes:

Code	Ancestry
AFR	African
AMR	Admixed American
EAS	East Asian
EUR	European
SAS	South Asian

json

// List all AFR populations
{"superpopulation": "AFR", "limit": 10}

// Search by name (free-text)
{"query": "Yoruba", "limit": 5}

// List all populations
{"limit": 26}

Response example:

json

{
  "status": "success",
  "data": {
    "total": 3,
    "populations": [
      {"code": "YRI", "name": "Yoruba", "description": "Yoruba in Ibadan, Nigeria",
       "sample_count": 188, "superpopulation_code": "AFR", "superpopulation_name": "African Ancestry"}
    ]
  }
}

IGSR_search_populations:

superpopulation

（字符串/空值，可选值为AFR/AMR/EAS/EUR/SAS），

query

（字符串/空值，按名称进行自由文本搜索），

limit

（整数）。返回结果格式：

{status, data: {total, populations: [{code, name, description, sample_count, superpopulation_code, superpopulation_name, latitude, longitude}]}, metadata: {source, filter_superpopulation, filter_query}}

。

超级群体编码：

编码	祖先群体
AFR	非洲
AMR	混血美洲人
EAS	东亚
EUR	欧洲
SAS	南亚

json

// 列出所有AFR群体
{"superpopulation": "AFR", "limit": 10}

// 按名称搜索（自由文本）
{"query": "Yoruba", "limit": 5}

// 列出所有群体
{"limit": 26}

响应示例：

json

{
  "status": "success",
  "data": {
    "total": 3,
    "populations": [
      {"code": "YRI", "name": "Yoruba", "description": "Yoruba in Ibadan, Nigeria",
       "sample_count": 188, "superpopulation_code": "AFR", "superpopulation_name": "African Ancestry"}
    ]
  }
}

Phase 2: Search Samples by Population

阶段2：按群体搜索样本

IGSR_search_samples:

population

(string/null, population code e.g. "YRI"),

data_collection

(string/null, collection title),

sample_name

(string/null, specific sample e.g. "NA12878"),

limit

(int). Returns

{status, data: {total, samples: [{name, sex, biosample_id, populations: [{code, name, superpopulation}], data_collections: [...]}]}}

json

// Find all YRI samples
{"population": "YRI", "limit": 10}

// Look up the reference sample NA12878
{"sample_name": "NA12878", "limit": 1}

// Find samples in the 30x high-coverage collection
{"data_collection": "1000 Genomes 30x on GRCh38", "limit": 5}

NOTE:

population

takes a population code (e.g. "YRI", "GBR", "CHB"), not a superpopulation code. Use IGSR_search_populations first to get population codes if starting from a superpopulation.

IGSR_search_samples:

population

（字符串/空值，群体编码例如"YRI"），

data_collection

（字符串/空值，集合标题），

sample_name

（字符串/空值，特定样本例如"NA12878"），

limit

（整数）。返回结果格式：

{status, data: {total, samples: [{name, sex, biosample_id, populations: [{code, name, superpopulation}], data_collections: [...]}]}}

。

json

// 查找所有YRI样本
{"population": "YRI", "limit": 10}

// 查询参考样本NA12878
{"sample_name": "NA12878", "limit": 1}

// 查找30x高覆盖集合中的样本
{"data_collection": "1000 Genomes 30x on GRCh38", "limit": 5}

注意：

population

参数接受群体编码（例如"YRI"、"GBR"、"CHB"），而非超级群体编码。如果从超级群体开始搜索，请先使用IGSR_search_populations获取群体编码。

Phase 3: List Data Collections

阶段3：列出数据集合

IGSR_list_data_collections:

limit

(int). Returns

{status, data: {total, collections: [{code, title, short_title, sample_count, population_count, data_types, website}]}}

json

{"limit": 20}

Key collections available (18 total):

Collection	Description	Data Types
1000 Genomes on GRCh38	2709 samples, 26 populations	sequence, alignment, variants
1000 Genomes 30x on GRCh38	High-coverage resequencing	sequence, alignment, variants
1000 Genomes phase 3 release	Original phase 3	sequence, alignment, variants
Human Genome Structural Variation Consortium	HGSVC SV discovery	sequence, alignment
MAGE RNA-seq	RNA-seq data	-
Geuvadis	Expression + genotype	-

IGSR_list_data_collections:

limit

（整数）。返回结果格式：

{status, data: {total, collections: [{code, title, short_title, sample_count, population_count, data_types, website}]}}

。

json

{"limit": 20}

可用的主要集合（共18个）：

集合	描述	数据类型
1000 Genomes on GRCh38	2709个样本，26个群体	序列、比对、变异
1000 Genomes 30x on GRCh38	高覆盖重测序	序列、比对、变异
1000 Genomes phase 3 release	原始第三阶段	序列、比对、变异
Human Genome Structural Variation Consortium	HGSVC结构变异发现	序列、比对
MAGE RNA-seq	RNA-seq数据	-
Geuvadis	表达+基因型	-

Phase 4: GWAS Context for Population Stratification

阶段4：群体分层的GWAS背景

Search GWAS associations for a trait

搜索某一性状的GWAS关联

gwas_search_associations:

trait

(string, free text),

limit

(int). Returns GWAS associations with rsID, p-value, mapped genes, EFO trait IDs.

json

{"trait": "type 2 diabetes", "limit": 10}

gwas_search_associations:

trait

（字符串，自由文本），

limit

（整数）。返回包含rsID、p值、映射基因、EFO性状ID的GWAS关联结果。

json

{"trait": "type 2 diabetes", "limit": 10}

Get variants for a specific trait (by EFO ID)

获取特定性状的变异（通过EFO ID）

gwas_get_variants_for_trait:

trait

(string, EFO ID e.g. "EFO_0001645"),

limit

(int).

json

{"trait": "EFO_0001645", "limit": 10}

gwas_get_variants_for_trait:

trait

（字符串，EFO ID例如"EFO_0001645"），

limit

（整数）。

json

{"trait": "EFO_0001645", "limit": 10}

Find SNPs in a gene from GWAS catalog

在GWAS目录中查找某一基因的SNP

gwas_get_snps_for_gene:

gene_symbol

(string),

limit

(int). Returns SNPs mapped to the gene with rsIDs, genomic positions, functional classes.

json

{"gene_symbol": "TCF7L2", "limit": 10}

gwas_get_snps_for_gene:

gene_symbol

（字符串），

limit

（整数）。返回映射到该基因的SNP，包含rsID、基因组位置、功能类别。

json

{"gene_symbol": "TCF7L2", "limit": 10}

Workflow: Population Stratification in GWAS

工作流：GWAS中的群体分层

Step 1 -- Find populations of interest:

json

// Get all EUR populations
{"superpopulation": "EUR", "limit": 10}
// -> Returns codes like GBR, FIN, CEU, TSI, IBS

Step 2 -- Get samples from target population:

json

// Get YRI samples (AFR)
{"population": "YRI", "limit": 100}

Step 3 -- Get GWAS SNPs for the gene or trait:

json

// GWAS hits for TCF7L2 (T2D gene)
{"gene_symbol": "TCF7L2", "limit": 20}

Step 4 -- Cross-reference with population data for stratification analysis.

步骤1 -- 找到感兴趣的群体：

json

// 获取所有EUR群体
{"superpopulation": "EUR", "limit": 10}
// -> 返回GBR、FIN、CEU、TSI、IBS等编码

步骤2 -- 获取目标群体的样本：

json

// 获取YRI样本（AFR）
{"population": "YRI", "limit": 100}

步骤3 -- 获取基因或性状的GWAS SNP：

json

// TCF7L2（2型糖尿病基因）的GWAS信号
{"gene_symbol": "TCF7L2", "limit": 20}

步骤4 -- 与群体数据交叉引用进行分层分析。

Common Population Codes

常见群体编码

Code	Population	Superpopulation
YRI	Yoruba in Ibadan, Nigeria	AFR
LWK	Luhya in Webuye, Kenya	AFR
GWD	Gambian Mandinka	AFR
CEU	Utah residents (CEPH)	EUR
GBR	British in England/Scotland	EUR
FIN	Finnish in Finland	EUR
TSI	Toscani in Italia	EUR
CHB	Han Chinese in Beijing	EAS
JPT	Japanese in Tokyo	EAS
CHS	Southern Han Chinese	EAS
MXL	Mexican Ancestry in LA	AMR
PUR	Puerto Rican in Puerto Rico	AMR
GIH	Gujarati Indian in Houston	SAS
PJL	Punjabi from Lahore	SAS

编码	群体	超级群体
YRI	尼日利亚伊巴丹的约鲁巴人	AFR
LWK	肯尼亚韦布耶的卢希亚人	AFR
GWD	冈比亚曼丁卡人	AFR
CEU	犹他州居民（CEPH）	EUR
GBR	英格兰/苏格兰的英国人	EUR
FIN	芬兰的芬兰人	EUR
TSI	意大利的托斯卡纳人	EUR
CHB	中国北京的汉族人	EAS
JPT	日本东京的日本人	EAS
CHS	中国南方汉族人	EAS
MXL	洛杉矶的墨西哥裔	AMR
PUR	波多黎各的波多黎各人	AMR
GIH	休斯顿的古吉拉特印度人	SAS
PJL	拉合尔的旁遮普人	SAS

Reasoning Framework for Result Interpretation

结果解读的推理框架

Evidence Grading

证据分级

Grade	Criteria	Example
Strong	AF difference > 0.2 across superpopulations, GWAS p < 5e-8, replicated in multiple cohorts	rs7903146 (TCF7L2) with AF = 0.30 EUR vs 0.05 EAS, GWAS p = 1e-40
Moderate	AF difference 0.05-0.2, GWAS p < 5e-8 in one ancestry, nominal in others	Variant with AF = 0.15 AFR vs 0.08 EUR, GWAS p < 5e-8 in EUR only
Weak	AF difference < 0.05, GWAS p < 5e-8 but single study, no cross-ancestry replication	Common variant with similar AF across populations, significant in one cohort
Population-specific	Variant common (AF > 0.01) in one superpopulation, rare (AF < 0.01) in others	Sickle cell variant (rs334) AF ~0.10 in AFR, < 0.001 elsewhere

等级	标准	示例
强	超级群体间等位基因频率差异>0.2，GWAS p值<5e-8，在多个队列中重复验证	rs7903146（TCF7L2），EUR中等位基因频率=0.30，EAS中=0.05，GWAS p值=1e-40
中等	等位基因频率差异0.05-0.2，GWAS p值在一个祖先群体中<5e-8，在其他群体中为名义显著	某变异AFR中等位基因频率=0.15，EUR中=0.08，仅在EUR中GWAS p值<5e-8
弱	等位基因频率差异<0.05，GWAS p值<5e-8但仅单个研究，无跨祖先重复验证	在各群体中等位基因频率相似的常见变异，仅在一个队列中显著
群体特异性	变异在一个超级群体中常见（等位基因频率>0.01），在其他群体中罕见（等位基因频率<0.01）	镰状细胞变异（rs334）在AFR中等位基因频率~0.10，其他地区<0.001

Interpretation Guidance

解读指南

Allele frequency interpretation by ancestry: Allele frequencies vary across superpopulations (AFR, AMR, EAS, EUR, SAS) due to genetic drift, selection, and demographic history. AFR populations have the highest genetic diversity and longest haplotypes broken by recombination. Disease-risk alleles may be common in one ancestry and rare in another, leading to differential genetic risk across populations.
Fst significance thresholds: Fst measures population differentiation (0 = no differentiation, 1 = complete fixation of different alleles). Global Fst for human populations averages ~0.12. Locus-specific Fst > 0.3 suggests strong differentiation (possible selection). Fst > 0.5 is extreme and rare in humans outside known selection targets (e.g., SLC24A5 for skin pigmentation). Compare locus Fst against genome-wide distribution to identify outliers.
LD interpretation: Linkage disequilibrium (LD) patterns differ by ancestry. AFR populations have shorter LD blocks due to older demographic history, requiring denser genotyping for fine-mapping. EUR and EAS populations have longer LD blocks. When a GWAS hit is in LD with multiple variants, the causal variant is more likely to be resolved in AFR-ancestry data. Report r-squared values: r2 > 0.8 = strong LD, 0.2-0.8 = moderate, < 0.2 = weak.
Population stratification: Uncontrolled population structure in GWAS inflates false positives. The 1000 Genomes superpopulation labels provide a framework for stratified analysis. Mixed-ancestry samples (e.g., AMR) require local ancestry deconvolution for accurate interpretation.
Sample size context: 1000 Genomes has ~2500 samples across 26 populations. Population-specific allele frequencies have limited precision for smaller populations (N < 100). For rare variants (AF < 0.01), larger resources like gnomAD provide more reliable estimates.

按祖先群体解读等位基因频率：等位基因频率在超级群体（AFR、AMR、EAS、EUR、SAS）间存在差异，这是遗传漂变、选择和人口历史导致的。AFR群体具有最高的遗传多样性和最长的单倍型（因重组被打断）。疾病风险等位基因可能在一个祖先群体中常见，在另一个群体中罕见，导致不同群体间遗传风险存在差异。
Fst显著性阈值：Fst衡量群体分化程度（0=无分化，1=不同等位基因完全固定）。人类群体的全球平均Fst约为0.12。位点特异性Fst>0.3表明存在强分化（可能是选择作用）。Fst>0.5在人类中极为罕见，仅见于已知的选择靶点（如影响肤色的SLC24A5）。将位点Fst与全基因组分布比较，以识别异常值。
连锁不平衡（LD）解读：连锁不平衡模式因祖先群体而异。AFR群体由于更古老的人口历史，LD区块更短，精细定位需要更密集的基因分型。EUR和EAS群体的LD区块更长。当GWAS信号与多个变异存在LD时，AFR祖先数据更有可能解析出因果变异。报告r平方值：r2>0.8=强LD，0.2-0.8=中等LD，<0.2=弱LD。
群体分层：GWAS中未控制的群体结构会增加假阳性结果。千人基因组的超级群体标签为分层分析提供了框架。混血样本（如AMR）需要进行本地祖先解析才能准确解读。
样本量背景：千人基因组在26个群体中约有2500个样本。对于较小的群体（N<100），群体特异性等位基因频率的精度有限。对于罕见变异（等位基因频率<0.01），像gnomAD这样的更大资源能提供更可靠的估计。

Synthesis Questions

综合问题

Does the allele frequency of the variant of interest differ meaningfully (> 5%) across superpopulations, and could this explain differential disease prevalence or GWAS effect sizes?
Is the GWAS association replicated across ancestries, or is it population-specific, potentially due to LD structure differences or population-specific selection?
For fine-mapping, does the LD pattern in AFR populations narrow the association signal compared to EUR, helping identify the likely causal variant?
Are the population labels and sample sizes in the 1000 Genomes dataset adequate for the analysis, or is the target population underrepresented?
Could population stratification (uncontrolled ancestry differences between cases and controls) explain the observed association, rather than a true genetic effect?

感兴趣变异的等位基因频率在超级群体间是否存在显著差异（>5%），这能否解释疾病患病率或GWAS效应量的差异？
GWAS关联是否在不同祖先群体中重复验证，还是具有群体特异性，可能是由于LD结构差异或群体特异性选择作用？
对于精细定位，AFR群体的LD模式是否比EUR群体更能缩小关联信号范围，帮助识别可能的因果变异？
千人基因组数据集中的群体标签和样本量是否足以支持分析，还是目标群体代表性不足？
观察到的关联是否可能是群体分层（病例和对照间未控制的祖先差异）导致的，而非真实的遗传效应？

Tool Parameter Quick Reference

工具参数速查

Tool	Key Parameters	Notes
IGSR_search_populations	superpopulation, query, limit	superpopulation: AFR/AMR/EAS/EUR/SAS
IGSR_search_samples	population, data_collection, sample_name, limit	population = population code (e.g. YRI)
IGSR_list_data_collections	limit	18 collections total
gwas_search_associations	trait, limit	free-text trait search
gwas_get_variants_for_trait	trait, limit	trait = EFO ID
gwas_get_snps_for_gene	gene_symbol, limit	returns mapped SNPs

工具	关键参数	说明
IGSR_search_populations	superpopulation, query, limit	superpopulation可选值：AFR/AMR/EAS/EUR/SAS
IGSR_search_samples	population, data_collection, sample_name, limit	population为群体编码（如YRI）
IGSR_list_data_collections	limit	共18个集合
gwas_search_associations	trait, limit	自由文本性状搜索
gwas_get_variants_for_trait	trait, limit	trait为EFO ID
gwas_get_snps_for_gene	gene_symbol, limit	返回映射的SNP