tooluniverse-regulatory-variant-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

COMPUTE, DON'T DESCRIBE

计算,而非描述

When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.
当分析需要计算(统计、数据处理、评分、富集分析)时,通过Bash编写并运行Python代码。不要描述你会做什么——直接执行并报告实际结果。使用ToolUniverse工具检索数据,然后使用Python(pandas、scipy、statsmodels、matplotlib)进行分析。

Regulatory Variant Analysis Skill

调控变异分析技能

Systematic regulatory variant interpretation: discover trait associations from GWAS, map eQTL effects, annotate chromatin context, assess regulatory element overlap, and produce evidence-graded functional impact predictions for non-coding variants.
系统性调控变异解读:从GWAS中发现性状关联、绘制eQTL效应图谱、注释染色质背景、评估调控元件重叠情况,并为非编码变异生成基于证据分级的功能影响预测。

When to Use

使用场景

  • "What GWAS associations exist for rs12913832?"
  • "Find eQTLs for the APOE locus in brain tissue"
  • "What regulatory elements overlap this variant region?"
  • "Which SNPs are associated with type 2 diabetes from GWAS?"
  • "Is this intronic variant in an active enhancer?"
  • "What is the RegulomeDB score for rs429358?"
  • "Find ENCODE histone marks at the BRCA1 promoter region"
  • "Map trait ontology terms for 'blood pressure' to EFO IDs"
NOT for (use other skills instead):
  • Coding variant pathogenicity -> Use
    tooluniverse-variant-interpretation
  • Full clinical variant classification (ACMG) -> Use
    tooluniverse-variant-interpretation
  • Gene-disease associations (not variant-specific) -> Use
    tooluniverse-gene-disease-association
  • Pharmacogenomic variant annotation -> Use
    tooluniverse-pharmacogenomics
  • Epigenomics data processing (BED/narrowPeak files) -> Use
    tooluniverse-epigenomics

  • "rs12913832存在哪些GWAS关联?"
  • "查找大脑组织中APOE基因座的eQTL"
  • "哪些调控元件与该变异区域重叠?"
  • "GWAS中哪些SNP与2型糖尿病相关?"
  • "这个内含子变异是否位于活跃增强子中?"
  • "rs429358的RegulomeDB评分是多少?"
  • "查找BRCA1启动子区域的ENCODE组蛋白标记"
  • "将'血压'的性状本体术语映射到EFO ID"
不适用场景(请使用其他技能):
  • 编码变异致病性 -> 使用
    tooluniverse-variant-interpretation
  • 完整临床变异分类(ACMG标准) -> 使用
    tooluniverse-variant-interpretation
  • 基因-疾病关联(非变异特异性) -> 使用
    tooluniverse-gene-disease-association
  • 药物基因组学变异注释 -> 使用
    tooluniverse-pharmacogenomics
  • 表观基因组学数据处理(BED/narrowPeak文件) -> 使用
    tooluniverse-epigenomics

Non-Coding Variant Impact Reasoning

非编码变异影响推理

When evaluating a non-coding variant, build evidence across four questions:
1. Is the variant in a regulatory element? Use RegulomeDB to assess whether the variant overlaps TF binding sites, chromatin accessibility peaks, or known regulatory annotations. A low RegulomeDB score (categories 1a-2a) indicates strong evidence that the position is functionally active. Confirm with ENCODE histone marks: H3K27ac signals active enhancers and active promoters; H3K4me1 alone marks poised enhancers; H3K4me3 marks active promoters; H3K27me3 marks silenced regions.
2. Does it alter a transcription factor binding site? Check RegulomeDB's TF binding evidence and ENCODE TF ChIP-seq experiments. A variant that falls within a TF footprint and disrupts the consensus motif is mechanistically actionable, especially if the TF is known to be relevant in the disease tissue.
3. Is there eQTL evidence linking it to a gene? Query GTEx to determine whether the variant (or variants in tight LD) modulates expression of a nearby gene in a tissue-specific or ubiquitous manner. A tissue-specific eQTL suggests cell-type-specific regulation; a ubiquitous eQTL suggests a core regulatory element. The direction of the NES (positive = alternative allele increases expression, negative = decreases) and effect size matter for interpretation.
4. Is there GWAS evidence for trait association? Search the GWAS Catalog for the rsID or the surrounding locus. Genome-wide significant associations (p < 5×10⁻⁸) in relevant traits anchor the variant's biological importance. Cross-reference with OpenTargets for locus-to-gene mapping from multiple GWAS studies.
Synthesizing the evidence: Build a multi-layer case. A variant with GWAS significance + eQTL evidence + RegulomeDB score 1a-2a + active chromatin (H3K27ac) in the relevant tissue represents high-confidence regulatory impact. Two or three converging lines of evidence (e.g., eQTL plus active enhancer) constitute moderate confidence. A single line, or a variant only in a poised but not active regulatory context, represents lower confidence.

评估非编码变异时,围绕四个问题构建证据链:
1. 变异是否位于调控元件中? 使用RegulomeDB评估变异是否重叠转录因子结合位点、染色质可及性峰或已知调控注释。较低的RegulomeDB评分(类别1a-2a)表明该位置具有强功能活性证据。结合ENCODE组蛋白标记确认:H3K27ac标记活跃增强子和活跃启动子;仅H3K4me1标记待激活增强子;H3K4me3标记活跃启动子;H3K27me3标记沉默区域。
2. 它是否改变转录因子结合位点? 检查RegulomeDB的转录因子结合证据和ENCODE转录因子ChIP-seq实验。落在转录因子足迹内并破坏共有基序的变异具有机制层面的可操作性,尤其是当该转录因子与疾病组织相关时。
3. 是否存在将其与基因关联的eQTL证据? 查询GTEx以确定该变异(或紧密连锁不平衡的变异)是否以组织特异性或普遍方式调控邻近基因的表达。组织特异性eQTL提示细胞类型特异性调控;普遍存在的eQTL提示核心调控元件。NES的方向(正=替代等位基因增加表达,负=降低表达)和效应量对解读至关重要。
4. 是否存在性状关联的GWAS证据? 在GWAS Catalog中搜索rsID或周围基因座。全基因组显著关联(p < 5×10⁻⁸)的相关性状确立了变异的生物学重要性。结合OpenTargets的多GWAS研究基因座-基因映射数据进行交叉验证。
证据合成:构建多层证据链。具有GWAS显著性 + eQTL证据 + RegulomeDB评分1a-2a + 相关组织中活跃染色质(H3K27ac)的变异代表高置信度调控影响。两到三条一致的证据链(如eQTL加活跃增强子)构成中等置信度。单一证据链,或仅位于待激活而非活跃调控背景中的变异,代表低置信度。

Workflow Overview

工作流程概述

Input (rsID, genomic coordinates, trait/disease, gene)
  |
  v
Phase 0: Variant/Trait Resolution
  Resolve rsIDs, map trait names to EFO/MONDO IDs via OLS
  |
  v
Phase 1: GWAS Association Lookup
  GWAS Catalog associations, p-values, effect sizes, study metadata
  |
  v
Phase 2: eQTL Analysis
  GTEx tissue-specific eQTLs, target gene identification
  |
  v
Phase 3: Regulatory Element Annotation
  ENCODE histone marks, RegulomeDB scores, chromatin state
  |
  v
Phase 4: OpenTargets GWAS Integration
  OpenTargets GWAS study aggregation, locus-to-gene mapping
  |
  v
Phase 5: Functional Impact Synthesis
  Integrate all evidence, assign regulatory impact level
  |
  v
Phase 6: Report
  Evidence-graded regulatory variant report

输入(rsID、基因组坐标、性状/疾病、基因)
  |
  v
阶段0:变异/性状解析
  解析rsID,通过OLS将性状名称映射到EFO/MONDO ID
  |
  v
阶段1:GWAS关联查询
  GWAS Catalog关联信息、p值、效应量、研究元数据
  |
  v
阶段2:eQTL分析
  GTEx组织特异性eQTL、靶基因识别
  |
  v
阶段3:调控元件注释
  ENCODE组蛋白标记、RegulomeDB评分、染色质状态
  |
  v
阶段4:OpenTargets GWAS整合
  OpenTargets GWAS研究聚合、基因座-基因映射
  |
  v
阶段5:功能影响合成
  整合所有证据,分配调控影响等级
  |
  v
阶段6:报告
  基于证据分级的调控变异报告

Phase 0: Variant/Trait Resolution

阶段0:变异/性状解析

Use
ols_search_terms
to resolve trait names to ontology IDs before GWAS queries. Restrict to
ontology="efo"
for GWAS traits; OpenTargets prefers MONDO IDs (e.g., MONDO_0005148 for type 2 diabetes rather than EFO_0001360). Use
EnsemblVEP_annotate_rsid
(param is
variant_id
, not
rsid
) for initial consequence annotation and nearest gene identification.

在GWAS查询前,使用
ols_search_terms
将性状名称解析为本体ID。GWAS性状限制使用
ontology="efo"
;OpenTargets偏好MONDO ID(例如2型糖尿病使用MONDO_0005148而非EFO_0001360)。使用
EnsemblVEP_annotate_rsid
(参数为
variant_id
而非
rsid
)进行初始后果注释和邻近基因识别。

Phase 1: GWAS Association Lookup

阶段1:GWAS关联查询

gwas_search_associations
is the primary tool: accepts
disease_trait
(free text),
efo_id
(preferred for precision),
rs_id
, and
p_value
threshold. Use
p_value=5e-8
for genome-wide significance. For locus-level discovery,
gwas_get_variants_for_trait
retrieves all SNPs for a trait.
gwas_get_snps_for_gene
finds GWAS-cataloged SNPs mapped to a specific gene.
Reasoning tip: When GWAS Catalog returns empty for a free-text trait, switch to the
efo_id
parameter — the catalog uses controlled vocabulary and free-text matching is imprecise.

gwas_search_associations
是主要工具:接受
disease_trait
(自由文本)、
efo_id
(优先使用以保证准确性)、
rs_id
p_value
阈值。使用
p_value=5e-8
作为全基因组显著性阈值。对于基因座水平的发现,
gwas_get_variants_for_trait
检索某一性状的所有SNP。
gwas_get_snps_for_gene
查找GWAS目录中映射到特定基因的SNP。
推理提示:当GWAS Catalog对自由文本性状返回空结果时,切换为
efo_id
参数——目录使用受控词汇,自由文本匹配不够精确。

Phase 2: eQTL Analysis

阶段2:eQTL分析

GTEx_query_eqtl
accepts a gene symbol (auto-resolved to GENCODE ID) or Ensembl gene ID. It returns tissue-specific SNP-gene associations with NES (normalized effect size) and p-value per tissue.
When interpreting results, ask: does the eQTL effect occur in the tissue most relevant to the disease? A brain-specific eQTL for a neurodegenerative disease variant is more compelling than a ubiquitous one. Use
GTEx_get_median_gene_expression
to confirm that the target gene is actually expressed in the relevant tissue before placing weight on eQTL evidence.
Note: GTEx API uses v8 data; gtex_v10 endpoints may return empty for some queries.

GTEx_query_eqtl
接受基因符号(自动解析为GENCODE ID)或Ensembl基因ID。返回组织特异性SNP-基因关联,包含每个组织的NES(标准化效应量)和p值。
解读结果时需问:eQTL效应是否发生在与疾病最相关的组织中?神经退行性疾病变异的脑特异性eQTL比普遍存在的eQTL更具说服力。在重视eQTL证据之前,使用
GTEx_get_median_gene_expression
确认靶基因确实在相关组织中表达。
注意:GTEx API使用v8数据;gtex_v10端点可能对部分查询返回空结果。

Phase 3: Regulatory Element Annotation

阶段3:调控元件注释

RegulomeDB_query_variant
(param:
rsid
) returns a regulatory score and feature annotations. Scores in categories 1a–2a indicate strong regulatory evidence (eQTL overlap + TF binding + chromatin accessibility). Scores 3a–6 represent progressively weaker evidence.
ENCODE_search_histone_experiments
accepts
histone_mark
(e.g., "H3K27ac") and
biosample_term_name
(tissue or cell line name — NOT a disease name; ENCODE uses biological sample names like "liver" or "breast epithelium"). Use
assay_title="TF ChIP-seq"
(not just "ChIP-seq") when querying TF binding data.
Reasoning tip: RegulomeDB aggregates ENCODE, Roadmap, and other data. If ENCODE doesn't have the specific biosample, RegulomeDB may still have aggregate evidence from related cell types.

RegulomeDB_query_variant
(参数:
rsid
)返回调控评分和特征注释。1a–2a类别的评分表明强调控证据(eQTL重叠+转录因子结合+染色质可及性)。3a–6类别的评分代表证据逐渐减弱。
ENCODE_search_histone_experiments
接受
histone_mark
(例如"H3K27ac")和
biosample_term_name
(组织或细胞系名称——不是疾病名称;ENCODE使用生物样本名称如"liver"或"breast epithelium")。查询转录因子结合数据时使用
assay_title="TF ChIP-seq"
(而非仅"ChIP-seq")。
推理提示:RegulomeDB整合了ENCODE、Roadmap和其他数据。如果ENCODE没有特定生物样本的数据,RegulomeDB可能仍有来自相关细胞类型的聚合证据。

Phase 4: OpenTargets GWAS Integration

阶段4:OpenTargets GWAS整合

OpenTargets_search_gwas_studies_by_disease
takes
diseaseIds
as an array of MONDO IDs. It provides locus-to-gene (L2G) scores from multiple GWAS studies, which go beyond simple proximity to incorporate colocalisation, eQTL, and chromatin data. Use
OpenTargets_multi_entity_search
or
OpenTargets_get_disease_id_description_by_name
to resolve disease names to MONDO/EFO IDs first.

OpenTargets_search_gwas_studies_by_disease
接受
diseaseIds
作为MONDO ID数组。提供来自多GWAS研究的基因座-基因(L2G)评分,该评分不仅考虑距离,还整合了共定位、eQTL和染色质数据。先使用
OpenTargets_multi_entity_search
OpenTargets_get_disease_id_description_by_name
将疾病名称解析为MONDO/EFO ID。

Phase 5: Functional Impact Synthesis

阶段5:功能影响合成

After collecting evidence, reason through the layers:
  • High impact: GWAS genome-wide significant + eQTL with meaningful NES + RegulomeDB score ≤ 2 + active chromatin (H3K27ac) in relevant tissue. Multiple independent lines converge on the same locus and gene.
  • Moderate impact: Two to three lines of evidence (e.g., eQTL + active enhancer overlap, or GWAS significant + RegulomeDB ≤ 3) without full convergence.
  • Low impact: Single line of evidence, or only computational annotation (VEP consequence category) without functional data.
  • No evidence: No regulatory annotations in any source; the variant may be in a non-functional region or the relevant cell type is not represented in available datasets.

收集证据后,逐层推理:
  • 高影响:GWAS全基因组显著 + 具有显著NES的eQTL + RegulomeDB评分≤2 + 相关组织中活跃染色质(H3K27ac)。多条独立证据链汇聚于同一基因座和基因。
  • 中等影响:两到三条证据链(如eQTL+活跃增强子重叠,或GWAS显著+RegulomeDB≤3)但未完全汇聚。
  • 低影响:单一证据链,或仅有机算注释(VEP后果类别)而无功能数据。
  • 无证据:所有来源均无调控注释;变异可能位于非功能区域,或相关细胞类型未在可用数据集中体现。

Fallback Strategies

fallback策略

  • GWAS Catalog returns empty: Switch from free-text
    disease_trait
    to
    efo_id
    ; broaden the trait term.
  • GTEx eQTL empty for gene: Verify gene symbol spelling; try Ensembl ID; increase
    size
    parameter.
  • RegulomeDB returns no data: Query ENCODE directly; the variant may lack regulatory annotations in available data.
  • OpenTargets GWAS returns None: Verify MONDO/EFO ID format; try
    OpenTargets_multi_entity_search
    first to confirm the correct ID.
  • ENCODE tissue not found: ENCODE uses specific biosample names; RegulomeDB aggregates data from many cell types and may cover the gap.

  • GWAS Catalog返回空:从自由文本
    disease_trait
    切换为
    efo_id
    ;扩大性状术语范围。
  • GTEx基因eQTL返回空:验证基因符号拼写;尝试Ensembl ID;增大
    size
    参数。
  • RegulomeDB无数据返回:直接查询ENCODE;该变异可能在可用数据中缺乏调控注释。
  • OpenTargets GWAS返回None:验证MONDO/EFO ID格式;先使用
    OpenTargets_multi_entity_search
    确认正确ID。
  • ENCODE组织未找到:ENCODE使用特定生物样本名称;RegulomeDB整合了多种细胞类型的数据,可能填补空白。

Example Workflows

示例工作流程

GWAS Variant Functional Annotation (rs429358 / APOE)

GWAS变异功能注释(rs429358 / APOE)

Step 1: gwas_search_associations(rs_id="rs429358")
  -> All trait associations (Alzheimer's disease, LDL cholesterol, etc.)

Step 2: GTEx_query_eqtl(gene_symbol="APOE")
  -> Tissue-specific eQTL evidence; note effect in brain vs liver

Step 3: RegulomeDB_query_variant(rsid="rs429358")
  -> Regulatory score and TF binding annotations

Step 4: ENCODE_search_histone_experiments(histone_mark="H3K27ac", biosample_term_name="brain")
  -> Active enhancer context near the variant

Step 5: Synthesize: does GWAS significance + eQTL + active chromatin converge on one gene?
步骤1:gwas_search_associations(rs_id="rs429358")
  -> 所有性状关联(阿尔茨海默病、LDL胆固醇等)

步骤2:GTEx_query_eqtl(gene_symbol="APOE")
  -> 组织特异性eQTL证据;注意大脑与肝脏中的效应差异

步骤3:RegulomeDB_query_variant(rsid="rs429358")
  -> 调控评分和转录因子结合注释

步骤4:ENCODE_search_histone_experiments(histone_mark="H3K27ac", biosample_term_name="brain")
  -> 变异附近的活跃增强子背景

步骤5:合成:GWAS显著性+eQTL+活跃染色质是否汇聚于同一基因?

Non-Coding Variant Assessment (Intronic/UTR Variant)

非编码变异评估(内含子/UTR变异)

Step 1: EnsemblVEP_annotate_rsid(variant_id="rs12345678")
  -> Confirm non-coding consequence, identify nearest gene

Step 2: RegulomeDB_query_variant(rsid="rs12345678")
  -> Is this position in a regulatory context?

Step 3: gwas_search_associations(rs_id="rs12345678")
  -> Any GWAS associations in relevant traits?

Step 4: GTEx_query_eqtl(gene_symbol=nearest_gene)
  -> Does this variant or nearby variants modulate expression?

Step 5: ENCODE_search_histone_experiments(histone_mark="H3K27ac", biosample_term_name=relevant_tissue)
  -> Active chromatin confirmation

Step 6: Classify impact based on convergence of evidence lines

步骤1:EnsemblVEP_annotate_rsid(variant_id="rs12345678")
  -> 确认非编码后果,识别邻近基因

步骤2:RegulomeDB_query_variant(rsid="rs12345678")
  -> 该位置是否处于调控背景中?

步骤3:gwas_search_associations(rs_id="rs12345678")
  -> 是否存在相关性状的GWAS关联?

步骤4:GTEx_query_eqtl(gene_symbol=nearest_gene)
  -> 该变异或邻近变异是否调控表达?

步骤5:ENCODE_search_histone_experiments(histone_mark="H3K27ac", biosample_term_name=relevant_tissue)
  -> 活跃染色质确认

步骤6:根据证据链的一致性分类影响等级

Limitations

局限性

  • GWAS Catalog covers published GWAS only; unpublished studies are not included.
  • GTEx eQTL data is from v8; v10 endpoints may return empty.
  • RegulomeDB annotations depend on available ENCODE/Roadmap data for the specific cell type.
  • eQTL analysis identifies correlation, not causation; fine-mapping is needed to identify causal variants.
  • RegulomeDB scores are heuristic; a score of 1a does not guarantee functional impact.
  • GWAS associations are population-level; individual variant effects depend on genetic background.
  • GWAS Catalog仅涵盖已发表的GWAS;未发表研究未被纳入。
  • GTEx eQTL数据来自v8;v10端点可能返回空结果。
  • RegulomeDB注释依赖于特定细胞类型的可用ENCODE/Roadmap数据。
  • eQTL分析识别相关性而非因果关系;需要精细定位来识别因果变异。
  • RegulomeDB评分为启发式评分;1a评分不能保证功能影响。
  • GWAS关联是群体水平的;个体变异效应取决于遗传背景。