alphagenome-single-variant-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Variant Analysis using AlphaGenome

使用AlphaGenome进行变异分析

Prerequisites

前提条件

  1. uv
    : Read the
    uv
    skill and follow its Setup instructions to ensure
    uv
    is installed and on PATH.
  2. User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://deepmind.google.com/science/alphagenome/, then (2) create the file recording the notification text and timestamp.
  3. .env
    file
    : Make sure the
    .env
    file exists in your home directory. Create one if it does not exist.
  4. ALPHAGENOME_API_KEY
    : This skill requires an API key to function. You must ask the user for an API key if this skill looks relevant to their request and you do not have one in .env. The user can sign up at https://deepmind.google.com/science/alphagenome/. Do NOT ask the user to paste their key into the chat (this would leak the key into the agent's context). Instead, explain that a key is necessary to use AlphaGenome and give the user this command substituting
    ENV_FILE
    with the resolved literal path to the
    .env
    file
    :
    bash
    printf "Enter AlphaGenome API key (typing hidden): " && read -s key && echo && echo "ALPHAGENOME_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
    The scripts load credentials automatically via
    dotenv
    . NEVER read, print, or inspect the
    .env
    file or its variables (e.g. no
    cat
    ,
    grep
    ,
    echo
    ,
    printenv
    , or
    os.environ.get
    on keys). Credentials must stay out of the agent's context.
    When running in sandbox,
    dotenv.load_dotenv()
    will be a no-op, and instead the sandbox will read credentials and inject them directly.
  1. uv
    :阅读
    uv
    技能并遵循其设置说明,确保
    uv
    已安装并添加到PATH中。
  2. 用户通知:如果此技能目录中不存在LICENSE_NOTIFICATION.txt文件,则(1) 显著通知用户查看https://deepmind.google.com/science/alphagenome/上的条款,然后(2) 创建记录通知文本和时间戳的文件。
  3. .env
    文件
    :确保主目录中存在
    .env
    文件。如果不存在则创建一个。
  4. bash
    printf "Enter AlphaGenome API key (typing hidden): " && read -s key && echo && echo "ALPHAGENOME_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
    脚本通过
    dotenv
    自动加载凭据。绝对不要读取、打印或检查
    .env
    文件或其变量(例如,不要对密钥使用
    cat
    grep
    echo
    printenv
    os.environ.get
    )。凭据必须远离Agent的上下文。
    在沙箱中运行时,
    dotenv.load_dotenv()
    将是无效操作,沙箱会直接读取并注入凭据。

Core Rules

核心规则

  • NEVER run
    python3
    or
    python3 -c
    directly.
    The system Python does not necessarily have pandas, numpy, and other key dependencies. ALWAYS use
    uv run
    to run ALL Python code — including scripts, ad-hoc analysis files, and one-liners. Do not attempt to
    pip install
    or create new venvs —
    uv
    manages an isolated environment automatically.
  • Offline Only: NEVER use external APIs (e.g., MyGene.info, Ensembl REST) for gene/transcript lookup. Use
    lookup_gene_info.py
    with the local GTF. If it fails, fix the environment/paths, do not switch to external APIs.
  • API Key is required:
    ALPHAGENOME_API_KEY
    must be set before running any script (in sandbox, credentials are injected automatically).
  • Notification: If this skill is used, ensure this is mentioned in the output.
  • Report Format: Always use the templates in
    docs/report-templates.md
    for generating analysis reports, and ensure to include the table of top hits from the discovery scan.
  • 绝对不要直接运行
    python3
    python3 -c
    :系统Python不一定安装了pandas、numpy和其他关键依赖项。始终使用
    uv run
    运行所有Python代码——包括脚本、临时分析文件和单行命令。不要尝试
    pip install
    或创建新的虚拟环境——
    uv
    会自动管理隔离环境。
  • 仅离线使用:绝对不要使用外部API(如MyGene.info、Ensembl REST)进行基因/转录本查询。使用
    lookup_gene_info.py
    和本地GTF文件。如果失败,请修复环境/路径,不要切换到外部API。
  • 必须有API密钥:运行任何脚本前必须设置
    ALPHAGENOME_API_KEY
    (在沙箱中,凭据会自动注入)。
  • 通知要求:如果使用此技能,确保在输出中提及这一点。
  • 报告格式:始终使用
    docs/report-templates.md
    中的模板生成分析报告,并确保包含发现扫描的热门结果表格。

Environment Setup & Troubleshooting

环境设置与故障排除

Python Environment

Python环境

All scripts must be executed using
uv run
, which manages an isolated virtual environment with the correct dependencies via
uv
.
bash
uv run <script_name> [args...]
For ad-hoc scripts (e.g., inline analysis code saved to a temp file), pass the full path instead of a short name:
bash
uv run --project $SKILL_DIR /tmp/my_analysis.py --arg1 val1
[!NOTE] The first invocation resolves and installs dependencies (~10s). Subsequent runs use the cached environment and start instantly. The cache lives in
~/.cache/uv/
.
所有脚本必须使用
uv run
执行,它通过
uv
管理具有正确依赖项的隔离虚拟环境。
bash
uv run <script_name> [args...]
对于临时脚本(例如保存到临时文件的内联分析代码),传递完整路径而非短名称:
bash
uv run --project $SKILL_DIR /tmp/my_analysis.py --arg1 val1
[!NOTE] 首次调用会解析并安装依赖项(约10秒)。后续运行使用缓存环境,启动瞬间完成。缓存位于
~/.cache/uv/

Common Issues

常见问题

  • Column Names:
    tidy_scores
    and metadata often use
    gene_name
    (not
    gene_symbol
    ) and
    output_type
    (not
    modality
    ). Always inspect
    df.columns
    before filtering.
  • Large Genes: Genes > 500kb (e.g.,
    USH2A
    ) break the
    whole_gene
    view. Use
    --view detail
    or manual regional windows instead.
  • Sashimi Strand Error:
    plot_components.Sashimi
    does NOT accept a
    strand
    argument directly. Filter input tracks instead.
  • KeyError: 'ontology_curie': Not all tracks have
    ontology_curie
    . Check
    track.metadata.columns
    before filtering.
  • Python Path: If
    exec: "python": executable file not found
    occurs, ensure you are using
    uv run
    instead of bare
    python
    /
    python3
    .
  • NotImplementedError (pandas): "iLocation based boolean indexing on an integer type is not available". This occurs when using boolean masks with
    .iloc
    on integer-indexed DataFrames in newer pandas versions. Fix: Convert boolean masks to integer indices using
    np.flatnonzero(mask)
    .
  • GTF Feather Case Sensitivity: The AlphaGenome GTF Feather file uses Capitalized column names (
    Feature
    ,
    Start
    ,
    End
    ,
    Strand
    ) unlike standard GTF files. Always check
    df.columns
    if getting KeyErrors.
  • score_variant
    ontology filtering
    :
    score_variant
    does NOT accept
    ontology_terms
    as an argument. You must filter the returned AnnData objects manually by inspecting
    adata.var
    columns. In contrast,
    predict_variant
    DOES accept
    ontology_terms
    directly.
  • Sashimi Zoom Logic: To ensure "skipping" arcs are visible, expand the zoom to include the flanking exons rather than relying on junction overlap alone.
  • Junction Scores: Raw
    Junction
    objects from
    prediction
    may be simple Intervals. Use
    junction_data.get_junctions_to_plot(predictions=..., name=...)
    to retrieve objects with the
    .k
    (abundance/score) attribute.
  • uv
    Not Found
    : If
    exec: uv: not found
    , follow the installation instructions in Prerequisites.
  • Registry Authentication Error (401): If
    uv
    fails with 401 Unauthorized for a private registry, set
    UV_INDEX_URL=https://pypi.org/simple
    before running the script.
  • 列名
    tidy_scores
    和元数据通常使用
    gene_name
    (而非
    gene_symbol
    )和
    output_type
    (而非
    modality
    )。过滤前务必检查
    df.columns
  • 大型基因:长度超过500kb的基因(如
    USH2A
    )会破坏
    whole_gene
    视图。使用
    --view detail
    或手动区域窗口替代。
  • Sashimi链错误
    plot_components.Sashimi
    不直接接受
    strand
    参数。请过滤输入轨道。
  • KeyError: 'ontology_curie':并非所有轨道都有
    ontology_curie
    。过滤前检查
    track.metadata.columns
  • Python路径:如果出现
    exec: "python": executable file not found
    错误,确保使用
    uv run
    而非直接使用
    python
    /
    python3
  • NotImplementedError (pandas):"iLocation based boolean indexing on an integer type is not available"。这在新版本pandas中对整数索引的DataFrame使用布尔掩码和
    .iloc
    时会发生。修复方法:使用
    np.flatnonzero(mask)
    将布尔掩码转换为整数索引。
  • GTF Feather大小写敏感性:AlphaGenome的GTF Feather文件使用大写列名(
    Feature
    Start
    End
    Strand
    ),与标准GTF文件不同。如果出现KeyErrors,务必检查
    df.columns
  • score_variant
    本体过滤
    score_variant
    不接受
    ontology_terms
    作为参数。必须通过检查
    adata.var
    列手动过滤返回的AnnData对象。相反,
    predict_variant
    直接接受
    ontology_terms
  • Sashimi缩放逻辑:为确保“跳跃”弧可见,请将缩放范围扩展到包含侧翼外显子,而非仅依赖连接重叠。
  • 连接评分:来自
    prediction
    的原始
    Junction
    对象可能是简单的区间。使用
    junction_data.get_junctions_to_plot(predictions=..., name=...)
    获取带有
    .k
    (丰度/评分)属性的对象。
  • uv
    未找到
    :如果出现
    exec: uv: not found
    ,请遵循前提条件中的安装说明。
  • 注册表认证错误(401):如果
    uv
    因私有注册表返回401未授权错误,在运行脚本前设置
    UV_INDEX_URL=https://pypi.org/simple

References

参考资料

  • alphagenome-api.md — API reference and code patterns
  • interpretation-guide.md — Interpretation guide, score magnitude rules, ISM, and checklist.
  • report-templates.md — Full report templates
  • scripts/visualize_variant_effects.py
    — Single-variant visualization template (Ref/Alt comparisons, Splicing).
    • Splicing Zoom Strategy: Uses a Hybrid Approach for optimal visibility:
      1. Base Interval: Variant +/- 1 downstream and upstream exon (Structural Context).
      2. Junction Expansion: Expands to include the full span of any significant splicing junction (e.g., exon skipping events that span multiple exons).
      3. Anchor Enforcement: Ensures the exons anchoring these long junctions are fully visible. Lesson: Simple fixed windows (e.g., 2kb) or nearest-exon logic often fail for skipping events. Always use the observed junction data to drive zoom levels.
  • examples/splicing/
    — Splicing analysis examples
  • examples/model_limitation_RNU4ATAC/
    — ncRNA structure limitation case study
  • examples/polyadenylation_HBA2/
    — 3' UTR / Polyadenylation case study
  • examples/regulatory/
    — Regulatory variant examples
  • examples/negative_result_GATA4/
    — Negative results (mathematical artefact)
  • examples/negative_result_TGFB3/
    — Negative results (proxies)
  • scripts/lookup_gene_info.py
    — Gene & transcript lookup
  • scripts/resolve_ontology_terms.py
    — Ontology term resolution (UBERON/CL IDs)

  • alphagenome-api.md — API参考和代码模式
  • interpretation-guide.md — 解读指南、评分幅度规则、ISM和检查清单
  • report-templates.md — 完整报告模板
  • scripts/visualize_variant_effects.py
    — 单变异可视化模板(参考/替代比较、剪接)
    • 剪接缩放策略:使用混合方法实现最佳可见性:
      1. 基础区间:变异上下游各1个外显子(结构上下文)。
      2. 连接扩展:扩展范围以包含任何显著剪接连接(例如跨越多个外显子的外显子跳跃事件)。
      3. 锚点强制:确保锚定这些长连接的外显子完全可见。经验:简单的固定窗口(如2kb)或最近外显子逻辑通常在处理跳跃事件时失效。始终使用观察到的连接数据驱动缩放级别。
  • examples/splicing/
    — 剪接分析示例
  • examples/model_limitation_RNU4ATAC/
    — ncRNA结构限制案例研究
  • examples/polyadenylation_HBA2/
    — 3' UTR/多聚腺苷酸化案例研究
  • examples/regulatory/
    — 调控变异示例
  • examples/negative_result_GATA4/
    — 阴性结果(数学伪影)
  • examples/negative_result_TGFB3/
    — 阴性结果(代理)
  • scripts/lookup_gene_info.py
    — 基因与转录本查询
  • scripts/resolve_ontology_terms.py
    — 本体术语解析(UBERON/CL ID)

Code Patterns

代码模式

Broad Discovery Scan

广泛发现扫描

Use
score_variant
across differential scorers only to discover unexpected tissue effects.
python
from alphagenome.models import dna_client
from alphagenome.models import variant_scorers
from alphagenome.data import genome
import os
import pandas as pd
仅在差异评分器上使用
score_variant
来发现意外的组织效应。
python
from alphagenome.models import dna_client
from alphagenome.models import variant_scorers
from alphagenome.data import genome
import os
import pandas as pd

Setup API Key and Client

设置API密钥和客户端

dna_model = dna_client.create(api_key=os.environ.get('ALPHAGENOME_API_KEY'), address='dns:///gdmscience.googleapis.com:443')
dna_model = dna_client.create(api_key=os.environ.get('ALPHAGENOME_API_KEY'), address='dns:///gdmscience.googleapis.com:443')

Define Variant (example)

定义变异(示例)

variant_str = "chr2:1234:A>C" chrom, pos_str, ref_alt = variant_str.split(':') ref, alt = ref_alt.split('>') pos = int(pos_str)
variant_str = "chr2:1234:A>C" chrom, pos_str, ref_alt = variant_str.split(':') ref, alt = ref_alt.split('>') pos = int(pos_str)

Use supported sequence length (e.g., 2**20 for optimal performance)

使用支持的序列长度(例如2**20以获得最佳性能)

SEQ_LENGTH = 2**20 interval = genome.Interval(chrom, pos - SEQ_LENGTH // 2, pos + SEQ_LENGTH // 2) variant = genome.Variant(chrom, pos, ref, alt)
scorers = [ variant_scorers.RECOMMENDED_VARIANT_SCORERS[m] for m in variant_scorers.RECOMMENDED_VARIANT_SCORERS if "ACTIVE" not in m and "CAGE" not in m and "PROCAP" not in m ]
print(f"Scoring variant {variant_str}...") scores_list = dna_model.score_variant(interval=interval, variant=variant, variant_scorers=scorers)
SEQ_LENGTH = 2**20 interval = genome.Interval(chrom, pos - SEQ_LENGTH // 2, pos + SEQ_LENGTH // 2) variant = genome.Variant(chrom, pos, ref, alt)
scorers = [ variant_scorers.RECOMMENDED_VARIANT_SCORERS[m] for m in variant_scorers.RECOMMENDED_VARIANT_SCORERS if "ACTIVE" not in m and "CAGE" not in m and "PROCAP" not in m ]
print(f"Scoring variant {variant_str}...") scores_list = dna_model.score_variant(interval=interval, variant=variant, variant_scorers=scorers)

Process and Display Results

处理并显示结果

all_dfs = [] for score_adata in scores_list: df = variant_scorers.tidy_scores([score_adata], match_gene_strand=True) if df is not None: all_dfs.append(df)
if all_dfs: df = pd.concat(all_dfs) significant = df[df['quantile_score'].abs() > 0.995] ranked = significant.sort_values('raw_score', key=abs, ascending=False) print("Top Significant Hits:") print(ranked[['biosample_name', 'gene_name', 'output_type', 'quantile_score', 'raw_score']])
undefined
all_dfs = [] for score_adata in scores_list: df = variant_scorers.tidy_scores([score_adata], match_gene_strand=True) if df is not None: all_dfs.append(df)
if all_dfs: df = pd.concat(all_dfs) significant = df[df['quantile_score'].abs() > 0.995] ranked = significant.sort_values('raw_score', key=abs, ascending=False) print("Top Significant Hits:") print(ranked[['biosample_name', 'gene_name', 'output_type', 'quantile_score', 'raw_score']])
undefined

Extended Search for Disease-Relevant Tissues

疾病相关组织的扩展搜索

python
undefined
python
undefined

Define keywords based on disease context

根据疾病背景定义关键词

disease_keywords = ["liver", "hepatocyte"]
disease_keywords = ["liver", "hepatocyte"]

Filter for any match

过滤匹配项

mask = df['biosample_name'].str.contains('|'.join(disease_keywords), case=False, na=False)
relevant_hits = df[mask].sort_values('raw_score', key=abs, ascending=False) print(f"\n--- Extended Analysis (Keywords: {disease_keywords}) ---") print(relevant_hits.head(20)[['biosample_name', 'output_type', 'raw_score', 'quantile_score']])
undefined
mask = df['biosample_name'].str.contains('|'.join(disease_keywords), case=False, na=False)
relevant_hits = df[mask].sort_values('raw_score', key=abs, ascending=False) print(f"\n--- Extended Analysis (Keywords: {disease_keywords}) ---") print(relevant_hits.head(20)[['biosample_name', 'output_type', 'raw_score', 'quantile_score']])
undefined

Workflow Checklist

工作流检查清单

Variant Analysis Progress:
- [ ] Step 0: Review Golden Examples (MANDATORY)
- [ ] Step 1: Create Output Folder and Setup
- [ ] Step 2: Parse User Query & Research
- [ ] Step 3: Resolve Tissues & Modalities
- [ ] Step 4: Visualize & Save Plots
- [ ] Step 5: Analyze Predictions (view plots, no code). MANDATORY: Read [interpretation-guide.md](docs/interpretation-guide.md) before interpreting results.
- [ ] Step 6: Write Report, save it as `report.md` (MANDATORY)
- [ ] Step 7: Self-Critique (view `report.md` to verify links & claims)
- [ ] Step 8: Make artifact out of `report.md`

变异分析进度:
- [ ] 步骤0:查看黄金示例(必填)
- [ ] 步骤1:创建输出文件夹并完成设置
- [ ] 步骤2:解析用户查询并调研
- [ ] 步骤3:解析组织与模态
- [ ] 步骤4:可视化并保存图表
- [ ] 步骤5:分析预测结果(查看图表,无需代码)。必填:解读结果前阅读[interpretation-guide.md](docs/interpretation-guide.md)
- [ ] 步骤6:撰写报告,保存为`report.md`(必填)
- [ ] 步骤7:自我审查(查看`report.md`以验证链接与声明)
- [ ] 步骤8:将`report.md`生成为工件

Multi-Variant Workflow

多变异工作流

If multiple variants are specified, spawn sub-agents to run each variant analysis and then synthesize each
report.md
into a single report.
如果指定了多个变异,生成子Agent来运行每个变异分析,然后将每个
report.md
合成为单个报告。

Script Reference

脚本参考

ScriptPurpose
lookup_gene_info
Comprehensive gene and transcript lookup using
: : GTF data :
resolve_ontology_terms
Biological terms → UBERON/CL/EFO IDs
visualize_variant_effects
REF/ALT visualization (expression, regulatory,
: : splicing) :
analyze_ism
In-Silico Mutagenesis SeqLogo generation
interpret_splicing
Quantitative splicing analysis (delta scores,
: : junctions) :
visualize_genome_tracks
Genomic track visualization for a region
脚本名称用途
lookup_gene_info
使用GTF数据进行全面的基因和转录本查询
resolve_ontology_terms
将生物术语转换为UBERON/CL/EFO ID
visualize_variant_effects
REF/ALT可视化(表达、调控、剪接)
analyze_ism
生成In-Silico Mutagenesis SeqLogo
interpret_splicing
定量剪接分析(delta评分、连接)
visualize_genome_tracks
特定区域的基因组轨道可视化