alphagenome-single-variant-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Variant Analysis using AlphaGenome

使用AlphaGenome进行变异分析

Prerequisites

前提条件

uv
: Read the
```
uv
```
skill and follow its Setup instructions to ensure
```
uv
```
is installed and on PATH.
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://deepmind.google.com/science/alphagenome/, then (2) create the file recording the notification text and timestamp.
.env
file: Make sure the
```
.env
```
file exists in your home directory. Create one if it does not exist.
ALPHAGENOME_API_KEY
: This skill requires an API key to function. You must ask the user for an API key if this skill looks relevant to their request and you do not have one in .env. The user can sign up at https://deepmind.google.com/science/alphagenome/. Do NOT ask the user to paste their key into the chat (this would leak the key into the agent's context). Instead, explain that a key is necessary to use AlphaGenome and give the user this command substituting
ENV_FILE
with the resolved literal path to the
.env
file:
bash
```
printf "Enter AlphaGenome API key (typing hidden): " && read -s key && echo && echo "ALPHAGENOME_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
```
The scripts load credentials automatically via
```
dotenv
```
. NEVER read, print, or inspect the
```
.env
```
file or its variables (e.g. no
```
cat
```
,
```
grep
```
,
```
echo
```
,
```
printenv
```
, or
```
os.environ.get
```
on keys). Credentials must stay out of the agent's context.
When running in sandbox,
```
dotenv.load_dotenv()
```
will be a no-op, and instead the sandbox will read credentials and inject them directly.

uv
：阅读
```
uv
```
技能并遵循其设置说明，确保
```
uv
```
已安装并添加到PATH中。
用户通知：如果此技能目录中不存在LICENSE_NOTIFICATION.txt文件，则(1) 显著通知用户查看https://deepmind.google.com/science/alphagenome/上的条款，然后(2) 创建记录通知文本和时间戳的文件。
.env
文件：确保主目录中存在
```
.env
```
文件。如果不存在则创建一个。
ALPHAGENOME_API_KEY
：此技能需要API密钥才能运行。如果此技能与用户请求相关，且
```
.env
```
中没有API密钥，必须向用户索要。用户可以在https://deepmind.google.com/science/alphagenome/注册。请勿要求用户将密钥粘贴到聊天中（这会导致密钥泄露到Agent的上下文）。相反，解释使用AlphaGenome需要密钥，并向用户提供以下命令**将`ENV_FILE`替换为`.env`文件的实际路径**：
bash
```
printf "Enter AlphaGenome API key (typing hidden): " && read -s key && echo && echo "ALPHAGENOME_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
```
脚本通过
```
dotenv
```
自动加载凭据。绝对不要读取、打印或检查
```
.env
```
文件或其变量（例如，不要对密钥使用
```
cat
```
、
```
grep
```
、
```
echo
```
、
```
printenv
```
或
```
os.environ.get
```
）。凭据必须远离Agent的上下文。
在沙箱中运行时，
```
dotenv.load_dotenv()
```
将是无效操作，沙箱会直接读取并注入凭据。

Core Rules

核心规则

NEVER run
python3
or
python3 -c
directly. The system Python does not necessarily have pandas, numpy, and other key dependencies. ALWAYS use
```
uv run
```
to run ALL Python code — including scripts, ad-hoc analysis files, and one-liners. Do not attempt to
```
pip install
```
or create new venvs —
```
uv
```
manages an isolated environment automatically.
Offline Only: NEVER use external APIs (e.g., MyGene.info, Ensembl REST) for gene/transcript lookup. Use
```
lookup_gene_info.py
```
with the local GTF. If it fails, fix the environment/paths, do not switch to external APIs.
API Key is required:
```
ALPHAGENOME_API_KEY
```
must be set before running any script (in sandbox, credentials are injected automatically).
Notification: If this skill is used, ensure this is mentioned in the output.
Report Format: Always use the templates in
```
docs/report-templates.md
```
for generating analysis reports, and ensure to include the table of top hits from the discovery scan.

绝对不要直接运行
python3
或
python3 -c
：系统Python不一定安装了pandas、numpy和其他关键依赖项。始终使用
```
uv run
```
运行所有Python代码——包括脚本、临时分析文件和单行命令。不要尝试
```
pip install
```
或创建新的虚拟环境——
```
uv
```
会自动管理隔离环境。
仅离线使用：绝对不要使用外部API（如MyGene.info、Ensembl REST）进行基因/转录本查询。使用
```
lookup_gene_info.py
```
和本地GTF文件。如果失败，请修复环境/路径，不要切换到外部API。
必须有API密钥：运行任何脚本前必须设置
```
ALPHAGENOME_API_KEY
```
（在沙箱中，凭据会自动注入）。
通知要求：如果使用此技能，确保在输出中提及这一点。
报告格式：始终使用
```
docs/report-templates.md
```
中的模板生成分析报告，并确保包含发现扫描的热门结果表格。

Environment Setup & Troubleshooting

环境设置与故障排除

Python Environment

Python环境

All scripts must be executed using

uv run

, which manages an isolated virtual environment with the correct dependencies via

uv

bash

uv run <script_name> [args...]

For ad-hoc scripts (e.g., inline analysis code saved to a temp file), pass the full path instead of a short name:

bash

uv run --project $SKILL_DIR /tmp/my_analysis.py --arg1 val1

[!NOTE] The first invocation resolves and installs dependencies (~10s). Subsequent runs use the cached environment and start instantly. The cache lives in
~/.cache/uv/
.

所有脚本必须使用

uv run

执行，它通过

uv

管理具有正确依赖项的隔离虚拟环境。

bash

uv run <script_name> [args...]

对于临时脚本（例如保存到临时文件的内联分析代码），传递完整路径而非短名称：

bash

uv run --project $SKILL_DIR /tmp/my_analysis.py --arg1 val1

[!NOTE] 首次调用会解析并安装依赖项（约10秒）。后续运行使用缓存环境，启动瞬间完成。缓存位于
~/.cache/uv/
。

Common Issues

常见问题

Column Names:
```
tidy_scores
```
and metadata often use
```
gene_name
```
(not
```
gene_symbol
```
) and
```
output_type
```
(not
```
modality
```
). Always inspect
```
df.columns
```
before filtering.
Large Genes: Genes > 500kb (e.g.,
```
USH2A
```
) break the
```
whole_gene
```
view. Use
```
--view detail
```
or manual regional windows instead.
Sashimi Strand Error:
```
plot_components.Sashimi
```
does NOT accept a
```
strand
```
argument directly. Filter input tracks instead.
KeyError: 'ontology_curie': Not all tracks have
```
ontology_curie
```
. Check
```
track.metadata.columns
```
before filtering.

Python Path: If

exec: "python": executable file not found

occurs, ensure you are using

uv run

instead of bare

python

python3

NotImplementedError (pandas): "iLocation based boolean indexing on an integer type is not available". This occurs when using boolean masks with
```
.iloc
```
on integer-indexed DataFrames in newer pandas versions. Fix: Convert boolean masks to integer indices using
```
np.flatnonzero(mask)
```
.
GTF Feather Case Sensitivity: The AlphaGenome GTF Feather file uses Capitalized column names (
```
Feature
```
,
```
Start
```
,
```
End
```
,
```
Strand
```
) unlike standard GTF files. Always check
```
df.columns
```
if getting KeyErrors.
score_variant
ontology filtering:
```
score_variant
```
does NOT accept
```
ontology_terms
```
as an argument. You must filter the returned AnnData objects manually by inspecting
```
adata.var
```
columns. In contrast,
```
predict_variant
```
DOES accept
```
ontology_terms
```
directly.
Sashimi Zoom Logic: To ensure "skipping" arcs are visible, expand the zoom to include the flanking exons rather than relying on junction overlap alone.
Junction Scores: Raw
```
Junction
```
objects from
```
prediction
```
may be simple Intervals. Use
```
junction_data.get_junctions_to_plot(predictions=..., name=...)
```
to retrieve objects with the
```
.k
```
(abundance/score) attribute.
uv
Not Found: If
```
exec: uv: not found
```
, follow the installation instructions in Prerequisites.
Registry Authentication Error (401): If
```
uv
```
fails with 401 Unauthorized for a private registry, set
```
UV_INDEX_URL=https://pypi.org/simple
```
before running the script.

列名：
```
tidy_scores
```
和元数据通常使用
```
gene_name
```
（而非
```
gene_symbol
```
）和
```
output_type
```
（而非
```
modality
```
）。过滤前务必检查
```
df.columns
```
。
大型基因：长度超过500kb的基因（如
```
USH2A
```
）会破坏
```
whole_gene
```
视图。使用
```
--view detail
```
或手动区域窗口替代。
Sashimi链错误：
```
plot_components.Sashimi
```
不直接接受
```
strand
```
参数。请过滤输入轨道。
KeyError: 'ontology_curie'：并非所有轨道都有
```
ontology_curie
```
。过滤前检查
```
track.metadata.columns
```
。
Python路径：如果出现
```
exec: "python": executable file not found
```
错误，确保使用
```
uv run
```
而非直接使用
```
python
```
/
```
python3
```
。
NotImplementedError (pandas)："iLocation based boolean indexing on an integer type is not available"。这在新版本pandas中对整数索引的DataFrame使用布尔掩码和
```
.iloc
```
时会发生。修复方法：使用
```
np.flatnonzero(mask)
```
将布尔掩码转换为整数索引。
GTF Feather大小写敏感性：AlphaGenome的GTF Feather文件使用大写列名（
```
Feature
```
、
```
Start
```
、
```
End
```
、
```
Strand
```
），与标准GTF文件不同。如果出现KeyErrors，务必检查
```
df.columns
```
。
score_variant
本体过滤：
```
score_variant
```
不接受
```
ontology_terms
```
作为参数。必须通过检查
```
adata.var
```
列手动过滤返回的AnnData对象。相反，
```
predict_variant
```
直接接受
```
ontology_terms
```
。
Sashimi缩放逻辑：为确保“跳跃”弧可见，请将缩放范围扩展到包含侧翼外显子，而非仅依赖连接重叠。
连接评分：来自
```
prediction
```
的原始
```
Junction
```
对象可能是简单的区间。使用
```
junction_data.get_junctions_to_plot(predictions=..., name=...)
```
获取带有
```
.k
```
（丰度/评分）属性的对象。
uv
未找到：如果出现
```
exec: uv: not found
```
，请遵循前提条件中的安装说明。
注册表认证错误(401)：如果
```
uv
```
因私有注册表返回401未授权错误，在运行脚本前设置
```
UV_INDEX_URL=https://pypi.org/simple
```
。

References

参考资料

alphagenome-api.md — API reference and code patterns
interpretation-guide.md — Interpretation guide, score magnitude rules, ISM, and checklist.
report-templates.md — Full report templates
```
scripts/visualize_variant_effects.py
```
— Single-variant visualization template (Ref/Alt comparisons, Splicing).
- Splicing Zoom Strategy: Uses a Hybrid Approach for optimal visibility:
  1. Base Interval: Variant +/- 1 downstream and upstream exon (Structural Context).
  2. Junction Expansion: Expands to include the full span of any significant splicing junction (e.g., exon skipping events that span multiple exons).
  3. Anchor Enforcement: Ensures the exons anchoring these long junctions are fully visible. Lesson: Simple fixed windows (e.g., 2kb) or nearest-exon logic often fail for skipping events. Always use the observed junction data to drive zoom levels.
```
examples/splicing/
```
— Splicing analysis examples
```
examples/model_limitation_RNU4ATAC/
```
— ncRNA structure limitation case study
```
examples/polyadenylation_HBA2/
```
— 3' UTR / Polyadenylation case study
```
examples/regulatory/
```
— Regulatory variant examples
```
examples/negative_result_GATA4/
```
— Negative results (mathematical artefact)
```
examples/negative_result_TGFB3/
```
— Negative results (proxies)
```
scripts/lookup_gene_info.py
```
— Gene & transcript lookup
```
scripts/resolve_ontology_terms.py
```
— Ontology term resolution (UBERON/CL IDs)

alphagenome-api.md — API参考和代码模式
interpretation-guide.md — 解读指南、评分幅度规则、ISM和检查清单
report-templates.md — 完整报告模板
```
scripts/visualize_variant_effects.py
```
— 单变异可视化模板（参考/替代比较、剪接）
- 剪接缩放策略：使用混合方法实现最佳可见性：
  1. 基础区间：变异上下游各1个外显子（结构上下文）。
  2. 连接扩展：扩展范围以包含任何显著剪接连接（例如跨越多个外显子的外显子跳跃事件）。
  3. 锚点强制：确保锚定这些长连接的外显子完全可见。经验：简单的固定窗口（如2kb）或最近外显子逻辑通常在处理跳跃事件时失效。始终使用观察到的连接数据驱动缩放级别。
```
examples/splicing/
```
— 剪接分析示例
```
examples/model_limitation_RNU4ATAC/
```
— ncRNA结构限制案例研究
```
examples/polyadenylation_HBA2/
```
— 3' UTR/多聚腺苷酸化案例研究
```
examples/regulatory/
```
— 调控变异示例
```
examples/negative_result_GATA4/
```
— 阴性结果（数学伪影）
```
examples/negative_result_TGFB3/
```
— 阴性结果（代理）
```
scripts/lookup_gene_info.py
```
— 基因与转录本查询
```
scripts/resolve_ontology_terms.py
```
— 本体术语解析（UBERON/CL ID）

Code Patterns

代码模式

Broad Discovery Scan

广泛发现扫描

Use

score_variant

across differential scorers only to discover unexpected tissue effects.

python

from alphagenome.models import dna_client
from alphagenome.models import variant_scorers
from alphagenome.data import genome
import os
import pandas as pd

仅在差异评分器上使用

score_variant

来发现意外的组织效应。

python

from alphagenome.models import dna_client
from alphagenome.models import variant_scorers
from alphagenome.data import genome
import os
import pandas as pd

Setup API Key and Client

设置API密钥和客户端

dna_model = dna_client.create(api_key=os.environ.get('ALPHAGENOME_API_KEY'), address='dns:///gdmscience.googleapis.com:443')

Define Variant (example)

定义变异（示例）

variant_str = "chr2:1234:A>C" chrom, pos_str, ref_alt = variant_str.split(':') ref, alt = ref_alt.split('>') pos = int(pos_str)

Use supported sequence length (e.g., 2**20 for optimal performance)

使用支持的序列长度（例如2**20以获得最佳性能）

SEQ_LENGTH = 2**20 interval = genome.Interval(chrom, pos - SEQ_LENGTH // 2, pos + SEQ_LENGTH // 2) variant = genome.Variant(chrom, pos, ref, alt)

scorers = [ variant_scorers.RECOMMENDED_VARIANT_SCORERS[m] for m in variant_scorers.RECOMMENDED_VARIANT_SCORERS if "ACTIVE" not in m and "CAGE" not in m and "PROCAP" not in m ]

print(f"Scoring variant {variant_str}...") scores_list = dna_model.score_variant(interval=interval, variant=variant, variant_scorers=scorers)

SEQ_LENGTH = 2**20 interval = genome.Interval(chrom, pos - SEQ_LENGTH // 2, pos + SEQ_LENGTH // 2) variant = genome.Variant(chrom, pos, ref, alt)

scorers = [ variant_scorers.RECOMMENDED_VARIANT_SCORERS[m] for m in variant_scorers.RECOMMENDED_VARIANT_SCORERS if "ACTIVE" not in m and "CAGE" not in m and "PROCAP" not in m ]

print(f"Scoring variant {variant_str}...") scores_list = dna_model.score_variant(interval=interval, variant=variant, variant_scorers=scorers)

Process and Display Results

处理并显示结果

all_dfs = [] for score_adata in scores_list: df = variant_scorers.tidy_scores([score_adata], match_gene_strand=True) if df is not None: all_dfs.append(df)

if all_dfs: df = pd.concat(all_dfs) significant = df[df['quantile_score'].abs() > 0.995] ranked = significant.sort_values('raw_score', key=abs, ascending=False) print("Top Significant Hits:") print(ranked[['biosample_name', 'gene_name', 'output_type', 'quantile_score', 'raw_score']])

undefined

all_dfs = [] for score_adata in scores_list: df = variant_scorers.tidy_scores([score_adata], match_gene_strand=True) if df is not None: all_dfs.append(df)

undefined

Extended Search for Disease-Relevant Tissues

疾病相关组织的扩展搜索

python

undefined

python

undefined

Define keywords based on disease context

根据疾病背景定义关键词

disease_keywords = ["liver", "hepatocyte"]

Filter for any match

过滤匹配项

mask = df['biosample_name'].str.contains('|'.join(disease_keywords), case=False, na=False)

relevant_hits = df[mask].sort_values('raw_score', key=abs, ascending=False) print(f"\n--- Extended Analysis (Keywords: {disease_keywords}) ---") print(relevant_hits.head(20)[['biosample_name', 'output_type', 'raw_score', 'quantile_score']])

undefined

mask = df['biosample_name'].str.contains('|'.join(disease_keywords), case=False, na=False)

undefined

Workflow Checklist

工作流检查清单

Variant Analysis Progress:
- [ ] Step 0: Review Golden Examples (MANDATORY)
- [ ] Step 1: Create Output Folder and Setup
- [ ] Step 2: Parse User Query & Research
- [ ] Step 3: Resolve Tissues & Modalities
- [ ] Step 4: Visualize & Save Plots
- [ ] Step 5: Analyze Predictions (view plots, no code). MANDATORY: Read [interpretation-guide.md](docs/interpretation-guide.md) before interpreting results.
- [ ] Step 6: Write Report, save it as `report.md` (MANDATORY)
- [ ] Step 7: Self-Critique (view `report.md` to verify links & claims)
- [ ] Step 8: Make artifact out of `report.md`

变异分析进度：
- [ ] 步骤0：查看黄金示例（必填）
- [ ] 步骤1：创建输出文件夹并完成设置
- [ ] 步骤2：解析用户查询并调研
- [ ] 步骤3：解析组织与模态
- [ ] 步骤4：可视化并保存图表
- [ ] 步骤5：分析预测结果（查看图表，无需代码）。必填：解读结果前阅读[interpretation-guide.md](docs/interpretation-guide.md)
- [ ] 步骤6：撰写报告，保存为`report.md`（必填）
- [ ] 步骤7：自我审查（查看`report.md`以验证链接与声明）
- [ ] 步骤8：将`report.md`生成为工件

Multi-Variant Workflow

多变异工作流

If multiple variants are specified, spawn sub-agents to run each variant analysis and then synthesize each

report.md

into a single report.

如果指定了多个变异，生成子Agent来运行每个变异分析，然后将每个

report.md

合成为单个报告。

Script Reference

脚本参考

Script	Purpose
`lookup_gene_info`	Comprehensive gene and transcript lookup using
: : GTF data :
`resolve_ontology_terms`	Biological terms → UBERON/CL/EFO IDs
`visualize_variant_effects`	REF/ALT visualization (expression, regulatory,
: : splicing) :
`analyze_ism`	In-Silico Mutagenesis SeqLogo generation
`interpret_splicing`	Quantitative splicing analysis (delta scores,
: : junctions) :
`visualize_genome_tracks`	Genomic track visualization for a region

脚本名称	用途
`lookup_gene_info`	使用GTF数据进行全面的基因和转录本查询
`resolve_ontology_terms`	将生物术语转换为UBERON/CL/EFO ID
`visualize_variant_effects`	REF/ALT可视化（表达、调控、剪接）
`analyze_ism`	生成In-Silico Mutagenesis SeqLogo
`interpret_splicing`	定量剪接分析（delta评分、连接）
`visualize_genome_tracks`	特定区域的基因组轨道可视化