alphagenome-single-variant-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVariant Analysis using AlphaGenome
使用AlphaGenome进行变异分析
Prerequisites
前提条件
-
: Read the
uvskill and follow its Setup instructions to ensureuvis installed and on PATH.uv -
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in this skill directory then (1) prominently notify the user to check the terms at https://deepmind.google.com/science/alphagenome/, then (2) create the file recording the notification text and timestamp.
-
file: Make sure the
.envfile exists in your home directory. Create one if it does not exist..env -
: This skill requires an API key to function. You must ask the user for an API key if this skill looks relevant to their request and you do not have one in .env. The user can sign up at https://deepmind.google.com/science/alphagenome/. Do NOT ask the user to paste their key into the chat (this would leak the key into the agent's context). Instead, explain that a key is necessary to use AlphaGenome and give the user this command substituting
ALPHAGENOME_API_KEYwith the resolved literal path to theENV_FILEfile:.envbashprintf "Enter AlphaGenome API key (typing hidden): " && read -s key && echo && echo "ALPHAGENOME_API_KEY=$key" >> "ENV_FILE" && echo "Saved."The scripts load credentials automatically via. NEVER read, print, or inspect thedotenvfile or its variables (e.g. no.env,cat,grep,echo, orprintenvon keys). Credentials must stay out of the agent's context.os.environ.getWhen running in sandbox,will be a no-op, and instead the sandbox will read credentials and inject them directly.dotenv.load_dotenv()
-
:阅读
uv技能并遵循其设置说明,确保uv已安装并添加到PATH中。uv -
用户通知:如果此技能目录中不存在LICENSE_NOTIFICATION.txt文件,则(1) 显著通知用户查看https://deepmind.google.com/science/alphagenome/上的条款,然后(2) 创建记录通知文本和时间戳的文件。
-
文件:确保主目录中存在
.env文件。如果不存在则创建一个。.env -
:此技能需要API密钥才能运行。如果此技能与用户请求相关,且
ALPHAGENOME_API_KEY中没有API密钥,必须向用户索要。用户可以在https://deepmind.google.com/science/alphagenome/注册。请勿要求用户将密钥粘贴到聊天中(这会导致密钥泄露到Agent的上下文)。相反,解释使用AlphaGenome需要密钥,并向用户提供以下命令**将`ENV_FILE`替换为`.env`文件的实际路径**:.envbashprintf "Enter AlphaGenome API key (typing hidden): " && read -s key && echo && echo "ALPHAGENOME_API_KEY=$key" >> "ENV_FILE" && echo "Saved."脚本通过自动加载凭据。绝对不要读取、打印或检查dotenv文件或其变量(例如,不要对密钥使用.env、cat、grep、echo或printenv)。凭据必须远离Agent的上下文。os.environ.get在沙箱中运行时,将是无效操作,沙箱会直接读取并注入凭据。dotenv.load_dotenv()
Core Rules
核心规则
- NEVER run or
python3directly. The system Python does not necessarily have pandas, numpy, and other key dependencies. ALWAYS usepython3 -cto run ALL Python code — including scripts, ad-hoc analysis files, and one-liners. Do not attempt touv runor create new venvs —pip installmanages an isolated environment automatically.uv - Offline Only: NEVER use external APIs (e.g., MyGene.info, Ensembl REST)
for gene/transcript lookup. Use with the local GTF. If it fails, fix the environment/paths, do not switch to external APIs.
lookup_gene_info.py - API Key is required: must be set before running any script (in sandbox, credentials are injected automatically).
ALPHAGENOME_API_KEY - Notification: If this skill is used, ensure this is mentioned in the output.
- Report Format: Always use the templates in for generating analysis reports, and ensure to include the table of top hits from the discovery scan.
docs/report-templates.md
- 绝对不要直接运行或
python3:系统Python不一定安装了pandas、numpy和其他关键依赖项。始终使用python3 -c运行所有Python代码——包括脚本、临时分析文件和单行命令。不要尝试uv run或创建新的虚拟环境——pip install会自动管理隔离环境。uv - 仅离线使用:绝对不要使用外部API(如MyGene.info、Ensembl REST)进行基因/转录本查询。使用和本地GTF文件。如果失败,请修复环境/路径,不要切换到外部API。
lookup_gene_info.py - 必须有API密钥:运行任何脚本前必须设置(在沙箱中,凭据会自动注入)。
ALPHAGENOME_API_KEY - 通知要求:如果使用此技能,确保在输出中提及这一点。
- 报告格式:始终使用中的模板生成分析报告,并确保包含发现扫描的热门结果表格。
docs/report-templates.md
Environment Setup & Troubleshooting
环境设置与故障排除
Python Environment
Python环境
All scripts must be executed using , which manages an isolated virtual
environment with the correct dependencies via .
uv runuvbash
uv run <script_name> [args...]For ad-hoc scripts (e.g., inline analysis code saved to a temp file), pass the
full path instead of a short name:
bash
uv run --project $SKILL_DIR /tmp/my_analysis.py --arg1 val1[!NOTE] The first invocation resolves and installs dependencies (~10s). Subsequent runs use the cached environment and start instantly. The cache lives in.~/.cache/uv/
所有脚本必须使用执行,它通过管理具有正确依赖项的隔离虚拟环境。
uv runuvbash
uv run <script_name> [args...]对于临时脚本(例如保存到临时文件的内联分析代码),传递完整路径而非短名称:
bash
uv run --project $SKILL_DIR /tmp/my_analysis.py --arg1 val1[!NOTE] 首次调用会解析并安装依赖项(约10秒)。后续运行使用缓存环境,启动瞬间完成。缓存位于。~/.cache/uv/
Common Issues
常见问题
- Column Names: and metadata often use
tidy_scores(notgene_name) andgene_symbol(notoutput_type). Always inspectmodalitybefore filtering.df.columns - Large Genes: Genes > 500kb (e.g., ) break the
USH2Aview. Usewhole_geneor manual regional windows instead.--view detail - Sashimi Strand Error: does NOT accept a
plot_components.Sashimiargument directly. Filter input tracks instead.strand - KeyError: 'ontology_curie': Not all tracks have . Check
ontology_curiebefore filtering.track.metadata.columns - Python Path: If occurs, ensure you are using
exec: "python": executable file not foundinstead of bareuv run/python.python3 - NotImplementedError (pandas): "iLocation based boolean indexing on an
integer type is not available". This occurs when using boolean masks with
on integer-indexed DataFrames in newer pandas versions. Fix: Convert boolean masks to integer indices using
.iloc.np.flatnonzero(mask) - GTF Feather Case Sensitivity: The AlphaGenome GTF Feather file uses
Capitalized column names (,
Feature,Start,End) unlike standard GTF files. Always checkStrandif getting KeyErrors.df.columns - ontology filtering:
score_variantdoes NOT acceptscore_variantas an argument. You must filter the returned AnnData objects manually by inspectingontology_termscolumns. In contrast,adata.varDOES acceptpredict_variantdirectly.ontology_terms - Sashimi Zoom Logic: To ensure "skipping" arcs are visible, expand the zoom to include the flanking exons rather than relying on junction overlap alone.
- Junction Scores: Raw objects from
Junctionmay be simple Intervals. Usepredictionto retrieve objects with thejunction_data.get_junctions_to_plot(predictions=..., name=...)(abundance/score) attribute..k - Not Found: If
uv, follow the installation instructions in Prerequisites.exec: uv: not found - Registry Authentication Error (401): If fails with 401 Unauthorized for a private registry, set
uvbefore running the script.UV_INDEX_URL=https://pypi.org/simple
- 列名:和元数据通常使用
tidy_scores(而非gene_name)和gene_symbol(而非output_type)。过滤前务必检查modality。df.columns - 大型基因:长度超过500kb的基因(如)会破坏
USH2A视图。使用whole_gene或手动区域窗口替代。--view detail - Sashimi链错误:不直接接受
plot_components.Sashimi参数。请过滤输入轨道。strand - KeyError: 'ontology_curie':并非所有轨道都有。过滤前检查
ontology_curie。track.metadata.columns - Python路径:如果出现错误,确保使用
exec: "python": executable file not found而非直接使用uv run/python。python3 - NotImplementedError (pandas):"iLocation based boolean indexing on an integer type is not available"。这在新版本pandas中对整数索引的DataFrame使用布尔掩码和时会发生。修复方法:使用
.iloc将布尔掩码转换为整数索引。np.flatnonzero(mask) - GTF Feather大小写敏感性:AlphaGenome的GTF Feather文件使用大写列名(、
Feature、Start、End),与标准GTF文件不同。如果出现KeyErrors,务必检查Strand。df.columns - 本体过滤:
score_variant不接受score_variant作为参数。必须通过检查ontology_terms列手动过滤返回的AnnData对象。相反,adata.var直接接受predict_variant。ontology_terms - Sashimi缩放逻辑:为确保“跳跃”弧可见,请将缩放范围扩展到包含侧翼外显子,而非仅依赖连接重叠。
- 连接评分:来自的原始
prediction对象可能是简单的区间。使用Junction获取带有junction_data.get_junctions_to_plot(predictions=..., name=...)(丰度/评分)属性的对象。.k - 未找到:如果出现
uv,请遵循前提条件中的安装说明。exec: uv: not found - 注册表认证错误(401):如果因私有注册表返回401未授权错误,在运行脚本前设置
uv。UV_INDEX_URL=https://pypi.org/simple
References
参考资料
- alphagenome-api.md — API reference and code patterns
- interpretation-guide.md — Interpretation guide, score magnitude rules, ISM, and checklist.
- report-templates.md — Full report templates
- — Single-variant visualization template (Ref/Alt comparisons, Splicing).
scripts/visualize_variant_effects.py- Splicing Zoom Strategy: Uses a Hybrid Approach for optimal
visibility:
- Base Interval: Variant +/- 1 downstream and upstream exon (Structural Context).
- Junction Expansion: Expands to include the full span of any significant splicing junction (e.g., exon skipping events that span multiple exons).
- Anchor Enforcement: Ensures the exons anchoring these long junctions are fully visible. Lesson: Simple fixed windows (e.g., 2kb) or nearest-exon logic often fail for skipping events. Always use the observed junction data to drive zoom levels.
- Splicing Zoom Strategy: Uses a Hybrid Approach for optimal
visibility:
- — Splicing analysis examples
examples/splicing/ - — ncRNA structure limitation case study
examples/model_limitation_RNU4ATAC/ - — 3' UTR / Polyadenylation case study
examples/polyadenylation_HBA2/ - — Regulatory variant examples
examples/regulatory/ - — Negative results (mathematical artefact)
examples/negative_result_GATA4/ - — Negative results (proxies)
examples/negative_result_TGFB3/ - — Gene & transcript lookup
scripts/lookup_gene_info.py - — Ontology term resolution (UBERON/CL IDs)
scripts/resolve_ontology_terms.py
- alphagenome-api.md — API参考和代码模式
- interpretation-guide.md — 解读指南、评分幅度规则、ISM和检查清单
- report-templates.md — 完整报告模板
- — 单变异可视化模板(参考/替代比较、剪接)
scripts/visualize_variant_effects.py- 剪接缩放策略:使用混合方法实现最佳可见性:
- 基础区间:变异上下游各1个外显子(结构上下文)。
- 连接扩展:扩展范围以包含任何显著剪接连接(例如跨越多个外显子的外显子跳跃事件)。
- 锚点强制:确保锚定这些长连接的外显子完全可见。经验:简单的固定窗口(如2kb)或最近外显子逻辑通常在处理跳跃事件时失效。始终使用观察到的连接数据驱动缩放级别。
- 剪接缩放策略:使用混合方法实现最佳可见性:
- — 剪接分析示例
examples/splicing/ - — ncRNA结构限制案例研究
examples/model_limitation_RNU4ATAC/ - — 3' UTR/多聚腺苷酸化案例研究
examples/polyadenylation_HBA2/ - — 调控变异示例
examples/regulatory/ - — 阴性结果(数学伪影)
examples/negative_result_GATA4/ - — 阴性结果(代理)
examples/negative_result_TGFB3/ - — 基因与转录本查询
scripts/lookup_gene_info.py - — 本体术语解析(UBERON/CL ID)
scripts/resolve_ontology_terms.py
Code Patterns
代码模式
Broad Discovery Scan
广泛发现扫描
Use across differential scorers only to discover unexpected
tissue effects.
score_variantpython
from alphagenome.models import dna_client
from alphagenome.models import variant_scorers
from alphagenome.data import genome
import os
import pandas as pd仅在差异评分器上使用来发现意外的组织效应。
score_variantpython
from alphagenome.models import dna_client
from alphagenome.models import variant_scorers
from alphagenome.data import genome
import os
import pandas as pdSetup API Key and Client
设置API密钥和客户端
dna_model = dna_client.create(api_key=os.environ.get('ALPHAGENOME_API_KEY'),
address='dns:///gdmscience.googleapis.com:443')
dna_model = dna_client.create(api_key=os.environ.get('ALPHAGENOME_API_KEY'),
address='dns:///gdmscience.googleapis.com:443')
Define Variant (example)
定义变异(示例)
variant_str = "chr2:1234:A>C"
chrom, pos_str, ref_alt = variant_str.split(':')
ref, alt = ref_alt.split('>')
pos = int(pos_str)
variant_str = "chr2:1234:A>C"
chrom, pos_str, ref_alt = variant_str.split(':')
ref, alt = ref_alt.split('>')
pos = int(pos_str)
Use supported sequence length (e.g., 2**20 for optimal performance)
使用支持的序列长度(例如2**20以获得最佳性能)
SEQ_LENGTH = 2**20
interval = genome.Interval(chrom, pos - SEQ_LENGTH // 2, pos + SEQ_LENGTH // 2)
variant = genome.Variant(chrom, pos, ref, alt)
scorers = [
variant_scorers.RECOMMENDED_VARIANT_SCORERS[m]
for m in variant_scorers.RECOMMENDED_VARIANT_SCORERS
if "ACTIVE" not in m and "CAGE" not in m and "PROCAP" not in m
]
print(f"Scoring variant {variant_str}...")
scores_list = dna_model.score_variant(interval=interval, variant=variant, variant_scorers=scorers)
SEQ_LENGTH = 2**20
interval = genome.Interval(chrom, pos - SEQ_LENGTH // 2, pos + SEQ_LENGTH // 2)
variant = genome.Variant(chrom, pos, ref, alt)
scorers = [
variant_scorers.RECOMMENDED_VARIANT_SCORERS[m]
for m in variant_scorers.RECOMMENDED_VARIANT_SCORERS
if "ACTIVE" not in m and "CAGE" not in m and "PROCAP" not in m
]
print(f"Scoring variant {variant_str}...")
scores_list = dna_model.score_variant(interval=interval, variant=variant, variant_scorers=scorers)
Process and Display Results
处理并显示结果
all_dfs = []
for score_adata in scores_list:
df = variant_scorers.tidy_scores([score_adata], match_gene_strand=True)
if df is not None:
all_dfs.append(df)
if all_dfs:
df = pd.concat(all_dfs)
significant = df[df['quantile_score'].abs() > 0.995]
ranked = significant.sort_values('raw_score', key=abs, ascending=False)
print("Top Significant Hits:")
print(ranked[['biosample_name', 'gene_name', 'output_type', 'quantile_score', 'raw_score']])
undefinedall_dfs = []
for score_adata in scores_list:
df = variant_scorers.tidy_scores([score_adata], match_gene_strand=True)
if df is not None:
all_dfs.append(df)
if all_dfs:
df = pd.concat(all_dfs)
significant = df[df['quantile_score'].abs() > 0.995]
ranked = significant.sort_values('raw_score', key=abs, ascending=False)
print("Top Significant Hits:")
print(ranked[['biosample_name', 'gene_name', 'output_type', 'quantile_score', 'raw_score']])
undefinedExtended Search for Disease-Relevant Tissues
疾病相关组织的扩展搜索
python
undefinedpython
undefinedDefine keywords based on disease context
根据疾病背景定义关键词
disease_keywords = ["liver", "hepatocyte"]
disease_keywords = ["liver", "hepatocyte"]
Filter for any match
过滤匹配项
mask = df['biosample_name'].str.contains('|'.join(disease_keywords), case=False, na=False)
relevant_hits = df[mask].sort_values('raw_score', key=abs, ascending=False)
print(f"\n--- Extended Analysis (Keywords: {disease_keywords}) ---")
print(relevant_hits.head(20)[['biosample_name', 'output_type', 'raw_score', 'quantile_score']])
undefinedmask = df['biosample_name'].str.contains('|'.join(disease_keywords), case=False, na=False)
relevant_hits = df[mask].sort_values('raw_score', key=abs, ascending=False)
print(f"\n--- Extended Analysis (Keywords: {disease_keywords}) ---")
print(relevant_hits.head(20)[['biosample_name', 'output_type', 'raw_score', 'quantile_score']])
undefinedWorkflow Checklist
工作流检查清单
Variant Analysis Progress:
- [ ] Step 0: Review Golden Examples (MANDATORY)
- [ ] Step 1: Create Output Folder and Setup
- [ ] Step 2: Parse User Query & Research
- [ ] Step 3: Resolve Tissues & Modalities
- [ ] Step 4: Visualize & Save Plots
- [ ] Step 5: Analyze Predictions (view plots, no code). MANDATORY: Read [interpretation-guide.md](docs/interpretation-guide.md) before interpreting results.
- [ ] Step 6: Write Report, save it as `report.md` (MANDATORY)
- [ ] Step 7: Self-Critique (view `report.md` to verify links & claims)
- [ ] Step 8: Make artifact out of `report.md`变异分析进度:
- [ ] 步骤0:查看黄金示例(必填)
- [ ] 步骤1:创建输出文件夹并完成设置
- [ ] 步骤2:解析用户查询并调研
- [ ] 步骤3:解析组织与模态
- [ ] 步骤4:可视化并保存图表
- [ ] 步骤5:分析预测结果(查看图表,无需代码)。必填:解读结果前阅读[interpretation-guide.md](docs/interpretation-guide.md)
- [ ] 步骤6:撰写报告,保存为`report.md`(必填)
- [ ] 步骤7:自我审查(查看`report.md`以验证链接与声明)
- [ ] 步骤8:将`report.md`生成为工件Multi-Variant Workflow
多变异工作流
If multiple variants are specified, spawn sub-agents to run each variant
analysis and then synthesize each into a single report.
report.md如果指定了多个变异,生成子Agent来运行每个变异分析,然后将每个合成为单个报告。
report.mdScript Reference
脚本参考
| Script | Purpose |
|---|---|
| Comprehensive gene and transcript lookup using |
| : : GTF data : | |
| Biological terms → UBERON/CL/EFO IDs |
| REF/ALT visualization (expression, regulatory, |
| : : splicing) : | |
| In-Silico Mutagenesis SeqLogo generation |
| Quantitative splicing analysis (delta scores, |
| : : junctions) : | |
| Genomic track visualization for a region |
| 脚本名称 | 用途 |
|---|---|
| 使用GTF数据进行全面的基因和转录本查询 |
| 将生物术语转换为UBERON/CL/EFO ID |
| REF/ALT可视化(表达、调控、剪接) |
| 生成In-Silico Mutagenesis SeqLogo |
| 定量剪接分析(delta评分、连接) |
| 特定区域的基因组轨道可视化 |