tooluniverse-comparative-genomics
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseComparative Genomics & Ortholog Analysis
比较基因组学与直系同源分析
Cross-species gene comparison, ortholog identification, sequence retrieval, and functional conservation analysis integrating Ensembl Compara, NCBI, UniProt, OLS, Monarch, and OpenTargets.
整合Ensembl Compara、NCBI、UniProt、OLS、Monarch及OpenTargets工具,实现跨物种基因比较、直系同源基因识别、序列检索与功能保守性分析。
LOOK UP, DON'T GUESS
查资料,勿猜测
When uncertain about any scientific fact, SEARCH databases first (PubMed, UniProt, ChEMBL, ClinVar, etc.) rather than reasoning from memory. A database-verified answer is always more reliable than a guess.
当对任何科学事实存疑时,优先检索数据库(PubMed、UniProt、ChEMBL、ClinVar等),而非凭记忆推断。经数据库验证的答案永远比猜测更可靠。
COMPUTE, DON'T DESCRIBE
做计算,勿描述
When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.
当分析需要计算(统计、数据处理、评分、富集分析)时,通过Bash编写并运行Python代码。不要描述你会做什么——直接执行并报告实际结果。使用ToolUniverse工具检索数据,再通过Python(pandas、scipy、statsmodels、matplotlib)进行分析。
When to Use This Skill
何时使用该技能
Triggers:
- "Find the mouse ortholog of [human gene]"
- "Compare [gene] across species"
- "Is [gene] conserved in [organism]?"
- "What are the orthologs of [gene]?"
- "Cross-species comparison of [gene/protein]"
- "Evolutionary conservation of [gene]"
- "Compare GO annotations between human and mouse [gene]"
Use Cases:
- Ortholog Discovery: Find equivalent genes in other species for a human gene
- Conservation Analysis: Assess how conserved a gene is across evolutionary distance
- Functional Comparison: Compare GO terms, domains, and annotations across orthologs
- Model Organism Selection: Determine which model organism best recapitulates human gene function
- Gene Tree Analysis: Visualize evolutionary history of a gene family
- Cross-Species Phenotype Bridging: Link human disease phenotypes to model organism phenotypes via orthologs
触发场景:
- "查找[人类基因]的小鼠直系同源基因"
- "跨物种比较[基因]"
- "[基因]在[生物]中是否保守?"
- "[基因]的直系同源基因有哪些?"
- "[基因/蛋白]的跨物种比较"
- "[基因]的进化保守性"
- "比较人类与小鼠[基因]的GO注释"
适用场景:
- 直系同源基因发现: 为人类基因寻找其他物种中的等效基因
- 保守性分析: 评估基因在进化过程中的保守程度
- 功能比较: 对比直系同源基因的GO术语、结构域及注释信息
- 模式生物选择: 确定最能重现人类基因功能的模式生物
- 基因树分析: 可视化基因家族的进化历史
- 跨物种表型关联: 通过直系同源基因将人类疾病表型与模式生物表型关联
Conservation Reasoning Framework
保守性推理框架
Understanding conservation requires distinguishing between types of evolutionary patterns and what they imply about function.
High conservation signals functional constraint. When a gene is maintained as a 1:1 ortholog from yeast to humans, purifying selection has prevented sequence divergence — the gene's function is essential and cannot be easily altered. Highly conserved positions within a protein sequence (high PhastCons scores > 0.8, or GERP RS > 4) are under strong constraint; mutations at these positions are disproportionately pathogenic. For non-coding regions, conservation in mammals at PhastCons > 0.5 suggests a candidate regulatory element.
Low conservation in one lineage has two possible explanations: relaxed selection or positive selection. Use the dN/dS ratio (nonsynonymous to synonymous substitution rate) to distinguish them. A dN/dS ratio near 1 suggests neutral evolution — the gene is no longer under purifying selection (relaxed constraint, possibly reflecting loss of function in that lineage). A dN/dS ratio > 1 indicates positive selection — the gene is diverging faster than neutral expectation, often because it is adapting to a new environment or function. A dN/dS ratio << 1 is the signature of purifying selection (functional constraint). When a vertebrate gene shows high divergence in a specific branch of the tree, ask which explanation applies before concluding that function is lost.
Ortholog relationship type shapes interpretation. A 1:1 ortholog (one gene in human, one in mouse) is the highest-confidence functional equivalent — it has not been duplicated in either lineage, so it most likely performs the same ancestral role. A 1:many relationship (one gene in human, multiple in mouse) means the target species has duplicated the gene; the copies may have subfunctionalized (each copy performs a subset of the original roles) or neofunctionalized (one copy gained a new role). Do not assume both copies retain full ancestral function. A many:many relationship reflects complex duplication history in both species and requires analyzing each paralog pair individually.
Conservation depth predicts essentiality. A gene conserved across all vertebrates suggests a fundamental cellular process. A gene conserved only in mammals suggests a more specialized vertebrate innovation. A gene present only in primates or only in humans is likely a recent evolutionary acquisition, possibly involved in human-specific biology but often lacking the depth of functional characterization available for deeply conserved genes.
Absence of an ortholog is a finding, not an error. Lineage-specific genes exist and are biologically meaningful. Before concluding a gene is lineage-specific, check: (1) whether BLAST with relaxed thresholds finds distant homologs, (2) whether a highly divergent ortholog exists that Ensembl Compara missed, and (3) whether the gene belongs to a rapidly evolving family (immune genes, olfactory receptors, reproductive proteins) where turnover is expected.
理解保守性需要区分不同的进化模式及其对功能的暗示。
高度保守性意味着功能约束。当一个基因从酵母到人类都保持1:1直系同源关系时,净化选择阻止了序列分化——该基因的功能至关重要,无法轻易改变。蛋白质序列中高度保守的位点(PhastCons得分>0.8,或GERP RS>4)受到强约束;这些位点的突变更易致病。对于非编码区域,哺乳动物中PhastCons>0.5的保守性表明其可能是候选调控元件。
某一谱系中保守性低有两种可能解释:选择放松或正向选择。使用dN/dS比值(非同义替换率与同义替换率的比值)来区分。dN/dS比值接近1表明中性进化——该基因不再受净化选择(约束放松,可能反映该谱系中功能丧失)。dN/dS比值>1表明正向选择——基因分化速度快于中性预期,通常是因为它正在适应新环境或功能。dN/dS比值<<1是净化选择(功能约束)的标志。当脊椎动物基因在进化树的特定分支中显示高度分化时,在得出功能丧失的结论前,需明确适用哪种解释。
直系同源关系类型影响解读。1:1直系同源(人类一个基因,小鼠一个基因)是可信度最高的功能等效基因——两个谱系中均未发生复制,因此它很可能执行相同的祖先功能。1:多关系(人类一个基因,小鼠多个基因)意味着目标物种发生了基因复制;复制后的基因可能发生亚功能化(每个复制体执行原始功能的子集)或新功能化(一个复制体获得新功能)。不要假设所有复制体都保留完整的祖先功能。多:多关系反映两个物种都有复杂的复制历史,需要单独分析每个旁系同源对。
保守深度预测必要性。在所有脊椎动物中保守的基因暗示着基础细胞过程。仅在哺乳动物中保守的基因表明是更特化的脊椎动物创新。仅存在于灵长类或人类中的基因可能是近期进化获得的,可能参与人类特异性生物学功能,但通常缺乏深度保守基因所具备的功能表征数据。
无直系同源基因是一个发现,而非错误。谱系特异性基因确实存在且具有生物学意义。在得出基因是谱系特异性的结论前,需检查:(1) 放宽阈值的BLAST是否能找到远缘同源物;(2) 是否存在Ensembl Compara未识别的高度分化直系同源基因;(3) 该基因是否属于快速进化家族(免疫基因、嗅觉受体、生殖蛋白),这类家族中基因更替是预期现象。
Workflow Overview
工作流概述
Input (gene symbol/ID + reference species)
|
v
Phase 1: Gene Identification & Validation
|
v
Phase 2: Ortholog Discovery (Ensembl Compara + OpenTargets)
|
v
Phase 3: Sequence Retrieval (NCBI + Ensembl)
|
v
Phase 4: Functional Annotation Comparison (UniProt + OLS GO terms)
|
v
Phase 5: Cross-Species Phenotype Bridging (Monarch)
|
v
Phase 6: Gene Tree & Evolutionary Context (Ensembl Compara)
|
v
Report: Conservation summary, ortholog evidence, functional comparison, phenotype bridging输入(基因符号/ID + 参考物种)
|
v
阶段1:基因识别与验证
|
v
阶段2:直系同源基因发现(Ensembl Compara + OpenTargets)
|
v
阶段3:序列检索(NCBI + Ensembl)
|
v
阶段4:功能注释比较(UniProt + OLS GO术语)
|
v
阶段5:跨物种表型关联(Monarch)
|
v
阶段6:基因树与进化背景(Ensembl Compara)
|
v
报告:保守性总结、直系同源证据、功能比较、表型关联Phase 1: Gene Identification & Validation
阶段1:基因识别与验证
ensembl_lookup_genegene_idspeciesspecies="homo_sapiens"speciesensembl_lookup_genegene_idspeciesspecies="homo_sapiens"speciesPhase 2: Ortholog Discovery
阶段2:直系同源基因发现
EnsemblCompara_get_orthologuesgenespeciestarget_speciestarget_taxontarget_speciesensembl_get_homologysequence="protein"aligned=trueOpenTargets_get_target_homologues_by_ensemblIDensemblIdReasoning: Prioritize 1:1 orthologs as high-confidence functional equivalents. For 1:many cases, report all copies and flag the need for paralog-specific functional analysis. If no Ensembl Compara entry exists, try BLAST as a last resort (note: BLAST protein search against swissprot is slow, 5-30 minutes; against nr may take longer).
Key model organisms to check: mouse (taxon 10090), rat (10116), zebrafish (7955), fruit fly (7227), C. elegans (6239), S. cerevisiae (4932).
EnsemblCompara_get_orthologuesgenespeciestarget_speciestarget_taxontarget_speciesensembl_get_homologysequence="protein"aligned=trueOpenTargets_get_target_homologues_by_ensemblIDensemblId推理逻辑:优先将1:1直系同源基因视为高可信度功能等效基因。对于1:多的情况,报告所有复制体并标记需要进行旁系同源特异性功能分析。如果Ensembl Compara中无相关条目,最后尝试使用BLAST(注意:针对swissprot的BLAST蛋白搜索速度较慢,需5-30分钟;针对nr数据库可能耗时更长)。
需重点检查的模式生物:小鼠(分类ID 10090)、大鼠(10116)、斑马鱼(7955)、果蝇(7227)、秀丽隐杆线虫(6239)、酿酒酵母(4932)。
Phase 3: Sequence Retrieval
阶段3:序列检索
Use (takes as full name, e.g., "Homo sapiens"; ; = "mRNA") to find sequence records, then to convert UIDs to accession numbers, then to retrieve FASTA data. Prefer RefSeq (NM_* for mRNA, NP_* for protein) over other accessions for canonical sequence.
NCBI_search_nucleotideorganismgeneseq_typeNCBI_fetch_accessionsNCBI_get_sequenceWhen aligned sequences are needed directly, with or is faster than running BLAST. Use BLAST only when Ensembl Compara does not find orthologs.
ensembl_get_homologysequence="cdna"sequence="protein"使用(接收为全称,例如"Homo sapiens";; = "mRNA")查找序列记录,然后使用将UID转换为登录号,再使用获取FASTA数据。优先选择RefSeq(mRNA为NM_,蛋白为NP_)而非其他登录号,以获取标准序列。
NCBI_search_nucleotideorganismgeneseq_typeNCBI_fetch_accessionsNCBI_get_sequence当需要直接获取比对序列时,使用并设置或比运行BLAST更快。仅当Ensembl Compara未找到直系同源基因时才使用BLAST。
ensembl_get_homologysequence="cdna"sequence="protein"Phase 4: Functional Annotation Comparison
阶段4:功能注释比较
UniProt_search"gene:TP53 AND organism_id:9606 AND reviewed:true"fieldsreviewed:trueUniProt_get_function_by_accessionFor each species being compared, retrieve GO terms and group them by Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). Shared GO terms indicate conserved function; terms present in human but absent in the ortholog may reflect annotation bias (less-studied organisms have fewer GO annotations) rather than true functional divergence. Focus conservation claims on shared terms.
Reasoning about annotation gaps: If a mouse ortholog lacks a GO term present in the human protein, consider that this may reflect incomplete annotation of the mouse gene rather than functional divergence. The inverse — a GO term in mouse that is absent in human — is less common but can indicate diverged or acquired function.
UniProt_search"gene:TP53 AND organism_id:9606 AND reviewed:true"fieldsreviewed:trueUniProt_get_function_by_accession对于每个待比较的物种,检索GO术语并按生物过程(BP)、分子功能(MF)和细胞组分(CC)分组。共享的GO术语表明功能保守;人类存在但直系同源基因中缺失的术语可能反映注释偏差(研究较少的生物GO注释较少)而非真正的功能分化。保守性结论应聚焦于共享术语。
注释缺口推理:如果小鼠直系同源基因缺少人类蛋白中存在的GO术语,需考虑这可能反映小鼠基因注释不完整而非功能分化。反之——小鼠存在但人类缺失的GO术语——较为少见,但可能表明功能分化或获得新功能。
Phase 5: Cross-Species Phenotype Bridging
阶段5:跨物种表型关联
Monarch_search_genequeryMonarch_get_gene_phenotypesMonarch_get_gene_diseasesPhenotype ontologies by species: Human = HP (HPO), Mouse = MP (Mammalian Phenotype), Zebrafish = ZP, Fly = FBcv. Monarch integrates across species; compare phenotype themes (e.g., "tumor susceptibility" in human and "increased tumor incidence" in mouse) rather than requiring exact term matches.
Reasoning for model organism selection: A mouse ortholog that has a 1:1 relationship AND shows phenotypes in Monarch that recapitulate the human disease is a strong disease model candidate. If the mouse phenotype diverges significantly from the human disease phenotype, this is worth flagging — it could indicate species-specific function or a limitation of the model.
Monarch_search_genequeryMonarch_get_gene_phenotypesMonarch_get_gene_diseases各物种的表型本体:人类=HP(HPO)、小鼠=MP(哺乳动物表型)、斑马鱼=ZP、果蝇=FBcv。Monarch整合跨物种数据;比较表型主题(例如人类的"肿瘤易感性"与小鼠的"肿瘤发生率增加"),而非要求完全匹配术语。
模式生物选择推理:具有1:1直系同源关系且Monarch中显示的表型可重现人类疾病的小鼠直系同源基因,是理想的疾病模型候选。如果小鼠表型与人类疾病表型差异显著,需标记这一点——这可能表明物种特异性功能或模型局限性。
Phase 6: Gene Tree & Evolutionary Context
阶段6:基因树与进化背景
EnsemblCompara_get_gene_treegenespeciesEnsemblCompara_get_paraloguesFrom the gene tree, assess: (1) how many species contain a member of this gene family; (2) when gene duplication events occurred (ancient vs. recent); (3) whether the gene family expanded in particular lineages. A gene present in a single copy across all vertebrates (deep conservation, no duplication) is likely under strong selective constraint.
EnsemblCompara_get_gene_treegenespeciesEnsemblCompara_get_paralogues从基因树评估:(1) 该基因家族存在于多少物种中;(2) 基因复制事件发生的时间(古老 vs 近期);(3) 该基因家族是否在特定谱系中扩增。在所有脊椎动物中以单拷贝存在的基因(深度保守,无复制)可能受到强选择约束。
Synthesis Questions
综合分析问题
When interpreting the assembled evidence, work through these questions:
- Is the ortholog relationship 1:1 or has duplication created paralogs that may have diverged in function? This determines how directly findings in the model organism translate to the human gene.
- Do orthologs share conserved GO terms (especially Biological Process), or are there lineage-specific functional annotations suggesting divergence?
- For disease gene studies, does the model organism ortholog recapitulate relevant human phenotypes (via Monarch), supporting its use as a disease model?
- Are non-coding regulatory regions around the gene also conserved (PhastCons/GERP from OpenCRAVAT), suggesting conservation of gene regulation beyond protein function?
- If no ortholog is found, is the gene truly lineage-specific, or might a highly divergent homolog exist that is only detectable by sensitive sequence methods?
解读收集到的证据时,需依次回答以下问题:
- 直系同源关系是1:1还是因复制产生了可能功能分化的旁系同源基因?这决定了模式生物中的发现能在多大程度上直接推广到人类基因。
- 直系同源基因是否共享保守的GO术语(尤其是生物过程),还是存在谱系特异性功能注释表明分化?
- 对于疾病基因研究,模式生物直系同源基因是否能重现相关人类表型(通过Monarch),支持其作为疾病模型的使用?
- 基因周围的非编码调控区域是否也保守(来自OpenCRAVAT的PhastCons/GERP数据),表明除蛋白功能外基因调控也保守?
- 如果未找到直系同源基因,该基因真的是谱系特异性的,还是可能存在仅通过敏感序列方法才能检测到的高度分化同源物?
Fallback Strategies
备选策略
- Ortholog not found in Ensembl Compara: Try , then
ensembl_get_homology, then BLAST as last resortOpenTargets_get_target_homologues_by_ensemblID - Sequence retrieval fails: Use with
ensembl_get_homologyas alternative to NCBIsequence="cdna" - UniProt returns empty with reviewed:true: Try without that filter; organism may have only TrEMBL entries
- Monarch returns no data: Use with
MonarchV3_get_associationsas alternativecategory="biolink:GeneToPhenotypicFeatureAssociation" - Gene symbol ambiguous across species: Use Ensembl IDs throughout to avoid symbol confusion (e.g., "p53" vs "tp53" in zebrafish)
- Ensembl Compara中未找到直系同源基因:尝试,然后
ensembl_get_homology,最后使用BLASTOpenTargets_get_target_homologues_by_ensemblID - 序列检索失败:使用并设置
ensembl_get_homology作为NCBI的替代方案sequence="cdna" - UniProt使用reviewed:true返回空结果:尝试移除该筛选条件;该生物可能只有TrEMBL条目
- Monarch无数据返回:使用并设置
MonarchV3_get_associations作为替代方案category="biolink:GeneToPhenotypicFeatureAssociation" - 基因符号在跨物种中存在歧义:全程使用Ensembl ID避免符号混淆(例如斑马鱼中的"p53" vs "tp53")
Limitations
局限性
- Ensembl Compara: Best for vertebrates; invertebrate and plant coverage is limited for some gene families
- BLAST_protein_search: Very slow (5-30 min); use only as last resort for ortholog discovery
- Monarch: Phenotype coverage varies by organism; mouse and zebrafish are best covered; fly and worm data are sparser
- UniProt GO annotations: Bias toward well-studied organisms; absence of annotation does not mean absence of function
- NCBI_search_nucleotide: May return many isoforms; filter for RefSeq (NM_*) for canonical transcripts
- Conservation does not equal essentiality: Some highly conserved genes are dispensable in specific organisms
- Ensembl Compara:最适用于脊椎动物;无脊椎动物和植物的部分基因家族覆盖有限
- BLAST_protein_search:速度极慢(5-30分钟);仅作为直系同源基因发现的最后手段
- Monarch:表型覆盖因生物而异;小鼠和斑马鱼覆盖最好;果蝇和线虫数据较稀疏
- UniProt GO注释:偏向研究充分的生物;注释缺失不代表功能缺失
- NCBI_search_nucleotide:可能返回多个异构体;筛选RefSeq(NM_*)获取标准转录本
- 保守性不等于必要性:一些高度保守的基因在特定生物中并非必需