tooluniverse-sequence-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBiological Sequence Analysis
生物序列分析
Retrieve, annotate, and compare biological sequences from NCBI, Ensembl, and UniProt. Covers nucleotide search, sequence fetching, gene summaries, ortholog discovery, and protein sequence extraction.
从NCBI、Ensembl和UniProt检索、注释并对比生物序列。涵盖核苷酸搜索、序列获取、基因摘要、直系同源基因发现以及蛋白质序列提取。
When to Use
使用场景
- "Get the mRNA sequence for BRCA1"
- "Search NCBI for E. coli K-12 complete genome"
- "Find orthologs of TP53 across species"
- "Fetch the protein sequence for UniProt P04637"
- "Get the CDS sequence for Ensembl transcript ENST00000269305"
- "获取BRCA1的mRNA序列"
- "在NCBI中搜索大肠杆菌K-12完整基因组"
- "查找跨物种的TP53直系同源基因"
- "获取UniProt登录号P04637对应的蛋白质序列"
- "获取Ensembl转录本ENST00000269305的CDS序列"
Workflow
工作流程
Input -> Phase 1: Gene ID resolution -> Phase 2: Nucleotide retrieval
-> Phase 3: Protein sequences -> Phase 4: Orthologs -> Output输入 -> 阶段1:基因ID解析 -> 阶段2:核苷酸检索
-> 阶段3:蛋白质序列获取 -> 阶段4:直系同源基因查找 -> 输出Phase 1: Gene Identification and Summary
阶段1:基因识别与摘要
NCBIGene_search: (string REQUIRED, format ), (int, default 10). Returns .
term"TP53[Symbol] AND Homo sapiens[Organism]"retmax{status, data: {esearchresult: {idlist: ["7157"]}}}NCBIGene_get_summary: (string REQUIRED, e.g., "7157"). Returns . Result is keyed by gene ID string.
id{status, data: {result: {"7157": {name, description, summary, chromosome, maplocation, genomicinfo, mim}}}}NCBIDatasets_get_gene_by_symbol: (string REQUIRED, e.g., "BRCA1"), (string, e.g., "human"). Returns gene ID, description, location, cross-references.
symboltaxonNCBIDatasets_get_gene: (string REQUIRED, e.g., "7157"). Returns comprehensive gene info.
gene_idNCBIGene_search:(字符串类型,必填,格式为),(整数类型,默认值为10)。返回结果为。
term"TP53[Symbol] AND Homo sapiens[Organism]"retmax{status, data: {esearchresult: {idlist: ["7157"]}}}NCBIGene_get_summary:(字符串类型,必填,例如"7157")。返回结果为。结果以基因ID字符串作为键。
id{status, data: {result: {"7157": {name, description, summary, chromosome, maplocation, genomicinfo, mim}}}}NCBIDatasets_get_gene_by_symbol:(字符串类型,必填,例如"BRCA1"),(字符串类型,例如"human")。返回基因ID、描述、位置、交叉引用信息。
symboltaxonNCBIDatasets_get_gene:(字符串类型,必填,例如"7157")。返回全面的基因信息。
gene_idPhase 2: Nucleotide Sequence Search and Retrieval
阶段2:核苷酸序列搜索与检索
NCBI_search_nucleotide: (free-form), (string), (string), (string), (string), ("complete_genome"/"mRNA"/"refseq"), (int, default 20). Returns .
queryorganismgenestrainkeywordsseq_typelimit{status, data: {uids: [...], accessions: [...]}}NCBI_fetch_accessions: (array REQUIRED, e.g., ["545778205"]). Returns .
uids{status, data: ["U00096.3"], count: 1}NCBI_get_sequence: (string REQUIRED, e.g., "NM_007294"), ("fasta"/"gb"/"embl"). Returns .
accessionformat{status, data: "FASTA string...", accession, format, length}EnsemblSeq_get_region_sequence: (string REQUIRED, "chr:start-end", e.g., "17:7668421-7668520"), (default "homo_sapiens"). Returns .
regionspecies{status, data: {sequence, sequence_length}}ensembl_get_sequence: (string REQUIRED, Ensembl ID), ("genomic"/"cds"/"cdna"/"protein"), (bool). Returns sequence data.
idtypemultiple_sequencesGotchas:
- NCBI_search_nucleotide returns UIDs, not accessions. Use NCBI_fetch_accessions to convert.
- NCBI_fetch_accessions requires (NOT
uids).accessions - ensembl_get_sequence with gene ID (ENSG) + type != "genomic" requires . Use transcript IDs (ENST) for specific sequences.
multiple_sequences=true
NCBI_search_nucleotide:(自由格式),(字符串类型),(字符串类型),(字符串类型),(字符串类型),(可选值为"complete_genome"/"mRNA"/"refseq"),(整数类型,默认值为20)。返回结果为。
queryorganismgenestrainkeywordsseq_typelimit{status, data: {uids: [...], accessions: [...]}}NCBI_fetch_accessions:(数组类型,必填,例如["545778205"])。返回结果为。
uids{status, data: ["U00096.3"], count: 1}NCBI_get_sequence:(字符串类型,必填,例如"NM_007294"),(可选值为"fasta"/"gb"/"embl")。返回结果为。
accessionformat{status, data: "FASTA格式字符串...", accession, format, length}EnsemblSeq_get_region_sequence:(字符串类型,必填,格式为"chr:start-end",例如"17:7668421-7668520"),(默认值为"homo_sapiens")。返回结果为。
regionspecies{status, data: {sequence, sequence_length}}ensembl_get_sequence:(字符串类型,必填,Ensembl ID),(可选值为"genomic"/"cds"/"cdna"/"protein"),(布尔类型)。返回序列数据。
idtypemultiple_sequences注意事项:
- NCBI_search_nucleotide返回的是UID而非登录号,需使用NCBI_fetch_accessions进行转换。
- NCBI_fetch_accessions需要传入参数(而非
uids)。accessions - 使用ensembl_get_sequence时,若传入基因ID(ENSG)且不等于"genomic",需设置
type。如需特定序列,请使用转录本ID(ENST)。multiple_sequences=true
Recipe: Get mRNA for a human gene
示例:获取人类基因的mRNA序列
NCBI_search_nucleotide(organism="Homo sapiens", gene="BRCA1", seq_type="mRNA", limit=5)- -> accession
NCBI_fetch_accessions(uids=[first_uid]) NCBI_get_sequence(accession="NM_007294", format="fasta")
- 调用
NCBI_search_nucleotide(organism="Homo sapiens", gene="BRCA1", seq_type="mRNA", limit=5) - 调用-> 获取登录号
NCBI_fetch_accessions(uids=[首个UID]) - 调用
NCBI_get_sequence(accession="NM_007294", format="fasta")
Phase 3: Protein Sequence Retrieval
阶段3:蛋白质序列检索
UniProt_get_sequence_by_accession: (string REQUIRED, e.g., "P04637"). Returns . Note: response key is , NOT .
accession{result: "MEEPQSDP..."}resultdataEnsemblSeq_get_id_sequence: (string REQUIRED, e.g., "ENSP00000269305"), ("protein"/"cdna"/"cds"). Returns .
ensembl_idtype{status, data: {ensembl_id, molecule, sequence, sequence_length}}UniProt_get_entry_by_accession: (string REQUIRED). Full protein annotation.
accessionGotchas:
- UniProt_get_sequence_by_accession returns , not
{result: "..."}.{status, data} - For Ensembl protein seqs, use ENSP IDs. For cDNA/CDS, use ENST IDs.
- To find UniProt accession from gene: use NCBIDatasets_get_gene_by_symbol (has cross-refs).
UniProt_get_sequence_by_accession:(字符串类型,必填,例如"P04637")。返回结果为。注意:响应的键为,而非。
accession{result: "MEEPQSDP..."}resultdataEnsemblSeq_get_id_sequence:(字符串类型,必填,例如"ENSP00000269305"),(可选值为"protein"/"cdna"/"cds")。返回结果为。
ensembl_idtype{status, data: {ensembl_id, molecule, sequence, sequence_length}}UniProt_get_entry_by_accession:(字符串类型,必填)。返回完整的蛋白质注释信息。
accession注意事项:
- UniProt_get_sequence_by_accession返回的格式为,而非
{result: "..."}。{status, data} - 获取Ensembl蛋白质序列时,请使用ENSP ID;获取cDNA/CDS序列时,请使用ENST ID。
- 若需通过基因名称查找UniProt登录号:使用NCBIDatasets_get_gene_by_symbol(包含交叉引用信息)。
Phase 4: Ortholog and Comparative Analysis
阶段4:直系同源基因与对比分析
NCBIDatasets_get_orthologs: (string REQUIRED, NCBI Gene ID e.g., "7157"), (int, default 20, max 100). Returns .
gene_idpage_size{status, data: [{gene_id, symbol, description, taxname, common_name, chromosomes}]}NCBIProtein_get_summary: (string REQUIRED, GI number or accession). Returns protein title, organism, length.
idGotcha: NCBIDatasets_get_orthologs requires NCBI Gene ID (numeric string), not gene symbol or Ensembl ID. Resolve via Phase 1 first.
NCBIDatasets_get_orthologs:(字符串类型,必填,NCBI基因ID,例如"7157"),(整数类型,默认值为20,最大值为100)。返回结果为。
gene_idpage_size{status, data: [{gene_id, symbol, description, taxname, common_name, chromosomes}]}NCBIProtein_get_summary:(字符串类型,必填,GI编号或登录号)。返回蛋白质标题、所属物种、长度信息。
id注意事项:NCBIDatasets_get_orthologs需要传入NCBI基因ID(数字字符串),而非基因符号或Ensembl ID。请先通过阶段1解析获取。
Recipe: Compare orthologs
示例:对比直系同源基因
- -> "7157"
NCBIGene_search(term="TP53[Symbol] AND Homo sapiens[Organism]") - -> mouse Trp53, rat Tp53, etc.
NCBIDatasets_get_orthologs(gene_id="7157", page_size=10)
- 调用-> 获取"7157"
NCBIGene_search(term="TP53[Symbol] AND Homo sapiens[Organism]") - 调用-> 获取小鼠Trp53、大鼠Tp53等直系同源基因
NCBIDatasets_get_orthologs(gene_id="7157", page_size=10)
Phase 5: Domain Architecture and Homology
阶段5:结构域架构与同源性
InterPro_get_entries_for_protein: (UniProt ID). Returns InterPro domain/family/superfamily entries with positions.
accessionPfam_get_protein_annotations: (UniProt ID). Returns Pfam domain hits with exact residue coordinates and E-values.
accessionBLAST_protein_search: (amino acid string), (default "swissprot"), . Returns homologs with alignment scores, identity, E-values.
sequencedatabaselimitEnsemblCompara_get_orthologues: (gene symbol, e.g., "CFTR"), (e.g., "human"). User-friendly alternative to NCBIDatasets_get_orthologs — accepts gene symbols directly.
genespeciesInterPro_get_entries_for_protein:(UniProt ID)。返回带有位置信息的InterPro结构域/家族/超家族条目。
accessionPfam_get_protein_annotations:(UniProt ID)。返回带有精确残基坐标和E值的Pfam结构域匹配结果。
accessionBLAST_protein_search:(氨基酸字符串),(默认值为"swissprot"),。返回带有比对得分、一致性和E值的同源序列。
sequencedatabaselimitEnsemblCompara_get_orthologues:(基因符号,例如"CFTR"),(例如"human")。是NCBIDatasets_get_orthologs的用户友好替代工具——可直接接受基因符号作为参数。
genespeciesPhase 6: Variant and Clinical Context
阶段6:变异与临床背景
EnsemblVEP_annotate_hgvs: (e.g., "NM_000492.4:c.1521_1523del"). Returns consequence, protein impact, genomic coordinates.
hgvs_notationClinVar_search_variants: (gene symbol). Returns variant count and IDs for clinical significance lookup.
genePubMed_search_articles: , . Literature context for gene/variant findings.
querylimitEnsemblVEP_annotate_hgvs:(例如"NM_000492.4:c.1521_1523del")。返回变异后果、蛋白质影响、基因组坐标信息。
hgvs_notationClinVar_search_variants:(基因符号)。返回临床意义查询所需的变异数量和ID。
genePubMed_search_articles:,。为基因/变异研究结果提供文献背景。
querylimitTool Parameter Quick Reference
工具参数速查
| Tool | Correct Param | Common Mistake |
|---|---|---|
| NCBIGene_search | | |
| NCBIGene_get_summary | | Integer type |
| NCBI_fetch_accessions | | |
| NCBI_get_sequence | | Passing UID |
| NCBIDatasets_get_orthologs | | Gene symbol |
| EnsemblSeq_get_id_sequence | | |
| ensembl_get_sequence | | Omitting multiple_sequences for gene+CDS |
| UniProt_get_sequence_by_accession | | Response is |
| 工具 | 正确参数 | 常见错误 |
|---|---|---|
| NCBIGene_search | | 使用 |
| NCBIGene_get_summary | | 使用整数类型 |
| NCBI_fetch_accessions | | 使用 |
| NCBI_get_sequence | | 传入UID |
| NCBIDatasets_get_orthologs | | 传入基因符号 |
| EnsemblSeq_get_id_sequence | | 使用 |
| ensembl_get_sequence | | 对于基因+CDS类型,省略multiple_sequences参数 |
| UniProt_get_sequence_by_accession | | 误将响应中的 |
Fallbacks
备选方案
- Gene not found -> try NCBIDatasets_get_gene_by_symbol with explicit taxon
- No accessions from search -> broaden query (remove strain/seq_type filters)
- Ensembl error for gene+CDS -> use transcript ID (ENST) or set multiple_sequences=true
- UniProt accession unknown -> NCBIDatasets_get_gene or UniProt_search for cross-refs
- Ortholog search empty -> verify gene_id is numeric NCBI Gene ID
- 未找到基因 -> 尝试使用NCBIDatasets_get_gene_by_symbol并指定明确的分类单元
- 搜索未返回登录号 -> 扩大查询范围(移除菌株/序列类型筛选条件)
- Ensembl基因+CDS查询出错 -> 使用转录本ID(ENST)或设置multiple_sequences=true
- 未知UniProt登录号 -> 使用NCBIDatasets_get_gene或UniProt_search获取交叉引用信息
- 直系同源基因搜索无结果 -> 验证gene_id是否为数字型NCBI基因ID
Sequence Analysis Reasoning (CRITICAL)
序列分析推理规则(重要)
LOOK UP DON'T GUESS -- always fetch sequences, coordinates, and domain boundaries from databases. Do not reconstruct them from memory.
查资料而非猜测——序列、坐标和结构域边界务必从数据库获取,切勿凭记忆重构。
When to Use Which Tool
工具选择指南
| Question Type | Tool Choice | Why |
|---|---|---|
| "Find similar sequences" | BLAST_protein_search | Homology search against databases; returns E-values and identity |
| "What domains does this protein have?" | InterPro_get_entries_for_protein or Pfam_get_protein_annotations | Domain architecture with exact residue coordinates |
| "Get the sequence of gene X" | NCBI_search_nucleotide -> NCBI_get_sequence | Nucleotide retrieval by gene name |
| "Compare orthologs" | NCBIDatasets_get_orthologs or EnsemblCompara_get_orthologues | Cross-species gene comparison |
| "What is the protein impact of variant X?" | EnsemblVEP_annotate_hgvs | Consequence prediction with protein coordinates |
| "Align two sequences" | BLAST (pairwise) | Quick pairwise comparison with scoring |
| 问题类型 | 工具选择 | 原因 |
|---|---|---|
| "查找相似序列" | BLAST_protein_search | 针对数据库进行同源性搜索;返回E值和一致性 |
| "该蛋白质包含哪些结构域?" | InterPro_get_entries_for_protein或Pfam_get_protein_annotations | 提供带有精确残基坐标的结构域架构 |
| "获取基因X的序列" | NCBI_search_nucleotide -> NCBI_get_sequence | 通过基因名称检索核苷酸序列 |
| "对比直系同源基因" | NCBIDatasets_get_orthologs或EnsemblCompara_get_orthologues | 跨物种基因对比 |
| "变异X对蛋白质有何影响?" | EnsemblVEP_annotate_hgvs | 预测变异后果并提供蛋白质坐标 |
| "比对两条序列" | BLAST(两两比对) | 快速两两比对并给出评分 |
Reading Frame Selection Strategy
阅读框选择策略
When translating a DNA sequence to protein:
- Do NOT guess the reading frame -- preferred: use tool; fallback:
DNA_translate_reading_frameswhich tries all 3 frames automaticallytranslate_dna.py - The correct frame is the one with the LONGEST open reading frame (no premature stops)
- If the sequence starts with ATG, frame 1 is likely correct -- but verify
- If all 3 frames have early stop codons, the sequence may be: (a) non-coding, (b) reversed, or (c) contains sequencing errors. Try reverse complement first.
将DNA序列翻译为蛋白质时:
- 切勿猜测阅读框——首选:使用工具;备选:使用
DNA_translate_reading_frames脚本自动尝试全部3种阅读框translate_dna.py - 正确的阅读框是拥有最长开放阅读框(无提前终止密码子)的那个
- 若序列以ATG开头,阅读框1大概率正确——但需验证
- 若3种阅读框均存在提前终止密码子,该序列可能是:(a) 非编码序列,(b) 反向序列,或(c) 包含测序错误。优先尝试反向互补序列。
Protein Domain Interpretation
蛋白质结构域解读
When asked about protein function or structure:
- Get domain architecture first: returns all annotated domains with positions
InterPro_get_entries_for_protein - Domain families indicate function: Kinase domain = phosphorylation activity; SH2 domain = phosphotyrosine binding; zinc finger = DNA binding
- Variants in conserved domains are more likely pathogenic than those in linker regions
- LOOK UP domain boundaries from the database -- do not estimate positions from memory
当被问及蛋白质功能或结构时:
- 先获取结构域架构:返回所有带位置信息的注释结构域
InterPro_get_entries_for_protein - 结构域家族指示功能:激酶结构域=磷酸化活性;SH2结构域=磷酸酪氨酸结合;锌指结构域=DNA结合
- 保守结构域中的变异比连接区的变异更可能致病
- 查资料获取结构域边界——切勿凭记忆估算位置
Reasoning for Protein Feature Questions
蛋白质特征问题推理规则
When asked "how many X residues in region Y of protein Z":
-
Identify the correct protein — Gene names are ambiguous. GABAA has many subunits (GABRA1, GABRB2, GABRR1...). Read the question carefully for the specific subunit. Usewith gene name + "human" to find the right accession.
proteins_api_search -
Find the region boundaries — Usewith the accession to get annotated domains (TRANSMEM, DOMAIN, REGION). Don't guess positions — get them from the database.
proteins_api_get_features -
Count residues in the region — Fetch the sequence, extract the region, count. WRITE Python code for this — don't try to count manually.
- Residue Counting Strategy:
python3 skills/tooluniverse-sequence-analysis/scripts/sequence_tools.py --type count_region --accession P24046 --start 318 --end 440 --residue C - For residue counting questions, ALWAYS use the script or . Do NOT estimate or count from memory.
sequence[start:end].count('C')
- Residue Counting Strategy:
-
Account for multimers — READ THE QUESTION for "homomeric", "pentamer", "tetramer", "dimer". If the question asks about a homomeric receptor (e.g., "homomeric GABAAρ1"), every subunit is identical. Count the residues in ONE subunit, then multiply:
- Homomeric pentamer (most ligand-gated ion channels like GABAA ρ1): × 5
- Homotetramer (many ion channels): × 4
- Homodimer: × 2 If the question says "in the TM3-TM4 linker domains" (plural), it means across all subunits in the complex.
当被问及“蛋白质Z的Y区域中有多少个X残基”时:
-
确定正确的蛋白质——基因名称存在歧义。例如GABAA有多个亚基(GABRA1、GABRB2、GABRR1...)。仔细阅读问题以确定具体亚基。使用工具,传入基因名称+"human"以找到正确的登录号。
proteins_api_search -
查找区域边界——使用工具,传入登录号以获取注释的结构域(TRANSMEM、DOMAIN、REGION)。切勿猜测位置——从数据库获取。
proteins_api_get_features -
统计区域内的残基数量——获取序列,提取目标区域,统计数量。编写Python代码完成此操作——切勿手动统计。
- 残基统计策略:
python3 skills/tooluniverse-sequence-analysis/scripts/sequence_tools.py --type count_region --accession P24046 --start 318 --end 440 --residue C - 对于残基统计问题,务必使用脚本或。切勿估算或凭记忆统计。
sequence[start:end].count('C')
- 残基统计策略:
-
考虑多聚体情况——仔细阅读问题中的“同源多聚体”“五聚体”“四聚体”“二聚体”等表述。若问题询问同源多聚体受体(例如“同源多聚体GABAAρ1”),每个亚基均相同。统计单个亚基中的残基数量,再乘以对应倍数:
- 同源五聚体(大多数配体门控离子通道如GABAA ρ1):× 5
- 同源四聚体(许多离子通道):× 4
- 同源二聚体:× 2 若问题提及“TM3-TM4连接区(复数)”,指的是复合物中所有亚基的该区域。
Bundled Computation Scripts
内置计算脚本
Never manually count residues, compute GC%, or write reverse-complement logic inline. Run these scripts instead — they are tested and handle edge cases.
切勿手动统计残基、计算GC含量或编写反向互补逻辑。请运行以下脚本——它们经过测试,可处理边缘情况。
biology_facts.py — Biology reference lookup
biology_facts.py —— 生物学参考资料查询
Script:
skills/tooluniverse-sequence-analysis/scripts/biology_facts.pyUse this script to look up commonly-confused biology facts instead of relying on memory. It covers receptor types, ion channel stoichiometry, neurotransmitters, immune cell markers, and gene naming confusions.
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type receptor --name "GABAA"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type ion_channel --name "NMDA"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type gene_confusion --name "GABRA1"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type receptor # list all entriesTypes: (stoichiometry, pharmacology), (subunit arrangement), (synthesis, receptors), (markers, lineage), (commonly mixed-up genes like GABRA1 vs GABRR1).
receptorion_channelneurotransmitterimmune_cellgene_confusionMandatory use: any question about receptor type/stoichiometry, immune cell markers, or gene name disambiguation.
脚本路径:
skills/tooluniverse-sequence-analysis/scripts/biology_facts.py使用此脚本查询易混淆的生物学事实,而非依赖记忆。涵盖受体类型、离子通道化学计量、神经递质、免疫细胞标志物以及基因命名混淆等内容。
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type receptor --name "GABAA"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type ion_channel --name "NMDA"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type gene_confusion --name "GABRA1"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type receptor # 列出所有条目类型选项:(化学计量、药理学)、(亚基排列)、(合成、受体)、(标志物、谱系)、(易混淆基因如GABRA1与GABRR1)。
receptorion_channelneurotransmitterimmune_cellgene_confusion强制使用场景:任何关于受体类型/化学计量、免疫细胞标志物或基因名称歧义的问题。
amino_acids.py — Codon table, amino acid properties, wobble pairing
amino_acids.py —— 密码子表、氨基酸性质、摇摆配对
Script:
skills/tooluniverse-sequence-analysis/scripts/amino_acids.pyUse this script for any question about the genetic code, codon degeneracy, amino acid chemistry, codon usage bias, or tRNA wobble pairing. All outputs are JSON.
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type codon_table
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --name "Cysteine"
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --code C
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --code TRP
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid # list all 20
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type count_codons --sequence "ATGCCCAAATTT..."
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "GAU"
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "IAU"Modes:
| What it returns | Key fields |
|---|---|---|
| All 64 codons grouped by amino acid | degeneracy, codons, human codon usage %, stop codon names, degeneracy distribution (1/2/3/4/6) |
| Properties of one or all amino acids | name, one_letter, three_letter, mw_da, pKa_side_chain, polarity, charge_ph7, hydrophobicity_index (Kyte-Doolittle), backbone_pKa, codons, degeneracy, rare_codons_le15pct |
| Codon frequency analysis for a DNA sequence | codon_counts with AA annotation and human usage freq, amino_acid_composition, rare_codons_present |
| Codons recognised by a given anticodon | recognised_codons (RNA+DNA form, AA), synonymous_only, wobble rule explanation |
When to use (mandatory):
- Any question about how many codons encode a given amino acid (degeneracy)
- Any question about rare vs. common codons for protein expression optimisation
- Any question about tRNA anticodon recognition / wobble base pairing
- Any question about amino acid physical-chemical properties (MW, pKa, hydrophobicity, polarity, charge)
- Any question about the names of stop codons (Amber/Ochre/Opal)
- Before manually stating codon degeneracy — verify with
codon_table
Wobble rules: I pairs U/C/A (3 codons); G pairs U/C; U pairs A/G; C pairs G only; A pairs U only (rare). Use to verify.
--type wobble --anticodon "GAU"Amino acid lookup: accepts full name (), 1-letter (), or 3-letter ().
--name "Cysteine"--code C--code CYS脚本路径:
skills/tooluniverse-sequence-analysis/scripts/amino_acids.py使用此脚本查询遗传密码、密码子简并性、氨基酸化学性质、密码子使用偏好或tRNA摇摆配对相关问题。所有输出均为JSON格式。
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type codon_table
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --name "Cysteine"
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --code C
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --code TRP
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid # 列出全部20种氨基酸
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type count_codons --sequence "ATGCCCAAATTT..."
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "GAU"
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "IAU"模式说明:
| 返回内容 | 关键字段 |
|---|---|---|
| 按氨基酸分组的全部64种密码子 | 简并性、密码子、人类密码子使用百分比、终止密码子名称、简并性分布(1/2/3/4/6) |
| 单个或全部氨基酸的性质 | 名称、单字母缩写、三字母缩写、分子量(道尔顿)、侧链pKa、极性、pH7时的电荷、疏水性指数(Kyte-Doolittle)、主链pKa、密码子、简并性、稀有密码子(占比≤15%) |
| DNA序列的密码子频率分析 | 带氨基酸注释和人类使用频率的密码子计数、氨基酸组成、存在的稀有密码子 |
| 给定反密码子可识别的密码子 | 可识别的密码子(RNA+DNA形式、对应氨基酸)、仅同义密码子、摇摆规则说明 |
强制使用场景:
- 任何关于编码特定氨基酸的密码子数量(简并性)的问题
- 任何关于蛋白质表达优化中稀有/常见密码子的问题
- 任何关于tRNA反密码子识别/摇摆碱基配对的问题
- 任何关于氨基酸物理化学性质(分子量、pKa、疏水性、极性、电荷)的问题
- 任何关于终止密码子名称(琥珀/赭石/乳白)的问题
- 手动陈述密码子简并性之前——务必使用验证
codon_table
摇摆规则:I(次黄嘌呤)可与U/C/A配对(3种密码子);G可与C/U配对;U可与A/G配对;C仅可与G配对;A仅可与U配对(罕见)。使用进行验证。
--type wobble --anticodon "GAU"氨基酸查询:接受全名()、单字母缩写()或三字母缩写()。
--name "Cysteine"--code C--code CYSCodon-Anticodon Matching Reasoning (CRITICAL for tRNA problems)
密码子-反密码子配对推理规则(tRNA问题必看)
When solving "which codons does this tRNA recognize" or "which tRNA reads this codon":
- Anticodon is written 3'->5' but conventionally listed 5'->3'. The FIRST position of the anticodon (5' end) is the WOBBLE position and pairs with the THIRD position of the codon (3' end).
- Anticodon-codon pairing is ANTIPARALLEL: anticodon 5'-X-Y-Z-3' pairs with codon 3'-X'-Y'-Z'-5' (i.e., codon 5'-Z'-Y'-X'-3').
- Wobble position rules (anticodon 5' base -> codon 3' base it can pair with):
- C -> G only (1 codon)
- A -> U only (1 codon; rare in bacteria, common in mitochondria)
- U -> A or G (2 codons)
- G -> C or U (2 codons)
- I (inosine, deaminated A) -> U, C, or A (3 codons)
- Minimum tRNA set: Because I reads 3 bases and G/U each read 2, a 4-codon family (e.g., GCN = Ala) needs only 2 tRNAs: one with I at wobble position (reads 3 of 4 codons) and one with C or U at wobble (reads the remaining 1-2).
- ALWAYS use the script: to verify rather than reasoning from memory.
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "IAU"
解决“该tRNA可识别哪些密码子”或“哪个tRNA读取该密码子”问题时:
- 反密码子以3'->5'方向书写,但通常以5'->3'列出。反密码子的第一个位置(5'端)是摇摆位,与密码子的第三个位置(3'端)配对。
- 反密码子-密码子配对是反向平行的:反密码子5'-X-Y-Z-3'与密码子3'-X'-Y'-Z'-5'配对(即密码子5'-Z'-Y'-X'-3')。
- 摇摆位规则(反密码子5'碱基 -> 可配对的密码子3'碱基):
- C -> 仅G(1种密码子)
- A -> 仅U(1种密码子;在细菌中罕见,在线粒体中常见)
- U -> A或G(2种密码子)
- G -> C或U(2种密码子)
- I(次黄嘌呤,脱氨基A)-> U、C或A(3种密码子)
- 最小tRNA集合:由于I可识别3种碱基,G/U各可识别2种,因此4密码子家族(例如GCN=丙氨酸)仅需2种tRNA:一种摇摆位为I(识别4种密码子中的3种),另一种摇摆位为C或U(识别剩余1-2种)。
- 务必使用脚本验证:运行,而非凭记忆推理。
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "IAU"
translate_dna.py — DNA to protein translation
translate_dna.py —— DNA转蛋白质翻译
Preferred: use tool (via MCP/SDK) with parameter. Fallback: run directly.
DNA_translate_reading_framessequencetranslate_dna.pypython3 skills/tooluniverse-sequence-analysis/scripts/translate_dna.py "ATGCCC..."Tries all 3 reading frames, picks longest ORF automatically.
首选:使用工具(通过MCP/SDK),传入参数。备选:直接运行脚本。
DNA_translate_reading_framessequencetranslate_dna.pypython3 skills/tooluniverse-sequence-analysis/scripts/translate_dna.py "ATGCCC..."自动尝试全部3种阅读框,自动选择最长的开放阅读框。
sequence_tools.py — Residue counting, GC content, reverse complement, stats
sequence_tools.py —— 残基统计、GC含量、反向互补、序列统计
Script:
skills/tooluniverse-sequence-analysis/scripts/sequence_tools.pyPreferred: Use ToolUniverse tools (via MCP/SDK) instead of the script:
- tool -- Count residues in a sequence or region. Fallback:
Sequence_count_residuesorsequence_tools.py --type count_residues--type count_region - tool -- GC% of DNA. Fallback:
Sequence_gc_contentsequence_tools.py --type gc_content - tool -- DNA reverse complement. Fallback:
Sequence_reverse_complementsequence_tools.py --type reverse_complement - tool -- Auto-detect type, length, MW. Fallback:
Sequence_statssequence_tools.py --type stats
Fallback script modes (use ):
--type- : Count residue in full sequence.
count_residues--sequence "ACDE..." --residue C - : Count in region (1-based inclusive).
count_regionOR--sequence "MAC..." --start 5 --end 20 --residue C(fetches from UniProt live)--accession P24046 --start 318 --end 440 --residue C - : GC% of DNA.
gc_content--sequence "ATGCGATCG" - : DNA reverse complement.
reverse_complement--sequence "ATGCGATCG" - : Auto-detect DNA/RNA/Protein, compute length, MW for protein.
stats--sequence "ATGCG..."
ALWAYS use when the user gives a UniProt accession + region -- do not count manually.
count_region --accession脚本路径:
skills/tooluniverse-sequence-analysis/scripts/sequence_tools.py首选:使用ToolUniverse工具(通过MCP/SDK)替代脚本:
- 工具——统计序列或区域内的残基数量。备选:
Sequence_count_residues或sequence_tools.py --type count_residues--type count_region - 工具——计算DNA的GC含量。备选:
Sequence_gc_contentsequence_tools.py --type gc_content - 工具——生成DNA反向互补序列。备选:
Sequence_reverse_complementsequence_tools.py --type reverse_complement - 工具——自动检测序列类型、长度、蛋白质分子量。备选:
Sequence_statssequence_tools.py --type stats
备选脚本模式(使用指定):
--type- :统计全序列中的残基数量。
count_residues--sequence "ACDE..." --residue C - :统计指定区域内的残基数量(1-based包含性)。
count_region或--sequence "MAC..." --start 5 --end 20 --residue C(从UniProt实时获取序列)--accession P24046 --start 318 --end 440 --residue C - :计算DNA的GC含量。
gc_content--sequence "ATGCGATCG" - :生成DNA反向互补序列。
reverse_complement--sequence "ATGCGATCG" - :自动检测DNA/RNA/蛋白质类型,计算长度、蛋白质分子量。
stats--sequence "ATGCG..."
当用户提供UniProt登录号+区域时,务必使用——切勿手动统计。
count_region --accessionInterpretation Framework
解读框架
Sequence Quality Assessment
序列质量评估
| Indicator | High Quality | Acceptable | Caution |
|---|---|---|---|
| RefSeq status | NM_/NP_ (curated) | XM_/XP_ (predicted) | No RefSeq (GenBank only) |
| Sequence version | Latest version (.N) | Previous version | Removed/replaced |
| Annotation | Reviewed (UniProt Swiss-Prot) | Unreviewed (TrEMBL) | No annotation |
| Gene symbol | HGNC approved | Alias/synonym | Locus tag only |
| 指标 | 高质量 | 可接受 | 需注意 |
|---|---|---|---|
| RefSeq状态 | NM_/NP_(已审核) | XM_/XP_(预测) | 无RefSeq(仅GenBank) |
| 序列版本 | 最新版本(.N) | 旧版本 | 已移除/替换 |
| 注释状态 | 已审核(UniProt Swiss-Prot) | 未审核(TrEMBL) | 无注释 |
| 基因符号 | HGNC批准 | 别名/同义词 | 仅基因座标签 |
Synthesis Questions
综合问题分析
- Is this the correct sequence? (verify organism, gene symbol, isoform)
- Is it the canonical isoform? (RefSeq MANE Select or UniProt canonical)
- How well-annotated is it? (SwissProt > TrEMBL > GenBank predicted)
- Are there known variants? (ClinVar pathogenic variants in this sequence)
- 这是正确的序列吗?(验证物种、基因符号、同工型)
- 这是标准同工型吗?(RefSeq MANE Select或UniProt标准型)
- 注释完善程度如何?(SwissProt > TrEMBL > GenBank预测序列)
- 是否存在已知变异?(该序列中是否有ClinVar致病性变异)
Answer Formatting (CRITICAL)
回答格式要求(重要)
TRIM YOUR ANSWER: If the question asks "what protein", answer with JUST the protein name. Do not add parenthetical abbreviations, descriptions, or qualifications. Example: answer "Glucose-6-phosphate 1-dehydrogenase", NOT "Glucose-6-phosphate 1-dehydrogenase (G6PD, EC 1.1.1.49)". When identifying a protein from a sequence, use BLAST/UniProt and report the top hit name exactly as it appears in the database — no embellishment.
精简回答:若问题询问“是什么蛋白质”,仅回答蛋白质名称。请勿添加括号缩写、描述或限定语。示例:回答“葡萄糖-6-磷酸脱氢酶”,而非“葡萄糖-6-磷酸脱氢酶(G6PD, EC 1.1.1.49)”。通过序列识别蛋白质时,使用BLAST/UniProt并准确报告数据库中的顶级匹配名称——无需额外修饰。
Peptide & Foldamer Structure
肽类与折叠体结构
- Alpha-peptide helices: alpha-helix (3.6 res/turn, i->i+4 H-bonds), 3_10-helix (3 res/turn, i->i+3), pi-helix (4.4 res/turn, i->i+5).
- Beta-peptide helices: named by H-bond ring size. 14-helix (i->i+2, 14-membered rings), 12-helix, 10-helix, 8-helix.
- Beta-amino acid ring size determines helix type: 4-membered cyclic constraint -> 10-helix; 5-membered (e.g., ACPC) -> 12-helix; 6-membered (e.g., ACHC) -> 14-helix. Acyclic beta3-residues default to 14-helix.
- Mixed alpha/beta foldamers (1:1 alternation): form 11-helix (i->i+3, 11-atom rings) or 14/15-helix (i->i+4, alternating 14- and 15-atom rings). Longer sequences prefer the 14/15-helix.
- Key rule: the number in the helix name = number of atoms in the hydrogen-bonded ring.
- Cyclic beta-amino acids (ACPC, ACHC) constrain backbone torsion angles, favoring specific helix types over acyclic residues.
- α-肽螺旋:α-螺旋(每圈3.6个残基,i->i+4氢键)、3₁₀-螺旋(每圈3个残基,i->i+3氢键)、π-螺旋(每圈4.4个残基,i->i+5氢键)。
- β-肽螺旋:以氢键环的大小命名。14-螺旋(i->i+2,14元环)、12-螺旋、10-螺旋、8-螺旋。
- β-氨基酸环大小决定螺旋类型:4元环约束 -> 10-螺旋;5元环(如ACPC)-> 12-螺旋;6元环(如ACHC)-> 14-螺旋。无环β3-残基默认形成14-螺旋。
- α/β混合折叠体(1:1交替):形成11-螺旋(i->i+3,11原子环)或14/15-螺旋(i->i+4,交替形成14和15原子环)。较长序列倾向于形成14/15-螺旋。
- 关键规则:螺旋名称中的数字 = 氢键环中的原子数量。
- 环状β-氨基酸(ACPC、ACHC)约束主链扭转角,相较于无环残基更易形成特定螺旋类型。
Limitations
局限性
- ensembl_get_sequence gene IDs + non-genomic type need
multiple_sequences=true - NCBIDatasets_get_orthologs requires NCBI Gene ID (not symbol); UniProt returns canonical isoform only
- 使用ensembl_get_sequence时,基因ID+非基因组类型需设置
multiple_sequences=true - NCBIDatasets_get_orthologs需要NCBI基因ID(而非基因符号);UniProt仅返回标准同工型