tooluniverse-sequence-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Biological Sequence Analysis

生物序列分析

Retrieve, annotate, and compare biological sequences from NCBI, Ensembl, and UniProt. Covers nucleotide search, sequence fetching, gene summaries, ortholog discovery, and protein sequence extraction.
从NCBI、Ensembl和UniProt检索、注释并对比生物序列。涵盖核苷酸搜索、序列获取、基因摘要、直系同源基因发现以及蛋白质序列提取。

When to Use

使用场景

  • "Get the mRNA sequence for BRCA1"
  • "Search NCBI for E. coli K-12 complete genome"
  • "Find orthologs of TP53 across species"
  • "Fetch the protein sequence for UniProt P04637"
  • "Get the CDS sequence for Ensembl transcript ENST00000269305"
  • "获取BRCA1的mRNA序列"
  • "在NCBI中搜索大肠杆菌K-12完整基因组"
  • "查找跨物种的TP53直系同源基因"
  • "获取UniProt登录号P04637对应的蛋白质序列"
  • "获取Ensembl转录本ENST00000269305的CDS序列"

Workflow

工作流程

Input -> Phase 1: Gene ID resolution -> Phase 2: Nucleotide retrieval
      -> Phase 3: Protein sequences -> Phase 4: Orthologs -> Output
输入 -> 阶段1:基因ID解析 -> 阶段2:核苷酸检索
      -> 阶段3:蛋白质序列获取 -> 阶段4:直系同源基因查找 -> 输出

Phase 1: Gene Identification and Summary

阶段1:基因识别与摘要

NCBIGene_search:
term
(string REQUIRED, format
"TP53[Symbol] AND Homo sapiens[Organism]"
),
retmax
(int, default 10). Returns
{status, data: {esearchresult: {idlist: ["7157"]}}}
.
NCBIGene_get_summary:
id
(string REQUIRED, e.g., "7157"). Returns
{status, data: {result: {"7157": {name, description, summary, chromosome, maplocation, genomicinfo, mim}}}}
. Result is keyed by gene ID string.
NCBIDatasets_get_gene_by_symbol:
symbol
(string REQUIRED, e.g., "BRCA1"),
taxon
(string, e.g., "human"). Returns gene ID, description, location, cross-references.
NCBIDatasets_get_gene:
gene_id
(string REQUIRED, e.g., "7157"). Returns comprehensive gene info.
NCBIGene_search
term
(字符串类型,必填,格式为
"TP53[Symbol] AND Homo sapiens[Organism]"
),
retmax
(整数类型,默认值为10)。返回结果为
{status, data: {esearchresult: {idlist: ["7157"]}}}
NCBIGene_get_summary
id
(字符串类型,必填,例如"7157")。返回结果为
{status, data: {result: {"7157": {name, description, summary, chromosome, maplocation, genomicinfo, mim}}}}
。结果以基因ID字符串作为键。
NCBIDatasets_get_gene_by_symbol
symbol
(字符串类型,必填,例如"BRCA1"),
taxon
(字符串类型,例如"human")。返回基因ID、描述、位置、交叉引用信息。
NCBIDatasets_get_gene
gene_id
(字符串类型,必填,例如"7157")。返回全面的基因信息。

Phase 2: Nucleotide Sequence Search and Retrieval

阶段2:核苷酸序列搜索与检索

NCBI_search_nucleotide:
query
(free-form),
organism
(string),
gene
(string),
strain
(string),
keywords
(string),
seq_type
("complete_genome"/"mRNA"/"refseq"),
limit
(int, default 20). Returns
{status, data: {uids: [...], accessions: [...]}}
.
NCBI_fetch_accessions:
uids
(array REQUIRED, e.g., ["545778205"]). Returns
{status, data: ["U00096.3"], count: 1}
.
NCBI_get_sequence:
accession
(string REQUIRED, e.g., "NM_007294"),
format
("fasta"/"gb"/"embl"). Returns
{status, data: "FASTA string...", accession, format, length}
.
EnsemblSeq_get_region_sequence:
region
(string REQUIRED, "chr:start-end", e.g., "17:7668421-7668520"),
species
(default "homo_sapiens"). Returns
{status, data: {sequence, sequence_length}}
.
ensembl_get_sequence:
id
(string REQUIRED, Ensembl ID),
type
("genomic"/"cds"/"cdna"/"protein"),
multiple_sequences
(bool). Returns sequence data.
Gotchas:
  • NCBI_search_nucleotide returns UIDs, not accessions. Use NCBI_fetch_accessions to convert.
  • NCBI_fetch_accessions requires
    uids
    (NOT
    accessions
    ).
  • ensembl_get_sequence with gene ID (ENSG) + type != "genomic" requires
    multiple_sequences=true
    . Use transcript IDs (ENST) for specific sequences.
NCBI_search_nucleotide
query
(自由格式),
organism
(字符串类型),
gene
(字符串类型),
strain
(字符串类型),
keywords
(字符串类型),
seq_type
(可选值为"complete_genome"/"mRNA"/"refseq"),
limit
(整数类型,默认值为20)。返回结果为
{status, data: {uids: [...], accessions: [...]}}
NCBI_fetch_accessions
uids
(数组类型,必填,例如["545778205"])。返回结果为
{status, data: ["U00096.3"], count: 1}
NCBI_get_sequence
accession
(字符串类型,必填,例如"NM_007294"),
format
(可选值为"fasta"/"gb"/"embl")。返回结果为
{status, data: "FASTA格式字符串...", accession, format, length}
EnsemblSeq_get_region_sequence
region
(字符串类型,必填,格式为"chr:start-end",例如"17:7668421-7668520"),
species
(默认值为"homo_sapiens")。返回结果为
{status, data: {sequence, sequence_length}}
ensembl_get_sequence
id
(字符串类型,必填,Ensembl ID),
type
(可选值为"genomic"/"cds"/"cdna"/"protein"),
multiple_sequences
(布尔类型)。返回序列数据。
注意事项:
  • NCBI_search_nucleotide返回的是UID而非登录号,需使用NCBI_fetch_accessions进行转换。
  • NCBI_fetch_accessions需要传入
    uids
    参数(而非
    accessions
    )。
  • 使用ensembl_get_sequence时,若传入基因ID(ENSG)且
    type
    不等于"genomic",需设置
    multiple_sequences=true
    。如需特定序列,请使用转录本ID(ENST)。

Recipe: Get mRNA for a human gene

示例:获取人类基因的mRNA序列

  1. NCBI_search_nucleotide(organism="Homo sapiens", gene="BRCA1", seq_type="mRNA", limit=5)
  2. NCBI_fetch_accessions(uids=[first_uid])
    -> accession
  3. NCBI_get_sequence(accession="NM_007294", format="fasta")
  1. 调用
    NCBI_search_nucleotide(organism="Homo sapiens", gene="BRCA1", seq_type="mRNA", limit=5)
  2. 调用
    NCBI_fetch_accessions(uids=[首个UID])
    -> 获取登录号
  3. 调用
    NCBI_get_sequence(accession="NM_007294", format="fasta")

Phase 3: Protein Sequence Retrieval

阶段3:蛋白质序列检索

UniProt_get_sequence_by_accession:
accession
(string REQUIRED, e.g., "P04637"). Returns
{result: "MEEPQSDP..."}
. Note: response key is
result
, NOT
data
.
EnsemblSeq_get_id_sequence:
ensembl_id
(string REQUIRED, e.g., "ENSP00000269305"),
type
("protein"/"cdna"/"cds"). Returns
{status, data: {ensembl_id, molecule, sequence, sequence_length}}
.
UniProt_get_entry_by_accession:
accession
(string REQUIRED). Full protein annotation.
Gotchas:
  • UniProt_get_sequence_by_accession returns
    {result: "..."}
    , not
    {status, data}
    .
  • For Ensembl protein seqs, use ENSP IDs. For cDNA/CDS, use ENST IDs.
  • To find UniProt accession from gene: use NCBIDatasets_get_gene_by_symbol (has cross-refs).
UniProt_get_sequence_by_accession
accession
(字符串类型,必填,例如"P04637")。返回结果为
{result: "MEEPQSDP..."}
注意:响应的键为
result
,而非
data
EnsemblSeq_get_id_sequence
ensembl_id
(字符串类型,必填,例如"ENSP00000269305"),
type
(可选值为"protein"/"cdna"/"cds")。返回结果为
{status, data: {ensembl_id, molecule, sequence, sequence_length}}
UniProt_get_entry_by_accession
accession
(字符串类型,必填)。返回完整的蛋白质注释信息。
注意事项:
  • UniProt_get_sequence_by_accession返回的格式为
    {result: "..."}
    ,而非
    {status, data}
  • 获取Ensembl蛋白质序列时,请使用ENSP ID;获取cDNA/CDS序列时,请使用ENST ID。
  • 若需通过基因名称查找UniProt登录号:使用NCBIDatasets_get_gene_by_symbol(包含交叉引用信息)。

Phase 4: Ortholog and Comparative Analysis

阶段4:直系同源基因与对比分析

NCBIDatasets_get_orthologs:
gene_id
(string REQUIRED, NCBI Gene ID e.g., "7157"),
page_size
(int, default 20, max 100). Returns
{status, data: [{gene_id, symbol, description, taxname, common_name, chromosomes}]}
.
NCBIProtein_get_summary:
id
(string REQUIRED, GI number or accession). Returns protein title, organism, length.
Gotcha: NCBIDatasets_get_orthologs requires NCBI Gene ID (numeric string), not gene symbol or Ensembl ID. Resolve via Phase 1 first.
NCBIDatasets_get_orthologs
gene_id
(字符串类型,必填,NCBI基因ID,例如"7157"),
page_size
(整数类型,默认值为20,最大值为100)。返回结果为
{status, data: [{gene_id, symbol, description, taxname, common_name, chromosomes}]}
NCBIProtein_get_summary
id
(字符串类型,必填,GI编号或登录号)。返回蛋白质标题、所属物种、长度信息。
注意事项:NCBIDatasets_get_orthologs需要传入NCBI基因ID(数字字符串),而非基因符号或Ensembl ID。请先通过阶段1解析获取。

Recipe: Compare orthologs

示例:对比直系同源基因

  1. NCBIGene_search(term="TP53[Symbol] AND Homo sapiens[Organism]")
    -> "7157"
  2. NCBIDatasets_get_orthologs(gene_id="7157", page_size=10)
    -> mouse Trp53, rat Tp53, etc.
  1. 调用
    NCBIGene_search(term="TP53[Symbol] AND Homo sapiens[Organism]")
    -> 获取"7157"
  2. 调用
    NCBIDatasets_get_orthologs(gene_id="7157", page_size=10)
    -> 获取小鼠Trp53、大鼠Tp53等直系同源基因

Phase 5: Domain Architecture and Homology

阶段5:结构域架构与同源性

InterPro_get_entries_for_protein:
accession
(UniProt ID). Returns InterPro domain/family/superfamily entries with positions.
Pfam_get_protein_annotations:
accession
(UniProt ID). Returns Pfam domain hits with exact residue coordinates and E-values.
BLAST_protein_search:
sequence
(amino acid string),
database
(default "swissprot"),
limit
. Returns homologs with alignment scores, identity, E-values.
EnsemblCompara_get_orthologues:
gene
(gene symbol, e.g., "CFTR"),
species
(e.g., "human"). User-friendly alternative to NCBIDatasets_get_orthologs — accepts gene symbols directly.
InterPro_get_entries_for_protein
accession
(UniProt ID)。返回带有位置信息的InterPro结构域/家族/超家族条目。
Pfam_get_protein_annotations
accession
(UniProt ID)。返回带有精确残基坐标和E值的Pfam结构域匹配结果。
BLAST_protein_search
sequence
(氨基酸字符串),
database
(默认值为"swissprot"),
limit
。返回带有比对得分、一致性和E值的同源序列。
EnsemblCompara_get_orthologues
gene
(基因符号,例如"CFTR"),
species
(例如"human")。是NCBIDatasets_get_orthologs的用户友好替代工具——可直接接受基因符号作为参数。

Phase 6: Variant and Clinical Context

阶段6:变异与临床背景

EnsemblVEP_annotate_hgvs:
hgvs_notation
(e.g., "NM_000492.4:c.1521_1523del"). Returns consequence, protein impact, genomic coordinates.
ClinVar_search_variants:
gene
(gene symbol). Returns variant count and IDs for clinical significance lookup.
PubMed_search_articles:
query
,
limit
. Literature context for gene/variant findings.

EnsemblVEP_annotate_hgvs
hgvs_notation
(例如"NM_000492.4:c.1521_1523del")。返回变异后果、蛋白质影响、基因组坐标信息。
ClinVar_search_variants
gene
(基因符号)。返回临床意义查询所需的变异数量和ID。
PubMed_search_articles
query
limit
。为基因/变异研究结果提供文献背景。

Tool Parameter Quick Reference

工具参数速查

ToolCorrect ParamCommon Mistake
NCBIGene_search
term
(with [Symbol] syntax)
query
or
gene
NCBIGene_get_summary
id
(string)
Integer type
NCBI_fetch_accessions
uids
(array)
accessions
NCBI_get_sequence
accession
(string)
Passing UID
NCBIDatasets_get_orthologs
gene_id
(string)
Gene symbol
EnsemblSeq_get_id_sequence
ensembl_id
id
ensembl_get_sequence
id
+
multiple_sequences
Omitting multiple_sequences for gene+CDS
UniProt_get_sequence_by_accession
accession
Response is
result
not
data
工具正确参数常见错误
NCBIGene_search
term
(使用[Symbol]语法)
使用
query
gene
参数
NCBIGene_get_summary
id
(字符串类型)
使用整数类型
NCBI_fetch_accessions
uids
(数组类型)
使用
accessions
参数
NCBI_get_sequence
accession
(字符串类型)
传入UID
NCBIDatasets_get_orthologs
gene_id
(字符串类型)
传入基因符号
EnsemblSeq_get_id_sequence
ensembl_id
使用
id
参数
ensembl_get_sequence
id
+
multiple_sequences
对于基因+CDS类型,省略multiple_sequences参数
UniProt_get_sequence_by_accession
accession
误将响应中的
result
当作
data

Fallbacks

备选方案

  • Gene not found -> try NCBIDatasets_get_gene_by_symbol with explicit taxon
  • No accessions from search -> broaden query (remove strain/seq_type filters)
  • Ensembl error for gene+CDS -> use transcript ID (ENST) or set multiple_sequences=true
  • UniProt accession unknown -> NCBIDatasets_get_gene or UniProt_search for cross-refs
  • Ortholog search empty -> verify gene_id is numeric NCBI Gene ID
  • 未找到基因 -> 尝试使用NCBIDatasets_get_gene_by_symbol并指定明确的分类单元
  • 搜索未返回登录号 -> 扩大查询范围(移除菌株/序列类型筛选条件)
  • Ensembl基因+CDS查询出错 -> 使用转录本ID(ENST)或设置multiple_sequences=true
  • 未知UniProt登录号 -> 使用NCBIDatasets_get_gene或UniProt_search获取交叉引用信息
  • 直系同源基因搜索无结果 -> 验证gene_id是否为数字型NCBI基因ID

Sequence Analysis Reasoning (CRITICAL)

序列分析推理规则(重要)

LOOK UP DON'T GUESS -- always fetch sequences, coordinates, and domain boundaries from databases. Do not reconstruct them from memory.
查资料而非猜测——序列、坐标和结构域边界务必从数据库获取,切勿凭记忆重构。

When to Use Which Tool

工具选择指南

Question TypeTool ChoiceWhy
"Find similar sequences"BLAST_protein_searchHomology search against databases; returns E-values and identity
"What domains does this protein have?"InterPro_get_entries_for_protein or Pfam_get_protein_annotationsDomain architecture with exact residue coordinates
"Get the sequence of gene X"NCBI_search_nucleotide -> NCBI_get_sequenceNucleotide retrieval by gene name
"Compare orthologs"NCBIDatasets_get_orthologs or EnsemblCompara_get_orthologuesCross-species gene comparison
"What is the protein impact of variant X?"EnsemblVEP_annotate_hgvsConsequence prediction with protein coordinates
"Align two sequences"BLAST (pairwise)Quick pairwise comparison with scoring
问题类型工具选择原因
"查找相似序列"BLAST_protein_search针对数据库进行同源性搜索;返回E值和一致性
"该蛋白质包含哪些结构域?"InterPro_get_entries_for_protein或Pfam_get_protein_annotations提供带有精确残基坐标的结构域架构
"获取基因X的序列"NCBI_search_nucleotide -> NCBI_get_sequence通过基因名称检索核苷酸序列
"对比直系同源基因"NCBIDatasets_get_orthologs或EnsemblCompara_get_orthologues跨物种基因对比
"变异X对蛋白质有何影响?"EnsemblVEP_annotate_hgvs预测变异后果并提供蛋白质坐标
"比对两条序列"BLAST(两两比对)快速两两比对并给出评分

Reading Frame Selection Strategy

阅读框选择策略

When translating a DNA sequence to protein:
  1. Do NOT guess the reading frame -- preferred: use
    DNA_translate_reading_frames
    tool; fallback:
    translate_dna.py
    which tries all 3 frames automatically
  2. The correct frame is the one with the LONGEST open reading frame (no premature stops)
  3. If the sequence starts with ATG, frame 1 is likely correct -- but verify
  4. If all 3 frames have early stop codons, the sequence may be: (a) non-coding, (b) reversed, or (c) contains sequencing errors. Try reverse complement first.
将DNA序列翻译为蛋白质时:
  1. 切勿猜测阅读框——首选:使用
    DNA_translate_reading_frames
    工具;备选:使用
    translate_dna.py
    脚本自动尝试全部3种阅读框
  2. 正确的阅读框是拥有最长开放阅读框(无提前终止密码子)的那个
  3. 若序列以ATG开头,阅读框1大概率正确——但需验证
  4. 若3种阅读框均存在提前终止密码子,该序列可能是:(a) 非编码序列,(b) 反向序列,或(c) 包含测序错误。优先尝试反向互补序列。

Protein Domain Interpretation

蛋白质结构域解读

When asked about protein function or structure:
  1. Get domain architecture first:
    InterPro_get_entries_for_protein
    returns all annotated domains with positions
  2. Domain families indicate function: Kinase domain = phosphorylation activity; SH2 domain = phosphotyrosine binding; zinc finger = DNA binding
  3. Variants in conserved domains are more likely pathogenic than those in linker regions
  4. LOOK UP domain boundaries from the database -- do not estimate positions from memory
当被问及蛋白质功能或结构时:
  1. 先获取结构域架构
    InterPro_get_entries_for_protein
    返回所有带位置信息的注释结构域
  2. 结构域家族指示功能:激酶结构域=磷酸化活性;SH2结构域=磷酸酪氨酸结合;锌指结构域=DNA结合
  3. 保守结构域中的变异比连接区的变异更可能致病
  4. 查资料获取结构域边界——切勿凭记忆估算位置

Reasoning for Protein Feature Questions

蛋白质特征问题推理规则

When asked "how many X residues in region Y of protein Z":
  1. Identify the correct protein — Gene names are ambiguous. GABAA has many subunits (GABRA1, GABRB2, GABRR1...). Read the question carefully for the specific subunit. Use
    proteins_api_search
    with gene name + "human" to find the right accession.
  2. Find the region boundaries — Use
    proteins_api_get_features
    with the accession to get annotated domains (TRANSMEM, DOMAIN, REGION). Don't guess positions — get them from the database.
  3. Count residues in the region — Fetch the sequence, extract the region, count. WRITE Python code for this — don't try to count manually.
    • Residue Counting Strategy:
      python3 skills/tooluniverse-sequence-analysis/scripts/sequence_tools.py --type count_region --accession P24046 --start 318 --end 440 --residue C
    • For residue counting questions, ALWAYS use the script or
      sequence[start:end].count('C')
      . Do NOT estimate or count from memory.
  4. Account for multimers — READ THE QUESTION for "homomeric", "pentamer", "tetramer", "dimer". If the question asks about a homomeric receptor (e.g., "homomeric GABAAρ1"), every subunit is identical. Count the residues in ONE subunit, then multiply:
    • Homomeric pentamer (most ligand-gated ion channels like GABAA ρ1): × 5
    • Homotetramer (many ion channels): × 4
    • Homodimer: × 2 If the question says "in the TM3-TM4 linker domains" (plural), it means across all subunits in the complex.
当被问及“蛋白质Z的Y区域中有多少个X残基”时:
  1. 确定正确的蛋白质——基因名称存在歧义。例如GABAA有多个亚基(GABRA1、GABRB2、GABRR1...)。仔细阅读问题以确定具体亚基。使用
    proteins_api_search
    工具,传入基因名称+"human"以找到正确的登录号。
  2. 查找区域边界——使用
    proteins_api_get_features
    工具,传入登录号以获取注释的结构域(TRANSMEM、DOMAIN、REGION)。切勿猜测位置——从数据库获取。
  3. 统计区域内的残基数量——获取序列,提取目标区域,统计数量。编写Python代码完成此操作——切勿手动统计。
    • 残基统计策略
      python3 skills/tooluniverse-sequence-analysis/scripts/sequence_tools.py --type count_region --accession P24046 --start 318 --end 440 --residue C
    • 对于残基统计问题,务必使用脚本或
      sequence[start:end].count('C')
      。切勿估算或凭记忆统计。
  4. 考虑多聚体情况——仔细阅读问题中的“同源多聚体”“五聚体”“四聚体”“二聚体”等表述。若问题询问同源多聚体受体(例如“同源多聚体GABAAρ1”),每个亚基均相同。统计单个亚基中的残基数量,再乘以对应倍数:
    • 同源五聚体(大多数配体门控离子通道如GABAA ρ1):× 5
    • 同源四聚体(许多离子通道):× 4
    • 同源二聚体:× 2 若问题提及“TM3-TM4连接区(复数)”,指的是复合物中所有亚基的该区域。

Bundled Computation Scripts

内置计算脚本

Never manually count residues, compute GC%, or write reverse-complement logic inline. Run these scripts instead — they are tested and handle edge cases.
切勿手动统计残基、计算GC含量或编写反向互补逻辑。请运行以下脚本——它们经过测试,可处理边缘情况。

biology_facts.py — Biology reference lookup

biology_facts.py —— 生物学参考资料查询

Script:
skills/tooluniverse-sequence-analysis/scripts/biology_facts.py
Use this script to look up commonly-confused biology facts instead of relying on memory. It covers receptor types, ion channel stoichiometry, neurotransmitters, immune cell markers, and gene naming confusions.
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type receptor --name "GABAA"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type ion_channel --name "NMDA"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type gene_confusion --name "GABRA1"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type receptor  # list all entries
Types:
receptor
(stoichiometry, pharmacology),
ion_channel
(subunit arrangement),
neurotransmitter
(synthesis, receptors),
immune_cell
(markers, lineage),
gene_confusion
(commonly mixed-up genes like GABRA1 vs GABRR1).
Mandatory use: any question about receptor type/stoichiometry, immune cell markers, or gene name disambiguation.
脚本路径
skills/tooluniverse-sequence-analysis/scripts/biology_facts.py
使用此脚本查询易混淆的生物学事实,而非依赖记忆。涵盖受体类型、离子通道化学计量、神经递质、免疫细胞标志物以及基因命名混淆等内容。
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type receptor --name "GABAA"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type ion_channel --name "NMDA"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type gene_confusion --name "GABRA1"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type receptor  # 列出所有条目
类型选项:
receptor
(化学计量、药理学)、
ion_channel
(亚基排列)、
neurotransmitter
(合成、受体)、
immune_cell
(标志物、谱系)、
gene_confusion
(易混淆基因如GABRA1与GABRR1)。
强制使用场景:任何关于受体类型/化学计量、免疫细胞标志物或基因名称歧义的问题。

amino_acids.py — Codon table, amino acid properties, wobble pairing

amino_acids.py —— 密码子表、氨基酸性质、摇摆配对

Script:
skills/tooluniverse-sequence-analysis/scripts/amino_acids.py
Use this script for any question about the genetic code, codon degeneracy, amino acid chemistry, codon usage bias, or tRNA wobble pairing. All outputs are JSON.
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type codon_table
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --name "Cysteine"
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --code C
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --code TRP
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid            # list all 20
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type count_codons --sequence "ATGCCCAAATTT..."
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "GAU"
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "IAU"
Modes:
--type
What it returnsKey fields
codon_table
All 64 codons grouped by amino aciddegeneracy, codons, human codon usage %, stop codon names, degeneracy distribution (1/2/3/4/6)
amino_acid
Properties of one or all amino acidsname, one_letter, three_letter, mw_da, pKa_side_chain, polarity, charge_ph7, hydrophobicity_index (Kyte-Doolittle), backbone_pKa, codons, degeneracy, rare_codons_le15pct
count_codons
Codon frequency analysis for a DNA sequencecodon_counts with AA annotation and human usage freq, amino_acid_composition, rare_codons_present
wobble
Codons recognised by a given anticodonrecognised_codons (RNA+DNA form, AA), synonymous_only, wobble rule explanation
When to use (mandatory):
  • Any question about how many codons encode a given amino acid (degeneracy)
  • Any question about rare vs. common codons for protein expression optimisation
  • Any question about tRNA anticodon recognition / wobble base pairing
  • Any question about amino acid physical-chemical properties (MW, pKa, hydrophobicity, polarity, charge)
  • Any question about the names of stop codons (Amber/Ochre/Opal)
  • Before manually stating codon degeneracy — verify with
    codon_table
Wobble rules: I pairs U/C/A (3 codons); G pairs U/C; U pairs A/G; C pairs G only; A pairs U only (rare). Use
--type wobble --anticodon "GAU"
to verify.
Amino acid lookup: accepts full name (
--name "Cysteine"
), 1-letter (
--code C
), or 3-letter (
--code CYS
).
脚本路径
skills/tooluniverse-sequence-analysis/scripts/amino_acids.py
使用此脚本查询遗传密码、密码子简并性、氨基酸化学性质、密码子使用偏好或tRNA摇摆配对相关问题。所有输出均为JSON格式。
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type codon_table
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --name "Cysteine"
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --code C
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --code TRP
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid            # 列出全部20种氨基酸
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type count_codons --sequence "ATGCCCAAATTT..."
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "GAU"
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "IAU"
模式说明:
--type
返回内容关键字段
codon_table
按氨基酸分组的全部64种密码子简并性、密码子、人类密码子使用百分比、终止密码子名称、简并性分布(1/2/3/4/6)
amino_acid
单个或全部氨基酸的性质名称、单字母缩写、三字母缩写、分子量(道尔顿)、侧链pKa、极性、pH7时的电荷、疏水性指数(Kyte-Doolittle)、主链pKa、密码子、简并性、稀有密码子(占比≤15%)
count_codons
DNA序列的密码子频率分析带氨基酸注释和人类使用频率的密码子计数、氨基酸组成、存在的稀有密码子
wobble
给定反密码子可识别的密码子可识别的密码子(RNA+DNA形式、对应氨基酸)、仅同义密码子、摇摆规则说明
强制使用场景:
  • 任何关于编码特定氨基酸的密码子数量(简并性)的问题
  • 任何关于蛋白质表达优化中稀有/常见密码子的问题
  • 任何关于tRNA反密码子识别/摇摆碱基配对的问题
  • 任何关于氨基酸物理化学性质(分子量、pKa、疏水性、极性、电荷)的问题
  • 任何关于终止密码子名称(琥珀/赭石/乳白)的问题
  • 手动陈述密码子简并性之前——务必使用
    codon_table
    验证
摇摆规则:I(次黄嘌呤)可与U/C/A配对(3种密码子);G可与C/U配对;U可与A/G配对;C仅可与G配对;A仅可与U配对(罕见)。使用
--type wobble --anticodon "GAU"
进行验证。
氨基酸查询:接受全名(
--name "Cysteine"
)、单字母缩写(
--code C
)或三字母缩写(
--code CYS
)。

Codon-Anticodon Matching Reasoning (CRITICAL for tRNA problems)

密码子-反密码子配对推理规则(tRNA问题必看)

When solving "which codons does this tRNA recognize" or "which tRNA reads this codon":
  1. Anticodon is written 3'->5' but conventionally listed 5'->3'. The FIRST position of the anticodon (5' end) is the WOBBLE position and pairs with the THIRD position of the codon (3' end).
  2. Anticodon-codon pairing is ANTIPARALLEL: anticodon 5'-X-Y-Z-3' pairs with codon 3'-X'-Y'-Z'-5' (i.e., codon 5'-Z'-Y'-X'-3').
  3. Wobble position rules (anticodon 5' base -> codon 3' base it can pair with):
    • C -> G only (1 codon)
    • A -> U only (1 codon; rare in bacteria, common in mitochondria)
    • U -> A or G (2 codons)
    • G -> C or U (2 codons)
    • I (inosine, deaminated A) -> U, C, or A (3 codons)
  4. Minimum tRNA set: Because I reads 3 bases and G/U each read 2, a 4-codon family (e.g., GCN = Ala) needs only 2 tRNAs: one with I at wobble position (reads 3 of 4 codons) and one with C or U at wobble (reads the remaining 1-2).
  5. ALWAYS use the script:
    python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "IAU"
    to verify rather than reasoning from memory.

解决“该tRNA可识别哪些密码子”或“哪个tRNA读取该密码子”问题时:
  1. 反密码子以3'->5'方向书写,但通常以5'->3'列出。反密码子的第一个位置(5'端)是摇摆位,与密码子的第三个位置(3'端)配对。
  2. 反密码子-密码子配对是反向平行的:反密码子5'-X-Y-Z-3'与密码子3'-X'-Y'-Z'-5'配对(即密码子5'-Z'-Y'-X'-3')。
  3. 摇摆位规则(反密码子5'碱基 -> 可配对的密码子3'碱基):
    • C -> 仅G(1种密码子)
    • A -> 仅U(1种密码子;在细菌中罕见,在线粒体中常见)
    • U -> A或G(2种密码子)
    • G -> C或U(2种密码子)
    • I(次黄嘌呤,脱氨基A)-> U、C或A(3种密码子)
  4. 最小tRNA集合:由于I可识别3种碱基,G/U各可识别2种,因此4密码子家族(例如GCN=丙氨酸)仅需2种tRNA:一种摇摆位为I(识别4种密码子中的3种),另一种摇摆位为C或U(识别剩余1-2种)。
  5. 务必使用脚本验证:运行
    python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "IAU"
    ,而非凭记忆推理。

translate_dna.py — DNA to protein translation

translate_dna.py —— DNA转蛋白质翻译

Preferred: use
DNA_translate_reading_frames
tool (via MCP/SDK) with
sequence
parameter. Fallback: run
translate_dna.py
directly.
python3 skills/tooluniverse-sequence-analysis/scripts/translate_dna.py "ATGCCC..."
Tries all 3 reading frames, picks longest ORF automatically.
首选:使用
DNA_translate_reading_frames
工具(通过MCP/SDK),传入
sequence
参数。备选:直接运行
translate_dna.py
脚本。
python3 skills/tooluniverse-sequence-analysis/scripts/translate_dna.py "ATGCCC..."
自动尝试全部3种阅读框,自动选择最长的开放阅读框。

sequence_tools.py — Residue counting, GC content, reverse complement, stats

sequence_tools.py —— 残基统计、GC含量、反向互补、序列统计

Script:
skills/tooluniverse-sequence-analysis/scripts/sequence_tools.py
Preferred: Use ToolUniverse tools (via MCP/SDK) instead of the script:
  • Sequence_count_residues
    tool -- Count residues in a sequence or region. Fallback:
    sequence_tools.py --type count_residues
    or
    --type count_region
  • Sequence_gc_content
    tool -- GC% of DNA. Fallback:
    sequence_tools.py --type gc_content
  • Sequence_reverse_complement
    tool -- DNA reverse complement. Fallback:
    sequence_tools.py --type reverse_complement
  • Sequence_stats
    tool -- Auto-detect type, length, MW. Fallback:
    sequence_tools.py --type stats
Fallback script modes (use
--type
):
  • count_residues
    : Count residue in full sequence.
    --sequence "ACDE..." --residue C
  • count_region
    : Count in region (1-based inclusive).
    --sequence "MAC..." --start 5 --end 20 --residue C
    OR
    --accession P24046 --start 318 --end 440 --residue C
    (fetches from UniProt live)
  • gc_content
    : GC% of DNA.
    --sequence "ATGCGATCG"
  • reverse_complement
    : DNA reverse complement.
    --sequence "ATGCGATCG"
  • stats
    : Auto-detect DNA/RNA/Protein, compute length, MW for protein.
    --sequence "ATGCG..."
ALWAYS use
count_region --accession
when the user gives a UniProt accession + region -- do not count manually.

脚本路径
skills/tooluniverse-sequence-analysis/scripts/sequence_tools.py
首选:使用ToolUniverse工具(通过MCP/SDK)替代脚本:
  • Sequence_count_residues
    工具——统计序列或区域内的残基数量。备选:
    sequence_tools.py --type count_residues
    --type count_region
  • Sequence_gc_content
    工具——计算DNA的GC含量。备选:
    sequence_tools.py --type gc_content
  • Sequence_reverse_complement
    工具——生成DNA反向互补序列。备选:
    sequence_tools.py --type reverse_complement
  • Sequence_stats
    工具——自动检测序列类型、长度、蛋白质分子量。备选:
    sequence_tools.py --type stats
备选脚本模式(使用
--type
指定):
  • count_residues
    :统计全序列中的残基数量。
    --sequence "ACDE..." --residue C
  • count_region
    :统计指定区域内的残基数量(1-based包含性)。
    --sequence "MAC..." --start 5 --end 20 --residue C
    --accession P24046 --start 318 --end 440 --residue C
    (从UniProt实时获取序列)
  • gc_content
    :计算DNA的GC含量。
    --sequence "ATGCGATCG"
  • reverse_complement
    :生成DNA反向互补序列。
    --sequence "ATGCGATCG"
  • stats
    :自动检测DNA/RNA/蛋白质类型,计算长度、蛋白质分子量。
    --sequence "ATGCG..."
当用户提供UniProt登录号+区域时,务必使用
count_region --accession
——切勿手动统计。

Interpretation Framework

解读框架

Sequence Quality Assessment

序列质量评估

IndicatorHigh QualityAcceptableCaution
RefSeq statusNM_/NP_ (curated)XM_/XP_ (predicted)No RefSeq (GenBank only)
Sequence versionLatest version (.N)Previous versionRemoved/replaced
AnnotationReviewed (UniProt Swiss-Prot)Unreviewed (TrEMBL)No annotation
Gene symbolHGNC approvedAlias/synonymLocus tag only
指标高质量可接受需注意
RefSeq状态NM_/NP_(已审核)XM_/XP_(预测)无RefSeq(仅GenBank)
序列版本最新版本(.N)旧版本已移除/替换
注释状态已审核(UniProt Swiss-Prot)未审核(TrEMBL)无注释
基因符号HGNC批准别名/同义词仅基因座标签

Synthesis Questions

综合问题分析

  1. Is this the correct sequence? (verify organism, gene symbol, isoform)
  2. Is it the canonical isoform? (RefSeq MANE Select or UniProt canonical)
  3. How well-annotated is it? (SwissProt > TrEMBL > GenBank predicted)
  4. Are there known variants? (ClinVar pathogenic variants in this sequence)

  1. 这是正确的序列吗?(验证物种、基因符号、同工型)
  2. 这是标准同工型吗?(RefSeq MANE Select或UniProt标准型)
  3. 注释完善程度如何?(SwissProt > TrEMBL > GenBank预测序列)
  4. 是否存在已知变异?(该序列中是否有ClinVar致病性变异)

Answer Formatting (CRITICAL)

回答格式要求(重要)

TRIM YOUR ANSWER: If the question asks "what protein", answer with JUST the protein name. Do not add parenthetical abbreviations, descriptions, or qualifications. Example: answer "Glucose-6-phosphate 1-dehydrogenase", NOT "Glucose-6-phosphate 1-dehydrogenase (G6PD, EC 1.1.1.49)". When identifying a protein from a sequence, use BLAST/UniProt and report the top hit name exactly as it appears in the database — no embellishment.
精简回答:若问题询问“是什么蛋白质”,仅回答蛋白质名称。请勿添加括号缩写、描述或限定语。示例:回答“葡萄糖-6-磷酸脱氢酶”,而非“葡萄糖-6-磷酸脱氢酶(G6PD, EC 1.1.1.49)”。通过序列识别蛋白质时,使用BLAST/UniProt并准确报告数据库中的顶级匹配名称——无需额外修饰。

Peptide & Foldamer Structure

肽类与折叠体结构

  • Alpha-peptide helices: alpha-helix (3.6 res/turn, i->i+4 H-bonds), 3_10-helix (3 res/turn, i->i+3), pi-helix (4.4 res/turn, i->i+5).
  • Beta-peptide helices: named by H-bond ring size. 14-helix (i->i+2, 14-membered rings), 12-helix, 10-helix, 8-helix.
  • Beta-amino acid ring size determines helix type: 4-membered cyclic constraint -> 10-helix; 5-membered (e.g., ACPC) -> 12-helix; 6-membered (e.g., ACHC) -> 14-helix. Acyclic beta3-residues default to 14-helix.
  • Mixed alpha/beta foldamers (1:1 alternation): form 11-helix (i->i+3, 11-atom rings) or 14/15-helix (i->i+4, alternating 14- and 15-atom rings). Longer sequences prefer the 14/15-helix.
  • Key rule: the number in the helix name = number of atoms in the hydrogen-bonded ring.
  • Cyclic beta-amino acids (ACPC, ACHC) constrain backbone torsion angles, favoring specific helix types over acyclic residues.
  • α-肽螺旋:α-螺旋(每圈3.6个残基,i->i+4氢键)、3₁₀-螺旋(每圈3个残基,i->i+3氢键)、π-螺旋(每圈4.4个残基,i->i+5氢键)。
  • β-肽螺旋:以氢键环的大小命名。14-螺旋(i->i+2,14元环)、12-螺旋、10-螺旋、8-螺旋。
  • β-氨基酸环大小决定螺旋类型:4元环约束 -> 10-螺旋;5元环(如ACPC)-> 12-螺旋;6元环(如ACHC)-> 14-螺旋。无环β3-残基默认形成14-螺旋。
  • α/β混合折叠体(1:1交替):形成11-螺旋(i->i+3,11原子环)或14/15-螺旋(i->i+4,交替形成14和15原子环)。较长序列倾向于形成14/15-螺旋。
  • 关键规则:螺旋名称中的数字 = 氢键环中的原子数量。
  • 环状β-氨基酸(ACPC、ACHC)约束主链扭转角,相较于无环残基更易形成特定螺旋类型。

Limitations

局限性

  • ensembl_get_sequence gene IDs + non-genomic type need
    multiple_sequences=true
  • NCBIDatasets_get_orthologs requires NCBI Gene ID (not symbol); UniProt returns canonical isoform only
  • 使用ensembl_get_sequence时,基因ID+非基因组类型需设置
    multiple_sequences=true
  • NCBIDatasets_get_orthologs需要NCBI基因ID(而非基因符号);UniProt仅返回标准同工型