tooluniverse-sequence-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Biological Sequence Analysis

生物序列分析

Retrieve, annotate, and compare biological sequences from NCBI, Ensembl, and UniProt. Covers nucleotide search, sequence fetching, gene summaries, ortholog discovery, and protein sequence extraction.

从NCBI、Ensembl和UniProt检索、注释并对比生物序列。涵盖核苷酸搜索、序列获取、基因摘要、直系同源基因发现以及蛋白质序列提取。

When to Use

使用场景

"Get the mRNA sequence for BRCA1"
"Search NCBI for E. coli K-12 complete genome"
"Find orthologs of TP53 across species"
"Fetch the protein sequence for UniProt P04637"
"Get the CDS sequence for Ensembl transcript ENST00000269305"

"获取BRCA1的mRNA序列"
"在NCBI中搜索大肠杆菌K-12完整基因组"
"查找跨物种的TP53直系同源基因"
"获取UniProt登录号P04637对应的蛋白质序列"
"获取Ensembl转录本ENST00000269305的CDS序列"

Workflow

工作流程

Input -> Phase 1: Gene ID resolution -> Phase 2: Nucleotide retrieval
      -> Phase 3: Protein sequences -> Phase 4: Orthologs -> Output

输入 -> 阶段1：基因ID解析 -> 阶段2：核苷酸检索
      -> 阶段3：蛋白质序列获取 -> 阶段4：直系同源基因查找 -> 输出

Phase 1: Gene Identification and Summary

阶段1：基因识别与摘要

NCBIGene_search:

term

(string REQUIRED, format

"TP53[Symbol] AND Homo sapiens[Organism]"

retmax

(int, default 10). Returns

{status, data: {esearchresult: {idlist: ["7157"]}}}

NCBIGene_get_summary:

id

(string REQUIRED, e.g., "7157"). Returns

{status, data: {result: {"7157": {name, description, summary, chromosome, maplocation, genomicinfo, mim}}}}

. Result is keyed by gene ID string.

NCBIDatasets_get_gene_by_symbol:

symbol

(string REQUIRED, e.g., "BRCA1"),

taxon

(string, e.g., "human"). Returns gene ID, description, location, cross-references.

NCBIDatasets_get_gene:

gene_id

(string REQUIRED, e.g., "7157"). Returns comprehensive gene info.

NCBIGene_search：

term

（字符串类型，必填，格式为

"TP53[Symbol] AND Homo sapiens[Organism]"

），

retmax

（整数类型，默认值为10）。返回结果为

{status, data: {esearchresult: {idlist: ["7157"]}}}

。

NCBIGene_get_summary：

id

（字符串类型，必填，例如"7157"）。返回结果为

{status, data: {result: {"7157": {name, description, summary, chromosome, maplocation, genomicinfo, mim}}}}

。结果以基因ID字符串作为键。

NCBIDatasets_get_gene_by_symbol：

symbol

（字符串类型，必填，例如"BRCA1"），

taxon

（字符串类型，例如"human"）。返回基因ID、描述、位置、交叉引用信息。

NCBIDatasets_get_gene：

gene_id

（字符串类型，必填，例如"7157"）。返回全面的基因信息。

Phase 2: Nucleotide Sequence Search and Retrieval

阶段2：核苷酸序列搜索与检索

NCBI_search_nucleotide:

query

(free-form),

organism

(string),

gene

(string),

strain

(string),

keywords

(string),

seq_type

("complete_genome"/"mRNA"/"refseq"),

limit

(int, default 20). Returns

{status, data: {uids: [...], accessions: [...]}}

NCBI_fetch_accessions:

uids

(array REQUIRED, e.g., ["545778205"]). Returns

{status, data: ["U00096.3"], count: 1}

NCBI_get_sequence:

accession

(string REQUIRED, e.g., "NM_007294"),

format

("fasta"/"gb"/"embl"). Returns

{status, data: "FASTA string...", accession, format, length}

EnsemblSeq_get_region_sequence:

region

(string REQUIRED, "chr:start-end", e.g., "17:7668421-7668520"),

species

(default "homo_sapiens"). Returns

{status, data: {sequence, sequence_length}}

ensembl_get_sequence:

id

(string REQUIRED, Ensembl ID),

type

("genomic"/"cds"/"cdna"/"protein"),

multiple_sequences

(bool). Returns sequence data.

Gotchas:

NCBI_search_nucleotide returns UIDs, not accessions. Use NCBI_fetch_accessions to convert.
NCBI_fetch_accessions requires
```
uids
```
(NOT
```
accessions
```
).
ensembl_get_sequence with gene ID (ENSG) + type != "genomic" requires
```
multiple_sequences=true
```
. Use transcript IDs (ENST) for specific sequences.

NCBI_search_nucleotide：

query

（自由格式），

organism

（字符串类型），

gene

（字符串类型），

strain

（字符串类型），

keywords

（字符串类型），

seq_type

（可选值为"complete_genome"/"mRNA"/"refseq"），

limit

（整数类型，默认值为20）。返回结果为

{status, data: {uids: [...], accessions: [...]}}

。

NCBI_fetch_accessions：

uids

（数组类型，必填，例如["545778205"]）。返回结果为

{status, data: ["U00096.3"], count: 1}

。

NCBI_get_sequence：

accession

（字符串类型，必填，例如"NM_007294"），

format

（可选值为"fasta"/"gb"/"embl"）。返回结果为

{status, data: "FASTA格式字符串...", accession, format, length}

。

EnsemblSeq_get_region_sequence：

region

（字符串类型，必填，格式为"chr:start-end"，例如"17:7668421-7668520"），

species

（默认值为"homo_sapiens"）。返回结果为

{status, data: {sequence, sequence_length}}

。

ensembl_get_sequence：

id

（字符串类型，必填，Ensembl ID），

type

（可选值为"genomic"/"cds"/"cdna"/"protein"），

multiple_sequences

（布尔类型）。返回序列数据。

注意事项:

NCBI_search_nucleotide返回的是UID而非登录号，需使用NCBI_fetch_accessions进行转换。
NCBI_fetch_accessions需要传入
```
uids
```
参数（而非
```
accessions
```
）。
使用ensembl_get_sequence时，若传入基因ID（ENSG）且
```
type
```
不等于"genomic"，需设置
```
multiple_sequences=true
```
。如需特定序列，请使用转录本ID（ENST）。

Recipe: Get mRNA for a human gene

示例：获取人类基因的mRNA序列

NCBI_search_nucleotide(organism="Homo sapiens", gene="BRCA1", seq_type="mRNA", limit=5)

```
NCBI_fetch_accessions(uids=[first_uid])
```
-> accession

NCBI_get_sequence(accession="NM_007294", format="fasta")

调用

NCBI_search_nucleotide(organism="Homo sapiens", gene="BRCA1", seq_type="mRNA", limit=5)

调用
```
NCBI_fetch_accessions(uids=[首个UID])
```
-> 获取登录号

调用

NCBI_get_sequence(accession="NM_007294", format="fasta")

Phase 3: Protein Sequence Retrieval

阶段3：蛋白质序列检索

UniProt_get_sequence_by_accession:

accession

(string REQUIRED, e.g., "P04637"). Returns

{result: "MEEPQSDP..."}

. Note: response key is

result

, NOT

data

EnsemblSeq_get_id_sequence:

ensembl_id

(string REQUIRED, e.g., "ENSP00000269305"),

type

("protein"/"cdna"/"cds"). Returns

{status, data: {ensembl_id, molecule, sequence, sequence_length}}

UniProt_get_entry_by_accession:

accession

(string REQUIRED). Full protein annotation.

Gotchas:

UniProt_get_sequence_by_accession returns
```
{result: "..."}
```
, not
```
{status, data}
```
.
For Ensembl protein seqs, use ENSP IDs. For cDNA/CDS, use ENST IDs.
To find UniProt accession from gene: use NCBIDatasets_get_gene_by_symbol (has cross-refs).

UniProt_get_sequence_by_accession：

accession

（字符串类型，必填，例如"P04637"）。返回结果为

{result: "MEEPQSDP..."}

。注意：响应的键为

result

，而非

data

。

EnsemblSeq_get_id_sequence：

ensembl_id

（字符串类型，必填，例如"ENSP00000269305"），

type

（可选值为"protein"/"cdna"/"cds"）。返回结果为

{status, data: {ensembl_id, molecule, sequence, sequence_length}}

。

UniProt_get_entry_by_accession：

accession

（字符串类型，必填）。返回完整的蛋白质注释信息。

注意事项:

UniProt_get_sequence_by_accession返回的格式为
```
{result: "..."}
```
，而非
```
{status, data}
```
。
获取Ensembl蛋白质序列时，请使用ENSP ID；获取cDNA/CDS序列时，请使用ENST ID。
若需通过基因名称查找UniProt登录号：使用NCBIDatasets_get_gene_by_symbol（包含交叉引用信息）。

Phase 4: Ortholog and Comparative Analysis

阶段4：直系同源基因与对比分析

NCBIDatasets_get_orthologs:

gene_id

(string REQUIRED, NCBI Gene ID e.g., "7157"),

page_size

(int, default 20, max 100). Returns

{status, data: [{gene_id, symbol, description, taxname, common_name, chromosomes}]}

NCBIProtein_get_summary:

id

(string REQUIRED, GI number or accession). Returns protein title, organism, length.

Gotcha: NCBIDatasets_get_orthologs requires NCBI Gene ID (numeric string), not gene symbol or Ensembl ID. Resolve via Phase 1 first.

NCBIDatasets_get_orthologs：

gene_id

（字符串类型，必填，NCBI基因ID，例如"7157"），

page_size

（整数类型，默认值为20，最大值为100）。返回结果为

{status, data: [{gene_id, symbol, description, taxname, common_name, chromosomes}]}

。

NCBIProtein_get_summary：

id

（字符串类型，必填，GI编号或登录号）。返回蛋白质标题、所属物种、长度信息。

注意事项：NCBIDatasets_get_orthologs需要传入NCBI基因ID（数字字符串），而非基因符号或Ensembl ID。请先通过阶段1解析获取。

Recipe: Compare orthologs

示例：对比直系同源基因

NCBIGene_search(term="TP53[Symbol] AND Homo sapiens[Organism]")

-> "7157"

NCBIDatasets_get_orthologs(gene_id="7157", page_size=10)

-> mouse Trp53, rat Tp53, etc.

调用

NCBIGene_search(term="TP53[Symbol] AND Homo sapiens[Organism]")

-> 获取"7157"

调用
```
NCBIDatasets_get_orthologs(gene_id="7157", page_size=10)
```
-> 获取小鼠Trp53、大鼠Tp53等直系同源基因

Phase 5: Domain Architecture and Homology

阶段5：结构域架构与同源性

InterPro_get_entries_for_protein:

accession

(UniProt ID). Returns InterPro domain/family/superfamily entries with positions.

Pfam_get_protein_annotations:

accession

(UniProt ID). Returns Pfam domain hits with exact residue coordinates and E-values.

BLAST_protein_search:

sequence

(amino acid string),

database

(default "swissprot"),

limit

. Returns homologs with alignment scores, identity, E-values.

EnsemblCompara_get_orthologues:

gene

(gene symbol, e.g., "CFTR"),

species

(e.g., "human"). User-friendly alternative to NCBIDatasets_get_orthologs — accepts gene symbols directly.

InterPro_get_entries_for_protein：

accession

（UniProt ID）。返回带有位置信息的InterPro结构域/家族/超家族条目。

Pfam_get_protein_annotations：

accession

（UniProt ID）。返回带有精确残基坐标和E值的Pfam结构域匹配结果。

BLAST_protein_search：

sequence

（氨基酸字符串），

database

（默认值为"swissprot"），

limit

。返回带有比对得分、一致性和E值的同源序列。

EnsemblCompara_get_orthologues：

gene

（基因符号，例如"CFTR"），

species

（例如"human"）。是NCBIDatasets_get_orthologs的用户友好替代工具——可直接接受基因符号作为参数。

Phase 6: Variant and Clinical Context

阶段6：变异与临床背景

EnsemblVEP_annotate_hgvs:

hgvs_notation

(e.g., "NM_000492.4:c.1521_1523del"). Returns consequence, protein impact, genomic coordinates.

ClinVar_search_variants:

gene

(gene symbol). Returns variant count and IDs for clinical significance lookup.

PubMed_search_articles:

query

limit

. Literature context for gene/variant findings.

EnsemblVEP_annotate_hgvs：

hgvs_notation

（例如"NM_000492.4:c.1521_1523del"）。返回变异后果、蛋白质影响、基因组坐标信息。

ClinVar_search_variants：

gene

（基因符号）。返回临床意义查询所需的变异数量和ID。

PubMed_search_articles：

query

，

limit

。为基因/变异研究结果提供文献背景。

Tool Parameter Quick Reference

工具参数速查

Tool	Correct Param	Common Mistake
NCBIGene_search	`term` (with [Symbol] syntax)	`query` or `gene`
NCBIGene_get_summary	`id` (string)	Integer type
NCBI_fetch_accessions	`uids` (array)	`accessions`
NCBI_get_sequence	`accession` (string)	Passing UID
NCBIDatasets_get_orthologs	`gene_id` (string)	Gene symbol
EnsemblSeq_get_id_sequence	`ensembl_id`	`id`
ensembl_get_sequence	`id` + `multiple_sequences`	Omitting multiple_sequences for gene+CDS
UniProt_get_sequence_by_accession	`accession`	Response is `result` not `data`

工具	正确参数	常见错误
NCBIGene_search	`term` （使用[Symbol]语法）	使用 `query` 或 `gene` 参数
NCBIGene_get_summary	`id` （字符串类型）	使用整数类型
NCBI_fetch_accessions	`uids` （数组类型）	使用 `accessions` 参数
NCBI_get_sequence	`accession` （字符串类型）	传入UID
NCBIDatasets_get_orthologs	`gene_id` （字符串类型）	传入基因符号
EnsemblSeq_get_id_sequence	`ensembl_id`	使用 `id` 参数
ensembl_get_sequence	`id` + `multiple_sequences`	对于基因+CDS类型，省略multiple_sequences参数
UniProt_get_sequence_by_accession	`accession`	误将响应中的 `result` 当作 `data`

Fallbacks

备选方案

Gene not found -> try NCBIDatasets_get_gene_by_symbol with explicit taxon
No accessions from search -> broaden query (remove strain/seq_type filters)
Ensembl error for gene+CDS -> use transcript ID (ENST) or set multiple_sequences=true
UniProt accession unknown -> NCBIDatasets_get_gene or UniProt_search for cross-refs
Ortholog search empty -> verify gene_id is numeric NCBI Gene ID

未找到基因 -> 尝试使用NCBIDatasets_get_gene_by_symbol并指定明确的分类单元
搜索未返回登录号 -> 扩大查询范围（移除菌株/序列类型筛选条件）
Ensembl基因+CDS查询出错 -> 使用转录本ID（ENST）或设置multiple_sequences=true
未知UniProt登录号 -> 使用NCBIDatasets_get_gene或UniProt_search获取交叉引用信息
直系同源基因搜索无结果 -> 验证gene_id是否为数字型NCBI基因ID

Sequence Analysis Reasoning (CRITICAL)

序列分析推理规则（重要）

LOOK UP DON'T GUESS -- always fetch sequences, coordinates, and domain boundaries from databases. Do not reconstruct them from memory.

查资料而非猜测——序列、坐标和结构域边界务必从数据库获取，切勿凭记忆重构。

When to Use Which Tool

工具选择指南

Question Type	Tool Choice	Why
"Find similar sequences"	BLAST_protein_search	Homology search against databases; returns E-values and identity
"What domains does this protein have?"	InterPro_get_entries_for_protein or Pfam_get_protein_annotations	Domain architecture with exact residue coordinates
"Get the sequence of gene X"	NCBI_search_nucleotide -> NCBI_get_sequence	Nucleotide retrieval by gene name
"Compare orthologs"	NCBIDatasets_get_orthologs or EnsemblCompara_get_orthologues	Cross-species gene comparison
"What is the protein impact of variant X?"	EnsemblVEP_annotate_hgvs	Consequence prediction with protein coordinates
"Align two sequences"	BLAST (pairwise)	Quick pairwise comparison with scoring

问题类型	工具选择	原因
"查找相似序列"	BLAST_protein_search	针对数据库进行同源性搜索；返回E值和一致性
"该蛋白质包含哪些结构域？"	InterPro_get_entries_for_protein或Pfam_get_protein_annotations	提供带有精确残基坐标的结构域架构
"获取基因X的序列"	NCBI_search_nucleotide -> NCBI_get_sequence	通过基因名称检索核苷酸序列
"对比直系同源基因"	NCBIDatasets_get_orthologs或EnsemblCompara_get_orthologues	跨物种基因对比
"变异X对蛋白质有何影响？"	EnsemblVEP_annotate_hgvs	预测变异后果并提供蛋白质坐标
"比对两条序列"	BLAST（两两比对）	快速两两比对并给出评分

Reading Frame Selection Strategy

阅读框选择策略

When translating a DNA sequence to protein:

Do NOT guess the reading frame -- preferred: use
```
DNA_translate_reading_frames
```
tool; fallback:
```
translate_dna.py
```
which tries all 3 frames automatically
The correct frame is the one with the LONGEST open reading frame (no premature stops)
If the sequence starts with ATG, frame 1 is likely correct -- but verify
If all 3 frames have early stop codons, the sequence may be: (a) non-coding, (b) reversed, or (c) contains sequencing errors. Try reverse complement first.

将DNA序列翻译为蛋白质时：

切勿猜测阅读框——首选：使用
```
DNA_translate_reading_frames
```
工具；备选：使用
```
translate_dna.py
```
脚本自动尝试全部3种阅读框
正确的阅读框是拥有最长开放阅读框（无提前终止密码子）的那个
若序列以ATG开头，阅读框1大概率正确——但需验证
若3种阅读框均存在提前终止密码子，该序列可能是：(a) 非编码序列，(b) 反向序列，或(c) 包含测序错误。优先尝试反向互补序列。

Protein Domain Interpretation

蛋白质结构域解读

When asked about protein function or structure:

Get domain architecture first:
```
InterPro_get_entries_for_protein
```
returns all annotated domains with positions
Domain families indicate function: Kinase domain = phosphorylation activity; SH2 domain = phosphotyrosine binding; zinc finger = DNA binding
Variants in conserved domains are more likely pathogenic than those in linker regions
LOOK UP domain boundaries from the database -- do not estimate positions from memory

当被问及蛋白质功能或结构时：

先获取结构域架构：
```
InterPro_get_entries_for_protein
```
返回所有带位置信息的注释结构域
结构域家族指示功能：激酶结构域=磷酸化活性；SH2结构域=磷酸酪氨酸结合；锌指结构域=DNA结合
保守结构域中的变异比连接区的变异更可能致病
查资料获取结构域边界——切勿凭记忆估算位置

Reasoning for Protein Feature Questions

蛋白质特征问题推理规则

When asked "how many X residues in region Y of protein Z":

Identify the correct protein — Gene names are ambiguous. GABAA has many subunits (GABRA1, GABRB2, GABRR1...). Read the question carefully for the specific subunit. Use
```
proteins_api_search
```
with gene name + "human" to find the right accession.
Find the region boundaries — Use
```
proteins_api_get_features
```
with the accession to get annotated domains (TRANSMEM, DOMAIN, REGION). Don't guess positions — get them from the database.
Count residues in the region — Fetch the sequence, extract the region, count. WRITE Python code for this — don't try to count manually.
- Residue Counting Strategy:
```
python3 skills/tooluniverse-sequence-analysis/scripts/sequence_tools.py --type count_region --accession P24046 --start 318 --end 440 --residue C
```
- For residue counting questions, ALWAYS use the script or
```
sequence[start:end].count('C')
```
  . Do NOT estimate or count from memory.
Account for multimers — READ THE QUESTION for "homomeric", "pentamer", "tetramer", "dimer". If the question asks about a homomeric receptor (e.g., "homomeric GABAAρ1"), every subunit is identical. Count the residues in ONE subunit, then multiply:
- Homomeric pentamer (most ligand-gated ion channels like GABAA ρ1): × 5
- Homotetramer (many ion channels): × 4
- Homodimer: × 2 If the question says "in the TM3-TM4 linker domains" (plural), it means across all subunits in the complex.

当被问及“蛋白质Z的Y区域中有多少个X残基”时：

确定正确的蛋白质——基因名称存在歧义。例如GABAA有多个亚基（GABRA1、GABRB2、GABRR1...）。仔细阅读问题以确定具体亚基。使用
```
proteins_api_search
```
工具，传入基因名称+"human"以找到正确的登录号。
查找区域边界——使用
```
proteins_api_get_features
```
工具，传入登录号以获取注释的结构域（TRANSMEM、DOMAIN、REGION）。切勿猜测位置——从数据库获取。
统计区域内的残基数量——获取序列，提取目标区域，统计数量。编写Python代码完成此操作——切勿手动统计。
- 残基统计策略：
```
python3 skills/tooluniverse-sequence-analysis/scripts/sequence_tools.py --type count_region --accession P24046 --start 318 --end 440 --residue C
```
- 对于残基统计问题，务必使用脚本或
```
sequence[start:end].count('C')
```
  。切勿估算或凭记忆统计。
考虑多聚体情况——仔细阅读问题中的“同源多聚体”“五聚体”“四聚体”“二聚体”等表述。若问题询问同源多聚体受体（例如“同源多聚体GABAAρ1”），每个亚基均相同。统计单个亚基中的残基数量，再乘以对应倍数：
- 同源五聚体（大多数配体门控离子通道如GABAA ρ1）：× 5
- 同源四聚体（许多离子通道）：× 4
- 同源二聚体：× 2 若问题提及“TM3-TM4连接区（复数）”，指的是复合物中所有亚基的该区域。

Bundled Computation Scripts

内置计算脚本

Never manually count residues, compute GC%, or write reverse-complement logic inline. Run these scripts instead — they are tested and handle edge cases.

切勿手动统计残基、计算GC含量或编写反向互补逻辑。请运行以下脚本——它们经过测试，可处理边缘情况。

biology_facts.py — Biology reference lookup

biology_facts.py —— 生物学参考资料查询

Script:

skills/tooluniverse-sequence-analysis/scripts/biology_facts.py

Use this script to look up commonly-confused biology facts instead of relying on memory. It covers receptor types, ion channel stoichiometry, neurotransmitters, immune cell markers, and gene naming confusions.

python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type receptor --name "GABAA"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type ion_channel --name "NMDA"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type gene_confusion --name "GABRA1"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type receptor  # list all entries

Types:

receptor

(stoichiometry, pharmacology),

ion_channel

(subunit arrangement),

neurotransmitter

(synthesis, receptors),

immune_cell

(markers, lineage),

gene_confusion

(commonly mixed-up genes like GABRA1 vs GABRR1).

Mandatory use: any question about receptor type/stoichiometry, immune cell markers, or gene name disambiguation.

脚本路径：

skills/tooluniverse-sequence-analysis/scripts/biology_facts.py

使用此脚本查询易混淆的生物学事实，而非依赖记忆。涵盖受体类型、离子通道化学计量、神经递质、免疫细胞标志物以及基因命名混淆等内容。

python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type receptor --name "GABAA"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type ion_channel --name "NMDA"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type gene_confusion --name "GABRA1"
python3 skills/tooluniverse-sequence-analysis/scripts/biology_facts.py --type receptor  # 列出所有条目

类型选项：

receptor

（化学计量、药理学）、

ion_channel

（亚基排列）、

neurotransmitter

（合成、受体）、

immune_cell

（标志物、谱系）、

gene_confusion

（易混淆基因如GABRA1与GABRR1）。

强制使用场景：任何关于受体类型/化学计量、免疫细胞标志物或基因名称歧义的问题。

amino_acids.py — Codon table, amino acid properties, wobble pairing

amino_acids.py —— 密码子表、氨基酸性质、摇摆配对

Script:

skills/tooluniverse-sequence-analysis/scripts/amino_acids.py

Use this script for any question about the genetic code, codon degeneracy, amino acid chemistry, codon usage bias, or tRNA wobble pairing. All outputs are JSON.

python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type codon_table
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --name "Cysteine"
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --code C
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --code TRP
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid            # list all 20
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type count_codons --sequence "ATGCCCAAATTT..."
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "GAU"
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "IAU"

Modes:

`--type`	What it returns	Key fields
`codon_table`	All 64 codons grouped by amino acid	degeneracy, codons, human codon usage %, stop codon names, degeneracy distribution (1/2/3/4/6)
`amino_acid`	Properties of one or all amino acids	name, one_letter, three_letter, mw_da, pKa_side_chain, polarity, charge_ph7, hydrophobicity_index (Kyte-Doolittle), backbone_pKa, codons, degeneracy, rare_codons_le15pct
`count_codons`	Codon frequency analysis for a DNA sequence	codon_counts with AA annotation and human usage freq, amino_acid_composition, rare_codons_present
`wobble`	Codons recognised by a given anticodon	recognised_codons (RNA+DNA form, AA), synonymous_only, wobble rule explanation

When to use (mandatory):

Any question about how many codons encode a given amino acid (degeneracy)
Any question about rare vs. common codons for protein expression optimisation
Any question about tRNA anticodon recognition / wobble base pairing
Any question about amino acid physical-chemical properties (MW, pKa, hydrophobicity, polarity, charge)
Any question about the names of stop codons (Amber/Ochre/Opal)
Before manually stating codon degeneracy — verify with
```
codon_table
```

Wobble rules: I pairs U/C/A (3 codons); G pairs U/C; U pairs A/G; C pairs G only; A pairs U only (rare). Use

--type wobble --anticodon "GAU"

to verify.

Amino acid lookup: accepts full name (

--name "Cysteine"

), 1-letter (

--code C

), or 3-letter (

--code CYS

脚本路径：

skills/tooluniverse-sequence-analysis/scripts/amino_acids.py

使用此脚本查询遗传密码、密码子简并性、氨基酸化学性质、密码子使用偏好或tRNA摇摆配对相关问题。所有输出均为JSON格式。

python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type codon_table
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --name "Cysteine"
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --code C
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid --code TRP
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type amino_acid            # 列出全部20种氨基酸
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type count_codons --sequence "ATGCCCAAATTT..."
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "GAU"
python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "IAU"

模式说明:

`--type`	返回内容	关键字段
`codon_table`	按氨基酸分组的全部64种密码子	简并性、密码子、人类密码子使用百分比、终止密码子名称、简并性分布（1/2/3/4/6）
`amino_acid`	单个或全部氨基酸的性质	名称、单字母缩写、三字母缩写、分子量（道尔顿）、侧链pKa、极性、pH7时的电荷、疏水性指数（Kyte-Doolittle）、主链pKa、密码子、简并性、稀有密码子（占比≤15%）
`count_codons`	DNA序列的密码子频率分析	带氨基酸注释和人类使用频率的密码子计数、氨基酸组成、存在的稀有密码子
`wobble`	给定反密码子可识别的密码子	可识别的密码子（RNA+DNA形式、对应氨基酸）、仅同义密码子、摇摆规则说明

强制使用场景:

任何关于编码特定氨基酸的密码子数量（简并性）的问题
任何关于蛋白质表达优化中稀有/常见密码子的问题
任何关于tRNA反密码子识别/摇摆碱基配对的问题
任何关于氨基酸物理化学性质（分子量、pKa、疏水性、极性、电荷）的问题
任何关于终止密码子名称（琥珀/赭石/乳白）的问题
手动陈述密码子简并性之前——务必使用
```
codon_table
```
验证

摇摆规则：I（次黄嘌呤）可与U/C/A配对（3种密码子）；G可与C/U配对；U可与A/G配对；C仅可与G配对；A仅可与U配对（罕见）。使用

--type wobble --anticodon "GAU"

进行验证。

氨基酸查询：接受全名（

--name "Cysteine"

）、单字母缩写（

--code C

）或三字母缩写（

--code CYS

）。

Codon-Anticodon Matching Reasoning (CRITICAL for tRNA problems)

密码子-反密码子配对推理规则（tRNA问题必看）

When solving "which codons does this tRNA recognize" or "which tRNA reads this codon":

Anticodon is written 3'->5' but conventionally listed 5'->3'. The FIRST position of the anticodon (5' end) is the WOBBLE position and pairs with the THIRD position of the codon (3' end).
Anticodon-codon pairing is ANTIPARALLEL: anticodon 5'-X-Y-Z-3' pairs with codon 3'-X'-Y'-Z'-5' (i.e., codon 5'-Z'-Y'-X'-3').
Wobble position rules (anticodon 5' base -> codon 3' base it can pair with):
- C -> G only (1 codon)
- A -> U only (1 codon; rare in bacteria, common in mitochondria)
- U -> A or G (2 codons)
- G -> C or U (2 codons)
- I (inosine, deaminated A) -> U, C, or A (3 codons)
Minimum tRNA set: Because I reads 3 bases and G/U each read 2, a 4-codon family (e.g., GCN = Ala) needs only 2 tRNAs: one with I at wobble position (reads 3 of 4 codons) and one with C or U at wobble (reads the remaining 1-2).

ALWAYS use the script:

python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "IAU"

to verify rather than reasoning from memory.

解决“该tRNA可识别哪些密码子”或“哪个tRNA读取该密码子”问题时：

反密码子以3'->5'方向书写，但通常以5'->3'列出。反密码子的第一个位置（5'端）是摇摆位，与密码子的第三个位置（3'端）配对。
反密码子-密码子配对是反向平行的：反密码子5'-X-Y-Z-3'与密码子3'-X'-Y'-Z'-5'配对（即密码子5'-Z'-Y'-X'-3'）。
摇摆位规则（反密码子5'碱基 -> 可配对的密码子3'碱基）：
- C -> 仅G（1种密码子）
- A -> 仅U（1种密码子；在细菌中罕见，在线粒体中常见）
- U -> A或G（2种密码子）
- G -> C或U（2种密码子）
- I（次黄嘌呤，脱氨基A）-> U、C或A（3种密码子）
最小tRNA集合：由于I可识别3种碱基，G/U各可识别2种，因此4密码子家族（例如GCN=丙氨酸）仅需2种tRNA：一种摇摆位为I（识别4种密码子中的3种），另一种摇摆位为C或U（识别剩余1-2种）。

务必使用脚本验证：运行

python3 skills/tooluniverse-sequence-analysis/scripts/amino_acids.py --type wobble --anticodon "IAU"

，而非凭记忆推理。

translate_dna.py — DNA to protein translation

translate_dna.py —— DNA转蛋白质翻译

Preferred: use

DNA_translate_reading_frames

tool (via MCP/SDK) with

sequence

parameter. Fallback: run

translate_dna.py

directly.

python3 skills/tooluniverse-sequence-analysis/scripts/translate_dna.py "ATGCCC..."

Tries all 3 reading frames, picks longest ORF automatically.

首选：使用

DNA_translate_reading_frames

工具（通过MCP/SDK），传入

sequence

参数。备选：直接运行

translate_dna.py

脚本。

python3 skills/tooluniverse-sequence-analysis/scripts/translate_dna.py "ATGCCC..."

自动尝试全部3种阅读框，自动选择最长的开放阅读框。

sequence_tools.py — Residue counting, GC content, reverse complement, stats

sequence_tools.py —— 残基统计、GC含量、反向互补、序列统计

Script:

skills/tooluniverse-sequence-analysis/scripts/sequence_tools.py

Preferred: Use ToolUniverse tools (via MCP/SDK) instead of the script:

Sequence_count_residues

tool -- Count residues in a sequence or region. Fallback:

sequence_tools.py --type count_residues

--type count_region

Sequence_gc_content

tool -- GC% of DNA. Fallback:

sequence_tools.py --type gc_content

Sequence_reverse_complement

tool -- DNA reverse complement. Fallback:

sequence_tools.py --type reverse_complement

```
Sequence_stats
```
tool -- Auto-detect type, length, MW. Fallback:
```
sequence_tools.py --type stats
```

Fallback script modes (use

--type

count_residues

: Count residue in full sequence.

--sequence "ACDE..." --residue C

count_region

: Count in region (1-based inclusive).

--sequence "MAC..." --start 5 --end 20 --residue C

--accession P24046 --start 318 --end 440 --residue C

(fetches from UniProt live)

```
gc_content
```
: GC% of DNA.
```
--sequence "ATGCGATCG"
```

reverse_complement

: DNA reverse complement.

--sequence "ATGCGATCG"

```
stats
```
: Auto-detect DNA/RNA/Protein, compute length, MW for protein.
```
--sequence "ATGCG..."
```

ALWAYS use

count_region --accession

when the user gives a UniProt accession + region -- do not count manually.

脚本路径：

skills/tooluniverse-sequence-analysis/scripts/sequence_tools.py

首选：使用ToolUniverse工具（通过MCP/SDK）替代脚本：

Sequence_count_residues

工具——统计序列或区域内的残基数量。备选：

sequence_tools.py --type count_residues

或

--type count_region

Sequence_gc_content

工具——计算DNA的GC含量。备选：

sequence_tools.py --type gc_content

Sequence_reverse_complement

工具——生成DNA反向互补序列。备选：

sequence_tools.py --type reverse_complement

```
Sequence_stats
```
工具——自动检测序列类型、长度、蛋白质分子量。备选：
```
sequence_tools.py --type stats
```

备选脚本模式（使用

--type

指定）：

count_residues

：统计全序列中的残基数量。

--sequence "ACDE..." --residue C

count_region

：统计指定区域内的残基数量（1-based包含性）。

--sequence "MAC..." --start 5 --end 20 --residue C

或

--accession P24046 --start 318 --end 440 --residue C

（从UniProt实时获取序列）

```
gc_content
```
：计算DNA的GC含量。
```
--sequence "ATGCGATCG"
```

reverse_complement

：生成DNA反向互补序列。

--sequence "ATGCGATCG"

```
stats
```
：自动检测DNA/RNA/蛋白质类型，计算长度、蛋白质分子量。
```
--sequence "ATGCG..."
```

当用户提供UniProt登录号+区域时，务必使用

count_region --accession

——切勿手动统计。

Interpretation Framework

解读框架

Sequence Quality Assessment

序列质量评估

Indicator	High Quality	Acceptable	Caution
RefSeq status	NM_/NP_ (curated)	XM_/XP_ (predicted)	No RefSeq (GenBank only)
Sequence version	Latest version (.N)	Previous version	Removed/replaced
Annotation	Reviewed (UniProt Swiss-Prot)	Unreviewed (TrEMBL)	No annotation
Gene symbol	HGNC approved	Alias/synonym	Locus tag only

指标	高质量	可接受	需注意
RefSeq状态	NM_/NP_（已审核）	XM_/XP_（预测）	无RefSeq（仅GenBank）
序列版本	最新版本（.N）	旧版本	已移除/替换
注释状态	已审核（UniProt Swiss-Prot）	未审核（TrEMBL）	无注释
基因符号	HGNC批准	别名/同义词	仅基因座标签

Synthesis Questions

综合问题分析

Is this the correct sequence? (verify organism, gene symbol, isoform)
Is it the canonical isoform? (RefSeq MANE Select or UniProt canonical)
How well-annotated is it? (SwissProt > TrEMBL > GenBank predicted)
Are there known variants? (ClinVar pathogenic variants in this sequence)

这是正确的序列吗？（验证物种、基因符号、同工型）
这是标准同工型吗？（RefSeq MANE Select或UniProt标准型）
注释完善程度如何？（SwissProt > TrEMBL > GenBank预测序列）
是否存在已知变异？（该序列中是否有ClinVar致病性变异）

Answer Formatting (CRITICAL)

回答格式要求（重要）

TRIM YOUR ANSWER: If the question asks "what protein", answer with JUST the protein name. Do not add parenthetical abbreviations, descriptions, or qualifications. Example: answer "Glucose-6-phosphate 1-dehydrogenase", NOT "Glucose-6-phosphate 1-dehydrogenase (G6PD, EC 1.1.1.49)". When identifying a protein from a sequence, use BLAST/UniProt and report the top hit name exactly as it appears in the database — no embellishment.

精简回答：若问题询问“是什么蛋白质”，仅回答蛋白质名称。请勿添加括号缩写、描述或限定语。示例：回答“葡萄糖-6-磷酸脱氢酶”，而非“葡萄糖-6-磷酸脱氢酶（G6PD, EC 1.1.1.49）”。通过序列识别蛋白质时，使用BLAST/UniProt并准确报告数据库中的顶级匹配名称——无需额外修饰。

Peptide & Foldamer Structure

肽类与折叠体结构

Alpha-peptide helices: alpha-helix (3.6 res/turn, i->i+4 H-bonds), 3_10-helix (3 res/turn, i->i+3), pi-helix (4.4 res/turn, i->i+5).
Beta-peptide helices: named by H-bond ring size. 14-helix (i->i+2, 14-membered rings), 12-helix, 10-helix, 8-helix.
Beta-amino acid ring size determines helix type: 4-membered cyclic constraint -> 10-helix; 5-membered (e.g., ACPC) -> 12-helix; 6-membered (e.g., ACHC) -> 14-helix. Acyclic beta3-residues default to 14-helix.
Mixed alpha/beta foldamers (1:1 alternation): form 11-helix (i->i+3, 11-atom rings) or 14/15-helix (i->i+4, alternating 14- and 15-atom rings). Longer sequences prefer the 14/15-helix.
Key rule: the number in the helix name = number of atoms in the hydrogen-bonded ring.
Cyclic beta-amino acids (ACPC, ACHC) constrain backbone torsion angles, favoring specific helix types over acyclic residues.

α-肽螺旋：α-螺旋（每圈3.6个残基，i->i+4氢键）、3₁₀-螺旋（每圈3个残基，i->i+3氢键）、π-螺旋（每圈4.4个残基，i->i+5氢键）。
β-肽螺旋：以氢键环的大小命名。14-螺旋（i->i+2，14元环）、12-螺旋、10-螺旋、8-螺旋。
β-氨基酸环大小决定螺旋类型：4元环约束 -> 10-螺旋；5元环（如ACPC）-> 12-螺旋；6元环（如ACHC）-> 14-螺旋。无环β3-残基默认形成14-螺旋。
α/β混合折叠体（1:1交替）：形成11-螺旋（i->i+3，11原子环）或14/15-螺旋（i->i+4，交替形成14和15原子环）。较长序列倾向于形成14/15-螺旋。
关键规则：螺旋名称中的数字 = 氢键环中的原子数量。
环状β-氨基酸（ACPC、ACHC）约束主链扭转角，相较于无环残基更易形成特定螺旋类型。

Limitations

局限性

ensembl_get_sequence gene IDs + non-genomic type need
```
multiple_sequences=true
```
NCBIDatasets_get_orthologs requires NCBI Gene ID (not symbol); UniProt returns canonical isoform only

使用ensembl_get_sequence时，基因ID+非基因组类型需设置
```
multiple_sequences=true
```
NCBIDatasets_get_orthologs需要NCBI基因ID（而非基因符号）；UniProt仅返回标准同工型