bio-fasta

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Sequence I/O

序列I/O

Read, write, and manipulate biological sequence files (FASTA, GenBank, FASTQ).

读取、写入和处理生物序列文件（FASTA、GenBank、FASTQ）。

When to Use This Skill

何时使用该技能

This skill should be used when:

Reading or writing sequence files (FASTA, GenBank, FASTQ)
Converting between sequence file formats
Manipulating sequences (complement, reverse complement, translate)
Extracting sequences from large indexed FASTA files (faidx)
Calculating sequence statistics (GC content, molecular weight, Tm)

该技能适用于以下场景：

读取或写入序列文件（FASTA、GenBank、FASTQ）
序列文件格式之间的转换
序列操作（互补、反向互补、翻译）
从大型带索引的FASTA文件（faidx）中提取序列
计算序列统计信息（GC含量、分子量、Tm值）

When NOT to Use This Skill

何时不使用该技能

NGS alignment files (SAM/BAM/VCF) → Use
```
pysam
```
BLAST searches → Use
```
gget
```
(quick) or
```
blat-integration
```
(large-scale)
Multiple sequence alignment → Use
```
msa-advanced
```
Phylogenetic analysis → Use
```
etetoolkit
```
NCBI database queries → Use
```
pubmed-database
```
or
```
gene-database
```

NGS比对文件（SAM/BAM/VCF） → 请使用
```
pysam
```
BLAST搜索 → 请使用
```
gget
```
（快速）或
```
blat-integration
```
（大规模）
多序列比对 → 请使用
```
msa-advanced
```
系统发育分析 → 请使用
```
etetoolkit
```
NCBI数据库查询 → 请使用
```
pubmed-database
```
或
```
gene-database
```

Tool Selection Guide

工具选择指南

Task	Tool	Reference
Parse FASTA/GenBank/FASTQ	`Bio.SeqIO`	`biopython_seqio.md`
Convert file formats	`Bio.SeqIO.convert()`	`biopython_seqio.md`
Sequence operations	`Bio.Seq`	`biopython_seqio.md`
Large FASTA random access	`pysam.FastaFile` + faidx	`faidx.md`
GC%, Tm, molecular weight	`Bio.SeqUtils`	`utilities.md`

任务	工具	参考文档
解析FASTA/GenBank/FASTQ	`Bio.SeqIO`	`biopython_seqio.md`
文件格式转换	`Bio.SeqIO.convert()`	`biopython_seqio.md`
序列操作	`Bio.Seq`	`biopython_seqio.md`
大型FASTA文件随机访问	`pysam.FastaFile` + faidx	`faidx.md`
GC含量、Tm值、分子量计算	`Bio.SeqUtils`	`utilities.md`

Quick Start

快速开始

Installation

安装

bash

uv pip install biopython pysam

bash

uv pip install biopython pysam

Read FASTA

读取FASTA文件

python

from Bio import SeqIO

for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(f"{record.id}: {len(record.seq)} bp")

python

from Bio import SeqIO

for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(f"{record.id}: {len(record.seq)} bp")

Convert GenBank to FASTA

将GenBank格式转换为FASTA格式

python

from Bio import SeqIO

SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")

python

from Bio import SeqIO

SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")

Random Access with faidx

使用faidx实现随机访问

python

import pysam

python

import pysam

Create index (once)

pysam.faidx("reference.fasta")

Random access

fasta = pysam.FastaFile("reference.fasta") seq = fasta.fetch("chr1", 1000, 2000) # 0-based coordinates fasta.close()

undefined

fasta = pysam.FastaFile("reference.fasta") seq = fasta.fetch("chr1", 1000, 2000) # 0-based coordinates fasta.close()

undefined

Sequence Operations

序列操作

python

from Bio.Seq import Seq

seq = Seq("ATGCGATCGATCG")
print(seq.complement())
print(seq.reverse_complement())
print(seq.translate())

python

from Bio.Seq import Seq

seq = Seq("ATGCGATCGATCG")
print(seq.complement())
print(seq.reverse_complement())
print(seq.translate())

Reference Documentation

参考文档

Consult the appropriate reference file for detailed documentation:

如需详细文档，请查阅对应的参考文件：

references/biopython_seqio.md

references/biopython_seqio.md

```
Bio.Seq
```
object and sequence operations
```
Bio.SeqIO
```
for file parsing and writing
```
SeqRecord
```
object and annotations
Supported file formats
Format conversion patterns

```
Bio.Seq
```
对象与序列操作
用于文件解析与写入的
```
Bio.SeqIO
```
```
SeqRecord
```
对象与注释
支持的文件格式
格式转换模式

references/faidx.md

references/faidx.md

Creating FASTA index with
```
pysam.faidx()
```
```
pysam.FastaFile
```
for random access
Coordinate systems (0-based vs 1-based)
Performance considerations for large files
Common patterns (variant context, gene extraction)

使用
```
pysam.faidx()
```
创建FASTA索引
用于随机访问的
```
pysam.FastaFile
```
坐标系统（0起始 vs 1起始）
大型文件的性能注意事项
常见使用场景（变异上下文、基因提取）

references/utilities.md

references/utilities.md

GC content calculation (
```
gc_fraction
```
)
Molecular weight (
```
molecular_weight
```
)
Melting temperature (
```
MeltingTemp
```
)
Codon usage analysis
Restriction enzyme sites

GC含量计算（
```
gc_fraction
```
）
分子量计算（
```
molecular_weight
```
）
解链温度计算（
```
MeltingTemp
```
）
密码子使用分析
限制性酶切位点

references/formats.md

references/formats.md

FASTA format specification
GenBank format specification
FASTQ format and quality scores
Format detection and validation

FASTA格式规范
GenBank格式规范
FASTQ格式与质量分数
格式检测与验证

Coordinate Systems

坐标系统

Biopython: Uses Python-style 0-based, half-open intervals for slicing.

pysam.FastaFile.fetch():

Numeric arguments: 0-based (
```
fetch("chr1", 999, 2000)
```
= positions 999-1999)
Region strings: 1-based (
```
fetch("chr1:1000-2000")
```
= positions 1000-2000)

Biopython：切片采用Python风格的0起始、左闭右开区间。

pysam.FastaFile.fetch():

数值参数：0起始（
```
fetch("chr1", 999, 2000)
```
对应位置999-1999）
区域字符串：1起始（
```
fetch("chr1:1000-2000")
```
对应位置1000-2000）

Common Pitfalls

常见陷阱

Coordinate confusion: Remember which tool uses 0-based vs 1-based
Missing faidx index: Random access requires
```
.fai
```
file
Format mismatch: Verify file format matches the format string in
```
SeqIO.parse()
```
Iterator exhaustion:
```
SeqIO.parse()
```
returns an iterator; convert to list if multiple passes needed
Large files: Use iterators, not
```
list()
```
, for memory efficiency

坐标混淆：注意区分工具使用的是0起始还是1起始坐标
缺少faidx索引：随机访问需要
```
.fai
```
文件
格式不匹配：确保文件格式与
```
SeqIO.parse()
```
中的格式字符串一致
迭代器耗尽：
```
SeqIO.parse()
```
返回迭代器；若需要多次遍历，请转换为列表
大型文件：为节省内存，请使用迭代器而非
```
list()
```