bio-fasta

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Sequence I/O

序列I/O

Read, write, and manipulate biological sequence files (FASTA, GenBank, FASTQ).
读取、写入和处理生物序列文件(FASTA、GenBank、FASTQ)。

When to Use This Skill

何时使用该技能

This skill should be used when:
  • Reading or writing sequence files (FASTA, GenBank, FASTQ)
  • Converting between sequence file formats
  • Manipulating sequences (complement, reverse complement, translate)
  • Extracting sequences from large indexed FASTA files (faidx)
  • Calculating sequence statistics (GC content, molecular weight, Tm)
该技能适用于以下场景:
  • 读取或写入序列文件(FASTA、GenBank、FASTQ)
  • 序列文件格式之间的转换
  • 序列操作(互补、反向互补、翻译)
  • 从大型带索引的FASTA文件(faidx)中提取序列
  • 计算序列统计信息(GC含量、分子量、Tm值)

When NOT to Use This Skill

何时不使用该技能

  • NGS alignment files (SAM/BAM/VCF) → Use
    pysam
  • BLAST searches → Use
    gget
    (quick) or
    blat-integration
    (large-scale)
  • Multiple sequence alignment → Use
    msa-advanced
  • Phylogenetic analysis → Use
    etetoolkit
  • NCBI database queries → Use
    pubmed-database
    or
    gene-database
  • NGS比对文件(SAM/BAM/VCF) → 请使用
    pysam
  • BLAST搜索 → 请使用
    gget
    (快速)或
    blat-integration
    (大规模)
  • 多序列比对 → 请使用
    msa-advanced
  • 系统发育分析 → 请使用
    etetoolkit
  • NCBI数据库查询 → 请使用
    pubmed-database
    gene-database

Tool Selection Guide

工具选择指南

TaskToolReference
Parse FASTA/GenBank/FASTQ
Bio.SeqIO
biopython_seqio.md
Convert file formats
Bio.SeqIO.convert()
biopython_seqio.md
Sequence operations
Bio.Seq
biopython_seqio.md
Large FASTA random access
pysam.FastaFile
+ faidx
faidx.md
GC%, Tm, molecular weight
Bio.SeqUtils
utilities.md
任务工具参考文档
解析FASTA/GenBank/FASTQ
Bio.SeqIO
biopython_seqio.md
文件格式转换
Bio.SeqIO.convert()
biopython_seqio.md
序列操作
Bio.Seq
biopython_seqio.md
大型FASTA文件随机访问
pysam.FastaFile
+ faidx
faidx.md
GC含量、Tm值、分子量计算
Bio.SeqUtils
utilities.md

Quick Start

快速开始

Installation

安装

bash
uv pip install biopython pysam
bash
uv pip install biopython pysam

Read FASTA

读取FASTA文件

python
from Bio import SeqIO

for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(f"{record.id}: {len(record.seq)} bp")
python
from Bio import SeqIO

for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(f"{record.id}: {len(record.seq)} bp")

Convert GenBank to FASTA

将GenBank格式转换为FASTA格式

python
from Bio import SeqIO

SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
python
from Bio import SeqIO

SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")

Random Access with faidx

使用faidx实现随机访问

python
import pysam
python
import pysam

Create index (once)

Create index (once)

pysam.faidx("reference.fasta")
pysam.faidx("reference.fasta")

Random access

Random access

fasta = pysam.FastaFile("reference.fasta") seq = fasta.fetch("chr1", 1000, 2000) # 0-based coordinates fasta.close()
undefined
fasta = pysam.FastaFile("reference.fasta") seq = fasta.fetch("chr1", 1000, 2000) # 0-based coordinates fasta.close()
undefined

Sequence Operations

序列操作

python
from Bio.Seq import Seq

seq = Seq("ATGCGATCGATCG")
print(seq.complement())
print(seq.reverse_complement())
print(seq.translate())
python
from Bio.Seq import Seq

seq = Seq("ATGCGATCGATCG")
print(seq.complement())
print(seq.reverse_complement())
print(seq.translate())

Reference Documentation

参考文档

Consult the appropriate reference file for detailed documentation:
如需详细文档,请查阅对应的参考文件:

references/biopython_seqio.md

references/biopython_seqio.md

  • Bio.Seq
    object and sequence operations
  • Bio.SeqIO
    for file parsing and writing
  • SeqRecord
    object and annotations
  • Supported file formats
  • Format conversion patterns
  • Bio.Seq
    对象与序列操作
  • 用于文件解析与写入的
    Bio.SeqIO
  • SeqRecord
    对象与注释
  • 支持的文件格式
  • 格式转换模式

references/faidx.md

references/faidx.md

  • Creating FASTA index with
    pysam.faidx()
  • pysam.FastaFile
    for random access
  • Coordinate systems (0-based vs 1-based)
  • Performance considerations for large files
  • Common patterns (variant context, gene extraction)
  • 使用
    pysam.faidx()
    创建FASTA索引
  • 用于随机访问的
    pysam.FastaFile
  • 坐标系统(0起始 vs 1起始)
  • 大型文件的性能注意事项
  • 常见使用场景(变异上下文、基因提取)

references/utilities.md

references/utilities.md

  • GC content calculation (
    gc_fraction
    )
  • Molecular weight (
    molecular_weight
    )
  • Melting temperature (
    MeltingTemp
    )
  • Codon usage analysis
  • Restriction enzyme sites
  • GC含量计算(
    gc_fraction
  • 分子量计算(
    molecular_weight
  • 解链温度计算(
    MeltingTemp
  • 密码子使用分析
  • 限制性酶切位点

references/formats.md

references/formats.md

  • FASTA format specification
  • GenBank format specification
  • FASTQ format and quality scores
  • Format detection and validation
  • FASTA格式规范
  • GenBank格式规范
  • FASTQ格式与质量分数
  • 格式检测与验证

Coordinate Systems

坐标系统

Biopython: Uses Python-style 0-based, half-open intervals for slicing.
pysam.FastaFile.fetch():
  • Numeric arguments: 0-based (
    fetch("chr1", 999, 2000)
    = positions 999-1999)
  • Region strings: 1-based (
    fetch("chr1:1000-2000")
    = positions 1000-2000)
Biopython:切片采用Python风格的0起始、左闭右开区间。
pysam.FastaFile.fetch():
  • 数值参数:0起始(
    fetch("chr1", 999, 2000)
    对应位置999-1999)
  • 区域字符串:1起始(
    fetch("chr1:1000-2000")
    对应位置1000-2000)

Common Pitfalls

常见陷阱

  1. Coordinate confusion: Remember which tool uses 0-based vs 1-based
  2. Missing faidx index: Random access requires
    .fai
    file
  3. Format mismatch: Verify file format matches the format string in
    SeqIO.parse()
  4. Iterator exhaustion:
    SeqIO.parse()
    returns an iterator; convert to list if multiple passes needed
  5. Large files: Use iterators, not
    list()
    , for memory efficiency
  1. 坐标混淆:注意区分工具使用的是0起始还是1起始坐标
  2. 缺少faidx索引:随机访问需要
    .fai
    文件
  3. 格式不匹配:确保文件格式与
    SeqIO.parse()
    中的格式字符串一致
  4. 迭代器耗尽
    SeqIO.parse()
    返回迭代器;若需要多次遍历,请转换为列表
  5. 大型文件:为节省内存,请使用迭代器而非
    list()