bio-fasta
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSequence I/O
序列I/O
Read, write, and manipulate biological sequence files (FASTA, GenBank, FASTQ).
读取、写入和处理生物序列文件(FASTA、GenBank、FASTQ)。
When to Use This Skill
何时使用该技能
This skill should be used when:
- Reading or writing sequence files (FASTA, GenBank, FASTQ)
- Converting between sequence file formats
- Manipulating sequences (complement, reverse complement, translate)
- Extracting sequences from large indexed FASTA files (faidx)
- Calculating sequence statistics (GC content, molecular weight, Tm)
该技能适用于以下场景:
- 读取或写入序列文件(FASTA、GenBank、FASTQ)
- 序列文件格式之间的转换
- 序列操作(互补、反向互补、翻译)
- 从大型带索引的FASTA文件(faidx)中提取序列
- 计算序列统计信息(GC含量、分子量、Tm值)
When NOT to Use This Skill
何时不使用该技能
- NGS alignment files (SAM/BAM/VCF) → Use
pysam - BLAST searches → Use (quick) or
gget(large-scale)blat-integration - Multiple sequence alignment → Use
msa-advanced - Phylogenetic analysis → Use
etetoolkit - NCBI database queries → Use or
pubmed-databasegene-database
- NGS比对文件(SAM/BAM/VCF) → 请使用
pysam - BLAST搜索 → 请使用(快速)或
gget(大规模)blat-integration - 多序列比对 → 请使用
msa-advanced - 系统发育分析 → 请使用
etetoolkit - NCBI数据库查询 → 请使用或
pubmed-databasegene-database
Tool Selection Guide
工具选择指南
| Task | Tool | Reference |
|---|---|---|
| Parse FASTA/GenBank/FASTQ | | |
| Convert file formats | | |
| Sequence operations | | |
| Large FASTA random access | | |
| GC%, Tm, molecular weight | | |
| 任务 | 工具 | 参考文档 |
|---|---|---|
| 解析FASTA/GenBank/FASTQ | | |
| 文件格式转换 | | |
| 序列操作 | | |
| 大型FASTA文件随机访问 | | |
| GC含量、Tm值、分子量计算 | | |
Quick Start
快速开始
Installation
安装
bash
uv pip install biopython pysambash
uv pip install biopython pysamRead FASTA
读取FASTA文件
python
from Bio import SeqIO
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"{record.id}: {len(record.seq)} bp")python
from Bio import SeqIO
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"{record.id}: {len(record.seq)} bp")Convert GenBank to FASTA
将GenBank格式转换为FASTA格式
python
from Bio import SeqIO
SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")python
from Bio import SeqIO
SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")Random Access with faidx
使用faidx实现随机访问
python
import pysampython
import pysamCreate index (once)
Create index (once)
pysam.faidx("reference.fasta")
pysam.faidx("reference.fasta")
Random access
Random access
fasta = pysam.FastaFile("reference.fasta")
seq = fasta.fetch("chr1", 1000, 2000) # 0-based coordinates
fasta.close()
undefinedfasta = pysam.FastaFile("reference.fasta")
seq = fasta.fetch("chr1", 1000, 2000) # 0-based coordinates
fasta.close()
undefinedSequence Operations
序列操作
python
from Bio.Seq import Seq
seq = Seq("ATGCGATCGATCG")
print(seq.complement())
print(seq.reverse_complement())
print(seq.translate())python
from Bio.Seq import Seq
seq = Seq("ATGCGATCGATCG")
print(seq.complement())
print(seq.reverse_complement())
print(seq.translate())Reference Documentation
参考文档
Consult the appropriate reference file for detailed documentation:
如需详细文档,请查阅对应的参考文件:
references/biopython_seqio.md
references/biopython_seqio.mdreferences/biopython_seqio.md
references/biopython_seqio.md- object and sequence operations
Bio.Seq - for file parsing and writing
Bio.SeqIO - object and annotations
SeqRecord - Supported file formats
- Format conversion patterns
- 对象与序列操作
Bio.Seq - 用于文件解析与写入的
Bio.SeqIO - 对象与注释
SeqRecord - 支持的文件格式
- 格式转换模式
references/faidx.md
references/faidx.mdreferences/faidx.md
references/faidx.md- Creating FASTA index with
pysam.faidx() - for random access
pysam.FastaFile - Coordinate systems (0-based vs 1-based)
- Performance considerations for large files
- Common patterns (variant context, gene extraction)
- 使用创建FASTA索引
pysam.faidx() - 用于随机访问的
pysam.FastaFile - 坐标系统(0起始 vs 1起始)
- 大型文件的性能注意事项
- 常见使用场景(变异上下文、基因提取)
references/utilities.md
references/utilities.mdreferences/utilities.md
references/utilities.md- GC content calculation ()
gc_fraction - Molecular weight ()
molecular_weight - Melting temperature ()
MeltingTemp - Codon usage analysis
- Restriction enzyme sites
- GC含量计算()
gc_fraction - 分子量计算()
molecular_weight - 解链温度计算()
MeltingTemp - 密码子使用分析
- 限制性酶切位点
references/formats.md
references/formats.mdreferences/formats.md
references/formats.md- FASTA format specification
- GenBank format specification
- FASTQ format and quality scores
- Format detection and validation
- FASTA格式规范
- GenBank格式规范
- FASTQ格式与质量分数
- 格式检测与验证
Coordinate Systems
坐标系统
Biopython: Uses Python-style 0-based, half-open intervals for slicing.
pysam.FastaFile.fetch():
- Numeric arguments: 0-based (= positions 999-1999)
fetch("chr1", 999, 2000) - Region strings: 1-based (= positions 1000-2000)
fetch("chr1:1000-2000")
Biopython:切片采用Python风格的0起始、左闭右开区间。
pysam.FastaFile.fetch():
- 数值参数:0起始(对应位置999-1999)
fetch("chr1", 999, 2000) - 区域字符串:1起始(对应位置1000-2000)
fetch("chr1:1000-2000")
Common Pitfalls
常见陷阱
- Coordinate confusion: Remember which tool uses 0-based vs 1-based
- Missing faidx index: Random access requires file
.fai - Format mismatch: Verify file format matches the format string in
SeqIO.parse() - Iterator exhaustion: returns an iterator; convert to list if multiple passes needed
SeqIO.parse() - Large files: Use iterators, not , for memory efficiency
list()
- 坐标混淆:注意区分工具使用的是0起始还是1起始坐标
- 缺少faidx索引:随机访问需要文件
.fai - 格式不匹配:确保文件格式与中的格式字符串一致
SeqIO.parse() - 迭代器耗尽:返回迭代器;若需要多次遍历,请转换为列表
SeqIO.parse() - 大型文件:为节省内存,请使用迭代器而非
list()