biopython

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Biopython: Python Tools for Computational Biology

Biopython:计算生物学Python工具包

Summary

概述

Biopython (v1.85+) delivers a comprehensive Python library for biological data analysis. It requires Python 3 and NumPy, providing modular components for sequences, alignments, database access, BLAST, structures, and phylogenetics.
Biopython(v1.85及以上版本)是一款用于生物数据分析的综合性Python库。它需要Python 3和NumPy环境,提供了序列处理、序列比对、数据库访问、BLAST运行、结构分析和系统发育分析等模块化组件。

Applicable Scenarios

适用场景

This skill applies when you need to:
Task CategoryExamples
Sequence OperationsCreate, modify, translate DNA/RNA/protein sequences
File Format HandlingParse or convert FASTA, GenBank, FASTQ, PDB, mmCIF
NCBI Database AccessQuery GenBank, PubMed, Protein, Gene, Taxonomy
Similarity SearchesExecute BLAST locally or via NCBI, parse results
Alignment WorkPairwise or multiple sequence alignments
Structural AnalysisParse PDB files, compute distances, DSSP assignment
Tree ConstructionBuild, manipulate, visualize phylogenetic trees
Motif DiscoveryFind and score sequence patterns
Sequence StatisticsGC content, molecular weight, melting temperature
当你需要完成以下任务时,可使用该工具:
任务类别示例
序列操作创建、修改、翻译DNA/RNA/蛋白质序列
文件格式处理解析或转换FASTA、GenBank、FASTQ、PDB、mmCIF格式
NCBI数据库访问查询GenBank、PubMed、Protein、Gene、Taxonomy数据库
相似性搜索本地或通过NCBI运行BLAST、解析结果
序列比对工作双序列或多序列比对
结构分析解析PDB文件、计算距离、DSSP赋值
进化树构建构建、操作、可视化进化树
基序发现查找并评分序列模式
序列统计GC含量、分子量、解链温度

Module Organization

模块组织

ModulePurposeReference
Bio.Seq / Bio.SeqIOSequence objects and file I/O
references/sequence-io.md
Bio.Align / Bio.AlignIOPairwise and multiple alignments
references/alignment.md
Bio.EntrezNCBI database programmatic access
references/databases.md
Bio.BlastBLAST execution and result parsing
references/blast.md
Bio.PDB3D structure manipulation
references/structure.md
Bio.PhyloPhylogenetic tree operations
references/phylogenetics.md
Bio.motifs, Bio.SeqUtils, etc.Motifs, utilities, restriction sites
references/advanced.md
模块用途参考文档
Bio.Seq / Bio.SeqIO序列对象与文件输入输出
references/sequence-io.md
Bio.Align / Bio.AlignIO双序列与多序列比对
references/alignment.md
Bio.EntrezNCBI数据库程序化访问
references/databases.md
Bio.BlastBLAST运行与结果解析
references/blast.md
Bio.PDB3D结构操作
references/structure.md
Bio.Phylo进化树操作
references/phylogenetics.md
Bio.motifs, Bio.SeqUtils, etc.基序、工具函数、酶切位点
references/advanced.md

Setup

安装配置

Install via pip:
python
uv pip install biopython
Configure NCBI access (mandatory for Entrez operations):
python
from Bio import Entrez

Entrez.email = "researcher@institution.edu"
Entrez.api_key = "your_ncbi_api_key"  # Optional: increases rate limit to 10 req/s
通过pip安装:
python
uv pip install biopython
配置NCBI访问(Entrez操作必需):
python
from Bio import Entrez

Entrez.email = "researcher@institution.edu"
Entrez.api_key = "your_ncbi_api_key"  # 可选:将请求速率限制提升至10次/秒

Quick Reference

快速参考

Parse Sequences

解析序列

python
from Bio import SeqIO

records = SeqIO.parse("data.fasta", "fasta")
for rec in records:
    print(f"{rec.id}: {len(rec)} bp")
python
from Bio import SeqIO

records = SeqIO.parse("data.fasta", "fasta")
for rec in records:
    print(f"{rec.id}: {len(rec)} bp")

Translate DNA

翻译DNA序列

python
from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
protein = dna.translate()
python
from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
protein = dna.translate()

Query NCBI

查询NCBI数据库

python
from Bio import Entrez

Entrez.email = "researcher@institution.edu"
handle = Entrez.esearch(db="nucleotide", term="insulin[Gene] AND human[Organism]")
results = Entrez.read(handle)
handle.close()
python
from Bio import Entrez

Entrez.email = "researcher@institution.edu"
handle = Entrez.esearch(db="nucleotide", term="insulin[Gene] AND human[Organism]")
results = Entrez.read(handle)
handle.close()

Run BLAST

运行BLAST

python
from Bio.Blast import NCBIWWW, NCBIXML

result = NCBIWWW.qblast("blastp", "swissprot", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH")
record = NCBIXML.read(result)
python
from Bio.Blast import NCBIWWW, NCBIXML

result = NCBIWWW.qblast("blastp", "swissprot", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH")
record = NCBIXML.read(result)

Parse Protein Structure

解析蛋白质结构

python
from Bio.PDB import PDBParser

parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "structure.pdb")
for atom in structure.get_atoms():
    print(atom.name, atom.coord)
python
from Bio.PDB import PDBParser

parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "structure.pdb")
for atom in structure.get_atoms():
    print(atom.name, atom.coord)

Build Phylogenetic Tree

构建进化树

python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

alignment = AlignIO.read("aligned.fasta", "fasta")
calc = DistanceCalculator("identity")
dm = calc.get_distance(alignment)
tree = DistanceTreeConstructor().nj(dm)
Phylo.draw_ascii(tree)
python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

alignment = AlignIO.read("aligned.fasta", "fasta")
calc = DistanceCalculator("identity")
dm = calc.get_distance(alignment)
tree = DistanceTreeConstructor().nj(dm)
Phylo.draw_ascii(tree)

Reference Files

参考文件

FileContents
references/sequence-io.md
Bio.Seq objects, SeqIO parsing/writing, large file handling, format conversion
references/alignment.md
Pairwise alignment, BLOSUM matrices, AlignIO, external aligners
references/databases.md
NCBI Entrez API, esearch/efetch/elink, batch downloads, search syntax
references/blast.md
Remote/local BLAST, XML parsing, result filtering, batch queries
references/structure.md
Bio.PDB, SMCRA hierarchy, DSSP, superimposition, spatial queries
references/phylogenetics.md
Tree I/O, distance matrices, tree construction, consensus, visualization
references/advanced.md
Motifs, SeqUtils, restriction enzymes, population genetics, GenomeDiagram
文件内容
references/sequence-io.md
Bio.Seq对象、SeqIO解析/写入、大文件处理、格式转换
references/alignment.md
双序列比对、BLOSUM矩阵、AlignIO、外部比对工具
references/databases.md
NCBI Entrez API、esearch/efetch/elink、批量下载、搜索语法
references/blast.md
远程/本地BLAST、XML解析、结果过滤、批量查询
references/structure.md
Bio.PDB、SMCRA层级、DSSP、结构叠加、空间查询
references/phylogenetics.md
进化树输入输出、距离矩阵、进化树构建、共识树、可视化
references/advanced.md
基序、SeqUtils、限制性内切酶、群体遗传学、GenomeDiagram

Implementation Patterns

实现模式

Retrieve and Analyze GenBank Record

获取并分析GenBank记录

python
from Bio import Entrez, SeqIO
from Bio.SeqUtils import gc_fraction

Entrez.email = "researcher@institution.edu"

handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

print(f"Organism: {record.annotations['organism']}")
print(f"Length: {len(record)} bp")
print(f"GC: {gc_fraction(record.seq):.1%}")
python
from Bio import Entrez, SeqIO
from Bio.SeqUtils import gc_fraction

Entrez.email = "researcher@institution.edu"

handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

print(f"物种: {record.annotations['organism']}")
print(f"长度: {len(record)} bp")
print(f"GC含量: {gc_fraction(record.seq):.1%}")

Batch Sequence Processing

批量序列处理

python
from Bio import SeqIO
from Bio.SeqUtils import gc_fraction

output_records = []
for record in SeqIO.parse("input.fasta", "fasta"):
    if len(record) >= 200 and gc_fraction(record.seq) > 0.4:
        output_records.append(record)

SeqIO.write(output_records, "filtered.fasta", "fasta")
python
from Bio import SeqIO
from Bio.SeqUtils import gc_fraction

output_records = []
for record in SeqIO.parse("input.fasta", "fasta"):
    if len(record) >= 200 and gc_fraction(record.seq) > 0.4:
        output_records.append(record)

SeqIO.write(output_records, "filtered.fasta", "fasta")

BLAST with Result Filtering

带结果过滤的BLAST运行

python
from Bio.Blast import NCBIWWW, NCBIXML

query = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
result_handle = NCBIWWW.qblast("blastp", "nr", query, hitlist_size=20)
record = NCBIXML.read(result_handle)

for alignment in record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < 1e-10:
            identity_pct = (hsp.identities / hsp.align_length) * 100
            print(f"{alignment.accession}: {identity_pct:.1f}% identity, E={hsp.expect:.2e}")
python
from Bio.Blast import NCBIWWW, NCBIXML

query = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
result_handle = NCBIWWW.qblast("blastp", "nr", query, hitlist_size=20)
record = NCBIXML.read(result_handle)

for alignment in record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < 1e-10:
            identity_pct = (hsp.identities / hsp.align_length) * 100
            print(f"{alignment.accession}: {identity_pct:.1f}% 同源性, E值={hsp.expect:.2e}")

Phylogeny from Alignment

基于比对结果构建进化树

python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import matplotlib.pyplot as plt

alignment = AlignIO.read("sequences.aln", "clustal")
calculator = DistanceCalculator("blosum62")
dm = calculator.get_distance(alignment)

constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
tree.root_at_midpoint()
tree.ladderize()

fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(tree, axes=ax)
fig.savefig("phylogeny.png", dpi=150)
python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import matplotlib.pyplot as plt

alignment = AlignIO.read("sequences.aln", "clustal")
calculator = DistanceCalculator("blosum62")
dm = calculator.get_distance(alignment)

constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
tree.root_at_midpoint()
tree.ladderize()

fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(tree, axes=ax)
fig.savefig("phylogeny.png", dpi=150)

Guidelines

规范指南

Imports: Use explicit imports
python
from Bio import SeqIO, Entrez
from Bio.Seq import Seq
File Handling: Always close handles or use context managers
python
with open("sequences.fasta") as f:
    for record in SeqIO.parse(f, "fasta"):
        process(record)
Memory Efficiency: Use iterators for large datasets
python
undefined
导入规范:使用显式导入
python
from Bio import SeqIO, Entrez
from Bio.Seq import Seq
文件处理:始终关闭句柄或使用上下文管理器
python
with open("sequences.fasta") as f:
    for record in SeqIO.parse(f, "fasta"):
        process(record)
内存效率:对大型数据集使用迭代器
python
undefined

Correct: iterate without loading all

正确:迭代处理,不加载全部数据

for record in SeqIO.parse("huge.fasta", "fasta"): if meets_criteria(record): yield record
for record in SeqIO.parse("huge.fasta", "fasta"): if meets_criteria(record): yield record

Avoid: loading entire file

避免:加载整个文件到内存

all_records = list(SeqIO.parse("huge.fasta", "fasta"))

**Error Handling**: Wrap network operations
```python
from urllib.error import HTTPError

try:
    handle = Entrez.efetch(db="nucleotide", id=accession)
    record = SeqIO.read(handle, "genbank")
except HTTPError as e:
    print(f"Fetch failed: {e.code}")
NCBI Compliance: Set email, respect rate limits, cache downloads locally
all_records = list(SeqIO.parse("huge.fasta", "fasta"))

**错误处理**:网络操作包裹异常捕获
```python
from urllib.error import HTTPError

try:
    handle = Entrez.efetch(db="nucleotide", id=accession)
    record = SeqIO.read(handle, "genbank")
except HTTPError as e:
    print(f"获取失败:{e.code}")
NCBI合规性:设置邮箱、遵守速率限制、本地缓存下载内容

Troubleshooting

故障排除

IssueResolution
"No handlers could be found for logger 'Bio.Entrez'"Set
Entrez.email
before any queries
HTTP 400 from NCBIVerify accession/ID format is correct
"ValueError: EOF" during parseConfirm file format matches format string
Alignment length mismatchSequences must be pre-aligned for AlignIO
Slow BLAST queriesUse local BLAST for large-scale searches
PDB parser warningsUse
PDBParser(QUIET=True)
or check structure quality
问题解决方法
日志报错:No handlers could be found for logger 'Bio.Entrez'在任何查询前设置
Entrez.email
NCBI返回HTTP 400错误验证登录号/ID格式是否正确
解析时出现"ValueError: EOF"确认文件格式与指定的格式字符串匹配
比对长度不匹配序列必须预先比对才能使用AlignIO
BLAST查询缓慢大规模搜索使用本地BLAST
PDB解析器警告使用
PDBParser(QUIET=True)
或检查结构质量

External Resources

外部资源