biopython

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Biopython: Python Tools for Computational Biology

Biopython：计算生物学Python工具包

Summary

概述

Biopython (v1.85+) delivers a comprehensive Python library for biological data analysis. It requires Python 3 and NumPy, providing modular components for sequences, alignments, database access, BLAST, structures, and phylogenetics.

Biopython（v1.85及以上版本）是一款用于生物数据分析的综合性Python库。它需要Python 3和NumPy环境，提供了序列处理、序列比对、数据库访问、BLAST运行、结构分析和系统发育分析等模块化组件。

Applicable Scenarios

适用场景

This skill applies when you need to:

Task Category	Examples
Sequence Operations	Create, modify, translate DNA/RNA/protein sequences
File Format Handling	Parse or convert FASTA, GenBank, FASTQ, PDB, mmCIF
NCBI Database Access	Query GenBank, PubMed, Protein, Gene, Taxonomy
Similarity Searches	Execute BLAST locally or via NCBI, parse results
Alignment Work	Pairwise or multiple sequence alignments
Structural Analysis	Parse PDB files, compute distances, DSSP assignment
Tree Construction	Build, manipulate, visualize phylogenetic trees
Motif Discovery	Find and score sequence patterns
Sequence Statistics	GC content, molecular weight, melting temperature

当你需要完成以下任务时，可使用该工具：

任务类别	示例
序列操作	创建、修改、翻译DNA/RNA/蛋白质序列
文件格式处理	解析或转换FASTA、GenBank、FASTQ、PDB、mmCIF格式
NCBI数据库访问	查询GenBank、PubMed、Protein、Gene、Taxonomy数据库
相似性搜索	本地或通过NCBI运行BLAST、解析结果
序列比对工作	双序列或多序列比对
结构分析	解析PDB文件、计算距离、DSSP赋值
进化树构建	构建、操作、可视化进化树
基序发现	查找并评分序列模式
序列统计	GC含量、分子量、解链温度

Module Organization

模块组织

Module	Purpose	Reference
Bio.Seq / Bio.SeqIO	Sequence objects and file I/O	`references/sequence-io.md`
Bio.Align / Bio.AlignIO	Pairwise and multiple alignments	`references/alignment.md`
Bio.Entrez	NCBI database programmatic access	`references/databases.md`
Bio.Blast	BLAST execution and result parsing	`references/blast.md`
Bio.PDB	3D structure manipulation	`references/structure.md`
Bio.Phylo	Phylogenetic tree operations	`references/phylogenetics.md`
Bio.motifs, Bio.SeqUtils, etc.	Motifs, utilities, restriction sites	`references/advanced.md`

模块	用途	参考文档
Bio.Seq / Bio.SeqIO	序列对象与文件输入输出	`references/sequence-io.md`
Bio.Align / Bio.AlignIO	双序列与多序列比对	`references/alignment.md`
Bio.Entrez	NCBI数据库程序化访问	`references/databases.md`
Bio.Blast	BLAST运行与结果解析	`references/blast.md`
Bio.PDB	3D结构操作	`references/structure.md`
Bio.Phylo	进化树操作	`references/phylogenetics.md`
Bio.motifs, Bio.SeqUtils, etc.	基序、工具函数、酶切位点	`references/advanced.md`

Setup

安装配置

Install via pip:

python

uv pip install biopython

Configure NCBI access (mandatory for Entrez operations):

python

from Bio import Entrez

Entrez.email = "researcher@institution.edu"
Entrez.api_key = "your_ncbi_api_key"  # Optional: increases rate limit to 10 req/s

通过pip安装：

python

uv pip install biopython

配置NCBI访问（Entrez操作必需）：

python

from Bio import Entrez

Entrez.email = "researcher@institution.edu"
Entrez.api_key = "your_ncbi_api_key"  # 可选：将请求速率限制提升至10次/秒

Quick Reference

快速参考

Parse Sequences

解析序列

python

from Bio import SeqIO

records = SeqIO.parse("data.fasta", "fasta")
for rec in records:
    print(f"{rec.id}: {len(rec)} bp")

python

from Bio import SeqIO

records = SeqIO.parse("data.fasta", "fasta")
for rec in records:
    print(f"{rec.id}: {len(rec)} bp")

Translate DNA

翻译DNA序列

python

from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
protein = dna.translate()

python

from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
protein = dna.translate()

Query NCBI

查询NCBI数据库

python

from Bio import Entrez

Entrez.email = "researcher@institution.edu"
handle = Entrez.esearch(db="nucleotide", term="insulin[Gene] AND human[Organism]")
results = Entrez.read(handle)
handle.close()

python

from Bio import Entrez

Entrez.email = "researcher@institution.edu"
handle = Entrez.esearch(db="nucleotide", term="insulin[Gene] AND human[Organism]")
results = Entrez.read(handle)
handle.close()

Run BLAST

运行BLAST

python

from Bio.Blast import NCBIWWW, NCBIXML

result = NCBIWWW.qblast("blastp", "swissprot", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH")
record = NCBIXML.read(result)

python

from Bio.Blast import NCBIWWW, NCBIXML

result = NCBIWWW.qblast("blastp", "swissprot", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH")
record = NCBIXML.read(result)

Parse Protein Structure

解析蛋白质结构

python

from Bio.PDB import PDBParser

parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "structure.pdb")
for atom in structure.get_atoms():
    print(atom.name, atom.coord)

python

from Bio.PDB import PDBParser

parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "structure.pdb")
for atom in structure.get_atoms():
    print(atom.name, atom.coord)

Build Phylogenetic Tree

构建进化树

python

from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

alignment = AlignIO.read("aligned.fasta", "fasta")
calc = DistanceCalculator("identity")
dm = calc.get_distance(alignment)
tree = DistanceTreeConstructor().nj(dm)
Phylo.draw_ascii(tree)

python

from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

alignment = AlignIO.read("aligned.fasta", "fasta")
calc = DistanceCalculator("identity")
dm = calc.get_distance(alignment)
tree = DistanceTreeConstructor().nj(dm)
Phylo.draw_ascii(tree)

Reference Files

参考文件

File	Contents
`references/sequence-io.md`	Bio.Seq objects, SeqIO parsing/writing, large file handling, format conversion
`references/alignment.md`	Pairwise alignment, BLOSUM matrices, AlignIO, external aligners
`references/databases.md`	NCBI Entrez API, esearch/efetch/elink, batch downloads, search syntax
`references/blast.md`	Remote/local BLAST, XML parsing, result filtering, batch queries
`references/structure.md`	Bio.PDB, SMCRA hierarchy, DSSP, superimposition, spatial queries
`references/phylogenetics.md`	Tree I/O, distance matrices, tree construction, consensus, visualization
`references/advanced.md`	Motifs, SeqUtils, restriction enzymes, population genetics, GenomeDiagram

文件	内容
`references/sequence-io.md`	Bio.Seq对象、SeqIO解析/写入、大文件处理、格式转换
`references/alignment.md`	双序列比对、BLOSUM矩阵、AlignIO、外部比对工具
`references/databases.md`	NCBI Entrez API、esearch/efetch/elink、批量下载、搜索语法
`references/blast.md`	远程/本地BLAST、XML解析、结果过滤、批量查询
`references/structure.md`	Bio.PDB、SMCRA层级、DSSP、结构叠加、空间查询
`references/phylogenetics.md`	进化树输入输出、距离矩阵、进化树构建、共识树、可视化
`references/advanced.md`	基序、SeqUtils、限制性内切酶、群体遗传学、GenomeDiagram

Implementation Patterns

实现模式

Retrieve and Analyze GenBank Record

获取并分析GenBank记录

python

from Bio import Entrez, SeqIO
from Bio.SeqUtils import gc_fraction

Entrez.email = "researcher@institution.edu"

handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

print(f"Organism: {record.annotations['organism']}")
print(f"Length: {len(record)} bp")
print(f"GC: {gc_fraction(record.seq):.1%}")

python

from Bio import Entrez, SeqIO
from Bio.SeqUtils import gc_fraction

Entrez.email = "researcher@institution.edu"

handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

print(f"物种: {record.annotations['organism']}")
print(f"长度: {len(record)} bp")
print(f"GC含量: {gc_fraction(record.seq):.1%}")

Batch Sequence Processing

批量序列处理

python

from Bio import SeqIO
from Bio.SeqUtils import gc_fraction

output_records = []
for record in SeqIO.parse("input.fasta", "fasta"):
    if len(record) >= 200 and gc_fraction(record.seq) > 0.4:
        output_records.append(record)

SeqIO.write(output_records, "filtered.fasta", "fasta")

python

from Bio import SeqIO
from Bio.SeqUtils import gc_fraction

output_records = []
for record in SeqIO.parse("input.fasta", "fasta"):
    if len(record) >= 200 and gc_fraction(record.seq) > 0.4:
        output_records.append(record)

SeqIO.write(output_records, "filtered.fasta", "fasta")

BLAST with Result Filtering

带结果过滤的BLAST运行

python

from Bio.Blast import NCBIWWW, NCBIXML

query = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
result_handle = NCBIWWW.qblast("blastp", "nr", query, hitlist_size=20)
record = NCBIXML.read(result_handle)

for alignment in record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < 1e-10:
            identity_pct = (hsp.identities / hsp.align_length) * 100
            print(f"{alignment.accession}: {identity_pct:.1f}% identity, E={hsp.expect:.2e}")

python

from Bio.Blast import NCBIWWW, NCBIXML

query = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
result_handle = NCBIWWW.qblast("blastp", "nr", query, hitlist_size=20)
record = NCBIXML.read(result_handle)

for alignment in record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < 1e-10:
            identity_pct = (hsp.identities / hsp.align_length) * 100
            print(f"{alignment.accession}: {identity_pct:.1f}% 同源性, E值={hsp.expect:.2e}")

Phylogeny from Alignment

基于比对结果构建进化树

python

from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import matplotlib.pyplot as plt

alignment = AlignIO.read("sequences.aln", "clustal")
calculator = DistanceCalculator("blosum62")
dm = calculator.get_distance(alignment)

constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
tree.root_at_midpoint()
tree.ladderize()

fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(tree, axes=ax)
fig.savefig("phylogeny.png", dpi=150)

python

from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import matplotlib.pyplot as plt

alignment = AlignIO.read("sequences.aln", "clustal")
calculator = DistanceCalculator("blosum62")
dm = calculator.get_distance(alignment)

constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
tree.root_at_midpoint()
tree.ladderize()

fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(tree, axes=ax)
fig.savefig("phylogeny.png", dpi=150)

Guidelines

规范指南

Imports: Use explicit imports

python

from Bio import SeqIO, Entrez
from Bio.Seq import Seq

File Handling: Always close handles or use context managers

python

with open("sequences.fasta") as f:
    for record in SeqIO.parse(f, "fasta"):
        process(record)

Memory Efficiency: Use iterators for large datasets

python

undefined

导入规范：使用显式导入

python

from Bio import SeqIO, Entrez
from Bio.Seq import Seq

文件处理：始终关闭句柄或使用上下文管理器

python

with open("sequences.fasta") as f:
    for record in SeqIO.parse(f, "fasta"):
        process(record)

内存效率：对大型数据集使用迭代器

python

undefined

Correct: iterate without loading all

正确：迭代处理，不加载全部数据

for record in SeqIO.parse("huge.fasta", "fasta"): if meets_criteria(record): yield record

Avoid: loading entire file

避免：加载整个文件到内存

all_records = list(SeqIO.parse("huge.fasta", "fasta"))


**Error Handling**: Wrap network operations
```python
from urllib.error import HTTPError

try:
    handle = Entrez.efetch(db="nucleotide", id=accession)
    record = SeqIO.read(handle, "genbank")
except HTTPError as e:
    print(f"Fetch failed: {e.code}")

NCBI Compliance: Set email, respect rate limits, cache downloads locally

all_records = list(SeqIO.parse("huge.fasta", "fasta"))


**错误处理**：网络操作包裹异常捕获
```python
from urllib.error import HTTPError

try:
    handle = Entrez.efetch(db="nucleotide", id=accession)
    record = SeqIO.read(handle, "genbank")
except HTTPError as e:
    print(f"获取失败：{e.code}")

NCBI合规性：设置邮箱、遵守速率限制、本地缓存下载内容

Troubleshooting

故障排除

Issue	Resolution
"No handlers could be found for logger 'Bio.Entrez'"	Set `Entrez.email` before any queries
HTTP 400 from NCBI	Verify accession/ID format is correct
"ValueError: EOF" during parse	Confirm file format matches format string
Alignment length mismatch	Sequences must be pre-aligned for AlignIO
Slow BLAST queries	Use local BLAST for large-scale searches
PDB parser warnings	Use `PDBParser(QUIET=True)` or check structure quality

问题	解决方法
日志报错：No handlers could be found for logger 'Bio.Entrez'	在任何查询前设置 `Entrez.email`
NCBI返回HTTP 400错误	验证登录号/ID格式是否正确
解析时出现"ValueError: EOF"	确认文件格式与指定的格式字符串匹配
比对长度不匹配	序列必须预先比对才能使用AlignIO
BLAST查询缓慢	大规模搜索使用本地BLAST
PDB解析器警告	使用 `PDBParser(QUIET=True)` 或检查结构质量

External Resources

外部资源

Biopython Documentation: https://biopython.org/docs/latest/
Biopython Tutorial: https://biopython.org/docs/latest/Tutorial/
GitHub Repository: https://github.com/biopython/biopython

Biopython官方文档：https://biopython.org/docs/latest/
Biopython教程：https://biopython.org/docs/latest/Tutorial/
GitHub仓库：https://github.com/biopython/biopython