pysam
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePysam
Pysam
Overview
概述
Pysam is a Python module for reading, manipulating, and writing genomic datasets. Read/write SAM/BAM/CRAM alignment files, VCF/BCF variant files, and FASTA/FASTQ sequences with a Pythonic interface to htslib. Query tabix-indexed files, perform pileup analysis for coverage, and execute samtools/bcftools commands.
Pysam是一个用于读取、操作和写入基因组数据集的Python模块。它通过对htslib的Python化接口,支持读写SAM/BAM/CRAM比对文件、VCF/BCF变异文件以及FASTA/FASTQ序列。可查询tabix索引文件、执行覆盖度的堆积分析,以及运行samtools/bcftools命令。
When to Use This Skill
何时使用该工具
This skill should be used when:
- Working with sequencing alignment files (BAM/CRAM)
- Analyzing genetic variants (VCF/BCF)
- Extracting reference sequences or gene regions
- Processing raw sequencing data (FASTQ)
- Calculating coverage or read depth
- Implementing bioinformatics analysis pipelines
- Quality control of sequencing data
- Variant calling and annotation workflows
在以下场景中应使用该工具:
- 处理测序比对文件(BAM/CRAM)
- 分析遗传变异(VCF/BCF)
- 提取参考序列或基因区域
- 处理原始测序数据(FASTQ)
- 计算覆盖度或读取深度
- 实现生物信息学分析流程
- 测序数据的质量控制
- 变异检测与注释工作流
Quick Start
快速开始
Installation
安装
bash
uv pip install pysambash
uv pip install pysamBasic Examples
基础示例
Read alignment file:
python
import pysam读取比对文件:
python
import pysamOpen BAM file and fetch reads in region
打开BAM文件并获取指定区域的读取数据
samfile = pysam.AlignmentFile("example.bam", "rb")
for read in samfile.fetch("chr1", 1000, 2000):
print(f"{read.query_name}: {read.reference_start}")
samfile.close()
**Read variant file:**
```pythonsamfile = pysam.AlignmentFile("example.bam", "rb")
for read in samfile.fetch("chr1", 1000, 2000):
print(f"{read.query_name}: {read.reference_start}")
samfile.close()
**读取变异文件:**
```pythonOpen VCF file and iterate variants
打开VCF文件并遍历变异数据
vcf = pysam.VariantFile("variants.vcf")
for variant in vcf:
print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}")
vcf.close()
**Query reference sequence:**
```pythonvcf = pysam.VariantFile("variants.vcf")
for variant in vcf:
print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}")
vcf.close()
**查询参考序列:**
```pythonOpen FASTA and extract sequence
打开FASTA文件并提取序列
fasta = pysam.FastaFile("reference.fasta")
sequence = fasta.fetch("chr1", 1000, 2000)
print(sequence)
fasta.close()
undefinedfasta = pysam.FastaFile("reference.fasta")
sequence = fasta.fetch("chr1", 1000, 2000)
print(sequence)
fasta.close()
undefinedCore Capabilities
核心功能
1. Alignment File Operations (SAM/BAM/CRAM)
1. 比对文件操作(SAM/BAM/CRAM)
Use the class to work with aligned sequencing reads. This is appropriate for analyzing mapping results, calculating coverage, extracting reads, or quality control.
AlignmentFileCommon operations:
- Open and read BAM/SAM/CRAM files
- Fetch reads from specific genomic regions
- Filter reads by mapping quality, flags, or other criteria
- Write filtered or modified alignments
- Calculate coverage statistics
- Perform pileup analysis (base-by-base coverage)
- Access read sequences, quality scores, and alignment information
Reference: See for detailed documentation on:
references/alignment_files.md- Opening and reading alignment files
- AlignedSegment attributes and methods
- Region-based fetching with
fetch() - Pileup analysis for coverage
- Writing and creating BAM files
- Coordinate systems and indexing
- Performance optimization tips
使用类处理比对后的测序读取数据。适用于分析比对结果、计算覆盖度、提取读取数据或质量控制。
AlignmentFile常见操作:
- 打开并读取BAM/SAM/CRAM文件
- 从特定基因组区域获取读取数据
- 按比对质量、标志或其他条件过滤读取数据
- 写入过滤或修改后的比对数据
- 计算覆盖度统计信息
- 执行堆积分析(逐碱基覆盖度)
- 访问读取序列、质量分数和比对信息
参考文档: 详见中的详细说明:
references/alignment_files.md- 打开和读取比对文件
- AlignedSegment属性与方法
- 使用进行基于区域的获取
fetch() - 用于覆盖度的堆积分析
- 写入和创建BAM文件
- 坐标系统与索引
- 性能优化技巧
2. Variant File Operations (VCF/BCF)
2. 变异文件操作(VCF/BCF)
Use the class to work with genetic variants from variant calling pipelines. This is appropriate for variant analysis, filtering, annotation, or population genetics.
VariantFileCommon operations:
- Read and write VCF/BCF files
- Query variants in specific regions
- Access variant information (position, alleles, quality)
- Extract genotype data for samples
- Filter variants by quality, allele frequency, or other criteria
- Annotate variants with additional information
- Subset samples or regions
Reference: See for detailed documentation on:
references/variant_files.md- Opening and reading variant files
- VariantRecord attributes and methods
- Accessing INFO and FORMAT fields
- Working with genotypes and samples
- Creating and writing VCF files
- Filtering and subsetting variants
- Multi-sample VCF operations
使用类处理来自变异检测流程的遗传变异数据。适用于变异分析、过滤、注释或群体遗传学研究。
VariantFile常见操作:
- 读取和写入VCF/BCF文件
- 查询特定区域的变异数据
- 访问变异信息(位置、等位基因、质量)
- 提取样本的基因型数据
- 按质量、等位基因频率或其他条件过滤变异
- 为变异添加注释信息
- 对样本或区域进行子集划分
参考文档: 详见中的详细说明:
references/variant_files.md- 打开和读取变异文件
- VariantRecord属性与方法
- 访问INFO和FORMAT字段
- 处理基因型与样本
- 创建和写入VCF文件
- 过滤和子集划分变异
- 多样本VCF操作
3. Sequence File Operations (FASTA/FASTQ)
3. 序列文件操作(FASTA/FASTQ)
Use for random access to reference sequences and for reading raw sequencing data. This is appropriate for extracting gene sequences, validating variants against reference, or processing raw reads.
FastaFileFastxFileCommon operations:
- Query reference sequences by genomic coordinates
- Extract sequences for genes or regions of interest
- Read FASTQ files with quality scores
- Validate variant reference alleles
- Calculate sequence statistics
- Filter reads by quality or length
- Convert between FASTA and FASTQ formats
Reference: See for detailed documentation on:
references/sequence_files.md- FASTA file access and indexing
- Extracting sequences by region
- Handling reverse complement for genes
- Reading FASTQ files sequentially
- Quality score conversion and filtering
- Working with tabix-indexed files (BED, GTF, GFF)
- Common sequence processing patterns
使用类随机访问参考序列,使用类读取原始测序数据。适用于提取基因序列、验证变异与参考序列的一致性,或处理原始读取数据。
FastaFileFastxFile常见操作:
- 按基因组坐标查询参考序列
- 提取目标基因或区域的序列
- 读取带质量分数的FASTQ文件
- 验证变异的参考等位基因
- 计算序列统计信息
- 按质量或长度过滤读取数据
- 在FASTA与FASTQ格式之间转换
参考文档: 详见中的详细说明:
references/sequence_files.md- FASTA文件访问与索引
- 按区域提取序列
- 处理基因的反向互补序列
- 顺序读取FASTQ文件
- 质量分数转换与过滤
- 处理tabix索引文件(BED、GTF、GFF)
- 常见序列处理模式
4. Integrated Bioinformatics Workflows
4. 集成生物信息学工作流
Pysam excels at integrating multiple file types for comprehensive genomic analyses. Common workflows combine alignment files, variant files, and reference sequences.
Common workflows:
- Calculate coverage statistics for specific regions
- Validate variants against aligned reads
- Annotate variants with coverage information
- Extract sequences around variant positions
- Filter alignments or variants based on multiple criteria
- Generate coverage tracks for visualization
- Quality control across multiple data types
Reference: See for detailed examples of:
references/common_workflows.md- Quality control workflows (BAM statistics, reference consistency)
- Coverage analysis (per-base coverage, low coverage detection)
- Variant analysis (annotation, filtering by read support)
- Sequence extraction (variant contexts, gene sequences)
- Read filtering and subsetting
- Integration patterns (BAM+VCF, VCF+BED, etc.)
- Performance optimization for complex workflows
Pysam擅长整合多种文件类型进行全面的基因组分析。常见工作流会结合比对文件、变异文件和参考序列。
常见工作流:
- 计算特定区域的覆盖度统计信息
- 根据比对读取数据验证变异
- 为变异添加覆盖度信息注释
- 提取变异位置周围的序列
- 根据多个条件过滤比对数据或变异
- 生成用于可视化的覆盖度轨迹
- 多数据类型的质量控制
参考文档: 详见中的详细示例:
references/common_workflows.md- 质量控制工作流(BAM统计、参考序列一致性)
- 覆盖度分析(逐碱基覆盖度、低覆盖度检测)
- 变异分析(注释、按读取支持过滤)
- 序列提取(变异上下文、基因序列)
- 读取数据过滤与子集划分
- 集成模式(BAM+VCF、VCF+BED等)
- 复杂工作流的性能优化
Key Concepts
关键概念
Coordinate Systems
坐标系统
Critical: Pysam uses 0-based, half-open coordinates (Python convention):
- Start positions are 0-based (first base is position 0)
- End positions are exclusive (not included in the range)
- Region 1000-2000 includes bases 1000-1999 (1000 bases total)
Exception: Region strings in follow samtools convention (1-based):
fetch()python
samfile.fetch("chr1", 999, 2000) # 0-based: positions 999-1999
samfile.fetch("chr1:1000-2000") # 1-based string: positions 1000-2000VCF files: Use 1-based coordinates in the file format, but is 0-based.
VariantRecord.start重要提示: Pysam使用0-based、半开区间坐标(Python惯例):
- 起始位置为0-based(第一个碱基是位置0)
- 结束位置是排他的(不包含在范围内)
- 区域1000-2000包含碱基1000-1999(共1000个碱基)
例外情况: 中的区域字符串遵循samtools惯例(1-based):
fetch()python
samfile.fetch("chr1", 999, 2000) # 0-based:位置999-1999
samfile.fetch("chr1:1000-2000") # 1-based字符串:位置1000-2000VCF文件: 文件格式中使用1-based坐标,但是0-based。
VariantRecord.startIndexing Requirements
索引要求
Random access to specific genomic regions requires index files:
- BAM files: Require index (create with
.bai)pysam.index() - CRAM files: Require index
.crai - FASTA files: Require index (create with
.fai)pysam.faidx() - VCF.gz files: Require tabix index (create with
.tbi)pysam.tabix_index() - BCF files: Require index
.csi
Without an index, use for sequential reading.
fetch(until_eof=True)对特定基因组区域的随机访问需要索引文件:
- BAM文件:需要索引(使用
.bai创建)pysam.index() - CRAM文件:需要索引
.crai - FASTA文件:需要索引(使用
.fai创建)pysam.faidx() - VCF.gz文件:需要tabix索引(使用
.tbi创建)pysam.tabix_index() - BCF文件:需要索引
.csi
如果没有索引,可使用进行顺序读取。
fetch(until_eof=True)File Modes
文件模式
Specify format when opening files:
- - Read BAM (binary)
"rb" - - Read SAM (text)
"r" - - Read CRAM
"rc" - - Write BAM
"wb" - - Write SAM
"w" - - Write CRAM
"wc"
打开文件时需指定格式:
- - 读取BAM(二进制)
"rb" - - 读取SAM(文本)
"r" - - 读取CRAM
"rc" - - 写入BAM
"wb" - - 写入SAM
"w" - - 写入CRAM
"wc"
Performance Considerations
性能注意事项
- Always use indexed files for random access operations
- Use for column-wise analysis instead of repeated fetch operations
pileup() - Use for counting instead of iterating and counting manually
count() - Process regions in parallel when analyzing independent genomic regions
- Close files explicitly to free resources
- Use for sequential processing without index
until_eof=True - Avoid multiple iterators unless necessary (use if needed)
multiple_iterators=True
- 始终使用索引文件进行随机访问操作
- 使用进行列分析,而非重复的fetch操作
pileup() - 使用进行计数,而非手动迭代计数
count() - 并行处理区域,当分析独立的基因组区域时
- 显式关闭文件以释放资源
- **使用**进行无索引的顺序处理
until_eof=True - 避免多个迭代器,除非必要(若需要则使用)
multiple_iterators=True
Common Pitfalls
常见陷阱
- Coordinate confusion: Remember 0-based vs 1-based systems in different contexts
- Missing indices: Many operations require index files—create them first
- Partial overlaps: returns reads overlapping region boundaries, not just those fully contained
fetch() - Iterator scope: Keep pileup iterator references alive to avoid "PileupProxy accessed after iterator finished" errors
- Quality score editing: Cannot modify in place after changing
query_qualities—create a copy firstquery_sequence - Stream limitations: Only stdin/stdout are supported for streaming, not arbitrary Python file objects
- Thread safety: While GIL is released during I/O, comprehensive thread-safety hasn't been fully validated
- 坐标混淆: 记住不同场景下的0-based与1-based系统
- 缺失索引: 许多操作需要索引文件——请先创建
- 部分重叠: 返回与区域边界重叠的读取数据,而非仅完全包含在区域内的
fetch() - 迭代器作用域: 保持堆积迭代器的引用,避免出现“PileupProxy在迭代器结束后被访问”的错误
- 质量分数编辑: 修改后,无法原地修改
query_sequence——请先创建副本query_qualities - 流限制: 仅支持stdin/stdout进行流式处理,不支持任意Python文件对象
- 线程安全: 虽然I/O期间会释放GIL,但全面的线程安全性尚未完全验证
Command-Line Tools
命令行工具
Pysam provides access to samtools and bcftools commands:
python
undefinedPysam提供对samtools和bcftools命令的访问:
python
undefinedSort BAM file
排序BAM文件
pysam.samtools.sort("-o", "sorted.bam", "input.bam")
pysam.samtools.sort("-o", "sorted.bam", "input.bam")
Index BAM
索引BAM文件
pysam.samtools.index("sorted.bam")
pysam.samtools.index("sorted.bam")
View specific region
查看特定区域
pysam.samtools.view("-b", "-o", "region.bam", "input.bam", "chr1:1000-2000")
pysam.samtools.view("-b", "-o", "region.bam", "input.bam", "chr1:1000-2000")
BCF tools
BCF工具
pysam.bcftools.view("-O", "z", "-o", "output.vcf.gz", "input.vcf")
**Error handling:**
```python
try:
pysam.samtools.sort("-o", "output.bam", "input.bam")
except pysam.SamtoolsError as e:
print(f"Error: {e}")pysam.bcftools.view("-O", "z", "-o", "output.vcf.gz", "input.vcf")
**错误处理:**
```python
try:
pysam.samtools.sort("-o", "output.bam", "input.bam")
except pysam.SamtoolsError as e:
print(f"错误: {e}")Resources
资源
references/
references/
Detailed documentation for each major capability:
-
alignment_files.md - Complete guide to SAM/BAM/CRAM operations, including AlignmentFile class, AlignedSegment attributes, fetch operations, pileup analysis, and writing alignments
-
variant_files.md - Complete guide to VCF/BCF operations, including VariantFile class, VariantRecord attributes, genotype handling, INFO/FORMAT fields, and multi-sample operations
-
sequence_files.md - Complete guide to FASTA/FASTQ operations, including FastaFile and FastxFile classes, sequence extraction, quality score handling, and tabix-indexed file access
-
common_workflows.md - Practical examples of integrated bioinformatics workflows combining multiple file types, including quality control, coverage analysis, variant validation, and sequence extraction
各主要功能的详细文档:
-
alignment_files.md - SAM/BAM/CRAM操作的完整指南,包括AlignmentFile类、AlignedSegment属性、fetch操作、堆积分析、写入比对文件等
-
variant_files.md - VCF/BCF操作的完整指南,包括VariantFile类、VariantRecord属性、基因型处理、INFO/FORMAT字段、多样本操作等
-
sequence_files.md - FASTA/FASTQ操作的完整指南,包括FastaFile和FastxFile类、序列提取、质量分数处理、tabix索引文件访问等
-
common_workflows.md - 整合多种文件类型的实用生物信息学工作流示例,包括质量控制、覆盖度分析、变异验证、序列提取等
Getting Help
—
For detailed information on specific operations, refer to the appropriate reference document:
- Working with BAM files or calculating coverage →
alignment_files.md - Analyzing variants or genotypes →
variant_files.md - Extracting sequences or processing FASTQ →
sequence_files.md - Complex workflows integrating multiple file types →
common_workflows.md
Official documentation: https://pysam.readthedocs.io/
—