cosmic-database

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

COSMIC Database

COSMIC数据库

Overview

概述

COSMIC (Catalogue of Somatic Mutations in Cancer) is the world's largest and most comprehensive database for exploring somatic mutations in human cancer. Access COSMIC's extensive collection of cancer genomics data, including millions of mutations across thousands of cancer types, curated gene lists, mutational signatures, and clinical annotations programmatically.
COSMIC(Catalogue of Somatic Mutations in Cancer)是全球规模最大、最全面的人类癌症体细胞突变探索数据库。你可以通过编程方式访问COSMIC丰富的癌症基因组数据,涵盖数千种癌症类型的数百万个突变、经过专家整理的基因列表、突变特征以及临床注释信息。

When to Use This Skill

何时使用该技能

This skill should be used when:
  • Downloading cancer mutation data from COSMIC
  • Accessing the Cancer Gene Census for curated cancer gene lists
  • Retrieving mutational signature profiles
  • Querying structural variants, copy number alterations, or gene fusions
  • Analyzing drug resistance mutations
  • Working with cancer cell line genomics data
  • Integrating cancer mutation data into bioinformatics pipelines
  • Researching specific genes or mutations in cancer contexts
当你有以下需求时,可使用本技能:
  • 从COSMIC下载癌症突变数据
  • 访问癌症基因普查(Cancer Gene Census)获取经过专家整理的癌症基因列表
  • 检索突变特征图谱
  • 查询结构变异、拷贝数改变或基因融合数据
  • 分析耐药突变
  • 处理癌细胞系基因组数据
  • 将癌症突变数据整合到生物信息学流程中
  • 在癌症研究场景下探究特定基因或突变

Prerequisites

前提条件

Account Registration

账户注册

COSMIC requires authentication for data downloads:
COSMIC的数据下载需要身份验证:

Python Requirements

Python环境要求

bash
uv pip install requests pandas
bash
uv pip install requests pandas

Quick Start

快速开始

1. Basic File Download

1. 基础文件下载

Use the
scripts/download_cosmic.py
script to download COSMIC data files:
python
from scripts.download_cosmic import download_cosmic_file
使用
scripts/download_cosmic.py
脚本下载COSMIC数据文件:
python
from scripts.download_cosmic import download_cosmic_file

Download mutation data

下载突变数据

download_cosmic_file( email="your_email@institution.edu", password="your_password", filepath="GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz", output_filename="cosmic_mutations.tsv.gz" )
undefined
download_cosmic_file( email="your_email@institution.edu", password="your_password", filepath="GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz", output_filename="cosmic_mutations.tsv.gz" )
undefined

2. Command-Line Usage

2. 命令行使用方式

bash
undefined
bash
undefined

Download using shorthand data type

使用简写数据类型下载

python scripts/download_cosmic.py user@email.com --data-type mutations
python scripts/download_cosmic.py user@email.com --data-type mutations

Download specific file

下载特定文件

python scripts/download_cosmic.py user@email.com
--filepath GRCh38/cosmic/latest/cancer_gene_census.csv
python scripts/download_cosmic.py user@email.com
--filepath GRCh38/cosmic/latest/cancer_gene_census.csv

Download for specific genome assembly

针对特定基因组组装版本下载

python scripts/download_cosmic.py user@email.com
--data-type gene_census --assembly GRCh37 -o cancer_genes.csv
undefined
python scripts/download_cosmic.py user@email.com
--data-type gene_census --assembly GRCh37 -o cancer_genes.csv
undefined

3. Working with Downloaded Data

3. 处理已下载的数据

python
import pandas as pd
python
import pandas as pd

Read mutation data

读取突变数据

mutations = pd.read_csv('cosmic_mutations.tsv.gz', sep='\t', compression='gzip')
mutations = pd.read_csv('cosmic_mutations.tsv.gz', sep='\t', compression='gzip')

Read Cancer Gene Census

读取癌症基因普查数据

gene_census = pd.read_csv('cancer_gene_census.csv')
gene_census = pd.read_csv('cancer_gene_census.csv')

Read VCF format

读取VCF格式数据

import pysam vcf = pysam.VariantFile('CosmicCodingMuts.vcf.gz')
undefined
import pysam vcf = pysam.VariantFile('CosmicCodingMuts.vcf.gz')
undefined

Available Data Types

可用数据类型

Core Mutations

核心突变数据

Download comprehensive mutation data including point mutations, indels, and genomic annotations.
Common data types:
  • mutations
    - Complete coding mutations (TSV format)
  • mutations_vcf
    - Coding mutations in VCF format
  • sample_info
    - Sample metadata and tumor information
python
undefined
下载包含点突变、插入缺失以及基因组注释信息的全面突变数据。
常见数据类型
  • mutations
    - 完整编码区突变(TSV格式)
  • mutations_vcf
    - VCF格式的编码区突变
  • sample_info
    - 样本元数据和肿瘤信息
python
undefined

Download all coding mutations

下载所有编码区突变数据

download_cosmic_file( email="user@email.com", password="password", filepath="GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz" )
undefined
download_cosmic_file( email="user@email.com", password="password", filepath="GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz" )
undefined

Cancer Gene Census

癌症基因普查

Access the expert-curated list of ~700+ cancer genes with substantial evidence of cancer involvement.
python
undefined
访问由专家整理的约700+个癌症基因列表,这些基因均有充分证据表明与癌症相关。
python
undefined

Download Cancer Gene Census

下载癌症基因普查数据

download_cosmic_file( email="user@email.com", password="password", filepath="GRCh38/cosmic/latest/cancer_gene_census.csv" )

**Use cases**:
- Identifying known cancer genes
- Filtering variants by cancer relevance
- Understanding gene roles (oncogene vs tumor suppressor)
- Target gene selection for research
download_cosmic_file( email="user@email.com", password="password", filepath="GRCh38/cosmic/latest/cancer_gene_census.csv" )

**使用场景**:
- 识别已知癌症基因
- 根据癌症相关性过滤变异位点
- 了解基因作用(癌基因vs抑癌基因)
- 为研究选择目标基因

Mutational Signatures

突变特征

Download signature profiles for mutational signature analysis.
python
undefined
下载突变特征图谱用于突变特征分析。
python
undefined

Download signature definitions

下载突变特征定义文件

download_cosmic_file( email="user@email.com", password="password", filepath="signatures/signatures.tsv" )

**Signature types**:
- Single Base Substitution (SBS) signatures
- Doublet Base Substitution (DBS) signatures
- Insertion/Deletion (ID) signatures
download_cosmic_file( email="user@email.com", password="password", filepath="signatures/signatures.tsv" )

**特征类型**:
- 单碱基替换(SBS)特征
- 双碱基替换(DBS)特征
- 插入/缺失(ID)特征

Structural Variants and Fusions

结构变异与融合基因

Access gene fusion data and structural rearrangements.
Available data types:
  • structural_variants
    - Structural breakpoints
  • fusion_genes
    - Gene fusion events
python
undefined
访问基因融合数据和结构重排信息。
可用数据类型
  • structural_variants
    - 结构断点数据
  • fusion_genes
    - 基因融合事件数据
python
undefined

Download gene fusions

下载基因融合数据

download_cosmic_file( email="user@email.com", password="password", filepath="GRCh38/cosmic/latest/CosmicFusionExport.tsv.gz" )
undefined
download_cosmic_file( email="user@email.com", password="password", filepath="GRCh38/cosmic/latest/CosmicFusionExport.tsv.gz" )
undefined

Copy Number and Expression

拷贝数与表达数据

Retrieve copy number alterations and gene expression data.
Available data types:
  • copy_number
    - Copy number gains/losses
  • gene_expression
    - Over/under-expression data
python
undefined
获取拷贝数改变和基因表达数据。
可用数据类型
  • copy_number
    - 拷贝数增益/缺失数据
  • gene_expression
    - 基因过表达/低表达数据
python
undefined

Download copy number data

下载拷贝数数据

download_cosmic_file( email="user@email.com", password="password", filepath="GRCh38/cosmic/latest/CosmicCompleteCNA.tsv.gz" )
undefined
download_cosmic_file( email="user@email.com", password="password", filepath="GRCh38/cosmic/latest/CosmicCompleteCNA.tsv.gz" )
undefined

Resistance Mutations

耐药突变

Access drug resistance mutation data with clinical annotations.
python
undefined
访问带有临床注释的耐药突变数据。
python
undefined

Download resistance mutations

下载耐药突变数据

download_cosmic_file( email="user@email.com", password="password", filepath="GRCh38/cosmic/latest/CosmicResistanceMutations.tsv.gz" )
undefined
download_cosmic_file( email="user@email.com", password="password", filepath="GRCh38/cosmic/latest/CosmicResistanceMutations.tsv.gz" )
undefined

Working with COSMIC Data

处理COSMIC数据

Genome Assemblies

基因组组装版本

COSMIC provides data for two reference genomes:
  • GRCh38 (recommended, current standard)
  • GRCh37 (legacy, for older pipelines)
Specify the assembly in file paths:
python
undefined
COSMIC提供两种参考基因组的数据:
  • GRCh38(推荐使用,当前标准版本)
  • GRCh37(旧版本,适用于传统分析流程)
在文件路径中指定组装版本:
python
undefined

GRCh38 (recommended)

GRCh38(推荐)

filepath="GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz"
filepath="GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz"

GRCh37 (legacy)

GRCh37(旧版本)

filepath="GRCh37/cosmic/latest/CosmicMutantExport.tsv.gz"
undefined
filepath="GRCh37/cosmic/latest/CosmicMutantExport.tsv.gz"
undefined

Versioning

版本控制

  • Use
    latest
    in file paths to always get the most recent release
  • COSMIC is updated quarterly (current version: v102, May 2025)
  • Specific versions can be used for reproducibility:
    v102
    ,
    v101
    , etc.
  • 在文件路径中使用
    latest
    可始终获取最新版本的数据
  • COSMIC每季度更新一次(当前版本:v102,2025年5月)
  • 可指定具体版本以保证分析可复现:
    v102
    v101

File Formats

文件格式

  • TSV/CSV: Tab/comma-separated, gzip compressed, read with pandas
  • VCF: Standard variant format, use with pysam, bcftools, or GATK
  • All files include headers describing column contents
  • TSV/CSV:制表符/逗号分隔的压缩文件,可使用pandas读取
  • VCF:标准变异格式,可与pysam、bcftools或GATK配合使用
  • 所有文件均包含表头,用于说明各列内容

Common Analysis Patterns

常见分析模式

Filter mutations by gene:
python
import pandas as pd

mutations = pd.read_csv('cosmic_mutations.tsv.gz', sep='\t', compression='gzip')
tp53_mutations = mutations[mutations['Gene name'] == 'TP53']
Identify cancer genes by role:
python
gene_census = pd.read_csv('cancer_gene_census.csv')
oncogenes = gene_census[gene_census['Role in Cancer'].str.contains('oncogene', na=False)]
tumor_suppressors = gene_census[gene_census['Role in Cancer'].str.contains('TSG', na=False)]
Extract mutations by cancer type:
python
mutations = pd.read_csv('cosmic_mutations.tsv.gz', sep='\t', compression='gzip')
lung_mutations = mutations[mutations['Primary site'] == 'lung']
Work with VCF files:
python
import pysam

vcf = pysam.VariantFile('CosmicCodingMuts.vcf.gz')
for record in vcf.fetch('17', 7577000, 7579000):  # TP53 region
    print(record.id, record.ref, record.alts, record.info)
按基因筛选突变
python
import pandas as pd

mutations = pd.read_csv('cosmic_mutations.tsv.gz', sep='\t', compression='gzip')
tp53_mutations = mutations[mutations['Gene name'] == 'TP53']
按基因作用类型识别癌症基因
python
gene_census = pd.read_csv('cancer_gene_census.csv')
oncogenes = gene_census[gene_census['Role in Cancer'].str.contains('oncogene', na=False)]
tumor_suppressors = gene_census[gene_census['Role in Cancer'].str.contains('TSG', na=False)]
按癌症类型提取突变
python
mutations = pd.read_csv('cosmic_mutations.tsv.gz', sep='\t', compression='gzip')
lung_mutations = mutations[mutations['Primary site'] == 'lung']
处理VCF文件
python
import pysam

vcf = pysam.VariantFile('CosmicCodingMuts.vcf.gz')
for record in vcf.fetch('17', 7577000, 7579000):  # TP53区域
    print(record.id, record.ref, record.alts, record.info)

Data Reference

数据参考文档

For comprehensive information about COSMIC data structure, available files, and field descriptions, see
references/cosmic_data_reference.md
. This reference includes:
  • Complete list of available data types and files
  • Detailed field descriptions for each file type
  • File format specifications
  • Common file paths and naming conventions
  • Data update schedule and versioning
  • Citation information
Use this reference when:
  • Exploring what data is available in COSMIC
  • Understanding specific field meanings
  • Determining the correct file path for a data type
  • Planning analysis workflows with COSMIC data
如需了解COSMIC数据结构、可用文件及字段说明的全面信息,请查看
references/cosmic_data_reference.md
。该参考文档包含:
  • 所有可用数据类型和文件的完整列表
  • 每种文件类型的详细字段说明
  • 文件格式规范
  • 常见文件路径和命名规则
  • 数据更新计划和版本控制说明
  • 引用信息
在以下场景中可使用该参考文档:
  • 探索COSMIC中的可用数据
  • 理解特定字段的含义
  • 确定对应数据类型的正确文件路径
  • 规划基于COSMIC数据的分析流程

Helper Functions

辅助函数

The download script includes helper functions for common operations:
下载脚本包含用于常见操作的辅助函数:

Get Common File Paths

获取常用文件路径

python
from scripts.download_cosmic import get_common_file_path
python
from scripts.download_cosmic import get_common_file_path

Get path for mutations file

获取突变数据文件路径

path = get_common_file_path('mutations', genome_assembly='GRCh38')
path = get_common_file_path('mutations', genome_assembly='GRCh38')

Returns: 'GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz'

返回值: 'GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz'

Get path for gene census

获取癌症基因普查数据文件路径

path = get_common_file_path('gene_census')
path = get_common_file_path('gene_census')

Returns: 'GRCh38/cosmic/latest/cancer_gene_census.csv'

返回值: 'GRCh38/cosmic/latest/cancer_gene_census.csv'


**Available shortcuts**:
- `mutations` - Core coding mutations
- `mutations_vcf` - VCF format mutations
- `gene_census` - Cancer Gene Census
- `resistance_mutations` - Drug resistance data
- `structural_variants` - Structural variants
- `gene_expression` - Expression data
- `copy_number` - Copy number alterations
- `fusion_genes` - Gene fusions
- `signatures` - Mutational signatures
- `sample_info` - Sample metadata

**可用快捷方式**:
- `mutations` - 核心编码区突变
- `mutations_vcf` - VCF格式突变数据
- `gene_census` - 癌症基因普查
- `resistance_mutations` - 耐药突变数据
- `structural_variants` - 结构变异
- `gene_expression` - 表达数据
- `copy_number` - 拷贝数改变
- `fusion_genes` - 融合基因
- `signatures` - 突变特征
- `sample_info` - 样本元数据

Troubleshooting

故障排除

Authentication Errors

身份验证错误

  • Verify email and password are correct
  • Ensure account is registered at cancer.sanger.ac.uk/cosmic
  • Check if commercial license is required for your use case
  • 确认邮箱和密码正确
  • 确保账户已在cancer.sanger.ac.uk/cosmic完成注册
  • 检查你的使用场景是否需要商业授权许可

File Not Found

文件未找到

  • Verify the filepath is correct
  • Check that the requested version exists
  • Use
    latest
    for the most recent version
  • Confirm genome assembly (GRCh37 vs GRCh38) is correct
  • 确认文件路径正确
  • 检查请求的版本是否存在
  • 使用
    latest
    获取最新版本数据
  • 确认基因组组装版本(GRCh37 vs GRCh38)正确

Large File Downloads

大文件下载

  • COSMIC files can be several GB in size
  • Ensure sufficient disk space
  • Download may take several minutes depending on connection
  • The script shows download progress for large files
  • COSMIC文件大小可达数GB
  • 确保有足够的磁盘空间
  • 下载时间可能因网络情况需要数分钟
  • 脚本会显示大文件的下载进度

Commercial Use

商业使用

  • Commercial users must license COSMIC through QIAGEN
  • Contact: cosmic-translation@sanger.ac.uk
  • Academic access is free but requires registration
  • 商业用户必须通过QIAGEN获取COSMIC授权许可
  • 联系方式:cosmic-translation@sanger.ac.uk
  • 学术用户可免费访问,但需要注册

Integration with Other Tools

与其他工具的集成

COSMIC data integrates well with:
  • Variant annotation: VEP, ANNOVAR, SnpEff
  • Signature analysis: SigProfiler, deconstructSigs, MuSiCa
  • Cancer genomics: cBioPortal, OncoKB, CIViC
  • Bioinformatics: Bioconductor, TCGA analysis tools
  • Data science: pandas, scikit-learn, PyTorch
COSMIC数据可与以下工具良好集成:
  • 变异注释:VEP、ANNOVAR、SnpEff
  • 特征分析:SigProfiler、deconstructSigs、MuSiCa
  • 癌症基因组学:cBioPortal、OncoKB、CIViC
  • 生物信息学:Bioconductor、TCGA分析工具
  • 数据科学:pandas、scikit-learn、PyTorch

Additional Resources

其他资源

Citation

引用说明

When using COSMIC data, cite: Tate JG, Bamford S, Jubb HC, et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Research. 2019;47(D1):D941-D947.
使用COSMIC数据时,请引用: Tate JG, Bamford S, Jubb HC, et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Research. 2019;47(D1):D941-D947.