imaging-data-commons

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Imaging Data Commons

影像数据共享平台(Imaging Data Commons)

Overview

概述

Use the
idc-index
Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.
Primary tool:
idc-index
(GitHub)
Check current data scale for the latest version:
python
from idc_index import IDCClient
client = IDCClient()
使用
idc-index
Python包查询并下载美国国家癌症研究所(NCI)影像数据共享平台(IDC)的公开癌症影像数据。数据访问无需身份验证。
核心工具:
idc-index
GitHub
查看最新版本的当前数据规模:
python
from idc_index import IDCClient
client = IDCClient()

get IDC data version

获取IDC数据版本

print(client.get_idc_version())
print(client.get_idc_version())

Get collection count and total series

获取集合数量和总序列数

stats = client.sql_query(""" SELECT
COUNT(DISTINCT collection_id) as collections, COUNT(DISTINCT analysis_result_id) as analysis_results, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT StudyInstanceUID) as studies, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(instanceCount) as instances, SUM(series_size_MB)/1000000 as size_TB FROM index """) print(stats)

**Core workflow:**
1. Query metadata → `client.sql_query()`
2. Download DICOM files → `client.download_from_selection()`
3. Visualize in browser → `client.get_viewer_URL(seriesInstanceUID=...)`
stats = client.sql_query(""" SELECT
COUNT(DISTINCT collection_id) as collections, COUNT(DISTINCT analysis_result_id) as analysis_results, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT StudyInstanceUID) as studies, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(instanceCount) as instances, SUM(series_size_MB)/1000000 as size_TB FROM index """) print(stats)

**核心工作流:**
1. 查询元数据 → `client.sql_query()`
2. 下载DICOM文件 → `client.download_from_selection()`
3. 浏览器可视化 → `client.get_viewer_URL(seriesInstanceUID=...)`

When to Use This Skill

适用场景

  • Finding publicly available radiology (CT, MR, PET) or pathology (slide microscopy) images
  • Selecting image subsets by cancer type, modality, anatomical site, or other metadata
  • Downloading DICOM data from IDC
  • Checking data licenses before use in research or commercial applications
  • Visualizing medical images in a browser without local DICOM viewer software
  • 查找公开可用的放射学(CT、MR、PET)或病理学(玻片显微镜)影像
  • 按癌症类型、成像模态、解剖部位或其他元数据筛选影像子集
  • 从IDC下载DICOM数据
  • 在研究或商业应用中使用前检查数据许可证
  • 无需本地DICOM查看器软件,直接在浏览器中查看医学影像

IDC Data Model

IDC数据模型

IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance):
  • collection_id: Groups patients by disease, modality, or research focus (e.g.,
    tcga_luad
    ,
    nlst
    ). A patient belongs to exactly one collection.
  • analysis_result_id: Identifies derived objects (segmentations, annotations, radiomics features) across one or more original collections.
Use
collection_id
to find original imaging data, may include annotations deposited along with the images; use
analysis_result_id
to find AI-generated or expert annotations.
Key identifiers for queries:
IdentifierScopeUse for
collection_id
Dataset groupingFiltering by project/study
PatientID
PatientGrouping images by patient
StudyInstanceUID
DICOM studyGrouping of related series, visualization
SeriesInstanceUID
DICOM seriesGrouping of related series, visualization
IDC在标准DICOM层级(患者→检查→序列→实例)之上新增了两个分组层级:
  • collection_id:按疾病、成像模态或研究重点对患者进行分组(例如
    tcga_luad
    nlst
    )。一名患者仅属于一个集合。
  • analysis_result_id:标识跨一个或多个原始集合的衍生对象(分割结果、标注、放射组学特征)。
使用
collection_id
查找原始影像数据,其中可能包含随影像提交的标注;使用
analysis_result_id
查找AI生成或专家标注的衍生数据集。
查询关键标识符:
标识符范围用途
collection_id
数据集分组按项目/研究筛选
PatientID
患者按患者分组影像
StudyInstanceUID
DICOM检查关联序列分组、可视化
SeriesInstanceUID
DICOM序列关联序列分组、可视化

Index Tables

索引表

The
idc-index
package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames.
Important: Use
client.indices_overview
to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.
idc-index
包提供多个元数据索引表,可通过SQL或pandas DataFrame访问。
重要提示: 使用
client.indices_overview
获取当前表的描述和列架构。这是获取可用列及其类型的权威来源——编写SQL或探索数据结构时务必查询该内容。

Available Tables

可用表

TableRow GranularityLoadedDescription
index
1 row = 1 DICOM seriesAutoPrimary metadata for all current IDC data
prior_versions_index
1 row = 1 DICOM seriesAutoSeries from previous IDC releases; for downloading deprecated data
collections_index
1 row = 1 collectionfetch_index()Collection-level metadata and descriptions
analysis_results_index
1 row = 1 analysis result collectionfetch_index()Metadata about derived datasets (annotations, segmentations)
clinical_index
1 row = 1 clinical data columnfetch_index()Dictionary mapping clinical table columns to collections
sm_index
1 row = 1 slide microscopy seriesfetch_index()Slide Microscopy (pathology) series metadata
sm_instance_index
1 row = 1 slide microscopy instancefetch_index()Instance-level (SOPInstanceUID) metadata for slide microscopy
seg_index
1 row = 1 DICOM Segmentation seriesfetch_index()Segmentation metadata: algorithm, segment count, reference to source image series
Auto = loaded automatically when
IDCClient()
is instantiated fetch_index() = requires
client.fetch_index("table_name")
to load
表名行粒度加载方式描述
index
1行 = 1个DICOM序列自动加载所有当前IDC数据的主要元数据
prior_versions_index
1行 = 1个DICOM序列自动加载来自IDC旧版本的序列;用于下载已弃用的数据
collections_index
1行 = 1个集合fetch_index()集合级元数据和描述
analysis_results_index
1行 = 1个分析结果集合fetch_index()衍生数据集(标注、分割结果)的元数据
clinical_index
1行 = 1个临床数据列fetch_index()临床表列与集合的映射字典
sm_index
1行 = 1个玻片显微镜序列fetch_index()玻片显微镜(病理学)序列元数据
sm_instance_index
1行 = 1个玻片显微镜实例fetch_index()玻片显微镜的实例级(SOPInstanceUID)元数据
seg_index
1行 = 1个DICOM分割序列fetch_index()分割元数据:算法、分割数量、源影像序列引用
自动加载 = 实例化
IDCClient()
时自动加载 fetch_index() = 需要调用
client.fetch_index("table_name")
加载

Joining Tables

表关联

Key columns are not explicitly labeled, the following is a subset that can be used in joins.
Join ColumnTablesUse Case
collection_id
index, prior_versions_index, collections_index, clinical_indexLink series to collection metadata or clinical data
SeriesInstanceUID
index, prior_versions_index, sm_index, sm_instance_indexLink series across tables; connect to slide microscopy details
StudyInstanceUID
index, prior_versions_indexLink studies across current and historical data
PatientID
index, prior_versions_indexLink patients across current and historical data
analysis_result_id
index, analysis_results_indexLink series to analysis result metadata (annotations, segmentations)
source_DOI
index, analysis_results_indexLink by publication DOI
crdc_series_uuid
index, prior_versions_indexLink by CRDC unique identifier
Modality
index, prior_versions_indexFilter by imaging modality
SeriesInstanceUID
index, seg_indexLink segmentation series to its index metadata
segmented_SeriesInstanceUID
seg_index → indexLink segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID)
Note:
Subjects
,
Updated
, and
Description
appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).
Example joins:
python
from idc_index import IDCClient
client = IDCClient()
关键列未显式标记,以下是可用于关联的子集。
关联列涉及表适用场景
collection_id
index, prior_versions_index, collections_index, clinical_index将序列与集合元数据或临床数据关联
SeriesInstanceUID
index, prior_versions_index, sm_index, sm_instance_index跨表关联序列;关联玻片显微镜详情
StudyInstanceUID
index, prior_versions_index关联当前和历史数据中的检查
PatientID
index, prior_versions_index关联当前和历史数据中的患者
analysis_result_id
index, analysis_results_index将序列与分析结果元数据(标注、分割结果)关联
source_DOI
index, analysis_results_index通过出版物DOI关联
crdc_series_uuid
index, prior_versions_index通过CRDC唯一标识符关联
Modality
index, prior_versions_index按成像模态筛选
SeriesInstanceUID
index, seg_index将分割序列与其索引元数据关联
segmented_SeriesInstanceUID
seg_index → index将分割结果与其源影像序列关联(关联条件:seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID)
注意:
Subjects
Updated
Description
出现在多个表中,但含义不同(计数与标识符、不同更新场景)。
关联示例:
python
from idc_index import IDCClient
client = IDCClient()

Join index with collections_index to get cancer types

关联index与collections_index以获取癌症类型

client.fetch_index("collections_index") result = client.sql_query(""" SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations FROM index i JOIN collections_index c ON i.collection_id = c.collection_id WHERE i.Modality = 'MR' LIMIT 10 """)
client.fetch_index("collections_index") result = client.sql_query(""" SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations FROM index i JOIN collections_index c ON i.collection_id = c.collection_id WHERE i.Modality = 'MR' LIMIT 10 """)

Join index with sm_index for slide microscopy details

关联index与sm_index以获取玻片显微镜详情

client.fetch_index("sm_index") result = client.sql_query(""" SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf FROM index i JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID LIMIT 10 """)
client.fetch_index("sm_index") result = client.sql_query(""" SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf FROM index i JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID LIMIT 10 """)

Join seg_index with index to find segmentations and their source images

关联seg_index与index以查找分割结果及其源影像

client.fetch_index("seg_index") result = client.sql_query(""" SELECT s.SeriesInstanceUID as seg_series, s.AlgorithmName, s.total_segments, src.collection_id, src.Modality as source_modality, src.BodyPartExamined FROM seg_index s JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID WHERE s.AlgorithmType = 'AUTOMATIC' LIMIT 10 """)
undefined
client.fetch_index("seg_index") result = client.sql_query(""" SELECT s.SeriesInstanceUID as seg_series, s.AlgorithmName, s.total_segments, src.collection_id, src.Modality as source_modality, src.BodyPartExamined FROM seg_index s JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID WHERE s.AlgorithmType = 'AUTOMATIC' LIMIT 10 """)
undefined

Accessing Index Tables

访问索引表

Via SQL (recommended for filtering/aggregation):
python
from idc_index import IDCClient
client = IDCClient()
通过SQL(推荐用于筛选/聚合):
python
from idc_index import IDCClient
client = IDCClient()

Query the primary index (always available)

查询主索引(始终可用)

results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")
results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")

Fetch and query additional indices

获取并查询额外索引

client.fetch_index("collections_index") collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
client.fetch_index("analysis_results_index") analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")

**As pandas DataFrames (direct access):**
```python
client.fetch_index("collections_index") collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
client.fetch_index("analysis_results_index") analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")

**作为pandas DataFrame(直接访问):**
```python

Primary index (always available after client initialization)

主索引(实例化客户端后始终可用)

df = client.index
df = client.index

Fetch and access on-demand indices

获取并访问按需加载的索引

client.fetch_index("sm_index") sm_df = client.sm_index
undefined
client.fetch_index("sm_index") sm_df = client.sm_index
undefined

Discovering Table Schemas (Essential for Query Writing)

发现表架构(查询编写必备)

The
indices_overview
dictionary contains complete schema information for all tables. Always consult this when writing queries or exploring data structure.
DICOM attribute mapping: Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like
PatientID
,
StudyInstanceUID
,
Modality
,
BodyPartExamined
work as expected.
python
from idc_index import IDCClient
client = IDCClient()
indices_overview
字典包含所有表的完整架构信息。编写查询或探索数据结构时务必参考该内容。
DICOM属性映射: 许多列直接从源文件的DICOM属性填充。架构中的列描述会指明该列是否对应DICOM属性(例如“DICOM Modality属性”或引用DICOM标签)。这允许在查询时利用DICOM知识——标准DICOM属性名称如
PatientID
StudyInstanceUID
Modality
BodyPartExamined
可直接使用。
python
from idc_index import IDCClient
client = IDCClient()

List all available indices with descriptions

列出所有可用索引及其描述

for name, info in client.indices_overview.items(): print(f"\n{name}:") print(f" Installed: {info['installed']}") print(f" Description: {info['description']}")
for name, info in client.indices_overview.items(): print(f"\n{name}:") print(f" 是否已加载: {info['installed']}") print(f" 描述: {info['description']}")

Get complete schema for a specific index (columns, types, descriptions)

获取特定索引的完整架构(列、类型、描述)

schema = client.indices_overview["index"]["schema"] print(f"\nTable: {schema['table_description']}") print("\nColumns:") for col in schema['columns']: desc = col.get('description', 'No description') # Description indicates if column is from DICOM attribute print(f" {col['name']} ({col['type']}): {desc}")
schema = client.indices_overview["index"]["schema"] print(f"\n表: {schema['table_description']}") print("\n列:") for col in schema['columns']: desc = col.get('description', '无描述') # 描述中会指明列是否来自DICOM属性 print(f" {col['name']} ({col['type']}): {desc}")

Find columns that are DICOM attributes (check description for "DICOM" reference)

查找源自DICOM属性的列(检查描述中是否包含"DICOM")

dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()] print(f"\nDICOM-sourced columns: {dicom_cols}")

**Alternative: use `get_index_schema()` method:**
```python
schema = client.get_index_schema("index")
dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()] print(f"\n源自DICOM的列: {dicom_cols}")

**替代方式:使用`get_index_schema()`方法:**
```python
schema = client.get_index_schema("index")

Returns same schema dict: {'table_description': ..., 'columns': [...]}

返回相同的架构字典: {'table_description': ..., 'columns': [...]}

undefined
undefined

Key Columns in Primary
index
Table

index
表的关键列

Most common columns for queries (use
indices_overview
for complete list and descriptions):
ColumnTypeDICOMDescription
collection_id
STRINGNoIDC collection identifier
analysis_result_id
STRINGNoIf applicable, indicates what analysis results collection given series is part of
source_DOI
STRINGNoDOI linking to dataset details; use for learning more about the content and for attribution (see citations below)
PatientID
STRINGYesPatient identifier
StudyInstanceUID
STRINGYesDICOM Study UID
SeriesInstanceUID
STRINGYesDICOM Series UID — use for downloads/viewing
Modality
STRINGYesImaging modality (CT, MR, PT, SM, etc.)
BodyPartExamined
STRINGYesAnatomical region
SeriesDescription
STRINGYesDescription of the series
Manufacturer
STRINGYesEquipment manufacturer
StudyDate
STRINGYesDate study was performed
PatientSex
STRINGYesPatient sex
PatientAge
STRINGYesPatient age at time of study
license_short_name
STRINGNoLicense type (CC BY 4.0, CC BY-NC 4.0, etc.)
series_size_MB
FLOATNoSize of series in megabytes
instanceCount
INTEGERNoNumber of DICOM instances in series
DICOM = Yes: Column value extracted from the DICOM attribute with the same name. Refer to the DICOM standard for numeric tag mappings. Use standard DICOM knowledge for expected values and formats.
查询中最常用的列(完整列表和描述请使用
indices_overview
):
列名类型是否来自DICOM描述
collection_id
字符串IDC集合标识符
analysis_result_id
字符串若适用,指示给定序列所属的分析结果集合
source_DOI
字符串链接到数据集详情的DOI;用于了解内容来源和引用(见下文引用部分)
PatientID
字符串患者标识符
StudyInstanceUID
字符串DICOM检查唯一标识符
SeriesInstanceUID
字符串DICOM序列唯一标识符——用于下载/可视化
Modality
字符串成像模态(CT、MR、PT、SM等)
BodyPartExamined
字符串解剖部位
SeriesDescription
字符串序列描述
Manufacturer
字符串设备制造商
StudyDate
字符串检查执行日期
PatientSex
字符串患者性别
PatientAge
字符串检查时患者年龄
license_short_name
字符串许可证类型(CC BY 4.0、CC BY-NC 4.0等)
series_size_MB
浮点数序列大小(MB)
instanceCount
整数序列中DICOM实例数量
是否来自DICOM = 是:列值从同名DICOM属性提取。有关数字标签映射,请参考DICOM标准。可使用标准DICOM知识推断预期值和格式。

Clinical Data Access

临床数据访问

python
undefined
python
undefined

Fetch clinical index (also downloads clinical data tables)

获取临床索引(同时下载临床数据表)

client.fetch_index("clinical_index")
client.fetch_index("clinical_index")

Query clinical index to find available tables and their columns

查询临床索引以查找可用表及其列

tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinical_index")
tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinical_index")

Load a specific clinical table as DataFrame

将特定临床表加载为DataFrame

clinical_df = client.get_clinical_table("table_name")

See `references/clinical_data_guide.md` for detailed workflows including value mapping patterns and joining clinical data with imaging.
clinical_df = client.get_clinical_table("table_name")

有关包括值映射模式和临床数据与影像关联的详细工作流,请参阅`references/clinical_data_guide.md`。

Data Access Options

数据访问选项

MethodAuth RequiredBest For
idc-index
NoKey queries and downloads (recommended)
IDC PortalNoInteractive exploration, manual selection, browser-based download
BigQueryYes (GCP account)Complex queries, full DICOM metadata
DICOMweb proxyNoTool integration via DICOMweb API
Cloud storage (S3/GCS)NoDirect file access, bulk downloads, custom pipelines
Cloud storage organization
IDC maintains all DICOM files in public cloud storage buckets mirrored between AWS S3 and Google Cloud Storage. Files are organized by CRDC UUIDs (not DICOM UIDs) to support versioning.
Bucket (AWS / GCS)LicenseContent
idc-open-data
/
idc-open-data
No commercial restriction>90% of IDC data
idc-open-data-two
/
idc-open-idc1
No commercial restrictionCollections with potential head scans
idc-open-data-cr
/
idc-open-cr
Commercial use restricted (CC BY-NC)~4% of data
Files are stored as
<crdc_series_uuid>/<crdc_instance_uuid>.dcm
. Access is free (no egress fees) via AWS CLI, gsutil, or s5cmd with anonymous access. Use
series_aws_url
column from the index for S3 URLs; GCS uses the same path structure.
See
references/cloud_storage_guide.md
for bucket details, access commands, UUID mapping, and versioning.
DICOMweb access
IDC data is available via DICOMweb interface (Google Cloud Healthcare API implementation) for integration with PACS systems and DICOMweb-compatible tools.
EndpointAuthUse Case
Public proxyNoTesting, moderate queries, daily quota
Google HealthcareYes (GCP)Production use, higher quotas
See
references/dicomweb_guide.md
for endpoint URLs, code examples, supported operations, and implementation details.
方法是否需要身份验证最佳适用场景
idc-index
核心查询和下载(推荐)
IDC门户交互式探索、手动选择、浏览器端下载
BigQuery是(需要GCP账户)复杂查询、完整DICOM元数据
DICOMweb代理通过DICOMweb API集成工具
云存储(AWS S3/GCS)直接文件访问、批量下载、自定义流水线
云存储组织
IDC将所有DICOM文件存储在公共云存储桶中,在AWS S3和Google Cloud Storage(GCS)之间镜像同步。文件按CRDC UUID(而非DICOM UID)组织,以支持版本控制。
存储桶(AWS / GCS)许可证内容
idc-open-data
/
idc-open-data
无商业限制占IDC数据的90%以上
idc-open-data-two
/
idc-open-idc1
无商业限制包含头部扫描的集合
idc-open-data-cr
/
idc-open-cr
商业使用受限(CC BY-NC)约占数据的4%
文件存储路径为
<crdc_series_uuid>/<crdc_instance_uuid>.dcm
。可通过AWS CLI、gsutil或s5cmd以匿名访问方式免费获取(无出口费用)。使用index表中的
series_aws_url
列获取S3 URL;GCS使用相同路径结构。
有关存储桶详情、访问命令、UUID映射和版本控制,请参阅
references/cloud_storage_guide.md
DICOMweb访问
IDC数据可通过DICOMweb接口(Google Cloud Healthcare API实现)访问,以集成到PACS系统和支持DICOMweb的工具中。
端点是否需要身份验证适用场景
公共代理测试、中等规模查询、每日配额限制
Google Healthcare是(需要GCP账户)生产使用、更高配额
有关端点URL、代码示例、支持的操作和实现细节,请参阅
references/dicomweb_guide.md

Installation and Setup

安装与设置

Required (for basic access):
bash
pip install --upgrade idc-index
Important: New IDC data release will always trigger a new version of
idc-index
. Always use
--upgrade
flag while installing, unless an older version is needed for reproducibility.
Tested with: idc-index 0.11.7 (IDC data version v23)
Optional (for data analysis):
bash
pip install pandas numpy pydicom
基础访问必备:
bash
pip install --upgrade idc-index
重要提示: IDC数据新版本发布后,
idc-index
会同步更新版本。除非需要复现旧版本结果,否则安装时请始终使用
--upgrade
参数。
测试兼容版本: idc-index 0.11.7(对应IDC数据版本v23)
数据分析可选依赖:
bash
pip install pandas numpy pydicom

Core Capabilities

核心功能

1. Data Discovery and Exploration

1. 数据发现与探索

Discover what imaging collections and data are available in IDC:
python
from idc_index import IDCClient

client = IDCClient()
探索IDC中可用的影像集合和数据:
python
from idc_index import IDCClient

client = IDCClient()

Get summary statistics from primary index

从主索引获取汇总统计

query = """ SELECT collection_id, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(series_size_MB) as size_mb FROM index GROUP BY collection_id ORDER BY patients DESC """ collections_summary = client.sql_query(query)
query = """ SELECT collection_id, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(series_size_MB) as size_mb FROM index GROUP BY collection_id ORDER BY patients DESC """ collections_summary = client.sql_query(query)

For richer collection metadata, use collections_index

如需更丰富的集合元数据,使用collections_index

client.fetch_index("collections_index") collections_info = client.sql_query(""" SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData FROM collections_index """)
client.fetch_index("collections_index") collections_info = client.sql_query(""" SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData FROM collections_index """)

For analysis results (annotations, segmentations), use analysis_results_index

如需分析结果(标注、分割结果),使用analysis_results_index

client.fetch_index("analysis_results_index") analysis_info = client.sql_query(""" SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities FROM analysis_results_index """)

**`collections_index`** provides curated metadata per collection: cancer types, tumor locations, species, subject counts, and supporting data types — without needing to aggregate from the primary index.

**`analysis_results_index`** lists derived datasets (AI segmentations, expert annotations, radiomics features) with their source collections and modalities.
client.fetch_index("analysis_results_index") analysis_info = client.sql_query(""" SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities FROM analysis_results_index """)

**`collections_index`** 提供每个集合的精选元数据:癌症类型、肿瘤位置、物种、受试者数量和支持数据类型——无需从主索引聚合。

**`analysis_results_index`** 列出衍生数据集(AI分割结果、专家标注、放射组学特征)及其源集合和模态。

2. Querying Metadata with SQL

2. 使用SQL查询元数据

Query the IDC mini-index using SQL to find specific datasets.
First, explore available values for filter columns:
python
from idc_index import IDCClient

client = IDCClient()
使用SQL查询IDC迷你索引以查找特定数据集。
首先,探索筛选列的可用值:
python
from idc_index import IDCClient

client = IDCClient()

Check what Modality values exist

查看所有可用的Modality值

modalities = client.sql_query(""" SELECT DISTINCT Modality, COUNT(*) as series_count FROM index GROUP BY Modality ORDER BY series_count DESC """) print(modalities)
modalities = client.sql_query(""" SELECT DISTINCT Modality, COUNT(*) as series_count FROM index GROUP BY Modality ORDER BY series_count DESC """) print(modalities)

Check what BodyPartExamined values exist for MR modality

查看MR模态下的BodyPartExamined值

body_parts = client.sql_query(""" SELECT DISTINCT BodyPartExamined, COUNT(*) as series_count FROM index WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL GROUP BY BodyPartExamined ORDER BY series_count DESC LIMIT 20 """) print(body_parts)

**Then query with validated filter values:**
```python
body_parts = client.sql_query(""" SELECT DISTINCT BodyPartExamined, COUNT(*) as series_count FROM index WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL GROUP BY BodyPartExamined ORDER BY series_count DESC LIMIT 20 """) print(body_parts)

**然后使用验证后的筛选值进行查询:**
```python

Find breast MRI scans (use actual values from exploration above)

查找乳腺MRI扫描(使用上述探索得到的实际值)

results = client.sql_query(""" SELECT collection_id, PatientID, SeriesInstanceUID, Modality, SeriesDescription, license_short_name FROM index WHERE Modality = 'MR' AND BodyPartExamined = 'BREAST' LIMIT 20 """)
results = client.sql_query(""" SELECT collection_id, PatientID, SeriesInstanceUID, Modality, SeriesDescription, license_short_name FROM index WHERE Modality = 'MR' AND BodyPartExamined = 'BREAST' LIMIT 20 """)

Access results as pandas DataFrame

以pandas DataFrame形式访问结果

for idx, row in results.iterrows(): print(f"Patient: {row['PatientID']}, Series: {row['SeriesInstanceUID']}")

**To filter by cancer type, join with `collections_index`:**
```python
client.fetch_index("collections_index")
results = client.sql_query("""
    SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
    FROM index i
    JOIN collections_index c ON i.collection_id = c.collection_id
    WHERE c.CancerTypes LIKE '%Breast%'
      AND i.Modality = 'MR'
    LIMIT 20
""")
Available metadata fields (use
client.indices_overview
for complete list):
  • Identifiers: collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID
  • Imaging: Modality, BodyPartExamined, Manufacturer, ManufacturerModelName
  • Clinical: PatientAge, PatientSex, StudyDate
  • Descriptions: StudyDescription, SeriesDescription
  • Licensing: license_short_name
Note: Cancer type is in
collections_index.CancerTypes
, not in the primary
index
table.
for idx, row in results.iterrows(): print(f"患者: {row['PatientID']}, 序列: {row['SeriesInstanceUID']}")

**如需按癌症类型筛选,关联`collections_index`:**
```python
client.fetch_index("collections_index")
results = client.sql_query("""
    SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
    FROM index i
    JOIN collections_index c ON i.collection_id = c.collection_id
    WHERE c.CancerTypes LIKE '%Breast%'
      AND i.Modality = 'MR'
    LIMIT 20
""")
可用元数据字段(完整列表请使用
client.indices_overview
):
  • 标识符:collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID
  • 影像相关:Modality, BodyPartExamined, Manufacturer, ManufacturerModelName
  • 临床相关:PatientAge, PatientSex, StudyDate
  • 描述信息:StudyDescription, SeriesDescription
  • 许可证:license_short_name
注意: 癌症类型存储在
collections_index.CancerTypes
中,而非主
index
表。

3. Downloading DICOM Files

3. 下载DICOM文件

Download imaging data efficiently from IDC's cloud storage:
Download entire collection:
python
from idc_index import IDCClient

client = IDCClient()
从IDC云存储高效下载影像数据:
下载整个集合:
python
from idc_index import IDCClient

client = IDCClient()

Download small collection (RIDER Pilot ~1GB)

下载小型集合(RIDER Pilot 约1GB)

client.download_from_selection( collection_id="rider_pilot", downloadDir="./data/rider" )

**Download specific series:**
```python
client.download_from_selection( collection_id="rider_pilot", downloadDir="./data/rider" )

**下载特定序列:**
```python

First, query for series UIDs

首先查询序列UID

series_df = client.sql_query(""" SELECT SeriesInstanceUID FROM index WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST' AND collection_id = 'nlst' LIMIT 5 """)
series_df = client.sql_query(""" SELECT SeriesInstanceUID FROM index WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST' AND collection_id = 'nlst' LIMIT 5 """)

Download only those series

仅下载这些序列

client.download_from_selection( seriesInstanceUID=list(series_df['SeriesInstanceUID'].values), downloadDir="./data/lung_ct" )

**Custom directory structure:**

Default `dirTemplate`: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`

```python
client.download_from_selection( seriesInstanceUID=list(series_df['SeriesInstanceUID'].values), downloadDir="./data/lung_ct" )

**自定义目录结构:**

默认`dirTemplate`:`%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`

```python

Simplified hierarchy (omit StudyInstanceUID level)

简化层级(省略StudyInstanceUID层级)

client.download_from_selection( collection_id="tcga_luad", downloadDir="./data", dirTemplate="%collection_id/%PatientID/%Modality" )
client.download_from_selection( collection_id="tcga_luad", downloadDir="./data", dirTemplate="%collection_id/%PatientID/%Modality" )

Results in: ./data/tcga_luad/TCGA-05-4244/CT/

结果路径: ./data/tcga_luad/TCGA-05-4244/CT/

Flat structure (all files in one directory)

扁平结构(所有文件在同一目录)

client.download_from_selection( seriesInstanceUID=list(series_df['SeriesInstanceUID'].values), downloadDir="./data/flat", dirTemplate="" )
client.download_from_selection( seriesInstanceUID=list(series_df['SeriesInstanceUID'].values), downloadDir="./data/flat", dirTemplate="" )

Results in: ./data/flat/*.dcm

结果路径: ./data/flat/*.dcm

undefined
undefined

Command-Line Download

命令行下载

The
idc download
command provides command-line access to download functionality without writing Python code. Available after installing
idc-index
.
Auto-detects input type: manifest file path, or identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid).
bash
undefined
安装
idc-index
后,可使用
idc download
命令通过命令行执行下载操作,无需编写Python代码。
自动检测输入类型: 清单文件路径,或标识符(collection_id、PatientID、StudyInstanceUID、SeriesInstanceUID、crdc_series_uuid)。
bash
undefined

Download entire collection

下载整个集合

idc download rider_pilot --download-dir ./data
idc download rider_pilot --download-dir ./data

Download specific series by UID

通过UID下载特定序列

idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data
idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data

Download multiple items (comma-separated)

下载多个项目(逗号分隔)

idc download "tcga_luad,tcga_lusc" --download-dir ./data
idc download "tcga_luad,tcga_lusc" --download-dir ./data

Download from manifest file (auto-detected)

从清单文件下载(自动检测)

idc download manifest.txt --download-dir ./data

**Options:**

| Option | Description |
|--------|-------------|
| `--download-dir` | Output directory (default: current directory) |
| `--dir-template` | Directory hierarchy template (default: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) |
| `--log-level` | Verbosity: debug, info, warning, error, critical |

**Manifest files:**

Manifest files contain S3 URLs (one per line) and can be:
- Exported from the IDC Portal after cohort selection
- Shared by collaborators for reproducible data access
- Generated programmatically from query results

Format (one S3 URL per line):
s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/* s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*

**Example: Generate manifest from Python query:**

```python
from idc_index import IDCClient

client = IDCClient()
idc download manifest.txt --download-dir ./data

**选项:**

| 选项 | 描述 |
|--------|-------------|
| `--download-dir` | 输出目录(默认:当前目录) |
| `--dir-template` | 目录层级模板(默认:`%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) |
| `--log-level` | 日志级别:debug、info、warning、error、critical |

**清单文件:**

清单文件包含S3 URL(每行一个),可通过以下方式生成:
- 在IDC门户中选择队列后导出
- 由协作者共享以实现可复现的数据访问
- 通过查询结果程序化生成

格式(每行一个S3 URL):
s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/* s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*

**示例:通过Python查询生成清单:**

```python
from idc_index import IDCClient

client = IDCClient()

Query for series URLs

查询序列URL

results = client.sql_query(""" SELECT series_aws_url FROM index WHERE collection_id = 'rider_pilot' AND Modality = 'CT' """)
results = client.sql_query(""" SELECT series_aws_url FROM index WHERE collection_id = 'rider_pilot' AND Modality = 'CT' """)

Save as manifest file

保存为清单文件

with open('ct_manifest.txt', 'w') as f: for url in results['series_aws_url']: f.write(url + '\n')

Then download:
```bash
idc download ct_manifest.txt --download-dir ./ct_data
with open('ct_manifest.txt', 'w') as f: for url in results['series_aws_url']: f.write(url + '\n')

然后执行下载:
```bash
idc download ct_manifest.txt --download-dir ./ct_data

4. Visualizing IDC Images

4. 可视化IDC影像

View DICOM data in browser without downloading:
python
from idc_index import IDCClient
import webbrowser

client = IDCClient()
无需下载即可在浏览器中查看DICOM数据:
python
from idc_index import IDCClient
import webbrowser

client = IDCClient()

First query to get valid UIDs

首先查询获取有效的UID

results = client.sql_query(""" SELECT SeriesInstanceUID, StudyInstanceUID FROM index WHERE collection_id = 'rider_pilot' AND Modality = 'CT' LIMIT 1 """)
results = client.sql_query(""" SELECT SeriesInstanceUID, StudyInstanceUID FROM index WHERE collection_id = 'rider_pilot' AND Modality = 'CT' LIMIT 1 """)

View single series

查看单个序列

viewer_url = client.get_viewer_URL(seriesInstanceUID=results.iloc[0]['SeriesInstanceUID']) webbrowser.open(viewer_url)
viewer_url = client.get_viewer_URL(seriesInstanceUID=results.iloc[0]['SeriesInstanceUID']) webbrowser.open(viewer_url)

View all series in a study (useful for multi-series exams like MRI protocols)

查看检查中的所有序列(适用于多序列检查,如MRI协议)

viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID']) webbrowser.open(viewer_url)

The method automatically selects OHIF v3 for radiology or SLIM for slide microscopy. Viewing by study is useful when a DICOM Study contains multiple Series (e.g., T1, T2, DWI sequences from a single MRI session).
viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID']) webbrowser.open(viewer_url)

该方法会自动为放射学影像选择OHIF v3,为玻片显微镜选择SLIM。按检查查看适用于DICOM检查包含多个序列的场景(例如单次MRI检查中的T1、T2、DWI序列)。

5. Understanding and Checking Licenses

5. 理解与检查许可证

Check data licensing before use (critical for commercial applications):
python
from idc_index import IDCClient

client = IDCClient()
使用前请检查数据许可证(商业应用尤为重要):
python
from idc_index import IDCClient

client = IDCClient()

Check licenses for all collections

检查所有集合的许可证

query = """ SELECT DISTINCT collection_id, license_short_name, COUNT(DISTINCT SeriesInstanceUID) as series_count FROM index GROUP BY collection_id, license_short_name ORDER BY collection_id """
licenses = client.sql_query(query) print(licenses)

**License types in IDC:**
- **CC BY 4.0** / **CC BY 3.0** (~97% of data) - Allows commercial use with attribution
- **CC BY-NC 4.0** / **CC BY-NC 3.0** (~3% of data) - Non-commercial use only
- **Custom licenses** (rare) - Some collections have specific terms (e.g., NLM Terms and Conditions)

**Important:** Always check the license before using IDC data in publications or commercial applications. Each DICOM file is tagged with its specific license in metadata.
query = """ SELECT DISTINCT collection_id, license_short_name, COUNT(DISTINCT SeriesInstanceUID) as series_count FROM index GROUP BY collection_id, license_short_name ORDER BY collection_id """
licenses = client.sql_query(query) print(licenses)

**IDC中的许可证类型:**
- **CC BY 4.0** / **CC BY 3.0**(约占数据的97%)- 允许商业使用,但需注明出处
- **CC BY-NC 4.0** / **CC BY-NC 3.0**(约占数据的3%)- 仅允许非商业使用
- **自定义许可证**(罕见)- 部分集合有特定条款(如NLM条款和条件)

**重要提示:** 在出版物或商业应用中使用IDC数据前,请务必检查许可证。每个DICOM文件的元数据中都标记了其特定许可证。

Generating Citations for Attribution

生成引用信息以注明出处

The
source_DOI
column contains DOIs linking to publications describing how the data was generated. To satisfy attribution requirements, use
citations_from_selection()
to generate properly formatted citations:
python
from idc_index import IDCClient

client = IDCClient()
source_DOI
列包含链接到数据集生成相关出版物的DOI。为满足出处要求,可使用
citations_from_selection()
生成格式规范的引用:
python
from idc_index import IDCClient

client = IDCClient()

Get citations for a collection (APA format by default)

获取集合的引用(默认APA格式)

citations = client.citations_from_selection(collection_id="rider_pilot") for citation in citations: print(citation)
citations = client.citations_from_selection(collection_id="rider_pilot") for citation in citations: print(citation)

Get citations for specific series

获取特定序列的引用

results = client.sql_query(""" SELECT SeriesInstanceUID FROM index WHERE collection_id = 'tcga_luad' LIMIT 5 """) citations = client.citations_from_selection( seriesInstanceUID=list(results['SeriesInstanceUID'].values) )
results = client.sql_query(""" SELECT SeriesInstanceUID FROM index WHERE collection_id = 'tcga_luad' LIMIT 5 """) citations = client.citations_from_selection( seriesInstanceUID=list(results['SeriesInstanceUID'].values) )

Alternative format: BibTeX (for LaTeX documents)

替代格式:BibTeX(适用于LaTeX文档)

bibtex_citations = client.citations_from_selection( collection_id="tcga_luad", citation_format=IDCClient.CITATION_FORMAT_BIBTEX )

**Parameters:**
- `collection_id`: Filter by collection(s)
- `patientId`: Filter by patient ID(s)
- `studyInstanceUID`: Filter by study UID(s)
- `seriesInstanceUID`: Filter by series UID(s)
- `citation_format`: Use `IDCClient.CITATION_FORMAT_*` constants:
  - `CITATION_FORMAT_APA` (default) - APA style
  - `CITATION_FORMAT_BIBTEX` - BibTeX for LaTeX
  - `CITATION_FORMAT_JSON` - CSL JSON
  - `CITATION_FORMAT_TURTLE` - RDF Turtle

**Best practice:** When publishing results using IDC data, include the generated citations to properly attribute the data sources and satisfy license requirements.
bibtex_citations = client.citations_from_selection( collection_id="tcga_luad", citation_format=IDCClient.CITATION_FORMAT_BIBTEX )

**参数:**
- `collection_id`:按集合筛选
- `patientId`:按患者ID筛选
- `studyInstanceUID`:按检查UID筛选
- `seriesInstanceUID`:按序列UID筛选
- `citation_format`:使用`IDCClient.CITATION_FORMAT_*`常量:
  - `CITATION_FORMAT_APA`(默认)- APA格式
  - `CITATION_FORMAT_BIBTEX` - 适用于LaTeX的BibTeX格式
  - `CITATION_FORMAT_JSON` - CSL JSON格式
  - `CITATION_FORMAT_TURTLE` - RDF Turtle格式

**最佳实践:** 使用IDC数据发表结果时,请包含生成的引用,以正确注明数据源并满足许可证要求。

6. Batch Processing and Filtering

6. 批量处理与筛选

Process large datasets efficiently with filtering:
python
from idc_index import IDCClient
import pandas as pd

client = IDCClient()
通过筛选高效处理大型数据集:
python
from idc_index import IDCClient
import pandas as pd

client = IDCClient()

Find chest CT scans from GE scanners

查找GE扫描仪的胸部CT扫描

query = """ SELECT SeriesInstanceUID, PatientID, collection_id, ManufacturerModelName FROM index WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST' AND Manufacturer = 'GE MEDICAL SYSTEMS' AND license_short_name = 'CC BY 4.0' LIMIT 100 """
results = client.sql_query(query)
query = """ SELECT SeriesInstanceUID, PatientID, collection_id, ManufacturerModelName FROM index WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST' AND Manufacturer = 'GE MEDICAL SYSTEMS' AND license_short_name = 'CC BY 4.0' LIMIT 100 """)
results = client.sql_query(query)

Save manifest for later

保存清单供后续使用

results.to_csv('lung_ct_manifest.csv', index=False)
results.to_csv('lung_ct_manifest.csv', index=False)

Download in batches to avoid timeout

分批下载以避免超时

batch_size = 10 for i in range(0, len(results), batch_size): batch = results.iloc[i:i+batch_size] client.download_from_selection( seriesInstanceUID=list(batch['SeriesInstanceUID'].values), downloadDir=f"./data/batch_{i//batch_size}" )
undefined
batch_size = 10 for i in range(0, len(results), batch_size): batch = results.iloc[i:i+batch_size] client.download_from_selection( seriesInstanceUID=list(batch['SeriesInstanceUID'].values), downloadDir=f"./data/batch_{i//batch_size}" )
undefined

7. Advanced Queries with BigQuery

7. 使用BigQuery进行高级查询

For queries requiring full DICOM metadata, complex JOINs, clinical data tables, or private DICOM elements, use Google BigQuery. Requires GCP account with billing enabled.
Quick reference:
  • Dataset:
    bigquery-public-data.idc_current.*
  • Main table:
    dicom_all
    (combined metadata)
  • Full metadata:
    dicom_metadata
    (all DICOM tags)
  • Private elements:
    OtherElements
    column (vendor-specific tags like diffusion b-values)
See
references/bigquery_guide.md
for setup, table schemas, query patterns, private element access, and cost optimization.
如需查询完整DICOM元数据、复杂关联、临床数据表或私有DICOM元素,请使用Google BigQuery。需要启用计费的GCP账户。
快速参考:
  • 数据集:
    bigquery-public-data.idc_current.*
  • 主表:
    dicom_all
    (合并元数据)
  • 完整元数据:
    dicom_metadata
    (所有DICOM标签)
  • 私有元素:
    OtherElements
    列(厂商特定标签,如弥散b值)
有关设置、表架构、查询模式、私有元素访问和成本优化,请参阅
references/bigquery_guide.md

8. Tool Selection Guide

8. 工具选择指南

TaskToolReference
Programmatic queries & downloads
idc-index
This document
Interactive explorationIDC Portalhttps://portal.imaging.datacommons.cancer.gov/
Complex metadata queriesBigQuery
references/bigquery_guide.md
3D visualization & analysisSlicerIDCBrowserhttps://github.com/ImagingDataCommons/SlicerIDCBrowser
Default choice: Use
idc-index
for most tasks (no auth, easy API, batch downloads).
任务工具参考文档
程序化查询与下载
idc-index
本文档
交互式探索IDC门户https://portal.imaging.datacommons.cancer.gov/
复杂元数据查询BigQuery
references/bigquery_guide.md
3D可视化与分析SlicerIDCBrowserhttps://github.com/ImagingDataCommons/SlicerIDCBrowser
默认选择: 大多数任务使用
idc-index
(无需身份验证、API易用、支持批量下载)。

9. Integration with Analysis Pipelines

9. 与分析流水线集成

Integrate IDC data into imaging analysis workflows:
Read downloaded DICOM files:
python
import pydicom
import os
将IDC数据集成到影像分析工作流:
读取下载的DICOM文件:
python
import pydicom
import os

Read DICOM files from downloaded series

读取下载序列中的DICOM文件

series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."
dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir) if f.endswith('.dcm')]
series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."
dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir) if f.endswith('.dcm')]

Load first image

加载第一张影像

ds = pydicom.dcmread(dicom_files[0]) print(f"Patient ID: {ds.PatientID}") print(f"Modality: {ds.Modality}") print(f"Image shape: {ds.pixel_array.shape}")

**Build 3D volume from CT series:**
```python
import pydicom
import numpy as np
from pathlib import Path

def load_ct_series(series_path):
    """Load CT series as 3D numpy array"""
    files = sorted(Path(series_path).glob('*.dcm'))
    slices = [pydicom.dcmread(str(f)) for f in files]

    # Sort by slice location
    slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))

    # Stack into 3D array
    volume = np.stack([s.pixel_array for s in slices])

    return volume, slices[0]  # Return volume and first slice for metadata

volume, metadata = load_ct_series("./data/lung_ct/series_dir")
print(f"Volume shape: {volume.shape}")  # (z, y, x)
Integrate with SimpleITK:
python
import SimpleITK as sitk
from pathlib import Path
ds = pydicom.dcmread(dicom_files[0]) print(f"患者ID: {ds.PatientID}") print(f"成像模态: {ds.Modality}") print(f"影像尺寸: {ds.pixel_array.shape}")

**从CT序列构建3D体素:**
```python
import pydicom
import numpy as np
from pathlib import Path

def load_ct_series(series_path):
    """将CT序列加载为3D numpy数组"""
    files = sorted(Path(series_path).glob('*.dcm'))
    slices = [pydicom.dcmread(str(f)) for f in files]

    # 按切片位置排序
    slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))

    # 堆叠为3D数组
    volume = np.stack([s.pixel_array for s in slices])

    return volume, slices[0]  # 返回体素和第一张切片的元数据

volume, metadata = load_ct_series("./data/lung_ct/series_dir")
print(f"体素尺寸: {volume.shape}")  # (z, y, x)
与SimpleITK集成:
python
import SimpleITK as sitk
from pathlib import Path

Read DICOM series

读取DICOM序列

series_path = "./data/ct_series" reader = sitk.ImageSeriesReader() dicom_names = reader.GetGDCMSeriesFileNames(series_path) reader.SetFileNames(dicom_names) image = reader.Execute()
series_path = "./data/ct_series" reader = sitk.ImageSeriesReader() dicom_names = reader.GetGDCMSeriesFileNames(series_path) reader.SetFileNames(dicom_names) image = reader.Execute()

Apply processing

应用处理

smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)
smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)

Save as NIfTI

保存为NIfTI格式

sitk.WriteImage(smoothed, "processed_volume.nii.gz")
undefined
sitk.WriteImage(smoothed, "processed_volume.nii.gz")
undefined

Common Use Cases

常见使用场景

Use Case 1: Find and Download Lung CT Scans for Deep Learning

场景1:查找并下载肺部CT扫描用于深度学习

Objective: Build training dataset of lung CT scans from NLST collection
Steps:
python
from idc_index import IDCClient

client = IDCClient()
目标: 构建NLST集合的肺部CT扫描训练数据集
步骤:
python
from idc_index import IDCClient

client = IDCClient()

1. Query for lung CT scans with specific criteria

1. 按特定条件查询肺部CT扫描

query = """ SELECT PatientID, SeriesInstanceUID, SeriesDescription FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' AND BodyPartExamined = 'CHEST' AND license_short_name = 'CC BY 4.0' ORDER BY PatientID LIMIT 100 """
results = client.sql_query(query) print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients")
query = """ SELECT PatientID, SeriesInstanceUID, SeriesDescription FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' AND BodyPartExamined = 'CHEST' AND license_short_name = 'CC BY 4.0' ORDER BY PatientID LIMIT 100 """)
results = client.sql_query(query) print(f"找到{len(results)}个序列,来自{results['PatientID'].nunique()}名患者")

2. Download data organized by patient

2. 按患者组织数据并下载

client.download_from_selection( seriesInstanceUID=list(results['SeriesInstanceUID'].values), downloadDir="./training_data", dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID" )
client.download_from_selection( seriesInstanceUID=list(results['SeriesInstanceUID'].values), downloadDir="./training_data", dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID" )

3. Save manifest for reproducibility

3. 保存清单以确保可复现性

results.to_csv('training_manifest.csv', index=False)
undefined
results.to_csv('training_manifest.csv', index=False)
undefined

Use Case 2: Query Brain MRI by Manufacturer for Quality Study

场景2:按厂商查询脑部MRI用于质量研究

Objective: Compare image quality across different MRI scanner manufacturers
Steps:
python
from idc_index import IDCClient
import pandas as pd

client = IDCClient()
目标: 比较不同MRI扫描仪厂商的影像质量
步骤:
python
from idc_index import IDCClient
import pandas as pd

client = IDCClient()

Query for brain MRI grouped by manufacturer

按厂商分组查询脑部MRI

query = """ SELECT Manufacturer, ManufacturerModelName, COUNT(DISTINCT SeriesInstanceUID) as num_series, COUNT(DISTINCT PatientID) as num_patients FROM index WHERE Modality = 'MR' AND BodyPartExamined LIKE '%BRAIN%' GROUP BY Manufacturer, ManufacturerModelName HAVING num_series >= 10 ORDER BY num_series DESC """
manufacturers = client.sql_query(query) print(manufacturers)
query = """ SELECT Manufacturer, ManufacturerModelName, COUNT(DISTINCT SeriesInstanceUID) as num_series, COUNT(DISTINCT PatientID) as num_patients FROM index WHERE Modality = 'MR' AND BodyPartExamined LIKE '%BRAIN%' GROUP BY Manufacturer, ManufacturerModelName HAVING num_series >= 10 ORDER BY num_series DESC """)
manufacturers = client.sql_query(query) print(manufacturers)

Download sample from each manufacturer for comparison

下载每个厂商的样本用于比较

for _, row in manufacturers.head(3).iterrows(): mfr = row['Manufacturer'] model = row['ManufacturerModelName']
query = f"""
SELECT SeriesInstanceUID
FROM index
WHERE Manufacturer = '{mfr}'
  AND ManufacturerModelName = '{model}'
  AND Modality = 'MR'
  AND BodyPartExamined LIKE '%BRAIN%'
LIMIT 5
"""

series = client.sql_query(query)
client.download_from_selection(
    seriesInstanceUID=list(series['SeriesInstanceUID'].values),
    downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
)
undefined
for _, row in manufacturers.head(3).iterrows(): mfr = row['Manufacturer'] model = row['ManufacturerModelName']
query = f"""
SELECT SeriesInstanceUID
FROM index
WHERE Manufacturer = '{mfr}'
  AND ManufacturerModelName = '{model}'
  AND Modality = 'MR'
  AND BodyPartExamined LIKE '%BRAIN%'
LIMIT 5
""")

series = client.sql_query(query)
client.download_from_selection(
    seriesInstanceUID=list(series['SeriesInstanceUID'].values),
    downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
)
undefined

Use Case 3: Visualize Series Without Downloading

场景3:无需下载即可预览序列

Objective: Preview imaging data before committing to download
python
from idc_index import IDCClient
import webbrowser

client = IDCClient()

series_list = client.sql_query("""
    SELECT SeriesInstanceUID, PatientID, SeriesDescription
    FROM index
    WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
    LIMIT 10
""")
目标: 在决定下载前预览影像数据
python
from idc_index import IDCClient
import webbrowser

client = IDCClient()

series_list = client.sql_query("""
    SELECT SeriesInstanceUID, PatientID, SeriesDescription
    FROM index
    WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
    LIMIT 10
""")

Preview each in browser

在浏览器中预览每个序列

for _, row in series_list.iterrows(): viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID']) print(f"Patient {row['PatientID']}: {row['SeriesDescription']}") print(f" View at: {viewer_url}") # webbrowser.open(viewer_url) # Uncomment to open automatically

For additional visualization options, see the [IDC Portal getting started guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration.
for _, row in series_list.iterrows(): viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID']) print(f"患者{row['PatientID']}: {row['SeriesDescription']}") print(f" 查看地址: {viewer_url}") # webbrowser.open(viewer_url) # 取消注释以自动打开

有关更多可视化选项,请参阅[IDC门户入门指南](https://learn.canceridc.dev/portal/getting-started)或用于3D Slicer集成的[SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser)。

Use Case 4: License-Aware Batch Download for Commercial Use

场景4:面向商业用途的许可证感知批量下载

Objective: Download only CC-BY licensed data suitable for commercial applications
Steps:
python
from idc_index import IDCClient

client = IDCClient()
目标: 仅下载适用于商业应用的CC-BY许可数据
步骤:
python
from idc_index import IDCClient

client = IDCClient()

Query ONLY for CC BY licensed data (allows commercial use with attribution)

仅查询CC BY许可数据(允许商业使用,需注明出处)

query = """ SELECT SeriesInstanceUID, collection_id, PatientID, Modality FROM index WHERE license_short_name LIKE 'CC BY%' AND license_short_name NOT LIKE '%NC%' AND Modality IN ('CT', 'MR') AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN') LIMIT 200 """
cc_by_data = client.sql_query(query)
print(f"Found {len(cc_by_data)} CC BY licensed series") print(f"Collections: {cc_by_data['collection_id'].unique()}")
query = """ SELECT SeriesInstanceUID, collection_id, PatientID, Modality FROM index WHERE license_short_name LIKE 'CC BY%' AND license_short_name NOT LIKE '%NC%' AND Modality IN ('CT', 'MR') AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN') LIMIT 200 """)
cc_by_data = client.sql_query(query)
print(f"找到{len(cc_by_data)}个CC BY许可序列") print(f"涉及集合: {cc_by_data['collection_id'].unique()}")

Download with license verification

下载并验证许可证

client.download_from_selection( seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values), downloadDir="./commercial_dataset", dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID" )
client.download_from_selection( seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values), downloadDir="./commercial_dataset", dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID" )

Save license information

保存许可证信息

cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
undefined
cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
undefined

Best Practices

最佳实践

  • Check licenses before use - Always query the
    license_short_name
    field and respect licensing terms (CC BY vs CC BY-NC)
  • Generate citations for attribution - Use
    citations_from_selection()
    to get properly formatted citations from
    source_DOI
    values; include these in publications
  • Start with small queries - Use
    LIMIT
    clause when exploring to avoid long downloads and understand data structure
  • Use mini-index for simple queries - Only use BigQuery when you need comprehensive metadata or complex JOINs
  • Organize downloads with dirTemplate - Use meaningful directory structures like
    %collection_id/%PatientID/%Modality
  • Cache query results - Save DataFrames to CSV files to avoid re-querying and ensure reproducibility
  • Estimate size first - Check collection size before downloading - some collection sizes are in terabytes!
  • Save manifests - Always save query results with Series UIDs for reproducibility and data provenance
  • Read documentation - IDC data structure and metadata fields are documented at https://learn.canceridc.dev/
  • Use IDC forum - Search for questons/answers and ask your questions to the IDC maintainers and users at https://discourse.canceridc.dev/
  • 使用前检查许可证 - 务必查询
    license_short_name
    字段并遵守许可条款(CC BY与CC BY-NC)
  • 生成引用以注明出处 - 使用
    citations_from_selection()
    source_DOI
    值生成格式规范的引用;在出版物中包含这些引用
  • 从小规模查询开始 - 探索时使用
    LIMIT
    子句,避免长时间下载并了解数据结构
  • 简单查询使用迷你索引 - 仅在需要全面元数据或复杂关联时使用BigQuery
  • 使用dirTemplate组织下载 - 使用有意义的目录结构,如
    %collection_id/%PatientID/%Modality
  • 缓存查询结果 - 将DataFrame保存为CSV文件,避免重复查询并确保可复现性
  • 先估算大小 - 下载前检查集合大小——部分集合大小可达TB级!
  • 保存清单 - 始终保存包含序列UID的查询结果,以确保可复现性和数据溯源
  • 阅读文档 - IDC数据结构和元数据字段记录在https://learn.canceridc.dev/
  • 使用IDC论坛 - 在https://discourse.canceridc.dev/搜索问题/答案,并向IDC维护者和用户提问

Troubleshooting

故障排除

Issue:
ModuleNotFoundError: No module named 'idc_index'
  • Cause: idc-index package not installed
  • Solution: Install with
    pip install --upgrade idc-index
Issue: Download fails with connection timeout
  • Cause: Network instability or large download size
  • Solution:
    • Download smaller batches (e.g., 10-20 series at a time)
    • Check network connection
    • Use
      dirTemplate
      to organize downloads by batch
    • Implement retry logic with delays
Issue:
BigQuery quota exceeded
or billing errors
  • Cause: BigQuery requires billing-enabled GCP project
  • Solution: Use idc-index mini-index for simple queries (no billing required), or see
    references/bigquery_guide.md
    for cost optimization tips
Issue: Series UID not found or no data returned
  • Cause: Typo in UID, data not in current IDC version, or wrong field name
  • Solution:
    • Check if data is in current IDC version (some old data may be deprecated)
    • Use
      LIMIT 5
      to test query first
    • Check field names against metadata schema documentation
Issue: Downloaded DICOM files won't open
  • Cause: Corrupted download or incompatible viewer
  • Solution:
    • Check DICOM object type (Modality and SOPClassUID attributes) - some object types require specialized tools
    • Verify file integrity (check file sizes)
    • Use pydicom to validate:
      pydicom.dcmread(file, force=True)
    • Try different DICOM viewer (3D Slicer, Horos, RadiAnt, QuPath)
    • Re-download the series
问题:
ModuleNotFoundError: No module named 'idc_index'
  • 原因: 未安装idc-index包
  • 解决方案: 使用
    pip install --upgrade idc-index
    安装
问题:下载因连接超时失败
  • 原因: 网络不稳定或下载文件过大
  • 解决方案:
    • 分批次下载(例如每次10-20个序列)
    • 检查网络连接
    • 使用
      dirTemplate
      按批次组织下载
    • 实现带延迟的重试逻辑
问题:
BigQuery quota exceeded
或计费错误
  • 原因: BigQuery需要启用计费的GCP项目
  • 解决方案: 简单查询使用idc-index迷你索引(无需计费),或参阅
    references/bigquery_guide.md
    获取成本优化技巧
问题:序列UID未找到或无数据返回
  • 原因: UID输入错误、数据不在当前IDC版本中、字段名错误
  • 解决方案:
    • 检查数据是否在当前IDC版本中(部分旧数据可能已弃用)
    • 使用
      LIMIT 5
      测试查询
    • 对照元数据架构文档检查字段名
问题:下载的DICOM文件无法打开
  • 原因: 下载损坏或查看器不兼容
  • 解决方案:
    • 检查DICOM对象类型(Modality和SOPClassUID属性)——部分对象类型需要专用工具
    • 验证文件完整性(检查文件大小)
    • 使用pydicom验证:
      pydicom.dcmread(file, force=True)
    • 尝试其他DICOM查看器(3D Slicer、Horos、RadiAnt、QuPath)
    • 重新下载序列

Common SQL Query Patterns

常见SQL查询模式

Quick reference for common queries. For detailed examples with context, see the Core Capabilities section above.
常见查询快速参考。有关带上下文的详细示例,请参阅上文核心功能部分。

Discover available filter values

探索筛选列的可用值

python
undefined
python
undefined

What modalities exist?

有哪些成像模态?

client.sql_query("SELECT DISTINCT Modality FROM index")
client.sql_query("SELECT DISTINCT Modality FROM index")

What body parts for a specific modality?

特定模态下有哪些解剖部位?

client.sql_query(""" SELECT DISTINCT BodyPartExamined, COUNT(*) as n FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL GROUP BY BodyPartExamined ORDER BY n DESC """)
client.sql_query(""" SELECT DISTINCT BodyPartExamined, COUNT(*) as n FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL GROUP BY BodyPartExamined ORDER BY n DESC """)

What manufacturers for MR?

MR模态有哪些设备制造商?

client.sql_query(""" SELECT DISTINCT Manufacturer, COUNT(*) as n FROM index WHERE Modality = 'MR' GROUP BY Manufacturer ORDER BY n DESC """)
undefined
client.sql_query(""" SELECT DISTINCT Manufacturer, COUNT(*) as n FROM index WHERE Modality = 'MR' GROUP BY Manufacturer ORDER BY n DESC """)
undefined

Find annotations and segmentations

查找标注与分割结果

Note: Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.
python
undefined
注意: 并非所有影像衍生对象都属于分析结果集合。部分标注随原始影像一起提交。可使用DICOM Modality或SOPClassUID查找所有衍生对象,无论集合类型。
python
undefined

Find ALL segmentations and structure sets by DICOM Modality

按DICOM模态查找所有分割结果和结构集

SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set

SEG = DICOM分割, RTSTRUCT = 放疗结构集

client.sql_query(""" SELECT collection_id, Modality, COUNT(*) as series_count FROM index WHERE Modality IN ('SEG', 'RTSTRUCT') GROUP BY collection_id, Modality ORDER BY series_count DESC """)
client.sql_query(""" SELECT collection_id, Modality, COUNT(*) as series_count FROM index WHERE Modality IN ('SEG', 'RTSTRUCT') GROUP BY collection_id, Modality ORDER BY series_count DESC """)

Find segmentations for a specific collection (includes non-analysis-result items)

查找特定集合的分割结果(包括非分析结果项)

client.sql_query(""" SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id FROM index WHERE collection_id = 'tcga_luad' AND Modality = 'SEG' """)
client.sql_query(""" SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id FROM index WHERE collection_id = 'tcga_luad' AND Modality = 'SEG' """)

List analysis result collections (curated derived datasets)

列出分析结果集合(精选衍生数据集)

client.fetch_index("analysis_results_index") client.sql_query(""" SELECT analysis_result_id, analysis_result_title, Collections, Modalities FROM analysis_results_index """)
client.fetch_index("analysis_results_index") client.sql_query(""" SELECT analysis_result_id, analysis_result_title, Collections, Modalities FROM analysis_results_index """)

Find analysis results for a specific source collection

查找特定源集合的分析结果

client.sql_query(""" SELECT analysis_result_id, analysis_result_title FROM analysis_results_index WHERE Collections LIKE '%tcga_luad%' """)
client.sql_query(""" SELECT analysis_result_id, analysis_result_title FROM analysis_results_index WHERE Collections LIKE '%tcga_luad%' """)

Use seg_index for detailed DICOM Segmentation metadata

使用seg_index获取详细的DICOM分割元数据

client.fetch_index("seg_index")
client.fetch_index("seg_index")

Get segmentation statistics by algorithm

按算法统计分割结果

client.sql_query(""" SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count FROM seg_index WHERE AlgorithmName IS NOT NULL GROUP BY AlgorithmName, AlgorithmType ORDER BY seg_count DESC LIMIT 10 """)
client.sql_query(""" SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count FROM seg_index WHERE AlgorithmName IS NOT NULL GROUP BY AlgorithmName, AlgorithmType ORDER BY seg_count DESC LIMIT 10 """)

Find segmentations for specific source images (e.g., chest CT)

查找特定源影像的分割结果(如胸部CT)

client.sql_query(""" SELECT s.SeriesInstanceUID as seg_series, s.AlgorithmName, s.total_segments, s.segmented_SeriesInstanceUID as source_series FROM seg_index s JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST' LIMIT 10 """)
client.sql_query(""" SELECT s.SeriesInstanceUID as seg_series, s.AlgorithmName, s.total_segments, s.segmented_SeriesInstanceUID as source_series FROM seg_index s JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST' LIMIT 10 """)

Find TotalSegmentator results with source image context

查找TotalSegmentator结果及其源影像上下文

client.sql_query(""" SELECT seg_info.collection_id, COUNT(DISTINCT s.SeriesInstanceUID) as seg_count, SUM(s.total_segments) as total_segments FROM seg_index s JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID WHERE s.AlgorithmName LIKE '%TotalSegmentator%' GROUP BY seg_info.collection_id ORDER BY seg_count DESC """)
undefined
client.sql_query(""" SELECT seg_info.collection_id, COUNT(DISTINCT s.SeriesInstanceUID) as seg_count, SUM(s.total_segments) as total_segments FROM seg_index s JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID WHERE s.AlgorithmName LIKE '%TotalSegmentator%' GROUP BY seg_info.collection_id ORDER BY seg_count DESC """)
undefined

Query slide microscopy data

查询玻片显微镜数据

python
undefined
python
undefined

sm_index has detailed metadata; join with index for collection_id

sm_index包含详细元数据;与index关联以获取collection_id

client.fetch_index("sm_index") client.sql_query(""" SELECT i.collection_id, COUNT(*) as slides, MIN(s.min_PixelSpacing_2sf) as min_resolution FROM sm_index s JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID GROUP BY i.collection_id ORDER BY slides DESC """)
undefined
client.fetch_index("sm_index") client.sql_query(""" SELECT i.collection_id, COUNT(*) as slides, MIN(s.min_PixelSpacing_2sf) as min_resolution FROM sm_index s JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID GROUP BY i.collection_id ORDER BY slides DESC """)
undefined

Estimate download size

估算下载大小

python
undefined
python
undefined

Size for specific criteria

特定条件下的下载大小

client.sql_query(""" SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' """)
undefined
client.sql_query(""" SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' """)
undefined

Link to clinical data

关联临床数据

python
client.fetch_index("clinical_index")
python
client.fetch_index("clinical_index")

Find collections with clinical data and their tables

查找包含临床数据的集合及其表

client.sql_query(""" SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns FROM clinical_index GROUP BY collection_id, table_name ORDER BY collection_id """)

See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection.
client.sql_query(""" SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns FROM clinical_index GROUP BY collection_id, table_name ORDER BY collection_id """)

有关包括值映射和患者队列选择的完整模式,请参阅`references/clinical_data_guide.md`。

Related Skills

相关技能

The following skills complement IDC workflows for downstream analysis and visualization:
以下技能可补充IDC工作流,用于下游分析和可视化:

DICOM Processing

DICOM处理

  • pydicom - Read, write, and manipulate downloaded DICOM files. Use for extracting pixel data, reading metadata, anonymization, and format conversion. Essential for working with IDC radiology data (CT, MR, PET).
  • pydicom - 读取、写入和操作下载的DICOM文件。用于提取像素数据、读取元数据、匿名化和格式转换。是处理IDC放射学数据(CT、MR、PET)的必备工具。

Pathology and Slide Microscopy

病理学与玻片显微镜

  • histolab - Lightweight tile extraction and preprocessing for whole slide images. Use for basic slide processing, tissue detection, and dataset preparation from IDC slide microscopy data.
  • pathml - Full-featured computational pathology toolkit. Use for advanced WSI analysis including multiplexed imaging, nucleus segmentation, and ML model training on pathology data downloaded from IDC.
  • histolab - 轻量级全玻片图像瓦片提取与预处理工具。用于IDC玻片显微镜数据的基础处理、组织检测和数据集准备。
  • pathml - 全功能计算病理学工具包。用于高级WSI分析,包括多模态成像、细胞核分割和基于IDC下载病理学数据的ML模型训练。

Metadata Visualization

元数据可视化

  • matplotlib - Low-level plotting for full customization. Use for creating static figures summarizing IDC query results (bar charts of modalities, histograms of series counts, etc.).
  • seaborn - Statistical visualization with pandas integration. Use for quick exploration of IDC metadata distributions, relationships between variables, and categorical comparisons with attractive defaults.
  • plotly - Interactive visualization. Use when you need hover info, zoom, and pan for exploring IDC metadata, or for creating web-embeddable dashboards of collection statistics.
  • matplotlib - 低级别绘图工具,支持完全自定义。用于创建静态图表汇总IDC查询结果(模态柱状图、序列数量直方图等)。
  • seaborn - 与pandas集成的统计可视化工具。用于快速探索IDC元数据分布、变量间关系和分类比较,默认样式美观。
  • plotly - 交互式可视化工具。需要悬停信息、缩放和平移功能探索IDC元数据,或创建可嵌入网页的集合统计仪表板时使用。

Data Exploration

数据探索

  • exploratory-data-analysis - Comprehensive EDA on scientific data files. Use after downloading IDC data to understand file structure, quality, and characteristics before analysis.
  • exploratory-data-analysis - 科学数据文件的全面EDA工具。下载IDC数据后使用,以了解文件结构、质量和特征,为后续分析做准备。

Resources

资源

Schema Reference (Primary Source)

架构参考(主要来源)

Always use
client.indices_overview
for current column schemas.
This ensures accuracy with the installed idc-index version:
python
undefined
始终使用
client.indices_overview
获取当前列架构。
这确保与已安装的idc-index版本一致:
python
undefined

Get all column names and types for any table

获取任意表的所有列名和类型

schema = client.indices_overview["index"]["schema"] columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']]
undefined
schema = client.indices_overview["index"]["schema"] columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']]
undefined

Reference Documentation

参考文档

  • clinical_data_guide.md - Clinical/tabular data navigation, value mapping, and joining with imaging data
  • cloud_storage_guide.md - Direct cloud bucket access (S3/GCS), file organization, CRDC UUIDs, versioning, and reproducibility
  • cli_guide.md - Complete idc-index command-line interface reference (
    idc download
    ,
    idc download-from-manifest
    ,
    idc download-from-selection
    )
  • bigquery_guide.md - Advanced BigQuery usage guide for complex metadata queries
  • dicomweb_guide.md - DICOMweb endpoint URLs, code examples, and Google Healthcare API implementation details
  • indices_reference - External documentation for index tables (may be ahead of the installed version)
  • clinical_data_guide.md - 临床/表格数据导航、值映射和与影像关联
  • cloud_storage_guide.md - 直接云存储访问(S3/GCS)、文件组织、CRDC UUID、版本控制和可复现性
  • cli_guide.md - idc-index命令行界面完整参考(
    idc download
    idc download-from-manifest
    idc download-from-selection
  • bigquery_guide.md - BigQuery高级使用指南,用于复杂元数据查询
  • dicomweb_guide.md - DICOMweb端点URL、代码示例和Google Healthcare API实现细节
  • indices_reference - 索引表的外部文档(可能领先于已安装版本)

External Links

外部链接

Skill Updates

技能更新

This skill version is available in skill metadata. To check for updates:
  • Visit the releases page
  • Watch the repository on GitHub (Watch → Custom → Releases)
本技能版本记录在技能元数据中。检查更新方式:
  • 访问发布页面
  • 在GitHub上关注该仓库(Watch → Custom → Releases)