h5py
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseh5py - Hierarchical Data Storage
h5py - 层级化数据存储
h5py provides a seamless bridge between NumPy and HDF5. It allows you to organize data into groups (like folders) and datasets (like NumPy arrays), with rich metadata (attributes) attached to every object.
h5py为NumPy和HDF5之间搭建了无缝桥梁。它允许您将数据组织为组(类似文件夹)和数据集(类似NumPy数组),并且每个对象都可以附加丰富的元数据(属性)。
When to Use
适用场景
- Storing datasets that are much larger than your computer's RAM.
- Organizing complex scientific data into a hierarchical "folder-like" structure.
- Storing numerical arrays (NumPy) with high-speed random access.
- Keeping metadata (units, experiment dates, parameters) attached directly to the data.
- Sharing data between different languages (C, C++, Fortran, Java, MATLAB), as HDF5 is a cross-platform standard.
- Reading/writing large datasets in chunks to optimize I/O performance.
- 存储远大于计算机内存容量的数据集
- 将复杂科学数据组织为类似文件夹的层级结构
- 存储支持高速随机访问的数值数组(NumPy)
- 将元数据(单位、实验日期、参数)直接附加到数据上
- 在不同编程语言(C、C++、Fortran、Java、MATLAB)间共享数据,因为HDF5是跨平台标准
- 分块读写大型数据集以优化I/O性能
Reference Documentation
参考文档
Official docs: https://docs.h5py.org/
HDF Group: https://www.hdfgroup.org/
Search patterns:, , , ,
HDF Group: https://www.hdfgroup.org/
Search patterns:
h5py.Filecreate_dataseth5py.Groupchunks=Truecompression="gzip"官方文档: https://docs.h5py.org/
HDF组织: https://www.hdfgroup.org/
常用搜索关键词:, , , ,
HDF组织: https://www.hdfgroup.org/
常用搜索关键词:
h5py.Filecreate_dataseth5py.Groupchunks=Truecompression="gzip"Core Principles
核心原则
The Hierarchy
层级结构
HDF5 files contain two main types of objects:
- Datasets: Multidimensional arrays of data (NumPy-like).
- Groups: Container structures that can hold datasets or other groups (like directories).
HDF5文件包含两种主要对象:
- 数据集:多维数据数组(类似NumPy数组)
- 组:可容纳数据集或其他组的容器结构(类似目录)
Slicing
切片操作
h5py datasets support standard NumPy slicing. When you slice a dataset, only that specific slice is read from the disk, keeping memory usage low.
h5py数据集支持标准NumPy切片。当您对数据集执行切片时,只会从磁盘读取指定的切片部分,从而降低内存占用。
Attributes
属性
Every group and dataset can have attributes (key-value pairs) for metadata.
每个组和数据集都可以附加属性(键值对)来存储元数据。
Quick Reference
快速参考
Installation
安装
bash
pip install h5pybash
pip install h5pyStandard Imports
标准导入
python
import h5py
import numpy as nppython
import h5py
import numpy as npBasic Pattern - Writing and Reading
基础模式 - 读写数据
python
import h5py
import numpy as nppython
import h5py
import numpy as npWriting data
写入数据
with h5py.File('data.h5', 'w') as f:
dset = f.create_dataset('main_data', data=np.random.rand(100, 100))
dset.attrs['units'] = 'meters'
grp = f.create_group('subgroup')
grp.create_dataset('results', data=[1, 2, 3])
with h5py.File('data.h5', 'w') as f:
dset = f.create_dataset('main_data', data=np.random.rand(100, 100))
dset.attrs['units'] = 'meters'
grp = f.create_group('subgroup')
grp.create_dataset('results', data=[1, 2, 3])
Reading data
读取数据
with h5py.File('data.h5', 'r') as f:
data_slice = f['main_data'][0:10, 0:10] # Only read 100 elements
units = f['main_data'].attrs['units']
print(f"Group content: {list(f['subgroup'].keys())}")
undefinedwith h5py.File('data.h5', 'r') as f:
data_slice = f['main_data'][0:10, 0:10] # 仅读取100个元素
units = f['main_data'].attrs['units']
print(f"组内容: {list(f['subgroup'].keys())}")
undefinedCritical Rules
重要规则
✅ DO
✅ 建议做法
- Use Context Managers - Always use to ensure files are closed even if errors occur.
with h5py.File(...) as f: - Use Chunking - For large datasets, specify or a manual shape to optimize access speed for specific slicing patterns.
chunks=True - Enable Compression - Use to save disk space for large numerical arrays.
compression="gzip" - Use Descriptive Names - Use groups to organize data logically (e.g., ).
/experiment1/sensorA/raw - Store Metadata in Attributes - Don't create separate text files for units or timestamps; attach them to the datasets.
- Check Membership - Use before accessing to avoid KeyError.
"name" in group
- 使用上下文管理器 - 始终使用 确保即使发生错误也能正确关闭文件
with h5py.File(...) as f: - 使用分块存储 - 对于大型数据集,指定 或手动设置分块形状,针对特定切片模式优化访问速度
chunks=True - 启用压缩 - 使用 为大型数值数组节省磁盘空间
compression="gzip" - 使用描述性名称 - 使用组对数据进行逻辑组织(例如:)
/experiment1/sensorA/raw - 在属性中存储元数据 - 不要为单位或时间戳创建单独的文本文件,直接附加到数据集上
- 检查成员存在性 - 访问前使用 避免KeyError错误
"name" in group
❌ DON'T
❌ 禁止做法
- Open files in 'w' by mistake - The 'w' mode overwrites existing files. Use 'a' (append/read-write) or 'r+' (read-write) instead.
- Load entire datasets into RAM - Avoid unless you are sure it fits in memory.
data = f['large_dataset'][:] - Store thousands of small datasets - HDF5 is optimized for large arrays. For millions of tiny scalars, use a single array or a different database.
- Forget to close files - An unclosed HDF5 file can become corrupted or locked.
- 误将文件以'w'模式打开 - 'w'模式会覆盖现有文件,应使用'a'(追加/读写)或'r+'(读写)模式
- 将整个数据集加载到内存 - 除非确定数据集能放入内存,否则避免使用
data = f['large_dataset'][:] - 存储数千个小型数据集 - HDF5针对大型数组优化,对于数百万个小标量,应使用单个数组或其他数据库
- 忘记关闭文件 - 未关闭的HDF5文件可能损坏或被锁定
Anti-Patterns (NEVER)
反模式(绝对避免)
python
import h5py
import numpy as nppython
import h5py
import numpy as np❌ BAD: Manual file closing (unsafe)
❌ 错误:手动关闭文件(不安全)
f = h5py.File('data.h5', 'w')
f.create_dataset('x', data=np.arange(10))
f.close() # If an error happened above, this never runs!
f = h5py.File('data.h5', 'w')
f.create_dataset('x', data=np.arange(10))
f.close() # 如果上方代码出错,这行永远不会执行!
✅ GOOD: Context manager
✅ 正确:使用上下文管理器
with h5py.File('data.h5', 'w') as f:
f.create_dataset('x', data=np.arange(10))
with h5py.File('data.h5', 'w') as f:
f.create_dataset('x', data=np.arange(10))
❌ BAD: Storing metadata as strings inside a dataset
❌ 错误:将元数据以字符串形式存储在数据集中
f.create_dataset('meta', data=np.array(['unit: meter', 'date: 2024']))
f.create_dataset('meta', data=np.array(['unit: meter', 'date: 2024']))
✅ GOOD: Using Attributes
✅ 正确:使用属性
dset = f.create_dataset('data', data=np.random.rand(10))
dset.attrs['unit'] = 'meter'
dset.attrs['date'] = '2024'
dset = f.create_dataset('data', data=np.random.rand(10))
dset.attrs['unit'] = 'meter'
dset.attrs['date'] = '2024'
❌ BAD: Inefficient chunking (one row at a time when you read columns)
❌ 错误:低效的分块设置(当您按列读取时却按行分块)
f.create_dataset('big', shape=(10000, 10000), chunks=(1, 10000))
f.create_dataset('big', shape=(10000, 10000), chunks=(1, 10000))
undefinedundefinedDataset Creation and Configuration
数据集创建与配置
Advanced Options
高级选项
python
with h5py.File('optimized.h5', 'w') as f:
# 1. Resizable dataset (maxshape)
dset = f.create_dataset('growing',
shape=(100,),
maxshape=(None,), # Allow growth in 1st dimension
dtype='float32')
# 2. Compression and Chunking
f.create_dataset('compressed',
data=np.random.randn(1000, 1000),
chunks=(100, 100),
compression="gzip",
compression_opts=4) # 4 is a good balance
# 3. Filling with default values
f.create_dataset('default', shape=(10, 10), fillvalue=-1.0)python
with h5py.File('optimized.h5', 'w') as f:
# 1. 可扩容数据集(maxshape)
dset = f.create_dataset('growing',
shape=(100,),
maxshape=(None,), # 允许第一维度扩容
dtype='float32')
# 2. 压缩与分块
f.create_dataset('compressed',
data=np.random.randn(1000, 1000),
chunks=(100, 100),
compression="gzip",
compression_opts=4) # 4是压缩比与速度的良好平衡
# 3. 填充默认值
f.create_dataset('default', shape=(10, 10), fillvalue=-1.0)Working with Groups
组的使用
Navigation and Iteration
导航与迭代
python
with h5py.File('nested.h5', 'w') as f:
f.create_group('raw/2024/january')
f.create_group('raw/2024/february')python
with h5py.File('nested.h5', 'w') as f:
f.create_group('raw/2024/january')
f.create_group('raw/2024/february')Recursive iteration
递归迭代
def print_structure(name, obj):
print(name)
with h5py.File('nested.h5', 'r') as f:
f.visititems(print_structure) # Visits every dataset and group
def print_structure(name, obj):
print(name)
with h5py.File('nested.h5', 'r') as f:
f.visititems(print_structure) # 遍历所有数据集和组
Accessing via path
通过路径访问
feb_data = f['/raw/2024/february']
undefinedfeb_data = f['/raw/2024/february']
undefinedPerformance Optimization
性能优化
1. Chunking Strategies
1. 分块策略
Chunks are the smallest unit of data that can be read or written.
- If you usually read row by row: .
chunks=(1, n_cols) - If you read blocks: .
chunks=(100, 100) - If unsure: lets h5py guess.
chunks=True
分块是可读写的最小数据单元。
- 如果您通常按行读取:
chunks=(1, n_cols) - 如果您读取块数据:
chunks=(100, 100) - 如果不确定:让h5py自动选择
chunks=True
2. SWMR (Single Writer Multiple Reader)
2. SWMR(单写多读)
Allows a writer to append to a file while other processes read from it in real-time.
python
undefined允许一个写入进程追加数据的同时,其他进程实时读取文件。
python
undefinedWriter
写入端
f = h5py.File('live.h5', 'w', libver='latest')
f.swmr_mode = True
f = h5py.File('live.h5', 'w', libver='latest')
f.swmr_mode = True
Reader
读取端
f = h5py.File('live.h5', 'r', libver='latest', swmr=True)
undefinedf = h5py.File('live.h5', 'r', libver='latest', swmr=True)
undefined3. Core Driver (In-Memory HDF5)
3. Core驱动(内存中HDF5)
Use HDF5 structure but keep it entirely in RAM for speed, with optional save to disk.
python
undefined使用HDF5结构但将数据完全保存在内存中以提升速度,也可选择保存到磁盘。
python
undefinedCreate an HDF5 file in memory
在内存中创建HDF5文件
f = h5py.File('memfile.h5', 'w', driver='core', backing_store=True)
undefinedf = h5py.File('memfile.h5', 'w', driver='core', backing_store=True)
undefinedPractical Workflows
实用工作流
1. Storing Machine Learning Training Data
1. 存储机器学习训练数据
python
def save_ml_dataset(X, y, filename):
with h5py.File(filename, 'w') as f:
# Create datasets for images and labels
f.create_dataset('images', data=X, compression="lzf") # LZF is fast
f.create_dataset('labels', data=y)
# Add metadata
f.attrs['n_samples'] = X.shape[0]
f.attrs['input_shape'] = X.shape[1:]
f.attrs['classes'] = np.unique(y)python
def save_ml_dataset(X, y, filename):
with h5py.File(filename, 'w') as f:
# 创建图像和标签数据集
f.create_dataset('images', data=X, compression="lzf") # LZF压缩速度快
f.create_dataset('labels', data=y)
# 添加元数据
f.attrs['n_samples'] = X.shape[0]
f.attrs['input_shape'] = X.shape[1:]
f.attrs['classes'] = np.unique(y)Use cases: training on data that exceeds RAM
适用场景:在超出内存容量的数据上进行训练
undefinedundefined2. Large Simulation Logger
2. 大型模拟日志记录
python
def log_simulation_step(filename, step_idx, data_array):
with h5py.File(filename, 'a') as f:
if 'simulation' not in f:
# Initialize resizable dataset
f.create_dataset('simulation',
shape=(0, *data_array.shape),
maxshape=(None, *data_array.shape),
chunks=(1, *data_array.shape))
dset = f['simulation']
dset.resize(step_idx + 1, axis=0)
dset[step_idx] = data_arraypython
def log_simulation_step(filename, step_idx, data_array):
with h5py.File(filename, 'a') as f:
if 'simulation' not in f:
# 初始化可扩容数据集
f.create_dataset('simulation',
shape=(0, *data_array.shape),
maxshape=(None, *data_array.shape),
chunks=(1, *data_array.shape))
dset = f['simulation']
dset.resize(step_idx + 1, axis=0)
dset[step_idx] = data_array3. Batch Image Storage
3. 批量图像存储
python
def store_images(image_files, h5_file):
with h5py.File(h5_file, 'w') as f:
grp = f.create_group('microscopy_data')
for i, img_path in enumerate(image_files):
# Load your image here
img_data = np.random.rand(512, 512)
dset = grp.create_dataset(f'img_{i:04d}', data=img_data)
dset.attrs['original_path'] = img_pathpython
def store_images(image_files, h5_file):
with h5py.File(h5_file, 'w') as f:
grp = f.create_group('microscopy_data')
for i, img_path in enumerate(image_files):
# 在此处加载图像
img_data = np.random.rand(512, 512)
dset = grp.create_dataset(f'img_{i:04d}', data=img_data)
dset.attrs['original_path'] = img_pathCommon Pitfalls and Solutions
常见陷阱与解决方案
The "Dataset Already Exists" Error
"数据集已存在"错误
python
undefinedpython
undefined❌ Problem: f.create_dataset('x', ...) fails if 'x' exists
❌ 问题:f.create_dataset('x', ...) 在'x'已存在时失败
✅ Solution: Delete first or use a check
✅ 解决方案:先删除或进行检查
if 'x' in f:
del f['x']
f.create_dataset('x', data=new_data)
undefinedif 'x' in f:
del f['x']
f.create_dataset('x', data=new_data)
undefinedFile Locking Issues
文件锁定问题
python
undefinedpython
undefined❌ Problem: "OSError: Unable to open file (file locking disabled on this file system)"
❌ 问题:"OSError: Unable to open file (file locking disabled on this file system)"
This often happens on network drives (NFS).
这种情况常发生在网络驱动器(NFS)上
✅ Solution: Set environment variable before running script
✅ 解决方案:运行脚本前设置环境变量
import os
os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'
import h5py
undefinedimport os
os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'
import h5py
undefinedStoring Unicode Strings
存储Unicode字符串
HDF5's support for strings is complex.
python
undefinedHDF5对字符串的支持较为复杂。
python
undefined❌ Problem: Storing lists of strings can sometimes cause issues in older versions
❌ 问题:在旧版本中存储字符串列表有时会出现问题
✅ Solution: Use special string types
✅ 解决方案:使用专用字符串类型
dt = h5py.string_dtype(encoding='utf-8')
dset = f.create_dataset('strings', (100,), dtype=dt)
dset[0] = "Научные данные"
h5py is the industrial-strength way to handle large numerical data. By combining the flexibility of NumPy with the power of HDF5, it ensures that your scientific data remains organized, accessible, and fast.dt = h5py.string_dtype(encoding='utf-8')
dset = f.create_dataset('strings', (100,), dtype=dt)
dset[0] = "Научные данные"
h5py是处理大型数值数据的工业级工具。通过结合NumPy的灵活性和HDF5的强大功能,它确保您的科学数据保持有序、可访问且高效。