h5py

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

h5py - Hierarchical Data Storage

h5py - 层级化数据存储

h5py provides a seamless bridge between NumPy and HDF5. It allows you to organize data into groups (like folders) and datasets (like NumPy arrays), with rich metadata (attributes) attached to every object.

h5py为NumPy和HDF5之间搭建了无缝桥梁。它允许您将数据组织为组（类似文件夹）和数据集（类似NumPy数组），并且每个对象都可以附加丰富的元数据（属性）。

When to Use

适用场景

Storing datasets that are much larger than your computer's RAM.
Organizing complex scientific data into a hierarchical "folder-like" structure.
Storing numerical arrays (NumPy) with high-speed random access.
Keeping metadata (units, experiment dates, parameters) attached directly to the data.
Sharing data between different languages (C, C++, Fortran, Java, MATLAB), as HDF5 is a cross-platform standard.
Reading/writing large datasets in chunks to optimize I/O performance.

存储远大于计算机内存容量的数据集
将复杂科学数据组织为类似文件夹的层级结构
存储支持高速随机访问的数值数组（NumPy）
将元数据（单位、实验日期、参数）直接附加到数据上
在不同编程语言（C、C++、Fortran、Java、MATLAB）间共享数据，因为HDF5是跨平台标准
分块读写大型数据集以优化I/O性能

Reference Documentation

参考文档

Official docs: https://docs.h5py.org/
HDF Group: https://www.hdfgroup.org/
Search patterns:

h5py.File

create_dataset

h5py.Group

chunks=True

compression="gzip"

官方文档: https://docs.h5py.org/
HDF组织: https://www.hdfgroup.org/
常用搜索关键词:

h5py.File

create_dataset

h5py.Group

chunks=True

compression="gzip"

Core Principles

核心原则

The Hierarchy

层级结构

HDF5 files contain two main types of objects:

Datasets: Multidimensional arrays of data (NumPy-like).
Groups: Container structures that can hold datasets or other groups (like directories).

HDF5文件包含两种主要对象：

数据集：多维数据数组（类似NumPy数组）
组：可容纳数据集或其他组的容器结构（类似目录）

Slicing

切片操作

h5py datasets support standard NumPy slicing. When you slice a dataset, only that specific slice is read from the disk, keeping memory usage low.

h5py数据集支持标准NumPy切片。当您对数据集执行切片时，只会从磁盘读取指定的切片部分，从而降低内存占用。

Attributes

属性

Every group and dataset can have attributes (key-value pairs) for metadata.

每个组和数据集都可以附加属性（键值对）来存储元数据。

Quick Reference

快速参考

Installation

安装

bash

pip install h5py

bash

pip install h5py

Standard Imports

标准导入

python

import h5py
import numpy as np

python

import h5py
import numpy as np

Basic Pattern - Writing and Reading

基础模式 - 读写数据

python

import h5py
import numpy as np

python

import h5py
import numpy as np

Writing data

写入数据

with h5py.File('data.h5', 'w') as f: dset = f.create_dataset('main_data', data=np.random.rand(100, 100)) dset.attrs['units'] = 'meters' grp = f.create_group('subgroup') grp.create_dataset('results', data=[1, 2, 3])

Reading data

读取数据

with h5py.File('data.h5', 'r') as f: data_slice = f['main_data'][0:10, 0:10] # Only read 100 elements units = f['main_data'].attrs['units'] print(f"Group content: {list(f['subgroup'].keys())}")

undefined

with h5py.File('data.h5', 'r') as f: data_slice = f['main_data'][0:10, 0:10] # 仅读取100个元素 units = f['main_data'].attrs['units'] print(f"组内容: {list(f['subgroup'].keys())}")

undefined

Critical Rules

重要规则

✅ DO

✅ 建议做法

Use Context Managers - Always use
```
with h5py.File(...) as f:
```
to ensure files are closed even if errors occur.
Use Chunking - For large datasets, specify
```
chunks=True
```
or a manual shape to optimize access speed for specific slicing patterns.
Enable Compression - Use
```
compression="gzip"
```
to save disk space for large numerical arrays.
Use Descriptive Names - Use groups to organize data logically (e.g.,
```
/experiment1/sensorA/raw
```
).
Store Metadata in Attributes - Don't create separate text files for units or timestamps; attach them to the datasets.
Check Membership - Use
```
"name" in group
```
before accessing to avoid KeyError.

使用上下文管理器 - 始终使用
```
with h5py.File(...) as f:
```
确保即使发生错误也能正确关闭文件
使用分块存储 - 对于大型数据集，指定
```
chunks=True
```
或手动设置分块形状，针对特定切片模式优化访问速度
启用压缩 - 使用
```
compression="gzip"
```
为大型数值数组节省磁盘空间
使用描述性名称 - 使用组对数据进行逻辑组织（例如：
```
/experiment1/sensorA/raw
```
）
在属性中存储元数据 - 不要为单位或时间戳创建单独的文本文件，直接附加到数据集上
检查成员存在性 - 访问前使用
```
"name" in group
```
避免KeyError错误

❌ DON'T

❌ 禁止做法

Open files in 'w' by mistake - The 'w' mode overwrites existing files. Use 'a' (append/read-write) or 'r+' (read-write) instead.
Load entire datasets into RAM - Avoid
```
data = f['large_dataset'][:]
```
unless you are sure it fits in memory.
Store thousands of small datasets - HDF5 is optimized for large arrays. For millions of tiny scalars, use a single array or a different database.
Forget to close files - An unclosed HDF5 file can become corrupted or locked.

误将文件以'w'模式打开 - 'w'模式会覆盖现有文件，应使用'a'（追加/读写）或'r+'（读写）模式
将整个数据集加载到内存 - 除非确定数据集能放入内存，否则避免使用
```
data = f['large_dataset'][:]
```
存储数千个小型数据集 - HDF5针对大型数组优化，对于数百万个小标量，应使用单个数组或其他数据库
忘记关闭文件 - 未关闭的HDF5文件可能损坏或被锁定

Anti-Patterns (NEVER)

反模式（绝对避免）

python

import h5py
import numpy as np

python

import h5py
import numpy as np

❌ BAD: Manual file closing (unsafe)

❌ 错误：手动关闭文件（不安全）

f = h5py.File('data.h5', 'w') f.create_dataset('x', data=np.arange(10)) f.close() # If an error happened above, this never runs!

f = h5py.File('data.h5', 'w') f.create_dataset('x', data=np.arange(10)) f.close() # 如果上方代码出错，这行永远不会执行！

✅ GOOD: Context manager

✅ 正确：使用上下文管理器

with h5py.File('data.h5', 'w') as f: f.create_dataset('x', data=np.arange(10))

❌ BAD: Storing metadata as strings inside a dataset

❌ 错误：将元数据以字符串形式存储在数据集中

f.create_dataset('meta', data=np.array(['unit: meter', 'date: 2024']))

✅ GOOD: Using Attributes

✅ 正确：使用属性

dset = f.create_dataset('data', data=np.random.rand(10)) dset.attrs['unit'] = 'meter' dset.attrs['date'] = '2024'

❌ BAD: Inefficient chunking (one row at a time when you read columns)

❌ 错误：低效的分块设置（当您按列读取时却按行分块）

f.create_dataset('big', shape=(10000, 10000), chunks=(1, 10000))

undefined

undefined

Dataset Creation and Configuration

数据集创建与配置

Advanced Options

高级选项

python

with h5py.File('optimized.h5', 'w') as f:
    # 1. Resizable dataset (maxshape)
    dset = f.create_dataset('growing', 
                            shape=(100,), 
                            maxshape=(None,), # Allow growth in 1st dimension
                            dtype='float32')
    
    # 2. Compression and Chunking
    f.create_dataset('compressed', 
                     data=np.random.randn(1000, 1000),
                     chunks=(100, 100), 
                     compression="gzip", 
                     compression_opts=4) # 4 is a good balance
    
    # 3. Filling with default values
    f.create_dataset('default', shape=(10, 10), fillvalue=-1.0)

python

with h5py.File('optimized.h5', 'w') as f:
    # 1. 可扩容数据集（maxshape）
    dset = f.create_dataset('growing', 
                            shape=(100,), 
                            maxshape=(None,), # 允许第一维度扩容
                            dtype='float32')
    
    # 2. 压缩与分块
    f.create_dataset('compressed', 
                     data=np.random.randn(1000, 1000),
                     chunks=(100, 100), 
                     compression="gzip", 
                     compression_opts=4) # 4是压缩比与速度的良好平衡
    
    # 3. 填充默认值
    f.create_dataset('default', shape=(10, 10), fillvalue=-1.0)

Working with Groups

组的使用

Navigation and Iteration

导航与迭代

python

with h5py.File('nested.h5', 'w') as f:
    f.create_group('raw/2024/january')
    f.create_group('raw/2024/february')

python

with h5py.File('nested.h5', 'w') as f:
    f.create_group('raw/2024/january')
    f.create_group('raw/2024/february')

Recursive iteration

递归迭代

def print_structure(name, obj): print(name)

with h5py.File('nested.h5', 'r') as f: f.visititems(print_structure) # Visits every dataset and group

def print_structure(name, obj): print(name)

with h5py.File('nested.h5', 'r') as f: f.visititems(print_structure) # 遍历所有数据集和组

Accessing via path

通过路径访问

feb_data = f['/raw/2024/february']

undefined

feb_data = f['/raw/2024/february']

undefined

Performance Optimization

性能优化

1. Chunking Strategies

1. 分块策略

Chunks are the smallest unit of data that can be read or written.

If you usually read row by row:
```
chunks=(1, n_cols)
```
.
If you read blocks:
```
chunks=(100, 100)
```
.
If unsure:
```
chunks=True
```
lets h5py guess.

分块是可读写的最小数据单元。

如果您通常按行读取：
```
chunks=(1, n_cols)
```
如果您读取块数据：
```
chunks=(100, 100)
```
如果不确定：
```
chunks=True
```
让h5py自动选择

2. SWMR (Single Writer Multiple Reader)

2. SWMR（单写多读）

Allows a writer to append to a file while other processes read from it in real-time.

python

undefined

允许一个写入进程追加数据的同时，其他进程实时读取文件。

python

undefined

Writer

写入端

f = h5py.File('live.h5', 'w', libver='latest') f.swmr_mode = True

Reader

读取端

f = h5py.File('live.h5', 'r', libver='latest', swmr=True)

undefined

f = h5py.File('live.h5', 'r', libver='latest', swmr=True)

undefined

3. Core Driver (In-Memory HDF5)

3. Core驱动（内存中HDF5）

Use HDF5 structure but keep it entirely in RAM for speed, with optional save to disk.

python

undefined

使用HDF5结构但将数据完全保存在内存中以提升速度，也可选择保存到磁盘。

python

undefined

Create an HDF5 file in memory

在内存中创建HDF5文件

f = h5py.File('memfile.h5', 'w', driver='core', backing_store=True)

undefined

f = h5py.File('memfile.h5', 'w', driver='core', backing_store=True)

undefined

Practical Workflows

实用工作流

1. Storing Machine Learning Training Data

1. 存储机器学习训练数据

python

def save_ml_dataset(X, y, filename):
    with h5py.File(filename, 'w') as f:
        # Create datasets for images and labels
        f.create_dataset('images', data=X, compression="lzf") # LZF is fast
        f.create_dataset('labels', data=y)
        
        # Add metadata
        f.attrs['n_samples'] = X.shape[0]
        f.attrs['input_shape'] = X.shape[1:]
        f.attrs['classes'] = np.unique(y)

python

def save_ml_dataset(X, y, filename):
    with h5py.File(filename, 'w') as f:
        # 创建图像和标签数据集
        f.create_dataset('images', data=X, compression="lzf") # LZF压缩速度快
        f.create_dataset('labels', data=y)
        
        # 添加元数据
        f.attrs['n_samples'] = X.shape[0]
        f.attrs['input_shape'] = X.shape[1:]
        f.attrs['classes'] = np.unique(y)

Use cases: training on data that exceeds RAM

适用场景：在超出内存容量的数据上进行训练

undefined

undefined

2. Large Simulation Logger

2. 大型模拟日志记录

python

def log_simulation_step(filename, step_idx, data_array):
    with h5py.File(filename, 'a') as f:
        if 'simulation' not in f:
            # Initialize resizable dataset
            f.create_dataset('simulation', 
                            shape=(0, *data_array.shape),
                            maxshape=(None, *data_array.shape),
                            chunks=(1, *data_array.shape))
            
        dset = f['simulation']
        dset.resize(step_idx + 1, axis=0)
        dset[step_idx] = data_array

python

def log_simulation_step(filename, step_idx, data_array):
    with h5py.File(filename, 'a') as f:
        if 'simulation' not in f:
            # 初始化可扩容数据集
            f.create_dataset('simulation', 
                            shape=(0, *data_array.shape),
                            maxshape=(None, *data_array.shape),
                            chunks=(1, *data_array.shape))
            
        dset = f['simulation']
        dset.resize(step_idx + 1, axis=0)
        dset[step_idx] = data_array

3. Batch Image Storage

3. 批量图像存储

python

def store_images(image_files, h5_file):
    with h5py.File(h5_file, 'w') as f:
        grp = f.create_group('microscopy_data')
        for i, img_path in enumerate(image_files):
            # Load your image here
            img_data = np.random.rand(512, 512) 
            dset = grp.create_dataset(f'img_{i:04d}', data=img_data)
            dset.attrs['original_path'] = img_path

python

def store_images(image_files, h5_file):
    with h5py.File(h5_file, 'w') as f:
        grp = f.create_group('microscopy_data')
        for i, img_path in enumerate(image_files):
            # 在此处加载图像
            img_data = np.random.rand(512, 512) 
            dset = grp.create_dataset(f'img_{i:04d}', data=img_data)
            dset.attrs['original_path'] = img_path

Common Pitfalls and Solutions

常见陷阱与解决方案

The "Dataset Already Exists" Error

"数据集已存在"错误

python

undefined

python

undefined

❌ Problem: f.create_dataset('x', ...) fails if 'x' exists

❌ 问题：f.create_dataset('x', ...) 在'x'已存在时失败

✅ Solution: Delete first or use a check

✅ 解决方案：先删除或进行检查

if 'x' in f: del f['x'] f.create_dataset('x', data=new_data)

undefined

if 'x' in f: del f['x'] f.create_dataset('x', data=new_data)

undefined

File Locking Issues

文件锁定问题

python

undefined

python

undefined

❌ Problem: "OSError: Unable to open file (file locking disabled on this file system)"

❌ 问题："OSError: Unable to open file (file locking disabled on this file system)"

This often happens on network drives (NFS).

这种情况常发生在网络驱动器（NFS）上

✅ Solution: Set environment variable before running script

✅ 解决方案：运行脚本前设置环境变量

import os os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE' import h5py

undefined

import os os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE' import h5py

undefined

Storing Unicode Strings

存储Unicode字符串

HDF5's support for strings is complex.

python

undefined

HDF5对字符串的支持较为复杂。

python

undefined

❌ Problem: Storing lists of strings can sometimes cause issues in older versions

❌ 问题：在旧版本中存储字符串列表有时会出现问题

✅ Solution: Use special string types

✅ 解决方案：使用专用字符串类型

dt = h5py.string_dtype(encoding='utf-8') dset = f.create_dataset('strings', (100,), dtype=dt) dset[0] = "Научные данные"


h5py is the industrial-strength way to handle large numerical data. By combining the flexibility of NumPy with the power of HDF5, it ensures that your scientific data remains organized, accessible, and fast.

dt = h5py.string_dtype(encoding='utf-8') dset = f.create_dataset('strings', (100,), dtype=dt) dset[0] = "Научные данные"


h5py是处理大型数值数据的工业级工具。通过结合NumPy的灵活性和HDF5的强大功能，它确保您的科学数据保持有序、可访问且高效。