pooch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePooch - Data File Fetching
Pooch - 数据文件获取
Quick Reference
快速参考
python
import poochpython
import poochDownload single file
下载单个文件
file_path = pooch.retrieve(
url="https://example.com/data.csv",
known_hash="sha256:abc123...", # None to skip verification
fname="data.csv",
path=pooch.os_cache("myproject")
)
file_path = pooch.retrieve(
url="https://example.com/data.csv",
known_hash="sha256:abc123...", # 设为None可跳过验证
fname="data.csv",
path=pooch.os_cache("myproject")
)
Create registry for multiple files
创建多文件注册表
REGISTRY = pooch.create(
path=pooch.os_cache("myproject"),
base_url="https://example.com/data/",
registry={"data.csv": "sha256:abc123...", "model.nc": "sha256:def456..."}
)
data_file = REGISTRY.fetch("data.csv")
REGISTRY = pooch.create(
path=pooch.os_cache("myproject"),
base_url="https://example.com/data/",
registry={"data.csv": "sha256:abc123...", "model.nc": "sha256:def456..."}
)
data_file = REGISTRY.fetch("data.csv")
Generate hash for local file
为本地文件生成哈希值
file_hash = pooch.file_hash("/path/to/file.csv")
undefinedfile_hash = pooch.file_hash("/path/to/file.csv")
undefinedKey Functions
核心函数
| Function | Purpose |
|---|---|
| Download single file with caching |
| Create custom data registry |
| Generate SHA256/MD5 hash of file |
| Get OS-specific cache directory |
| 函数 | 用途 |
|---|---|
| 带缓存功能下载单个文件 |
| 创建自定义数据注册表 |
| 生成文件的SHA256/MD5哈希值 |
| 获取操作系统专属的缓存目录 |
Essential Operations
关键操作
Download Files
下载文件
python
undefinedpython
undefinedWith hash verification
带哈希验证的下载
file_path = pooch.retrieve(
url="https://example.com/data.nc",
known_hash="sha256:abc123..."
)
file_path = pooch.retrieve(
url="https://example.com/data.nc",
known_hash="sha256:abc123..."
)
Without verification (development only)
无验证下载(仅用于开发环境)
file_path = pooch.retrieve(url="https://example.com/data.nc", known_hash=None)
file_path = pooch.retrieve(url="https://example.com/data.nc", known_hash=None)
From Zenodo DOI
从Zenodo DOI下载
file_path = pooch.retrieve(
url="doi:10.5281/zenodo.1234567/data.zip",
known_hash="sha256:abc123..."
)
undefinedfile_path = pooch.retrieve(
url="doi:10.5281/zenodo.1234567/data.zip",
known_hash="sha256:abc123..."
)
undefinedExtract Archives
提取归档文件
python
undefinedpython
undefinedZIP archive
ZIP归档文件
files = pooch.retrieve(
url="https://example.com/data.zip",
known_hash="sha256:abc123...",
processor=pooch.Unzip()
)
files = pooch.retrieve(
url="https://example.com/data.zip",
known_hash="sha256:abc123...",
processor=pooch.Unzip()
)
Decompress single gzip file
解压单个gzip文件
file_path = pooch.retrieve(
url="https://example.com/data.csv.gz",
known_hash="sha256:abc123...",
processor=pooch.Decompress(name="data.csv")
)
undefinedfile_path = pooch.retrieve(
url="https://example.com/data.csv.gz",
known_hash="sha256:abc123...",
processor=pooch.Decompress(name="data.csv")
)
undefinedAdditional Options
额外选项
python
undefinedpython
undefinedProgress bar for large downloads
大文件下载显示进度条
file_path = pooch.retrieve(url=url, known_hash=hash, progressbar=True)
file_path = pooch.retrieve(url=url, known_hash=hash, progressbar=True)
HTTP authentication
HTTP身份验证
file_path = pooch.retrieve(
url="https://example.com/protected/data.csv",
known_hash=None,
downloader=pooch.HTTPDownloader(auth=("user", "pass"))
)
undefinedfile_path = pooch.retrieve(
url="https://example.com/protected/data.csv",
known_hash=None,
downloader=pooch.HTTPDownloader(auth=("user", "pass"))
)
undefinedProcessor Options
处理器选项
| Processor | Purpose |
|---|---|
| Extract ZIP archives |
| Extract TAR/TAR.GZ archives |
| Decompress gzip, bz2, lzma, xz |
| 处理器 | 用途 |
|---|---|
| 提取ZIP归档文件 |
| 提取TAR/TAR.GZ归档文件 |
| 解压gzip、bz2、lzma、xz格式文件 |
Cache Locations
缓存位置
| OS | Default Path |
|---|---|
| Linux | |
| macOS | |
| Windows | |
| 操作系统 | 默认路径 |
|---|---|
| Linux | |
| macOS | |
| Windows | |
Error Handling
错误处理
python
try:
file_path = pooch.retrieve(url=url, known_hash=hash)
except pooch.exceptions.HTTPDownloadError:
print("Download failed - check URL")
except pooch.exceptions.DownloadError:
print("Network issue")python
try:
file_path = pooch.retrieve(url=url, known_hash=hash)
except pooch.exceptions.HTTPDownloadError:
print("下载失败 - 请检查URL")
except pooch.exceptions.DownloadError:
print("网络问题")When to Use vs Alternatives
适用场景与替代工具对比
| Tool | Best For | Limitations |
|---|---|---|
| pooch | Reproducible data downloads, hash verification, caching | Not a version control system |
| urllib/requests | Simple one-off downloads, custom HTTP logic | No caching, no hash verification |
| DVC | Data version control alongside git | Heavier setup, requires remote storage |
| wget | Quick command-line downloads | No Python integration, no caching logic |
Use pooch when you need reproducible data downloads with automatic caching and
integrity verification, especially for scientific data registries.
Consider alternatives when you need full data version control with git integration
(use DVC), simple one-off downloads without caching needs (use requests), or
command-line batch downloads (use wget).
| 工具 | 最佳适用场景 | 局限性 |
|---|---|---|
| pooch | 可复现的数据下载、哈希验证、缓存 | 不是版本控制系统 |
| urllib/requests | 简单的一次性下载、自定义HTTP逻辑 | 无缓存、无哈希验证 |
| DVC | 与Git结合的数据版本控制 | 配置复杂,需要远程存储 |
| wget | 快速的命令行下载 | 无Python集成、无缓存逻辑 |
当你需要具备自动缓存和完整性验证的可复现数据下载时,使用pooch,尤其适用于科学数据注册表场景。
当你需要与Git集成的完整数据版本控制时,考虑替代工具(使用DVC);若只需简单的无缓存一次性下载(使用requests);或命令行批量下载(使用wget)。
Common Workflows
常见工作流
Set up reproducible data download with registry
搭建带注册表的可复现数据下载流程
- Identify all required data files and their URLs
- Generate SHA256 hashes with for each file
pooch.file_hash() - Create registry with specifying base URL and file hashes
pooch.create() - Fetch files with in analysis scripts
REGISTRY.fetch() - Add processors for compressed files (,
Unzip(),Untar())Decompress() - Test registry by clearing cache and re-fetching
- Document registry in project README for collaborators
- 确定所有所需数据文件及其URL
- 使用为每个文件生成SHA256哈希值
pooch.file_hash() - 使用创建注册表,指定基础URL和文件哈希值
pooch.create() - 在分析脚本中使用获取文件
REGISTRY.fetch() - 为压缩文件添加处理器(、
Unzip()、Untar())Decompress() - 通过清除缓存并重新获取来测试注册表
- 在项目README中记录注册表,方便协作
References
参考资料
- Registry Configuration - Create and manage file registries
- 注册表配置 - 创建和管理文件注册表
Scripts
脚本
- scripts/create_registry.py - Generate registry from local files
- scripts/create_registry.py - 从本地文件生成注册表