pooch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Pooch - Data File Fetching

Pooch - 数据文件获取

Quick Reference

快速参考

python
import pooch
python
import pooch

Download single file

下载单个文件

file_path = pooch.retrieve( url="https://example.com/data.csv", known_hash="sha256:abc123...", # None to skip verification fname="data.csv", path=pooch.os_cache("myproject") )
file_path = pooch.retrieve( url="https://example.com/data.csv", known_hash="sha256:abc123...", # 设为None可跳过验证 fname="data.csv", path=pooch.os_cache("myproject") )

Create registry for multiple files

创建多文件注册表

REGISTRY = pooch.create( path=pooch.os_cache("myproject"), base_url="https://example.com/data/", registry={"data.csv": "sha256:abc123...", "model.nc": "sha256:def456..."} ) data_file = REGISTRY.fetch("data.csv")
REGISTRY = pooch.create( path=pooch.os_cache("myproject"), base_url="https://example.com/data/", registry={"data.csv": "sha256:abc123...", "model.nc": "sha256:def456..."} ) data_file = REGISTRY.fetch("data.csv")

Generate hash for local file

为本地文件生成哈希值

file_hash = pooch.file_hash("/path/to/file.csv")
undefined
file_hash = pooch.file_hash("/path/to/file.csv")
undefined

Key Functions

核心函数

FunctionPurpose
pooch.retrieve()
Download single file with caching
pooch.create()
Create custom data registry
pooch.file_hash()
Generate SHA256/MD5 hash of file
pooch.os_cache()
Get OS-specific cache directory
函数用途
pooch.retrieve()
带缓存功能下载单个文件
pooch.create()
创建自定义数据注册表
pooch.file_hash()
生成文件的SHA256/MD5哈希值
pooch.os_cache()
获取操作系统专属的缓存目录

Essential Operations

关键操作

Download Files

下载文件

python
undefined
python
undefined

With hash verification

带哈希验证的下载

file_path = pooch.retrieve( url="https://example.com/data.nc", known_hash="sha256:abc123..." )
file_path = pooch.retrieve( url="https://example.com/data.nc", known_hash="sha256:abc123..." )

Without verification (development only)

无验证下载(仅用于开发环境)

file_path = pooch.retrieve(url="https://example.com/data.nc", known_hash=None)
file_path = pooch.retrieve(url="https://example.com/data.nc", known_hash=None)

From Zenodo DOI

从Zenodo DOI下载

file_path = pooch.retrieve( url="doi:10.5281/zenodo.1234567/data.zip", known_hash="sha256:abc123..." )
undefined
file_path = pooch.retrieve( url="doi:10.5281/zenodo.1234567/data.zip", known_hash="sha256:abc123..." )
undefined

Extract Archives

提取归档文件

python
undefined
python
undefined

ZIP archive

ZIP归档文件

files = pooch.retrieve( url="https://example.com/data.zip", known_hash="sha256:abc123...", processor=pooch.Unzip() )
files = pooch.retrieve( url="https://example.com/data.zip", known_hash="sha256:abc123...", processor=pooch.Unzip() )

Decompress single gzip file

解压单个gzip文件

file_path = pooch.retrieve( url="https://example.com/data.csv.gz", known_hash="sha256:abc123...", processor=pooch.Decompress(name="data.csv") )
undefined
file_path = pooch.retrieve( url="https://example.com/data.csv.gz", known_hash="sha256:abc123...", processor=pooch.Decompress(name="data.csv") )
undefined

Additional Options

额外选项

python
undefined
python
undefined

Progress bar for large downloads

大文件下载显示进度条

file_path = pooch.retrieve(url=url, known_hash=hash, progressbar=True)
file_path = pooch.retrieve(url=url, known_hash=hash, progressbar=True)

HTTP authentication

HTTP身份验证

file_path = pooch.retrieve( url="https://example.com/protected/data.csv", known_hash=None, downloader=pooch.HTTPDownloader(auth=("user", "pass")) )
undefined
file_path = pooch.retrieve( url="https://example.com/protected/data.csv", known_hash=None, downloader=pooch.HTTPDownloader(auth=("user", "pass")) )
undefined

Processor Options

处理器选项

ProcessorPurpose
Unzip()
Extract ZIP archives
Untar()
Extract TAR/TAR.GZ archives
Decompress()
Decompress gzip, bz2, lzma, xz
处理器用途
Unzip()
提取ZIP归档文件
Untar()
提取TAR/TAR.GZ归档文件
Decompress()
解压gzip、bz2、lzma、xz格式文件

Cache Locations

缓存位置

OSDefault Path
Linux
~/.cache/<project>
macOS
~/Library/Caches/<project>
Windows
C:\Users\<user>\AppData\Local\<project>\Cache
操作系统默认路径
Linux
~/.cache/<project>
macOS
~/Library/Caches/<project>
Windows
C:\Users\<user>\AppData\Local\<project>\Cache

Error Handling

错误处理

python
try:
    file_path = pooch.retrieve(url=url, known_hash=hash)
except pooch.exceptions.HTTPDownloadError:
    print("Download failed - check URL")
except pooch.exceptions.DownloadError:
    print("Network issue")
python
try:
    file_path = pooch.retrieve(url=url, known_hash=hash)
except pooch.exceptions.HTTPDownloadError:
    print("下载失败 - 请检查URL")
except pooch.exceptions.DownloadError:
    print("网络问题")

When to Use vs Alternatives

适用场景与替代工具对比

ToolBest ForLimitations
poochReproducible data downloads, hash verification, cachingNot a version control system
urllib/requestsSimple one-off downloads, custom HTTP logicNo caching, no hash verification
DVCData version control alongside gitHeavier setup, requires remote storage
wgetQuick command-line downloadsNo Python integration, no caching logic
Use pooch when you need reproducible data downloads with automatic caching and integrity verification, especially for scientific data registries.
Consider alternatives when you need full data version control with git integration (use DVC), simple one-off downloads without caching needs (use requests), or command-line batch downloads (use wget).
工具最佳适用场景局限性
pooch可复现的数据下载、哈希验证、缓存不是版本控制系统
urllib/requests简单的一次性下载、自定义HTTP逻辑无缓存、无哈希验证
DVC与Git结合的数据版本控制配置复杂,需要远程存储
wget快速的命令行下载无Python集成、无缓存逻辑
当你需要具备自动缓存和完整性验证的可复现数据下载时,使用pooch,尤其适用于科学数据注册表场景。
当你需要与Git集成的完整数据版本控制时,考虑替代工具(使用DVC);若只需简单的无缓存一次性下载(使用requests);或命令行批量下载(使用wget)。

Common Workflows

常见工作流

Set up reproducible data download with registry

搭建带注册表的可复现数据下载流程

  • Identify all required data files and their URLs
  • Generate SHA256 hashes with
    pooch.file_hash()
    for each file
  • Create registry with
    pooch.create()
    specifying base URL and file hashes
  • Fetch files with
    REGISTRY.fetch()
    in analysis scripts
  • Add processors for compressed files (
    Unzip()
    ,
    Untar()
    ,
    Decompress()
    )
  • Test registry by clearing cache and re-fetching
  • Document registry in project README for collaborators
  • 确定所有所需数据文件及其URL
  • 使用
    pooch.file_hash()
    为每个文件生成SHA256哈希值
  • 使用
    pooch.create()
    创建注册表,指定基础URL和文件哈希值
  • 在分析脚本中使用
    REGISTRY.fetch()
    获取文件
  • 为压缩文件添加处理器(
    Unzip()
    Untar()
    Decompress()
  • 通过清除缓存并重新获取来测试注册表
  • 在项目README中记录注册表,方便协作

References

参考资料

  • Registry Configuration - Create and manage file registries
  • 注册表配置 - 创建和管理文件注册表

Scripts

脚本

  • scripts/create_registry.py - Generate registry from local files
  • scripts/create_registry.py - 从本地文件生成注册表