datapackage

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Frictionless Data Package Guide

Frictionless Data Package 指南

This skill covers any dataset described by a Frictionless Data Package descriptor file (
datapackage.json
). It is intentionally generic — it works for any conforming datapackage, regardless of who published it or what the data contains.
For PUDL-specific knowledge (S3 bucket paths, table tier conventions, data source context, usage warnings), also use the
pudl
skill on top of this one.
本技能适用于所有由Frictionless Data Package描述符文件(
datapackage.json
)定义的数据集。它具有通用特性——适用于任何符合规范的datapackage,无论发布者是谁或数据包含什么内容。
若需要PUDL相关的特定知识(S3存储桶路径、表层级约定、数据源背景、使用警告),可在本技能基础上搭配使用
pudl
技能。

What is a datapackage.json?

什么是datapackage.json?

A
datapackage.json
is a JSON file that describes a collection of tabular data resources. Each resource represents one table (or file) and includes:
  • name
    : machine-readable identifier
  • description
    : human-readable description, often including processing notes, primary keys, and usage warnings
  • path
    : filename or URL of the actual data file
  • schema.fields
    : list of columns, each with a
    name
    and
    description
The file can be large (hundreds of resources, megabytes of JSON). Always query it selectively — never load it whole into context.
datapackage.json
是一个描述表格数据资源集合的JSON文件。每个资源代表一个表格(或文件),包含以下内容:
  • name
    :机器可读的标识符
  • description
    :人类可读的描述,通常包含处理说明、主键和使用警告
  • path
    :实际数据文件的文件名或URL
  • schema.fields
    :列的列表,每个列包含
    name
    description
该文件可能很大(包含数百个资源、数MB的JSON数据)。请务必选择性地查询它——切勿将整个文件加载到上下文环境中。

Dependency check

依赖检查

Before querying metadata, verify
jq
is available:
bash
command -v jq
If not found, tell the user how to install it:
  • macOS:
    brew install jq
  • Linux (apt):
    sudo apt install jq
  • Linux (conda):
    conda install jq
  • Windows:
    winget install jqlang.jq
For data loading and SQL queries, the
attach-db
, and
query
skills from
duckdb-skills
must be installed. Install them from
duckdb/duckdb-skills
.
在查询元数据之前,请验证
jq
是否可用:
bash
command -v jq
如果未找到,请告知用户安装方法:
  • macOS:
    brew install jq
  • Linux(apt):
    sudo apt install jq
  • Linux(conda):
    conda install jq
  • Windows:
    winget install jqlang.jq
对于数据加载和SQL查询,必须安装
duckdb-skills
中的
attach-db
query
技能。可从
duckdb/duckdb-skills
安装它们。

Workflow overview

工作流程概述

  1. Locate the descriptor — find or download
    datapackage.json
    (see below).
  2. Query metadata selectively — use jq or DuckDB to extract only what you need. See Metadata Querying.
  3. Surface warnings — always check for usage warnings before presenting a resource.
  4. Validate (optional) — if the user wants to know whether the data actually matches the descriptor, or if you're diagnosing a suspicious package, use
    frictionless validate
    . See Frictionless Validate.
  5. Load the data (optional) — only if the user explicitly wants to query or explore the actual data. Data files can be large and remote access can be slow or costly. Don't initiate data loading as a follow-on to a metadata lookup without confirming the user wants it. See Storage Backends.
  1. 定位描述符——查找或下载
    datapackage.json
    (见下文)。
  2. 选择性查询元数据——使用jq或DuckDB仅提取所需内容。详见元数据查询
  3. 显示警告——在展示资源前,务必检查使用警告。
  4. 验证(可选)——如果用户想了解数据是否与描述符匹配,或者你正在诊断可疑的数据包,请使用
    frictionless validate
    。详见Frictionless 验证
  5. 加载数据(可选)——仅当用户明确想要查询或探索实际数据时才执行此操作。数据文件可能很大,远程访问可能缓慢或成本高昂。在未确认用户需求的情况下,请勿在元数据查询后自动启动数据加载。详见存储后端

Reference index

参考索引

  • Metadata Querying — locate the descriptor, query it selectively with jq or DuckDB, surface usage warnings
  • Storage Backends — load data from Parquet, DuckDB, SQLite, or CSV files referenced by the descriptor
  • Frictionless Validate — use the
    frictionless
    CLI to validate packages, check data quality, infer schemas, and diagnose unfamiliar descriptors; read when the user wants to validate a descriptor, check if data matches its schema, or understand what the
    frictionless
    tool can tell them about a package
  • 元数据查询——定位描述符,使用jq或DuckDB选择性查询,显示使用警告
  • 存储后端——从描述符引用的Parquet、DuckDB、SQLite或CSV文件中加载数据
  • Frictionless 验证——使用
    frictionless
    CLI验证数据包、检查数据质量、推断模式并诊断不熟悉的描述符;当用户想要验证描述符、检查数据是否匹配其模式,或了解
    frictionless
    工具能提供的数据包相关信息时,可阅读此部分

Community patterns and recipes

社区模式与实践

The datapackage standard is permissive: publishers frequently add non-standard fields. Two conventions are worth knowing immediately:
  • Custom fields — non-standard keys added by publishers are common and valid. The
    _
    prefix convention marks system-generated or platform-specific keys (e.g.
    _cache
    ,
    _platformVersion
    ). Some publishers add custom keys without the prefix (e.g. PUDL adds
    duckdb_table
    ,
    sqlite_table
    on database-backed resources). Treat unknown fields as informational metadata, not errors.
  • Compressed resources — a resource with a
    .gz
    or
    .zip
    path may have an explicit
    "compression": "gz"
    field. The
    bytes
    and
    hash
    fields apply to the compressed file, not the uncompressed original.
For other patterns (catalogs, versioning, external foreign keys, translation support, field relationships, etc.), fetch the relevant page on demand:
Both pages cover largely the same set of community conventions; consult whichever matches the descriptor version you're working with.
datapackage标准具有灵活性:发布者经常添加非标准字段。有两个约定需要立即了解:
  • 自定义字段——发布者添加的非标准键是常见且有效的。
    _
    前缀约定用于标记系统生成或平台特定的键(如
    _cache
    _platformVersion
    )。部分发布者添加不带前缀的自定义键(如PUDL在基于数据库的资源上添加
    duckdb_table
    sqlite_table
    )。将未知字段视为信息元数据,而非错误。
  • 压缩资源——路径带有
    .gz
    .zip
    的资源可能包含显式的
    "compression": "gz"
    字段。
    bytes
    hash
    字段适用于压缩文件,而非未压缩的原始文件。
如需了解其他模式(目录、版本控制、外部外键、翻译支持、字段关系等),可按需获取相关页面:
两个页面涵盖的社区约定基本相同;请根据你所处理的描述符版本选择查阅。

Companion skills

配套技能

This skill delegates actual data querying to:
  • /duckdb-skills:attach-db
    — attach a
    .duckdb
    or
    .sqlite
    database file and set up a persistent session for querying
  • /duckdb-skills:query
    — run SQL or natural language queries against attached databases, ad-hoc files (Parquet, CSV, remote HTTPS/S3), and JSON files including
    datapackage.json
    itself (via DuckDB's
    read_json
    )
These skills must be installed. See
skills-lock.json
in the project root.
本技能将实际数据查询委托给以下技能:
  • /duckdb-skills:attach-db
    ——附加
    .duckdb
    .sqlite
    数据库文件,并设置持久化查询会话
  • /duckdb-skills:query
    ——对附加的数据库、临时文件(Parquet、CSV、远程HTTPS/S3)以及包括
    datapackage.json
    在内的JSON文件(通过DuckDB的
    read_json
    )运行SQL或自然语言查询
这些技能必须安装。请查看项目根目录下的
skills-lock.json

Key constraints

关键约束

  • Golden rule: never load the full datapackage.json into context. It may be megabytes with hundreds of resources. Always query selectively.
  • Read the full description before presenting a resource. Descriptions often contain important context: processing notes, primary key conventions, data provenance, or caveats about known limitations. Don't skip them.
  • Use
    uv
    to install Python packages
    — prefer
    uv add <package>
    over
    pip install <package>
    .
    uv
    is faster and installs into a virtual environment rather than globally. Fall back to
    pip
    only if
    uv
    is not available (
    command -v uv
    returns nothing).
  • Do not use Python to query descriptor metadata. Python is not the right tool here — it loads the full JSON into memory (violating the golden rule above), adds unnecessary dependencies, and can't easily handle remote descriptors. Use jq for metadata-only tasks; use DuckDB when you need to combine metadata queries with data queries. Python is only appropriate for loading data (via pandas or polars) after you already know which table and columns you need.
  • **黄金法则:切勿将完整的datapackage.json加载到上下文环境中。**它可能有数MB大小,包含数百个资源。务必选择性查询。
  • **在展示资源前阅读完整描述。**描述中通常包含重要背景信息:处理说明、主键约定、数据来源,或已知限制的说明。请勿跳过。
  • 使用
    uv
    安装Python包
    ——优先使用
    uv add <package>
    而非
    pip install <package>
    uv
    速度更快,且会安装到虚拟环境而非全局环境。仅当
    uv
    不可用时(
    command -v uv
    无返回结果),才退而使用
    pip
  • **不要使用Python查询描述符元数据。**Python并非合适的工具——它会将完整的JSON加载到内存中(违反上述黄金法则),增加不必要的依赖,且难以处理远程描述符。仅处理元数据时使用jq;当需要结合元数据查询与数据查询时使用DuckDB。仅当你已经明确需要哪些表和列时,才适合使用Python(通过pandas或polars)加载数据。

Schema reference and version detection

模式参考与版本检测

Two versions of the Frictionless Data Package standard are in common use. Identify the version from the top-level descriptor before parsing:
Field presentVersionExample value
"$schema"
v2.0
"https://datapackage.org/profiles/2.0/datapackage.json"
"profile"
v1.0
"tabular-data-package"
or
"data-package"
neitherambiguous (treat as v1 baseline)
Key differences between versions that affect parsing:
  • Contributors — v1 has
    "role": "author"
    (singular string); v2 has
    "roles": ["author"]
    (array). Both may appear in the wild.
  • Name pattern — v1 enforces strictly lowercase
    [-a-z0-9._/]
    ; v2 is unrestricted.
  • version
    field
    — present in v2, absent in v1.
Bundled schemas:
  • assets/datapackage-v1.schema.json
    — v1.0 (JSON Schema draft-04). Used by FERC XBRL packages and many older datasets.
  • assets/datapackage-v2.schema.json
    — v2.0 (JSON Schema draft-07). The current standard. Canonical version always at: https://datapackage.org/profiles/2.0/datapackage.json
Read the appropriate schema when you need to understand which fields are valid in a descriptor or validate one programmatically.
Frictionless Data Package标准有两个版本被广泛使用。在解析前,请从顶层描述符中识别版本:
存在的字段版本示例值
"$schema"
v2.0
"https://datapackage.org/profiles/2.0/datapackage.json"
"profile"
v1.0
"tabular-data-package"
"data-package"
均不存在模糊(视为v1基线)
影响解析的版本间关键差异:
  • 贡献者——v1使用
    "role": "author"
    (单个字符串);v2使用
    "roles": ["author"]
    (数组)。两种形式在实际中都可能出现。
  • 名称规则——v1严格强制使用小写
    [-a-z0-9._/]
    ;v2无限制。
  • version
    字段
    ——v2中存在,v1中不存在。
内置模式:
  • assets/datapackage-v1.schema.json
    ——v1.0(JSON Schema draft-04)。FERC XBRL数据包和许多旧数据集使用此版本。
  • assets/datapackage-v2.schema.json
    ——v2.0(JSON Schema draft-07)。当前标准。标准版本始终位于:https://datapackage.org/profiles/2.0/datapackage.json
当你需要了解描述符中哪些字段有效,或需要以编程方式验证描述符时,请阅读相应的模式。