sdmx-explorer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SDMX Explorer — Guided Dataset Discovery

SDMX Explorer — 引导式数据集发现

This skill uses the opensdmx CLI to explore any SDMX 2.1 REST endpoint: Eurostat, OECD, ECB, World Bank, ISTAT, and others. The primary reference provider is Eurostat (default in the opensdmx CLI). All examples use Eurostat unless stated otherwise.
Every
opensdmx
command supports
--help
— run it first to discover options and see usage examples:
bash
opensdmx --help                  # list all commands
opensdmx search --help           # options and examples for search
opensdmx constraints --help      # options and examples for constraints
opensdmx get --help              # options and examples for get
opensdmx run --help              # options and examples for run
该技能使用opensdmx CLI探索所有SDMX 2.1 REST端点: Eurostat、OECD、ECB、世界银行、ISTAT等。 默认优先使用的提供方是Eurostat(opensdmx CLI的默认配置)。 除非另有说明,所有示例均使用Eurostat数据源。
所有
opensdmx
命令都支持
--help
参数,先运行该命令可查看选项和使用示例:
bash
opensdmx --help                  # 列出所有命令
opensdmx search --help           # search命令的选项和示例
opensdmx constraints --help      # constraints命令的选项和示例
opensdmx get --help              # get命令的选项和示例
opensdmx run --help              # run命令的选项和示例

... same for info, values, constraints, embed, blacklist, plot

info、values、constraints、embed、blacklist、plot等命令同理


This skill runs a four-phase interactive loop. Always follow the phases in order.
The goal is to help the user understand the data landscape and make informed choices,
not to fetch data immediately.

---

该技能运行四阶段的交互式循环,请始终按顺序执行各阶段。目标是帮助用户了解数据全貌、做出合理选择,而非直接拉取数据。

---

Phase 1 — Discovery: find candidate dataflows

阶段1 — 发现:查找候选数据流

Identify which SDMX provider is relevant (ISTAT for Italian statistics, Eurostat for European statistics, OECD for international comparisons, etc.). If unclear, ask.
确定相关的SDMX提供方(意大利统计数据用ISTAT、欧洲统计数据用Eurostat、国际对比数据用OECD等),如果不明确可询问用户。

Step 1a — Extract keywords AND expected dimensions

步骤1a — 提取关键词和预期维度

Before searching, parse the user's question on two levels:
  1. Topic keywords (2–4 terms) for the
    opensdmx search
    call. Example: "unemployment", "labour force"
  2. Expected dimensions — the analytical angles the user wants to slice by. These are often NOT in the dataset title or description, but must appear as dimensions in the dataflow structure. Extract them explicitly:
    User saysExpected dimension
    "by age group" / "per fascia di età"
    age
    "by sex" / "per sesso"
    sex
    "by country" / "per paese"
    geo
    "by region"
    geo
    (NUTS level)
    "by education level"
    isced11
    or similar
    "quarterly" / "monthly"
    freq
    Example: "unemployment for EU countries, by age group and sex" → topic keywords:
    unemployment
    ; expected dimensions:
    age
    ,
    sex
    ,
    geo
    .
搜索前,从两个层面解析用户的问题:
  1. 主题关键词(2-4个词)用于
    opensdmx search
    调用,示例:"unemployment"、"labour force"
  2. 预期维度——用户想要拆分数据的分析角度,这些通常不会出现在数据集标题或描述中,但必须作为维度存在于数据流结构里,请明确提取:
用户表述预期维度
"按年龄分组" / "per fascia di età"
age
"按性别" / "per sesso"
sex
"按国家" / "per paese"
geo
"按地区"
geo
(NUTS层级)
"按教育水平"
isced11
或类似字段
"季度" / "月度"
freq
示例:"欧盟国家的失业数据,按年龄和性别分组" → 主题关键词:
unemployment
;预期维度:
age
sex
geo

Step 1b — Search and pre-filter candidates

步骤1b — 搜索和预过滤候选集

Search for dataflows:
  • Eurostat (default provider — no
    --provider
    flag needed):
    opensdmx search "<keyword>"
  • ISTAT:
    opensdmx search "<keyword>" --provider istat
  • Other providers:
    opensdmx search "<keyword>" --provider <name>
    (available:
    oecd
    ,
    ecb
    ,
    worldbank
    ,
    insee
    ,
    bundesbank
    ,
    abs
    )
To see the full list of built-in providers with their descriptions, run:
opensdmx providers
From the search results, pick 5–8 plausible candidates by title relevance. Then run
opensdmx info <id>
on each one in parallel to check their dimension list. Keep only the candidates that contain all expected dimensions. Discard candidates missing a required dimension — even if the title looks right.
If page 1 (20 results) yields no strong candidates, paginate before giving up:
bash
opensdmx search "unemployment" --page 2   # results 21-40
opensdmx search "unemployment" --page 3   # results 41-60
The title shows the total available (e.g.
21-40 of 114
), so you know how many pages exist. Keep paginating until you find at least 3 plausible candidates or exhaust the results. Only after exhausting pagination should you try a different keyword or provider. Use
--all
only as a last resort (may produce very long output).
If keyword search returns 0 or very few results (< 3), offer semantic search:
"I didn't find much with a keyword search. I can try a semantic search instead — it matches by meaning, not exact words, so it can find datasets even when the terminology differs. It requires Ollama to be running and is slower (10–30 s). Want me to try?"
If the user agrees:
bash
opensdmx search --semantic "<query>"
Semantic search returns the top 20 results ranked by similarity score. Pick the most relevant candidates (score > 0.5) and continue with Step 1c as normal.
bash
undefined
搜索数据流:
  • Eurostat(默认提供方,无需加
    --provider
    参数):
    opensdmx search "<keyword>"
  • ISTAT
    opensdmx search "<keyword>" --provider istat
  • 其他提供方
    opensdmx search "<keyword>" --provider <name>
    (可用值:
    oecd
    ecb
    worldbank
    insee
    bundesbank
    abs
要查看所有内置提供方的完整列表和描述,运行:
opensdmx providers
从搜索结果中按标题相关性筛选5-8个合理候选集,然后并行对每个候选集运行
opensdmx info <id>
检查维度列表,仅保留包含所有预期维度的候选集。即便标题看起来匹配,缺失必填维度的候选集也要丢弃。
如果第1页(20条结果)没有合适的候选集,请先翻页再放弃:
bash
opensdmx search "unemployment" --page 2   # 第21-40条结果
opensdmx search "unemployment" --page 3   # 第41-60条结果
标题会显示总结果数(例如
21-40 of 114
),由此可知总页数。持续翻页直到找到至少3个合理候选集或遍历完所有结果。只有遍历完所有分页后,再尝试更换关键词或提供方。仅在万不得已时使用
--all
参数(可能会输出非常长的内容)。
如果关键词搜索返回0条或极少结果(<3条),可提供语义搜索选项:
"我通过关键词搜索没找到太多匹配结果,还可以尝试语义搜索,它按含义匹配而非精确词汇,即便术语不同也能找到对应数据集。语义搜索需要运行Ollama,速度较慢(10-30秒),要我试试吗?"
如果用户同意:
bash
opensdmx search --semantic "<query>"
语义搜索会按相似度排名返回前20条结果,挑选最相关的候选集(相似度>0.5),照常进入步骤1c。
bash
undefined

Example: verify age and sex are present

示例:验证是否存在age和sex维度

opensdmx info UNE_RT_A # ✓ has age, sex, geo → keep opensdmx info TIPSUN20 # ✗ no age, no sex → discard
undefined
opensdmx info UNE_RT_A # ✓ 包含age、sex、geo → 保留 opensdmx info TIPSUN20 # ✗ 没有age、sex → 丢弃
undefined

Step 1c — Present verified candidates

步骤1c — 展示验证通过的候选集

From the verified candidates, select 3–5 and present them. For each, confirm which expected dimensions are present and note any extras or limitations.
Present them like this (use the conversation language; adapt as needed):
I found these datasets that could answer your question:

1. **UNE_RT_A** — Unemployment by sex and age – annual data (Eurostat)  ⭐ recommended
   Has all three dimensions you need: age (7 ranges), sex (F/M/total), geo (38 countries).
   Annual data from 2003 to 2025. Clean structure, no extra mandatory filters.

2. **LFSA_URGAED** — Unemployment rates by educational attainment level (Eurostat)
   Also has age (29 ranges!) and sex, but adds a mandatory education-level dimension
   (ISCED11). More granular age breakdown, but requires choosing an education filter.
   Best if you also want to break down by education.

3. **MET_LFU3RT** — Unemployment rates by sex, age and metropolitan region (Eurostat)
   Has age and sex, but geo is at metropolitan region level — not country level.
   Not suitable for country comparisons.

Which one would you like to explore? You can also say "the first one" or
describe more precisely what you need.
Wait for the user's choice before proceeding.

从验证通过的候选集中选择3-5个展示给用户。对每个候选集,确认包含哪些预期维度,说明额外特性或限制。
按如下格式展示(使用对话对应的语言,可按需调整):
我找到了这些可回答您问题的数据集:

1. **UNE_RT_A** — 按性别和年龄分组的失业数据(年度数据,Eurostat) ⭐ 推荐
   包含您需要的全部三个维度:年龄(7个区间)、性别(女/男/合计)、地理(38个国家)。
   覆盖2003年至2025年的年度数据,结构清晰,无额外必填筛选条件。

2. **LFSA_URGAED** — 按教育水平分组的失业率(Eurostat)
   也包含年龄(29个区间!)和性别维度,但额外增加了必填的教育水平维度(ISCED11)。
   年龄粒度更细,但需要选择教育筛选条件,如果您还需要按教育维度拆分可以选择该数据集。

3. **MET_LFU3RT** — 按性别、年龄和大都市区分组的失业率(Eurostat)
   包含年龄和性别维度,但地理维度是大都市区级别,而非国家级别,不适合做国家对比。

您想要探索哪个数据集?您也可以直接说“第一个”,或者更具体地描述您的需求。
等待用户选择后再继续后续步骤。

Phase 2 — Schema: explore the chosen dataflow

阶段2 — 结构:探索选中的数据流

Once the user has chosen, retrieve the structure and available codes for the dataflow.
用户做出选择后,拉取该数据流的结构和可用代码。

Default flow (Eurostat, OECD, ECB, etc.)

默认流程(Eurostat、OECD、ECB等)

Step 1 — get the codes actually present in the dataflow (real constraints):
bash
opensdmx constraints PRC_HICP_MANR
步骤1 — 获取数据流中实际存在的代码(真实约束):
bash
opensdmx constraints PRC_HICP_MANR

shows all dimensions with count and sample of codes

展示所有维度的代码数量和示例

opensdmx constraints PRC_HICP_MANR coicop
opensdmx constraints PRC_HICP_MANR coicop

shows full list of codes present in that dimension, with labels

展示该维度下所有实际存在的代码及标签


`opensdmx constraints` is the ground truth — it queries the `availableconstraint`
SDMX endpoint and returns only codes that actually exist in this specific dataflow.

Step 2 — get dimension order and structure:
```bash
opensdmx info PRC_HICP_MANR

`opensdmx constraints`返回的是真实数据,它会请求SDMX的`availableconstraint`端点,仅返回该特定数据流中实际存在的代码。

步骤2 — 获取维度顺序和结构:
```bash
opensdmx info PRC_HICP_MANR

(no --provider needed for Eurostat, it's the default)

Eurostat是默认提供方,无需加--provider参数


`opensdmx values` returns the **full codelist** (all theoretically possible codes),
not the codes actually present. Use it only when you need labels for codes you already
know are valid and `opensdmx constraints` doesn't provide enough detail.

**Never use `opensdmx values` to validate filter codes.** A code present in the codelist
may return no data if it doesn't exist in this specific dataflow.

`opensdmx values`返回的是**完整代码表**(所有理论上可能的代码),而非实际存在的代码。仅当你需要已知有效代码的标签、且`opensdmx constraints`没有提供足够细节时再使用该命令。

**绝对不要用`opensdmx values`验证筛选代码**,代码表中存在的代码如果不在该特定数据流里,查询可能返回空结果。

ISTAT fast flow (recommended)

ISTAT快速流程(推荐)

ISTAT's
availableconstraint
endpoint is extremely slow and often times out on large datasets (e.g. municipal-level data with thousands of territory codes). Use this faster flow instead:
Step 1 — get dimension order and structure:
bash
opensdmx info <dataflow_id> --provider istat
Step 2 — explore codelist values for the dimensions you need to filter:
bash
opensdmx values <dataflow_id> REF_AREA --provider istat
opensdmx values <dataflow_id> DATA_TYPE --provider istat
values
returns the full codelist (all theoretically possible codes). For ISTAT this is usually reliable enough because ISTAT codelists tend to be well-aligned with actual data. Use
grep -i
to find specific codes (e.g. city names, indicators).
Step 3 — go directly to
get
with filters and verify with a narrow query:
bash
opensdmx get <dataflow_id> --provider istat --REF_AREA <code> --last-n 1
If the query returns a 404 or empty result, the code may not be present in this dataflow. Only then fall back to
opensdmx constraints
to check which codes are actually available — but be aware it may take 30–60+ seconds or time out on large datasets.
ISTAT的
availableconstraint
端点速度极慢,处理大数据集时经常超时(例如包含数千个地区代码的市级数据),请使用以下更快的流程:
步骤1 — 获取维度顺序和结构:
bash
opensdmx info <dataflow_id> --provider istat
步骤2 — 探索需要筛选的维度的代码表值:
bash
opensdmx values <dataflow_id> REF_AREA --provider istat
opensdmx values <dataflow_id> DATA_TYPE --provider istat
values
返回完整代码表(所有理论上可能的代码),对ISTAT来说通常足够可靠,因为ISTAT的代码表和实际数据匹配度很高。使用
grep -i
查找特定代码(例如城市名、指标)。
步骤3 — 直接使用筛选条件调用
get
,用窄范围查询验证:
bash
opensdmx get <dataflow_id> --provider istat --REF_AREA <code> --last-n 1
如果查询返回404或空结果,说明该代码可能不在该数据流中。只有这种情况下再回退到
opensdmx constraints
检查实际可用的代码——但请注意,处理大数据集时该命令可能需要30-60秒以上甚至超时。

Extract from both flows

从两种流程中提取信息

Parse the output and extract:
  • Dimension list in order (position matters for URL construction later)
  • Available codes for each dimension, with descriptions
  • Time range (StartPeriod / EndPeriod)
  • Dimensions with more than one available value (these are the meaningful filters)

解析输出并提取以下内容:
  • 维度列表的顺序(后续构造URL时顺序很重要)
  • 每个维度的可用代码及描述
  • 时间范围(StartPeriod / EndPeriod)
  • 有多个可用值的维度(这些是有意义的筛选维度)

Phase 3 — Presentation: explain the dataset to the user

阶段3 — 展示:向用户解释数据集

Synthesize what you learned in Phase 2 into a clear, human-readable summary. The goal is for the user to understand the dataset without knowing SDMX.
Structure your summary like this:
将你在阶段2获取的信息整合成清晰易懂的摘要,目标是让用户无需了解SDMX就能明白数据集内容。
按以下结构组织摘要:

What the dataset contains

数据集内容

Describe the subject matter in plain language.
用平实的语言描述数据集主题。

Granularity

粒度

  • Geographic: national only? Regions? EU countries? Global?
  • Temporal: what years are available? Annual, monthly, quarterly?
  • 地理粒度:仅国家级?地区级?欧盟国家?全球范围?
  • 时间粒度:覆盖哪些年份?年度、月度还是季度?

Key dimensions to filter

核心筛选维度

List only the dimensions with more than one available value that are meaningful for the user's question. For each, show the options in plain language:
- **Country** (GEO): IT (Italy), DE (Germany), FR (France)… 35 countries
- **Indicator** (INDIC_DE): live births (GBIRTHS), deaths (DEATH),
  crude birth rate (CNBIRTHS), total fertility rate (TOTFERRT)…
- **Period**: 1960 to 2024, annual frequency
For dimensions with only one available value, mention them briefly: "Other dimensions have a single fixed value and are included automatically."
仅列出有多个可用值、且对用户问题有意义的维度,对每个维度用平实语言展示选项:
- **国家**(GEO):IT(意大利)、DE(德国)、FR(法国)… 共35个国家
- **指标**(INDIC_DE):活产数(GBIRTHS)、死亡数(DEATH)、粗出生率(CNBIRTHS)、总和生育率(TOTFERRT)…
- **时间段**:1960年至2024年,年度频率
对仅单个可用值的维度简要说明即可:“其他维度均为固定值,会自动包含在结果中。”

Estimated size

预估数据规模

Give a rough sense of scale: "Downloading everything (all countries + all indicators + all years) would give you approximately X rows." This helps the user decide how to filter.
给出大致的数据量级:“下载全量数据(所有国家+所有指标+所有年份)大约会得到X行数据”,帮助用户决定如何筛选。

Invitation to choose

引导用户选择

End with a clear prompt:
How would you like to proceed?
- Do you want data for a specific country or a European comparison?
- Which time period are you interested in?
- Are there any dimensions you want to filter?

Tell me what you want and I'll build the query.

以清晰的提示结束:
您希望如何继续?
- 您需要特定国家的数据还是欧盟范围的对比数据?
- 您感兴趣的时间段是?
- 您有没有需要筛选的维度?

告诉我您的需求,我会构建对应的查询。

Phase 4 — Data retrieval: after the user decides

阶段4 — 数据获取:用户确认需求后

Once the user has specified their choices, build the query and fetch the data.
用户明确需求后,构建查询并拉取数据。

Building the query — critical rules

构建查询 — 核心规则

  1. Dimension order must match the
    opensdmx info
    output exactly.
    Never guess the order.
  2. Use only codes confirmed by
    opensdmx constraints
    , never codes from
    opensdmx values
    or other sources. Providers often return 404 or empty results for invalid codes.
  3. For dimensions with a single available value, include that value — don't skip them.
  4. For unfiltered dimensions (user wants all values), use
    .
    as wildcard.
Note: Eurostat dimension flags are lowercase (
--geo
,
--coicop
,
--freq
). ISTAT dimension flags are uppercase (
--REF_AREA
,
--DATA_TYPE
,
--FREQ
).
  1. 维度顺序必须和
    opensdmx info
    的输出完全一致
    ,绝对不要猜测顺序。
  2. 仅使用
    opensdmx constraints
    确认过的代码
    ,不要使用
    opensdmx values
    或其他来源的代码,提供方通常会对无效代码返回404或空结果。
  3. 对于仅单个可用值的维度,必须包含该值,不要省略。
  4. 对于未筛选的维度(用户需要所有值),使用
    .
    作为通配符。
注意:Eurostat的维度参数是小写(
--geo
--coicop
--freq
),ISTAT的维度参数是大写(
--REF_AREA
--DATA_TYPE
--FREQ
)。

Step 1 — Verify with a preview (last observation)

步骤1 — 预览验证(最新观测值)

Before fetching everything, do a quick sanity check with
--last-n 1
to confirm the query is valid and the data looks correct:
bash
opensdmx get PRC_HICP_MANR --coicop CP00 --geo IT+DE+FR --start-period 2020 --end-period 2023 --last-n 1
--last-n 1
returns the most recent observation per series (one row per country/dimension combination), which is enough to verify the query structure without flooding the output. Prefer
--last-n 1
over
--first-n N
for previews: it shows the most recent data and produces far fewer rows when there are many series.
For ISTAT: use a narrow time range (1–2 years) as preview.
Show the user those few rows and confirm the data makes sense (right columns, right units, no unexpected flags). A one-line comment is enough: "Query works — here is the latest observation per series."
拉取全量数据前,先用
--last-n 1
做快速校验,确认查询有效、数据符合预期:
bash
opensdmx get PRC_HICP_MANR --coicop CP00 --geo IT+DE+FR --start-period 2020 --end-period 2023 --last-n 1
--last-n 1
会返回每个序列的最新观测值(每个国家/维度组合返回一行),足够验证查询结构,不会输出过多内容。预览优先使用
--last-n 1
而非
--first-n N
:它展示最新数据,且序列很多时输出的行数少得多。
对于ISTAT:使用窄时间范围(1-2年)作为预览。
向用户展示这几行数据,确认数据合理(列正确、单位正确、无意外标记),一行说明即可:“查询有效,以下是每个序列的最新观测值。”

Step 2 — Provide the download URL

步骤2 — 提供下载URL

Build and show the equivalent curl command so the user can download the full dataset independently, without relying on the CLI:
Eurostat URL pattern:
https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/{dataflow_id}/{dim1.dim2...}/ALL/?startPeriod={start}&endPeriod={end}&format=SDMX-CSV
Dimension values in the path must follow the exact order from
opensdmx info
, with
.
for unfiltered dimensions and
+
for multiple values.
Example:
bash
curl "https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/APRO_CPNH1/A.I2200.AR./ALL/?startPeriod=2014&endPeriod=2023&format=SDMX-CSV"
ISTAT URL pattern:
https://esploradati.istat.it/SDMXWS/rest/data/{dataflow_id}/{dim1.dim2...}?startPeriod={start}&endPeriod={end}
构建并展示等效的curl命令,用户无需依赖CLI即可独立下载全量数据集:
Eurostat URL格式:
https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/{dataflow_id}/{dim1.dim2...}/ALL/?startPeriod={start}&endPeriod={end}&format=SDMX-CSV
路径中的维度值必须严格遵循
opensdmx info
返回的顺序,未筛选的维度用
.
,多个值用
+
拼接。
示例:
bash
curl "https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/APRO_CPNH1/A.I2200.AR./ALL/?startPeriod=2014&endPeriod=2023&format=SDMX-CSV"
ISTAT URL格式:
https://esploradati.istat.it/SDMXWS/rest/data/{dataflow_id}/{dim1.dim2...}?startPeriod={start}&endPeriod={end}

Step 3 — Ask the user what to do next

步骤3 — 询问用户后续操作

End with a short, clear question:
Would you like me to download and save the full dataset locally for analysis?
If yes, tell me where to save it (e.g. /tmp/data.csv) or I'll use a default path.
I can then run a quick analysis: row count, top values, time range, flagged records.
If the user says yes, download with
--out <path>
and run a quick analysis (row count, top values, time range, any missing/flagged data worth noting).
以简短清晰的问题结束:
需要我下载全量数据集保存到本地用于分析吗?如果需要,请告诉我保存路径(例如/tmp/data.csv),我也可以使用默认路径。下载后我可以做快速分析:行数统计、top值、时间范围、标记记录等。
如果用户同意,使用
--out <path>
下载数据,并做快速分析(行数、top值、时间范围、值得注意的缺失/标记数据)。

Step 4 — Offer to save the query as a reusable template

步骤4 — 建议将查询保存为可复用模板

After a successful download, always ask the user if they want to save the query:
This query worked well. Would you like to save it as a YAML template so you can
re-run it later without remembering all the parameters?

  opensdmx get <id> [filters] --out data.csv --query-file my_query.yaml

The YAML captures provider, dataset, filters with human-readable descriptions,
and time range. To re-run it later:

  opensdmx run my_query.yaml
  opensdmx run my_query.yaml --out fresh_data.csv

The file is also useful for version control and sharing with colleagues.
Always suggest a meaningful filename (e.g.
unemployment_eu_2020_2024.yaml
,
gdp_annual_eurostat.yaml
) based on the dataset and filters used.
下载成功后,始终询问用户是否需要保存查询:
该查询运行正常,需要我将其保存为YAML模板吗?这样后续您无需记住所有参数即可重新运行:

  opensdmx get <id> [filters] --out data.csv --query-file my_query.yaml

YAML文件会记录提供方、数据集、带可读性描述的筛选条件和时间范围。后续重新运行只需执行:

  opensdmx run my_query.yaml
  opensdmx run my_query.yaml --out fresh_data.csv

该文件也适合版本控制和与同事共享。
始终根据使用的数据集和筛选条件推荐有意义的文件名(例如
unemployment_eu_2020_2024.yaml
gdp_annual_eurostat.yaml
)。

Step 5 — Offer metadata and README

步骤5 — 提供元数据和README

After downloading, always ask the user two optional extras:
Two optional extras:
1. **Full metadata**: do you want the complete list of codes and labels for each
   dimension (e.g. all country names, all flag meanings)? I can extract them from
   the opensdmx cache and save them as a companion file (e.g. `metadata.csv`).
2. **README**: do you want a `README.md` that documents the dataset schema —
   columns, dimension codes with labels, flag meanings, units, and source URL?
   Useful if you plan to share the data or revisit it later.
If the user says yes to metadata:
  • Run
    opensdmx constraints <dataflow_id> <dim>
    for each dimension with more than one value to get the full code → label mapping.
  • Combine all dimensions into a single metadata file with columns:
    dimension
    ,
    code
    ,
    label
    .
  • Save it alongside the data file (e.g.
    tomato_production_metadata.csv
    ).
If the user says yes to README:
  • Generate a
    README.md
    in the same folder as the data file.
  • The goal is to make the output verifiable (check values against the source), evaluable (judge quality and scope), and repeatable (reproduce from scratch).
  • Follow the full template in references/readme-template.md. In summary, include:
    • Files produced (table)
    • One section per source dataflow: ID, provider, filters with labels, unit, last update date, and the exact download URL
    • Derivations: join keys, filters applied after download, computed columns with explicit formulas (not prose)
    • Column schema: name, type, description, unit for every column in the output
    • Flag legend: only flags actually present in the data, with row counts
    • Coverage table when geographic or categorical gaps exist
    • Caveats: scope limitations, reporting lags, known biases
下载完成后,始终询问用户是否需要两个可选附加内容:
两个可选附加内容:
1. **完整元数据**:需要我提取每个维度的完整代码和标签列表(例如所有国家名称、所有标记的含义)吗?我可以从opensdmx缓存中提取这些内容,保存为配套文件(例如`metadata.csv`)。
2. **README**:需要我生成`README.md`记录数据集结构吗?包含列说明、带标签的维度代码、标记含义、单位和源URL,如果您后续要共享数据或者重新查看会很方便。
如果用户同意要元数据
  • 对每个有多个值的维度运行
    opensdmx constraints <dataflow_id> <dim>
    获取完整的代码→标签映射。
  • 将所有维度的映射合并为单个元数据文件,列名为:
    dimension
    code
    label
  • 保存到数据文件同目录下(例如
    tomato_production_metadata.csv
    )。
如果用户同意要README
  • 在数据文件同目录下生成
    README.md
  • 目标是让输出可验证(可对照源检查值)、可评估(可判断质量和范围)、可复现(可从零开始复现结果)。
  • 遵循references/readme-template.md的完整模板,简要来说包含:
    • 生成的文件列表(表格形式)
    • 每个源数据流的板块:ID、提供方、带标签的筛选条件、单位、最后更新日期、精确的下载URL
    • 衍生逻辑:关联键、下载后应用的筛选条件、带明确公式的计算列(而非文字描述)
    • 列结构:输出中每一列的名称、类型、描述、单位
    • 标记说明:仅包含数据中实际存在的标记,以及对应的行数
    • 存在地理或分类缺口时的覆盖范围表
    • 注意事项:范围限制、报告延迟、已知偏差

Step 6 — Visualization

步骤6 — 可视化

After downloading data, offer to create charts using
opensdmx plot
. The plot command uses plotnine (Python's ggplot2) and accepts both dataflow IDs and local files (.csv, .tsv, .parquet).
For the complete visualization reference — Grammar of Graphics concepts, data preparation rules, DuckDB examples, iterative chart quality loop, and common fixes — see references/visualization.md.
Supported chart types (via
--geom
):
  • line
    (default): line chart with points — best for time series
  • bar
    : vertical bar chart — best for comparing values across categories over time; with
    --color
    produces stacked bars
  • barh
    : horizontal bar chart — best for rankings; bars are automatically sorted by value (lowest at bottom, highest at top)
  • point
    : scatter plot — best for correlations between two numeric variables
For other chart types (heatmaps, faceted plots), write a short Python script using plotnine directly.
Key points:
  • Always prepare data with DuckDB before plotting (separate units, limit series, remove aggregates, use year strings for annual data)
  • After generating a chart, read the image and evaluate it — if it's not good, fix it yourself before showing the user
  • Multiple focused charts are better than one overloaded chart

下载数据后,可提议使用
opensdmx plot
创建图表。plot命令使用plotnine(Python版的ggplot2),支持数据流ID和本地文件(.csv、.tsv、.parquet)。
完整的可视化参考——图形语法概念、数据预处理规则、DuckDB示例、迭代优化图表质量的流程、常见问题修复——请查看references/visualization.md
支持的图表类型(通过
--geom
参数指定):
  • line
    (默认):带点的折线图——最适合时间序列
  • bar
    :垂直柱状图——最适合跨类别对比跨时间的数值;搭配
    --color
    可生成堆叠柱状图
  • barh
    :水平柱状图——最适合排名;柱状图会自动按值排序(最低在底部,最高在顶部)
  • point
    :散点图——最适合两个数值变量的相关性分析
其他图表类型(热力图、分面图)可直接用plotnine写短Python脚本实现。
核心要点:
  • 绘图前始终用DuckDB预处理数据(拆分单位、限制序列数量、移除聚合值、年度数据使用年份字符串)
  • 生成图表后读取图片评估质量,如果效果不好先自行优化再展示给用户
  • 多个聚焦的图表比一个负载过重的图表效果更好

Key principles

核心原则

Constraints vs codelists — always use constraints Use
opensdmx constraints
to get codes actually present in the data.
opensdmx values
returns the full codelist (all possible codes), which may include codes absent from a specific dataflow. A code valid in the codelist may return no data if it doesn't exist in that dataflow.
Proposals, not lists When presenting dataflow candidates, reason about each one: explain why it might or might not answer the question, what its limitations are, and which one you'd recommend. The user should feel guided, not overwhelmed.
Explain dimensions in plain language Translate SDMX dimension IDs into human concepts:
  • CITIZENSHIP_MOTHER
    → "mother's citizenship"
  • DATA_TYPE: LBIRTH
    → "live births (absolute count)"
  • GEO: IT
    → "Italy"
  • INDIC_DE: GBIRTHS
    → "live births"
  • INDIC_DE: CNBIRTHS
    → "crude birth rate (per 1,000 inhabitants)" Never show raw codes without an explanation.
Explore all columns, not just the value column When the preview arrives (Step 1 of Phase 4), look at all columns in the response, not just the observation value. SDMX datasets often include extra columns that affect interpretation: quality flags, confidentiality markers, unit multipliers, notes. For each non-obvious column, check what values are present and explain their meaning to the user. For example:
  • OBS_FLAG
    or
    OBS_STATUS
    : quality/availability flags — look up what each code means in the context of that provider (
    b
    = break in series,
    e
    = estimated,
    n
    = not significant,
    u
    = unreliable,
    p
    = provisional, etc.)
  • UNIT_MULT
    : multiplier applied to the value (e.g.
    3
    means values are in thousands)
  • CONF_STATUS
    : confidentiality status
  • NOTE_*
    : free-text annotations attached to specific dimensions
Don't hardcode these — inspect what columns are actually present in the data and explain the ones that are populated. Skip columns that are entirely empty.
Provider-specific quirks
ProviderNotes
ISTATUse
--provider istat
; 404 = "NoRecordsFound" (not a server error); rate limit ~13s; some IDs are parent containers (e.g.
25_74
) — use sub-dataflow IDs (e.g.
25_74_DF_DCIS_NATI2_1
); use the ISTAT fast flow (info → values → get) instead of constraints — the
availableconstraint
endpoint is very slow and often times out on large datasets
EurostatDefault provider (no
--provider
flag needed); dimension flags are lowercase (
--geo
,
--coicop
); country codes: ISO 3166-1 alpha-2 + EU aggregates like
EU27_2020
OECDUse
--provider oecd
; good for international comparisons
ECBUse
--provider ecb
; financial and monetary data
World BankUse
--provider worldbank
; single-dataflow architecture — all 1400+ indicators live inside one dataflow called
WDI
; do NOT use
opensdmx search
to find indicators (it only returns
WDI
); use
opensdmx values WDI SERIES --provider worldbank | grep -i <topic>
to find indicator codes;
availableconstraint
is not supported (returns 400) — skip
opensdmx constraints
and use
opensdmx values
directly; country codes are ISO 3166-1 alpha-3 (
USA
,
DEU
,
ITA
, not
US
/
DE
/
IT
); NOTE: data requests currently fail with HTTP 401/307 due to a known bug (see GitHub issue #5) — as a workaround, suggest equivalent OECD datasets for macro/GDP indicators
World Bank flow (different from all other providers)
World Bank exposes a single dataflow
WDI
containing all indicators. The exploration flow is different — do NOT follow the standard Phase 1 search:
bash
undefined
约束 vs 代码表 — 始终使用约束 使用
opensdmx constraints
获取数据中实际存在的代码。
opensdmx values
返回完整代码表(所有可能的代码),可能包含特定数据流中不存在的代码,代码表中有效的代码如果不在该数据流里,查询可能返回空结果。
给出建议,而非仅罗列选项 展示数据流候选集时,说明每个候选集的优劣:解释它是否能回答问题、有什么限制、你推荐哪一个。让用户感受到引导,而不是被大量选项淹没。
用平实语言解释维度 将SDMX维度ID转换为易懂的概念:
  • CITIZENSHIP_MOTHER
    → "母亲的国籍"
  • DATA_TYPE: LBIRTH
    → "活产数(绝对数量)"
  • GEO: IT
    → "意大利"
  • INDIC_DE: GBIRTHS
    → "活产数"
  • INDIC_DE: CNBIRTHS
    → "粗出生率(每千人口)" 绝对不要展示没有解释的原始代码。
探索所有列,而非仅数值列 拿到预览结果(阶段4步骤1)时,查看返回的所有列,而非仅观测值列。SDMX数据集通常包含影响数据解读的额外列:质量标记、保密标记、单位乘数、注释。对每个不直观的列,检查存在的值并向用户解释含义,例如:
  • OBS_FLAG
    OBS_STATUS
    :质量/可用性标记——查找该提供方上下文下每个代码的含义(
    b
    =序列中断、
    e
    =估算值、
    n
    =不显著、
    u
    =不可靠、
    p
    = provisional( provisional是临时的意思,哦不对,这里保留术语?不,翻译的话:
    p
    =临时值 等)
  • UNIT_MULT
    :数值的乘数(例如
    3
    表示数值单位为千)
  • CONF_STATUS
    :保密状态
  • NOTE_*
    :附加在特定维度上的自由文本注释
不要硬编码这些说明——检查数据中实际存在的列,解释有值的列,跳过全空的列。
各提供方的特殊注意事项
提供方说明
ISTAT使用
--provider istat
;404=“无记录”(不是服务器错误);接口限速~13秒;部分ID是父容器(例如
25_74
)——使用子数据流ID(例如
25_74_DF_DCIS_NATI2_1
);使用ISTAT快速流程(info→values→get)而非constraints——
availableconstraint
端点速度极慢,处理大数据集时经常超时
Eurostat默认提供方(无需加
--provider
参数);维度参数为小写(
--geo
--coicop
);国家代码:ISO 3166-1 alpha-2 + 欧盟聚合代码例如
EU27_2020
OECD使用
--provider oecd
;适合做国际对比
ECB使用
--provider ecb
;提供金融和货币数据
世界银行使用
--provider worldbank
单数据流架构——所有1400+指标都在名为
WDI
的单个数据流中;不要用
opensdmx search
查找指标(只会返回
WDI
);使用
opensdmx values WDI SERIES --provider worldbank | grep -i <主题>
查找指标代码;不支持
availableconstraint
(返回400)——跳过
opensdmx constraints
直接使用
opensdmx values
;国家代码是ISO 3166-1 alpha-3
USA
DEU
ITA
,不是
US
/
DE
/
IT
);注意:由于已知bug(见GitHub issue #5),目前数据请求会返回HTTP 401/307错误——作为替代方案,可推荐OECD对应的宏观/GDP指标数据集
世界银行流程(和其他所有提供方不同)
世界银行仅暴露单个数据流
WDI
包含所有指标,探索流程不同——不要遵循标准的阶段1搜索流程:
bash
undefined

Step 1 — find the indicator code (replaces opensdmx search)

步骤1 — 查找指标代码(替代opensdmx search)

opensdmx values WDI SERIES --provider worldbank 2>&1 | grep -i "gdp per capita"
opensdmx values WDI SERIES --provider worldbank 2>&1 | grep -i "gdp per capita"

→ NY_GDP_PCAP_KD (constant USD), NY_GDP_PCAP_PP_KD (PPP), etc.

→ NY_GDP_PCAP_KD (不变价美元)、NY_GDP_PCAP_PP_KD (购买力平价)等

Step 2 — get structure

步骤2 — 获取结构

opensdmx info WDI --provider worldbank
opensdmx info WDI --provider worldbank

→ 3 dimensions: FREQ · SERIES · REF_AREA

→ 3个维度:FREQ · SERIES · REF_AREA

Step 3 — find country codes (alpha-3, not alpha-2)

步骤3 — 查找国家代码(alpha-3,不是alpha-2)

opensdmx values WDI REF_AREA --provider worldbank 2>&1 | grep -i "italy|germany"
opensdmx values WDI REF_AREA --provider worldbank 2>&1 | grep -i "italy|germany"

→ ITA, DEU (not IT, DE)

→ ITA、DEU(不是IT、DE)

Step 4 — build query (skip constraints — endpoint returns 400)

步骤4 — 构建查询(跳过constraints——端点返回400)

opensdmx get WDI --provider worldbank --SERIES NY_GDP_PCAP_KD
--REF_AREA ITA+DEU+FRA --start-period 2000

If data retrieval fails with HTTP 401/307 (known bug, issue #5), offer the equivalent
OECD dataset as a workaround — OECD publishes most macro indicators (GDP, employment,
prices) with comparable coverage.

**Territory resolution (Eurostat)**
Country codes follow ISO 3166-1 alpha-2: `IT` (Italy), `DE` (Germany), `FR` (France),
`ES` (Spain). EU aggregates: `EU27_2020`, `EA20` (Euro area). Always verify against
`opensdmx constraints` before using — not all codes are present in every dataset.

**Territory resolution (ISTAT)**
ISTAT uses numeric REF_AREA codes (6-digit municipal codes, province codes, region codes,
and aggregate codes like `ITG12` for provinces or `SLL_*` for labour market areas).
Use `opensdmx values <dataflow_id> REF_AREA --provider istat` to browse the full
codelist — pipe through `grep -i` to find specific cities or territories:

```bash
opensdmx values <dataflow_id> REF_AREA --provider istat 2>&1 | grep -i "palermo\|matera"
Avoid
opensdmx constraints <dataflow_id> REF_AREA --provider istat
on municipal-level datasets — the
availableconstraint
endpoint is very slow with thousands of codes. Use
values
+
grep
to find codes, then verify with a narrow
get
query.
opensdmx get WDI --provider worldbank --SERIES NY_GDP_PCAP_KD
--REF_AREA ITA+DEU+FRA --start-period 2000

如果数据请求返回HTTP 401/307(已知bug,issue #5),推荐对应的OECD数据集作为替代——OECD发布大部分宏观指标(GDP、就业、物价),覆盖范围可比。

**地区代码解析(Eurostat)**
国家代码遵循ISO 3166-1 alpha-2:`IT`(意大利)、`DE`(德国)、`FR`(法国)、`ES`(西班牙)。欧盟聚合代码:`EU27_2020`、`EA20`(欧元区)。使用前始终通过`opensdmx constraints`验证——不是所有代码都存在于每个数据集中。

**地区代码解析(ISTAT)**
ISTAT使用数字REF_AREA代码(6位市级代码、省级代码、地区代码,以及聚合代码例如省份用`ITG12`、劳动力市场区域用`SLL_*`)。使用`opensdmx values <dataflow_id> REF_AREA --provider istat`浏览完整代码表——通过管道接`grep -i`查找特定城市或地区:

```bash
opensdmx values <dataflow_id> REF_AREA --provider istat 2>&1 | grep -i "palermo\|matera"
处理市级数据集时避免使用
opensdmx constraints <dataflow_id> REF_AREA --provider istat
——
availableconstraint
端点处理数千个代码时速度极慢。使用
values
+
grep
查找代码,然后用窄范围
get
查询验证。