sdmx-explorer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSDMX Explorer — Guided Dataset Discovery
SDMX Explorer — 引导式数据集发现
This skill uses the opensdmx CLI to explore any SDMX 2.1 REST endpoint:
Eurostat, OECD, ECB, World Bank, ISTAT, and others.
The primary reference provider is Eurostat (default in the opensdmx CLI).
All examples use Eurostat unless stated otherwise.
Every command supports — run it first to discover options and
see usage examples:
opensdmx--helpbash
opensdmx --help # list all commands
opensdmx search --help # options and examples for search
opensdmx constraints --help # options and examples for constraints
opensdmx get --help # options and examples for get
opensdmx run --help # options and examples for run该技能使用opensdmx CLI探索所有SDMX 2.1 REST端点:
Eurostat、OECD、ECB、世界银行、ISTAT等。
默认优先使用的提供方是Eurostat(opensdmx CLI的默认配置)。
除非另有说明,所有示例均使用Eurostat数据源。
所有命令都支持参数,先运行该命令可查看选项和使用示例:
opensdmx--helpbash
opensdmx --help # 列出所有命令
opensdmx search --help # search命令的选项和示例
opensdmx constraints --help # constraints命令的选项和示例
opensdmx get --help # get命令的选项和示例
opensdmx run --help # run命令的选项和示例... same for info, values, constraints, embed, blacklist, plot
info、values、constraints、embed、blacklist、plot等命令同理
This skill runs a four-phase interactive loop. Always follow the phases in order.
The goal is to help the user understand the data landscape and make informed choices,
not to fetch data immediately.
---
该技能运行四阶段的交互式循环,请始终按顺序执行各阶段。目标是帮助用户了解数据全貌、做出合理选择,而非直接拉取数据。
---Phase 1 — Discovery: find candidate dataflows
阶段1 — 发现:查找候选数据流
Identify which SDMX provider is relevant (ISTAT for Italian statistics, Eurostat for
European statistics, OECD for international comparisons, etc.). If unclear, ask.
确定相关的SDMX提供方(意大利统计数据用ISTAT、欧洲统计数据用Eurostat、国际对比数据用OECD等),如果不明确可询问用户。
Step 1a — Extract keywords AND expected dimensions
步骤1a — 提取关键词和预期维度
Before searching, parse the user's question on two levels:
-
Topic keywords (2–4 terms) for thecall. Example: "unemployment", "labour force"
opensdmx search -
Expected dimensions — the analytical angles the user wants to slice by. These are often NOT in the dataset title or description, but must appear as dimensions in the dataflow structure. Extract them explicitly:
User says Expected dimension "by age group" / "per fascia di età" age"by sex" / "per sesso" sex"by country" / "per paese" geo"by region" (NUTS level)geo"by education level" or similarisced11"quarterly" / "monthly" freqExample: "unemployment for EU countries, by age group and sex" → topic keywords:; expected dimensions:unemployment,age,sex.geo
搜索前,从两个层面解析用户的问题:
-
主题关键词(2-4个词)用于调用,示例:"unemployment"、"labour force"
opensdmx search -
预期维度——用户想要拆分数据的分析角度,这些通常不会出现在数据集标题或描述中,但必须作为维度存在于数据流结构里,请明确提取:
| 用户表述 | 预期维度 |
|---|---|
| "按年龄分组" / "per fascia di età" | |
| "按性别" / "per sesso" | |
| "按国家" / "per paese" | |
| "按地区" | |
| "按教育水平" | |
| "季度" / "月度" | |
示例:"欧盟国家的失业数据,按年龄和性别分组" → 主题关键词:;预期维度:、、。
unemploymentagesexgeoStep 1b — Search and pre-filter candidates
步骤1b — 搜索和预过滤候选集
Search for dataflows:
- Eurostat (default provider — no flag needed):
--provideropensdmx search "<keyword>" - ISTAT:
opensdmx search "<keyword>" --provider istat - Other providers: (available:
opensdmx search "<keyword>" --provider <name>,oecd,ecb,worldbank,insee,bundesbank)abs
To see the full list of built-in providers with their descriptions, run:
opensdmx providersFrom the search results, pick 5–8 plausible candidates by title relevance.
Then run on each one in parallel to check their dimension
list. Keep only the candidates that contain all expected dimensions.
Discard candidates missing a required dimension — even if the title looks right.
opensdmx info <id>If page 1 (20 results) yields no strong candidates, paginate before giving up:
bash
opensdmx search "unemployment" --page 2 # results 21-40
opensdmx search "unemployment" --page 3 # results 41-60The title shows the total available (e.g. ), so you know how many
pages exist. Keep paginating until you find at least 3 plausible candidates or
exhaust the results. Only after exhausting pagination should you try a different
keyword or provider. Use only as a last resort (may produce very long output).
21-40 of 114--allIf keyword search returns 0 or very few results (< 3), offer semantic search:
"I didn't find much with a keyword search. I can try a semantic search instead — it matches by meaning, not exact words, so it can find datasets even when the terminology differs. It requires Ollama to be running and is slower (10–30 s). Want me to try?"
If the user agrees:
bash
opensdmx search --semantic "<query>"Semantic search returns the top 20 results ranked by similarity score. Pick the
most relevant candidates (score > 0.5) and continue with Step 1c as normal.
bash
undefined搜索数据流:
- Eurostat(默认提供方,无需加参数):
--provideropensdmx search "<keyword>" - ISTAT:
opensdmx search "<keyword>" --provider istat - 其他提供方:(可用值:
opensdmx search "<keyword>" --provider <name>、oecd、ecb、worldbank、insee、bundesbank)abs
要查看所有内置提供方的完整列表和描述,运行:
opensdmx providers从搜索结果中按标题相关性筛选5-8个合理候选集,然后并行对每个候选集运行检查维度列表,仅保留包含所有预期维度的候选集。即便标题看起来匹配,缺失必填维度的候选集也要丢弃。
opensdmx info <id>如果第1页(20条结果)没有合适的候选集,请先翻页再放弃:
bash
opensdmx search "unemployment" --page 2 # 第21-40条结果
opensdmx search "unemployment" --page 3 # 第41-60条结果标题会显示总结果数(例如),由此可知总页数。持续翻页直到找到至少3个合理候选集或遍历完所有结果。只有遍历完所有分页后,再尝试更换关键词或提供方。仅在万不得已时使用参数(可能会输出非常长的内容)。
21-40 of 114--all如果关键词搜索返回0条或极少结果(<3条),可提供语义搜索选项:
"我通过关键词搜索没找到太多匹配结果,还可以尝试语义搜索,它按含义匹配而非精确词汇,即便术语不同也能找到对应数据集。语义搜索需要运行Ollama,速度较慢(10-30秒),要我试试吗?"
如果用户同意:
bash
opensdmx search --semantic "<query>"语义搜索会按相似度排名返回前20条结果,挑选最相关的候选集(相似度>0.5),照常进入步骤1c。
bash
undefinedExample: verify age and sex are present
示例:验证是否存在age和sex维度
opensdmx info UNE_RT_A # ✓ has age, sex, geo → keep
opensdmx info TIPSUN20 # ✗ no age, no sex → discard
undefinedopensdmx info UNE_RT_A # ✓ 包含age、sex、geo → 保留
opensdmx info TIPSUN20 # ✗ 没有age、sex → 丢弃
undefinedStep 1c — Present verified candidates
步骤1c — 展示验证通过的候选集
From the verified candidates, select 3–5 and present them. For each, confirm
which expected dimensions are present and note any extras or limitations.
Present them like this (use the conversation language; adapt as needed):
I found these datasets that could answer your question:
1. **UNE_RT_A** — Unemployment by sex and age – annual data (Eurostat) ⭐ recommended
Has all three dimensions you need: age (7 ranges), sex (F/M/total), geo (38 countries).
Annual data from 2003 to 2025. Clean structure, no extra mandatory filters.
2. **LFSA_URGAED** — Unemployment rates by educational attainment level (Eurostat)
Also has age (29 ranges!) and sex, but adds a mandatory education-level dimension
(ISCED11). More granular age breakdown, but requires choosing an education filter.
Best if you also want to break down by education.
3. **MET_LFU3RT** — Unemployment rates by sex, age and metropolitan region (Eurostat)
Has age and sex, but geo is at metropolitan region level — not country level.
Not suitable for country comparisons.
Which one would you like to explore? You can also say "the first one" or
describe more precisely what you need.Wait for the user's choice before proceeding.
从验证通过的候选集中选择3-5个展示给用户。对每个候选集,确认包含哪些预期维度,说明额外特性或限制。
按如下格式展示(使用对话对应的语言,可按需调整):
我找到了这些可回答您问题的数据集:
1. **UNE_RT_A** — 按性别和年龄分组的失业数据(年度数据,Eurostat) ⭐ 推荐
包含您需要的全部三个维度:年龄(7个区间)、性别(女/男/合计)、地理(38个国家)。
覆盖2003年至2025年的年度数据,结构清晰,无额外必填筛选条件。
2. **LFSA_URGAED** — 按教育水平分组的失业率(Eurostat)
也包含年龄(29个区间!)和性别维度,但额外增加了必填的教育水平维度(ISCED11)。
年龄粒度更细,但需要选择教育筛选条件,如果您还需要按教育维度拆分可以选择该数据集。
3. **MET_LFU3RT** — 按性别、年龄和大都市区分组的失业率(Eurostat)
包含年龄和性别维度,但地理维度是大都市区级别,而非国家级别,不适合做国家对比。
您想要探索哪个数据集?您也可以直接说“第一个”,或者更具体地描述您的需求。等待用户选择后再继续后续步骤。
Phase 2 — Schema: explore the chosen dataflow
阶段2 — 结构:探索选中的数据流
Once the user has chosen, retrieve the structure and available codes for the dataflow.
用户做出选择后,拉取该数据流的结构和可用代码。
Default flow (Eurostat, OECD, ECB, etc.)
默认流程(Eurostat、OECD、ECB等)
Step 1 — get the codes actually present in the dataflow (real constraints):
bash
opensdmx constraints PRC_HICP_MANR步骤1 — 获取数据流中实际存在的代码(真实约束):
bash
opensdmx constraints PRC_HICP_MANRshows all dimensions with count and sample of codes
展示所有维度的代码数量和示例
opensdmx constraints PRC_HICP_MANR coicop
opensdmx constraints PRC_HICP_MANR coicop
shows full list of codes present in that dimension, with labels
展示该维度下所有实际存在的代码及标签
`opensdmx constraints` is the ground truth — it queries the `availableconstraint`
SDMX endpoint and returns only codes that actually exist in this specific dataflow.
Step 2 — get dimension order and structure:
```bash
opensdmx info PRC_HICP_MANR
`opensdmx constraints`返回的是真实数据,它会请求SDMX的`availableconstraint`端点,仅返回该特定数据流中实际存在的代码。
步骤2 — 获取维度顺序和结构:
```bash
opensdmx info PRC_HICP_MANR(no --provider needed for Eurostat, it's the default)
Eurostat是默认提供方,无需加--provider参数
`opensdmx values` returns the **full codelist** (all theoretically possible codes),
not the codes actually present. Use it only when you need labels for codes you already
know are valid and `opensdmx constraints` doesn't provide enough detail.
**Never use `opensdmx values` to validate filter codes.** A code present in the codelist
may return no data if it doesn't exist in this specific dataflow.
`opensdmx values`返回的是**完整代码表**(所有理论上可能的代码),而非实际存在的代码。仅当你需要已知有效代码的标签、且`opensdmx constraints`没有提供足够细节时再使用该命令。
**绝对不要用`opensdmx values`验证筛选代码**,代码表中存在的代码如果不在该特定数据流里,查询可能返回空结果。ISTAT fast flow (recommended)
ISTAT快速流程(推荐)
ISTAT's endpoint is extremely slow and often times out on large
datasets (e.g. municipal-level data with thousands of territory codes). Use this faster
flow instead:
availableconstraintStep 1 — get dimension order and structure:
bash
opensdmx info <dataflow_id> --provider istatStep 2 — explore codelist values for the dimensions you need to filter:
bash
opensdmx values <dataflow_id> REF_AREA --provider istat
opensdmx values <dataflow_id> DATA_TYPE --provider istatvaluesgrep -iStep 3 — go directly to with filters and verify with a narrow query:
getbash
opensdmx get <dataflow_id> --provider istat --REF_AREA <code> --last-n 1If the query returns a 404 or empty result, the code may not be present in this dataflow.
Only then fall back to to check which codes are actually
available — but be aware it may take 30–60+ seconds or time out on large datasets.
opensdmx constraintsISTAT的端点速度极慢,处理大数据集时经常超时(例如包含数千个地区代码的市级数据),请使用以下更快的流程:
availableconstraint步骤1 — 获取维度顺序和结构:
bash
opensdmx info <dataflow_id> --provider istat步骤2 — 探索需要筛选的维度的代码表值:
bash
opensdmx values <dataflow_id> REF_AREA --provider istat
opensdmx values <dataflow_id> DATA_TYPE --provider istatvaluesgrep -i步骤3 — 直接使用筛选条件调用,用窄范围查询验证:
getbash
opensdmx get <dataflow_id> --provider istat --REF_AREA <code> --last-n 1如果查询返回404或空结果,说明该代码可能不在该数据流中。只有这种情况下再回退到检查实际可用的代码——但请注意,处理大数据集时该命令可能需要30-60秒以上甚至超时。
opensdmx constraintsExtract from both flows
从两种流程中提取信息
Parse the output and extract:
- Dimension list in order (position matters for URL construction later)
- Available codes for each dimension, with descriptions
- Time range (StartPeriod / EndPeriod)
- Dimensions with more than one available value (these are the meaningful filters)
解析输出并提取以下内容:
- 维度列表的顺序(后续构造URL时顺序很重要)
- 每个维度的可用代码及描述
- 时间范围(StartPeriod / EndPeriod)
- 有多个可用值的维度(这些是有意义的筛选维度)
Phase 3 — Presentation: explain the dataset to the user
阶段3 — 展示:向用户解释数据集
Synthesize what you learned in Phase 2 into a clear, human-readable summary.
The goal is for the user to understand the dataset without knowing SDMX.
Structure your summary like this:
将你在阶段2获取的信息整合成清晰易懂的摘要,目标是让用户无需了解SDMX就能明白数据集内容。
按以下结构组织摘要:
What the dataset contains
数据集内容
Describe the subject matter in plain language.
用平实的语言描述数据集主题。
Granularity
粒度
- Geographic: national only? Regions? EU countries? Global?
- Temporal: what years are available? Annual, monthly, quarterly?
- 地理粒度:仅国家级?地区级?欧盟国家?全球范围?
- 时间粒度:覆盖哪些年份?年度、月度还是季度?
Key dimensions to filter
核心筛选维度
List only the dimensions with more than one available value that are meaningful
for the user's question. For each, show the options in plain language:
- **Country** (GEO): IT (Italy), DE (Germany), FR (France)… 35 countries
- **Indicator** (INDIC_DE): live births (GBIRTHS), deaths (DEATH),
crude birth rate (CNBIRTHS), total fertility rate (TOTFERRT)…
- **Period**: 1960 to 2024, annual frequencyFor dimensions with only one available value, mention them briefly:
"Other dimensions have a single fixed value and are included automatically."
仅列出有多个可用值、且对用户问题有意义的维度,对每个维度用平实语言展示选项:
- **国家**(GEO):IT(意大利)、DE(德国)、FR(法国)… 共35个国家
- **指标**(INDIC_DE):活产数(GBIRTHS)、死亡数(DEATH)、粗出生率(CNBIRTHS)、总和生育率(TOTFERRT)…
- **时间段**:1960年至2024年,年度频率对仅单个可用值的维度简要说明即可:“其他维度均为固定值,会自动包含在结果中。”
Estimated size
预估数据规模
Give a rough sense of scale: "Downloading everything (all countries + all indicators +
all years) would give you approximately X rows." This helps the user decide how to filter.
给出大致的数据量级:“下载全量数据(所有国家+所有指标+所有年份)大约会得到X行数据”,帮助用户决定如何筛选。
Invitation to choose
引导用户选择
End with a clear prompt:
How would you like to proceed?
- Do you want data for a specific country or a European comparison?
- Which time period are you interested in?
- Are there any dimensions you want to filter?
Tell me what you want and I'll build the query.以清晰的提示结束:
您希望如何继续?
- 您需要特定国家的数据还是欧盟范围的对比数据?
- 您感兴趣的时间段是?
- 您有没有需要筛选的维度?
告诉我您的需求,我会构建对应的查询。Phase 4 — Data retrieval: after the user decides
阶段4 — 数据获取:用户确认需求后
Once the user has specified their choices, build the query and fetch the data.
用户明确需求后,构建查询并拉取数据。
Building the query — critical rules
构建查询 — 核心规则
- Dimension order must match the output exactly. Never guess the order.
opensdmx info - Use only codes confirmed by , never codes from
opensdmx constraintsor other sources. Providers often return 404 or empty results for invalid codes.opensdmx values - For dimensions with a single available value, include that value — don't skip them.
- For unfiltered dimensions (user wants all values), use as wildcard.
.
Note: Eurostat dimension flags are lowercase (, , ).
ISTAT dimension flags are uppercase (, , ).
--geo--coicop--freq--REF_AREA--DATA_TYPE--FREQ- 维度顺序必须和的输出完全一致,绝对不要猜测顺序。
opensdmx info - 仅使用确认过的代码,不要使用
opensdmx constraints或其他来源的代码,提供方通常会对无效代码返回404或空结果。opensdmx values - 对于仅单个可用值的维度,必须包含该值,不要省略。
- 对于未筛选的维度(用户需要所有值),使用作为通配符。
.
注意:Eurostat的维度参数是小写(、、),ISTAT的维度参数是大写(、、)。
--geo--coicop--freq--REF_AREA--DATA_TYPE--FREQStep 1 — Verify with a preview (last observation)
步骤1 — 预览验证(最新观测值)
Before fetching everything, do a quick sanity check with to confirm
the query is valid and the data looks correct:
--last-n 1bash
opensdmx get PRC_HICP_MANR --coicop CP00 --geo IT+DE+FR --start-period 2020 --end-period 2023 --last-n 1--last-n 1--last-n 1--first-n NFor ISTAT: use a narrow time range (1–2 years) as preview.
Show the user those few rows and confirm the data makes sense (right columns, right units,
no unexpected flags). A one-line comment is enough: "Query works — here is the latest observation per series."
拉取全量数据前,先用做快速校验,确认查询有效、数据符合预期:
--last-n 1bash
opensdmx get PRC_HICP_MANR --coicop CP00 --geo IT+DE+FR --start-period 2020 --end-period 2023 --last-n 1--last-n 1--last-n 1--first-n N对于ISTAT:使用窄时间范围(1-2年)作为预览。
向用户展示这几行数据,确认数据合理(列正确、单位正确、无意外标记),一行说明即可:“查询有效,以下是每个序列的最新观测值。”
Step 2 — Provide the download URL
步骤2 — 提供下载URL
Build and show the equivalent curl command so the user can download the full dataset
independently, without relying on the CLI:
Eurostat URL pattern:
https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/{dataflow_id}/{dim1.dim2...}/ALL/?startPeriod={start}&endPeriod={end}&format=SDMX-CSVDimension values in the path must follow the exact order from ,
with for unfiltered dimensions and for multiple values.
opensdmx info.+Example:
bash
curl "https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/APRO_CPNH1/A.I2200.AR./ALL/?startPeriod=2014&endPeriod=2023&format=SDMX-CSV"ISTAT URL pattern:
https://esploradati.istat.it/SDMXWS/rest/data/{dataflow_id}/{dim1.dim2...}?startPeriod={start}&endPeriod={end}构建并展示等效的curl命令,用户无需依赖CLI即可独立下载全量数据集:
Eurostat URL格式:
https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/{dataflow_id}/{dim1.dim2...}/ALL/?startPeriod={start}&endPeriod={end}&format=SDMX-CSV路径中的维度值必须严格遵循返回的顺序,未筛选的维度用,多个值用拼接。
opensdmx info.+示例:
bash
curl "https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/APRO_CPNH1/A.I2200.AR./ALL/?startPeriod=2014&endPeriod=2023&format=SDMX-CSV"ISTAT URL格式:
https://esploradati.istat.it/SDMXWS/rest/data/{dataflow_id}/{dim1.dim2...}?startPeriod={start}&endPeriod={end}Step 3 — Ask the user what to do next
步骤3 — 询问用户后续操作
End with a short, clear question:
Would you like me to download and save the full dataset locally for analysis?
If yes, tell me where to save it (e.g. /tmp/data.csv) or I'll use a default path.
I can then run a quick analysis: row count, top values, time range, flagged records.If the user says yes, download with and run a quick analysis
(row count, top values, time range, any missing/flagged data worth noting).
--out <path>以简短清晰的问题结束:
需要我下载全量数据集保存到本地用于分析吗?如果需要,请告诉我保存路径(例如/tmp/data.csv),我也可以使用默认路径。下载后我可以做快速分析:行数统计、top值、时间范围、标记记录等。如果用户同意,使用下载数据,并做快速分析(行数、top值、时间范围、值得注意的缺失/标记数据)。
--out <path>Step 4 — Offer to save the query as a reusable template
步骤4 — 建议将查询保存为可复用模板
After a successful download, always ask the user if they want to save the query:
This query worked well. Would you like to save it as a YAML template so you can
re-run it later without remembering all the parameters?
opensdmx get <id> [filters] --out data.csv --query-file my_query.yaml
The YAML captures provider, dataset, filters with human-readable descriptions,
and time range. To re-run it later:
opensdmx run my_query.yaml
opensdmx run my_query.yaml --out fresh_data.csv
The file is also useful for version control and sharing with colleagues.Always suggest a meaningful filename (e.g. ,
) based on the dataset and filters used.
unemployment_eu_2020_2024.yamlgdp_annual_eurostat.yaml下载成功后,始终询问用户是否需要保存查询:
该查询运行正常,需要我将其保存为YAML模板吗?这样后续您无需记住所有参数即可重新运行:
opensdmx get <id> [filters] --out data.csv --query-file my_query.yaml
YAML文件会记录提供方、数据集、带可读性描述的筛选条件和时间范围。后续重新运行只需执行:
opensdmx run my_query.yaml
opensdmx run my_query.yaml --out fresh_data.csv
该文件也适合版本控制和与同事共享。始终根据使用的数据集和筛选条件推荐有意义的文件名(例如、)。
unemployment_eu_2020_2024.yamlgdp_annual_eurostat.yamlStep 5 — Offer metadata and README
步骤5 — 提供元数据和README
After downloading, always ask the user two optional extras:
Two optional extras:
1. **Full metadata**: do you want the complete list of codes and labels for each
dimension (e.g. all country names, all flag meanings)? I can extract them from
the opensdmx cache and save them as a companion file (e.g. `metadata.csv`).
2. **README**: do you want a `README.md` that documents the dataset schema —
columns, dimension codes with labels, flag meanings, units, and source URL?
Useful if you plan to share the data or revisit it later.If the user says yes to metadata:
- Run for each dimension with more than one value to get the full code → label mapping.
opensdmx constraints <dataflow_id> <dim> - Combine all dimensions into a single metadata file with columns:
,
dimension,code.label - Save it alongside the data file (e.g. ).
tomato_production_metadata.csv
If the user says yes to README:
- Generate a in the same folder as the data file.
README.md - The goal is to make the output verifiable (check values against the source), evaluable (judge quality and scope), and repeatable (reproduce from scratch).
- Follow the full template in references/readme-template.md.
In summary, include:
- Files produced (table)
- One section per source dataflow: ID, provider, filters with labels, unit, last update date, and the exact download URL
- Derivations: join keys, filters applied after download, computed columns with explicit formulas (not prose)
- Column schema: name, type, description, unit for every column in the output
- Flag legend: only flags actually present in the data, with row counts
- Coverage table when geographic or categorical gaps exist
- Caveats: scope limitations, reporting lags, known biases
下载完成后,始终询问用户是否需要两个可选附加内容:
两个可选附加内容:
1. **完整元数据**:需要我提取每个维度的完整代码和标签列表(例如所有国家名称、所有标记的含义)吗?我可以从opensdmx缓存中提取这些内容,保存为配套文件(例如`metadata.csv`)。
2. **README**:需要我生成`README.md`记录数据集结构吗?包含列说明、带标签的维度代码、标记含义、单位和源URL,如果您后续要共享数据或者重新查看会很方便。如果用户同意要元数据:
- 对每个有多个值的维度运行获取完整的代码→标签映射。
opensdmx constraints <dataflow_id> <dim> - 将所有维度的映射合并为单个元数据文件,列名为:、
dimension、code。label - 保存到数据文件同目录下(例如)。
tomato_production_metadata.csv
如果用户同意要README:
- 在数据文件同目录下生成。
README.md - 目标是让输出可验证(可对照源检查值)、可评估(可判断质量和范围)、可复现(可从零开始复现结果)。
- 遵循references/readme-template.md的完整模板,简要来说包含:
- 生成的文件列表(表格形式)
- 每个源数据流的板块:ID、提供方、带标签的筛选条件、单位、最后更新日期、精确的下载URL
- 衍生逻辑:关联键、下载后应用的筛选条件、带明确公式的计算列(而非文字描述)
- 列结构:输出中每一列的名称、类型、描述、单位
- 标记说明:仅包含数据中实际存在的标记,以及对应的行数
- 存在地理或分类缺口时的覆盖范围表
- 注意事项:范围限制、报告延迟、已知偏差
Step 6 — Visualization
步骤6 — 可视化
After downloading data, offer to create charts using .
The plot command uses plotnine (Python's ggplot2) and accepts both dataflow IDs
and local files (.csv, .tsv, .parquet).
opensdmx plotFor the complete visualization reference — Grammar of Graphics concepts, data
preparation rules, DuckDB examples, iterative chart quality loop, and common
fixes — see references/visualization.md.
Supported chart types (via ):
--geom- (default): line chart with points — best for time series
line - : vertical bar chart — best for comparing values across categories over time; with
barproduces stacked bars--color - : horizontal bar chart — best for rankings; bars are automatically sorted by value (lowest at bottom, highest at top)
barh - : scatter plot — best for correlations between two numeric variables
point
For other chart types (heatmaps, faceted plots), write a short Python script
using plotnine directly.
Key points:
- Always prepare data with DuckDB before plotting (separate units, limit series, remove aggregates, use year strings for annual data)
- After generating a chart, read the image and evaluate it — if it's not good, fix it yourself before showing the user
- Multiple focused charts are better than one overloaded chart
下载数据后,可提议使用创建图表。plot命令使用plotnine(Python版的ggplot2),支持数据流ID和本地文件(.csv、.tsv、.parquet)。
opensdmx plot完整的可视化参考——图形语法概念、数据预处理规则、DuckDB示例、迭代优化图表质量的流程、常见问题修复——请查看references/visualization.md。
支持的图表类型(通过参数指定):
--geom- (默认):带点的折线图——最适合时间序列
line - :垂直柱状图——最适合跨类别对比跨时间的数值;搭配
bar可生成堆叠柱状图--color - :水平柱状图——最适合排名;柱状图会自动按值排序(最低在底部,最高在顶部)
barh - :散点图——最适合两个数值变量的相关性分析
point
其他图表类型(热力图、分面图)可直接用plotnine写短Python脚本实现。
核心要点:
- 绘图前始终用DuckDB预处理数据(拆分单位、限制序列数量、移除聚合值、年度数据使用年份字符串)
- 生成图表后读取图片评估质量,如果效果不好先自行优化再展示给用户
- 多个聚焦的图表比一个负载过重的图表效果更好
Key principles
核心原则
Constraints vs codelists — always use constraints
Use to get codes actually present in the data.
returns the full codelist (all possible codes), which may include
codes absent from a specific dataflow. A code valid in the codelist may return no data
if it doesn't exist in that dataflow.
opensdmx constraintsopensdmx valuesProposals, not lists
When presenting dataflow candidates, reason about each one: explain why it might or
might not answer the question, what its limitations are, and which one you'd recommend.
The user should feel guided, not overwhelmed.
Explain dimensions in plain language
Translate SDMX dimension IDs into human concepts:
- → "mother's citizenship"
CITIZENSHIP_MOTHER - → "live births (absolute count)"
DATA_TYPE: LBIRTH - → "Italy"
GEO: IT - → "live births"
INDIC_DE: GBIRTHS - → "crude birth rate (per 1,000 inhabitants)" Never show raw codes without an explanation.
INDIC_DE: CNBIRTHS
Explore all columns, not just the value column
When the preview arrives (Step 1 of Phase 4), look at all columns in the response,
not just the observation value. SDMX datasets often include extra columns that affect
interpretation: quality flags, confidentiality markers, unit multipliers, notes.
For each non-obvious column, check what values are present and explain their meaning
to the user. For example:
- or
OBS_FLAG: quality/availability flags — look up what each code means in the context of that provider (OBS_STATUS= break in series,b= estimated,e= not significant,n= unreliable,u= provisional, etc.)p - : multiplier applied to the value (e.g.
UNIT_MULTmeans values are in thousands)3 - : confidentiality status
CONF_STATUS - : free-text annotations attached to specific dimensions
NOTE_*
Don't hardcode these — inspect what columns are actually present in the data and
explain the ones that are populated. Skip columns that are entirely empty.
Provider-specific quirks
| Provider | Notes |
|---|---|
| ISTAT | Use |
| Eurostat | Default provider (no |
| OECD | Use |
| ECB | Use |
| World Bank | Use |
World Bank flow (different from all other providers)
World Bank exposes a single dataflow containing all indicators. The exploration
flow is different — do NOT follow the standard Phase 1 search:
WDIbash
undefined约束 vs 代码表 — 始终使用约束
使用获取数据中实际存在的代码。返回完整代码表(所有可能的代码),可能包含特定数据流中不存在的代码,代码表中有效的代码如果不在该数据流里,查询可能返回空结果。
opensdmx constraintsopensdmx values给出建议,而非仅罗列选项
展示数据流候选集时,说明每个候选集的优劣:解释它是否能回答问题、有什么限制、你推荐哪一个。让用户感受到引导,而不是被大量选项淹没。
用平实语言解释维度
将SDMX维度ID转换为易懂的概念:
- → "母亲的国籍"
CITIZENSHIP_MOTHER - → "活产数(绝对数量)"
DATA_TYPE: LBIRTH - → "意大利"
GEO: IT - → "活产数"
INDIC_DE: GBIRTHS - → "粗出生率(每千人口)" 绝对不要展示没有解释的原始代码。
INDIC_DE: CNBIRTHS
探索所有列,而非仅数值列
拿到预览结果(阶段4步骤1)时,查看返回的所有列,而非仅观测值列。SDMX数据集通常包含影响数据解读的额外列:质量标记、保密标记、单位乘数、注释。对每个不直观的列,检查存在的值并向用户解释含义,例如:
- 或
OBS_FLAG:质量/可用性标记——查找该提供方上下文下每个代码的含义(OBS_STATUS=序列中断、b=估算值、e=不显著、n=不可靠、u= provisional( provisional是临时的意思,哦不对,这里保留术语?不,翻译的话:p=临时值 等)p - :数值的乘数(例如
UNIT_MULT表示数值单位为千)3 - :保密状态
CONF_STATUS - :附加在特定维度上的自由文本注释
NOTE_*
不要硬编码这些说明——检查数据中实际存在的列,解释有值的列,跳过全空的列。
各提供方的特殊注意事项
| 提供方 | 说明 |
|---|---|
| ISTAT | 使用 |
| Eurostat | 默认提供方(无需加 |
| OECD | 使用 |
| ECB | 使用 |
| 世界银行 | 使用 |
世界银行流程(和其他所有提供方不同)
世界银行仅暴露单个数据流包含所有指标,探索流程不同——不要遵循标准的阶段1搜索流程:
WDIbash
undefinedStep 1 — find the indicator code (replaces opensdmx search)
步骤1 — 查找指标代码(替代opensdmx search)
opensdmx values WDI SERIES --provider worldbank 2>&1 | grep -i "gdp per capita"
opensdmx values WDI SERIES --provider worldbank 2>&1 | grep -i "gdp per capita"
→ NY_GDP_PCAP_KD (constant USD), NY_GDP_PCAP_PP_KD (PPP), etc.
→ NY_GDP_PCAP_KD (不变价美元)、NY_GDP_PCAP_PP_KD (购买力平价)等
Step 2 — get structure
步骤2 — 获取结构
opensdmx info WDI --provider worldbank
opensdmx info WDI --provider worldbank
→ 3 dimensions: FREQ · SERIES · REF_AREA
→ 3个维度:FREQ · SERIES · REF_AREA
Step 3 — find country codes (alpha-3, not alpha-2)
步骤3 — 查找国家代码(alpha-3,不是alpha-2)
opensdmx values WDI REF_AREA --provider worldbank 2>&1 | grep -i "italy|germany"
opensdmx values WDI REF_AREA --provider worldbank 2>&1 | grep -i "italy|germany"
→ ITA, DEU (not IT, DE)
→ ITA、DEU(不是IT、DE)
Step 4 — build query (skip constraints — endpoint returns 400)
步骤4 — 构建查询(跳过constraints——端点返回400)
opensdmx get WDI --provider worldbank --SERIES NY_GDP_PCAP_KD
--REF_AREA ITA+DEU+FRA --start-period 2000
--REF_AREA ITA+DEU+FRA --start-period 2000
If data retrieval fails with HTTP 401/307 (known bug, issue #5), offer the equivalent
OECD dataset as a workaround — OECD publishes most macro indicators (GDP, employment,
prices) with comparable coverage.
**Territory resolution (Eurostat)**
Country codes follow ISO 3166-1 alpha-2: `IT` (Italy), `DE` (Germany), `FR` (France),
`ES` (Spain). EU aggregates: `EU27_2020`, `EA20` (Euro area). Always verify against
`opensdmx constraints` before using — not all codes are present in every dataset.
**Territory resolution (ISTAT)**
ISTAT uses numeric REF_AREA codes (6-digit municipal codes, province codes, region codes,
and aggregate codes like `ITG12` for provinces or `SLL_*` for labour market areas).
Use `opensdmx values <dataflow_id> REF_AREA --provider istat` to browse the full
codelist — pipe through `grep -i` to find specific cities or territories:
```bash
opensdmx values <dataflow_id> REF_AREA --provider istat 2>&1 | grep -i "palermo\|matera"Avoid on municipal-level
datasets — the endpoint is very slow with thousands of codes.
Use + to find codes, then verify with a narrow query.
opensdmx constraints <dataflow_id> REF_AREA --provider istatavailableconstraintvaluesgrepgetopensdmx get WDI --provider worldbank --SERIES NY_GDP_PCAP_KD
--REF_AREA ITA+DEU+FRA --start-period 2000
--REF_AREA ITA+DEU+FRA --start-period 2000
如果数据请求返回HTTP 401/307(已知bug,issue #5),推荐对应的OECD数据集作为替代——OECD发布大部分宏观指标(GDP、就业、物价),覆盖范围可比。
**地区代码解析(Eurostat)**
国家代码遵循ISO 3166-1 alpha-2:`IT`(意大利)、`DE`(德国)、`FR`(法国)、`ES`(西班牙)。欧盟聚合代码:`EU27_2020`、`EA20`(欧元区)。使用前始终通过`opensdmx constraints`验证——不是所有代码都存在于每个数据集中。
**地区代码解析(ISTAT)**
ISTAT使用数字REF_AREA代码(6位市级代码、省级代码、地区代码,以及聚合代码例如省份用`ITG12`、劳动力市场区域用`SLL_*`)。使用`opensdmx values <dataflow_id> REF_AREA --provider istat`浏览完整代码表——通过管道接`grep -i`查找特定城市或地区:
```bash
opensdmx values <dataflow_id> REF_AREA --provider istat 2>&1 | grep -i "palermo\|matera"处理市级数据集时避免使用——端点处理数千个代码时速度极慢。使用+查找代码,然后用窄范围查询验证。
opensdmx constraints <dataflow_id> REF_AREA --provider istatavailableconstraintvaluesgrepget