Concepts & Design¶
This page explains the design choices behind sdmxflow: append-only artifacts, incremental refresh logic, and why the outputs are shaped for warehouse ingestion.
The mental model¶
sdmxflow is not a “query SDMX interactively” tool. It is a scheduled ingestion tool that turns an SDMX dataset into stable artifacts you can load repeatedly.
Key idea: you don’t mutate history. You append a new upstream version as a new slice in dataset.csv and keep the lineage in metadata.json.
Append-only artifacts (and why)¶
sdmxflow writes an append-only facts file:
dataset.csvkeeps all downloaded versions over time.- Each row is tagged with a
last_updatedvalue (UTC ISO-8601) that identifies the upstream version it came from.
Why this is useful:
- Reproducibility: older versions remain available.
- Idempotent scheduling: a run that finds no upstream change does not rewrite the facts.
- Warehouse-friendly: you can load once and query “latest version” using
last_updated.
Gotcha Append-only means duplicates across versions are expected. The same “business key” row may appear multiple times (one per upstream version). Your downstream models should filter to the latest
last_updated(or build a snapshot dimension).
Incremental refresh logic (what “changed?” means)¶
A single fetch() run:
- Ensures the output folder exists.
- Queries the provider for an upstream “last updated” timestamp.
- Loads
metadata.jsonif present (otherwise initializes it). - Compares the upstream timestamp to the most recently recorded
last_updated_data_at. - If different: downloads a new dataset slice and appends it; then updates metadata and codelists.
- If unchanged: skips the dataset download/append, but still ensures metadata and codelists are up to date.
In the current implementation:
- For
source_id="ESTAT", upstream change detection uses Eurostat’s SDMX metadata (“annotations”) viasdmxflow.query.last_updated_data.eurostat_last_updated(). - The comparison is strict string equality on the canonical UTC ISO-8601 timestamp stored in
metadata.json.
Operational note If the upstream provider republishes data without updating their “last updated” signal,
sdmxflowwill not append a new slice. That’s a provider semantics issue, not a local-state issue. See Provider Support.
Metadata history (lineage & reproducibility)¶
metadata.json is the audit trail for your ingestion:
- dataset identity (
agency_id,dataset_id,key,params) created_at,last_fetched_at,last_updated_at,last_updated_data_atversions[]: an append-only list of versions, with HTTP details androws_appendedfiles: relative paths for artifactscodelists[]: mapping from dataset columns to exported codelist CSVs
Why it exists:
- Lineage: where data came from (URL/status/headers when available)
- Governance: what changed and when
- Debuggability: what a run did without digging through logs
Codelists and dataset dimensions¶
Many SDMX datasets encode dimensions/attributes as short codes (e.g., SEX=M). To interpret or join those columns in a warehouse model, you need the corresponding code → label tables.
sdmxflow:
- downloads SDMX structures (Dataflow + DSD)
- extracts the codelists referenced by dimensions/attributes
- writes one CSV per codelist under
codelists/with columnscode,name - stores a column → codelist mapping in
metadata.json
Practical implication: treat codelists/*.csv as reference tables in your warehouse.
Deterministic artifacts (principles and gotchas)¶
sdmxflow aims for stable, deterministic outputs:
- Stable filenames (
dataset.csv,metadata.json,codelists/<ID>.csv) - Stable “slice tagging” via the
last_updatedcolumn - Strict schema matching when appending: if the provider CSV columns change,
sdmxflowraises an error instead of silently corrupting the append history.
Warning If the provider changes the dataset schema (columns), appends will fail with a metadata/schema error. In that case, treat it as a breaking upstream change and rebuild into a fresh
out_dir(see FAQ & Troubleshooting).
Glossary (SDMX terms)¶
- SDMX: Statistical Data and Metadata eXchange, a standard for statistical data exchange.
- Dataflow: A dataset definition/endpoint identifier in SDMX.
- DSD (Data Structure Definition): The schema describing dimensions, attributes, measures.
- Dimension: A column used to identify a slice of data (e.g., time, geography).
- Attribute: Additional descriptor columns (may also use codelists).
- Codelist: Reference list mapping a code to a human-readable label.
- Last updated: Provider-specific signal for when data changed; used by
sdmxflowto decide whether to append.
Next:
- See Output Artifacts (Contract) for the stable on-disk contract.
- See Integration Patterns for warehouse loading examples.