Configuration Reference¶
This page documents the current user-facing configuration surface: SdmxDataset parameters, defaults, outputs, logging, and error behavior.
SdmxDataset¶
SdmxDataset is the primary entrypoint.
Constructor¶
from pathlib import Path
from sdmxflow import SdmxDataset
ds = SdmxDataset(
out_dir=Path("./out/my_dataset"),
source_id="ESTAT",
dataset_id="lfsa_egai2d",
)
Parameters:
out_dir(required,str | Path): output directory for all artifacts. It is expanded and resolved.source_id(required,str): provider/source identifier.dataset_id(required,str): dataset identifier within the provider.
Optional parameters:
-
agency_id(str | None, default:None) -
For
ESTAT,agency_iddefaults to the upper-casedsource_id. -
key(str | dict[str, object] | None, default:"") -
Provider-specific dataset restriction.
- For
ESTAT, useNoneto request the full dataset. The empty string""means “provider default” (which is currently also the full dataset). - Dict keys are supported for Eurostat bulk downloads and are converted deterministically to a SDMX key string.
-
params(dict[str, object] | None, default:None) -
Provider-specific passthrough parameters.
- For Eurostat bulk CSV downloads,
sdmxflowrecognizes common SDMX 3.0 bulk params such as time-window filters. -
logger(logging.Logger | None, default:None) -
If omitted,
sdmxflowuses a default library logger. -
save_logs(bool, default:False) -
If
True, writes a per-run log file under<out_dir>/logs/.
Output paths¶
sdmxflow uses fixed names under out_dir:
dataset.csvmetadata.jsoncodelists/
See Output Artifacts (Contract).
setup()¶
Creates the output directory structure (safe to call multiple times):
<out_dir>/<out_dir>/codelists/
fetch() calls setup() internally.
fetch()¶
Runs one refresh cycle:
- Ensures local directories exist.
- Resolves the upstream “last updated” timestamp.
- Loads or initializes
metadata.json. - Appends a new slice only if upstream changed.
- Ensures codelists and updates metadata.
Return value:
- a
FetchResultpointing at artifact paths and a booleanappended.
Refresh semantics¶
- On “no change”:
dataset.csvis not modified, butmetadata.jsonis updated (e.g.,last_fetched_at) and codelists are ensured. - On “changed”: a new slice is downloaded and appended to
dataset.csvwith a leadinglast_updatedcolumn value.
FetchResult¶
FetchResult fields:
out_dir: output directory used for this fetchdataset_csv: path to the facts CSV (<out_dir>/dataset.csv)metadata_json: path to metadata (<out_dir>/metadata.json)codelists_dir: path to codelists directory (<out_dir>/codelists/)appended: whether a new upstream version was appended
Example:
Logging configuration¶
sdmxflow uses Python’s standard logging module and does not configure handlers.
Minimal configuration:
At INFO level, each fetch() emits exactly three user-facing messages:
- intention (what will be fetched and where)
- decision (download vs already up to date)
- completion summary (paths to artifacts)
For detailed diagnostics:
For per-run capture:
save_logs=Truewrites<out_dir>/logs/<agency>__<dataset>__<timestamp>.log.
See Logging.
Timeouts, retries, and network settings¶
Current behavior:
fetch()does not expose a top-leveltimeout_secondsor retry policy.- Some lower-level download components support timeouts internally, but they are not part of the stable public API yet.
Operational guidance:
- implement retries/backoff in your scheduler (Airflow/Prefect/Kubernetes)
- use
save_logs=Trueto capture diagnostics
Errors and exceptions¶
fetch() raises typed sdmxflow errors for common operational failure modes:
SdmxDownloadError: unsupported provider or a download failedSdmxTimeoutError: upstream request timed outSdmxUnreachableError: DNS/connection failuresSdmxInterruptedError: user interruption (Ctrl+C)
Troubleshooting guidance is in FAQ & Troubleshooting.