Versioning & Publish¶
Publishing is the final step in the DBPort workflow. It writes data from DuckDB to the Iceberg warehouse with full metadata, codelist, and version tracking.
Basic publish¶
Publish modes¶
| Mode | Behavior |
|---|---|
| Default | Idempotent. Skips silently if the version is already completed. Resumes from checkpoint if interrupted. |
| Dry run | Schema validation only. No data is written. Useful for CI checks. |
| Refresh | Overwrites an existing version unconditionally. |
Full lifecycle in one command¶
The CLI provides dbp model run to execute the hook and publish in a single step:
This syncs state, executes the configured hook, and publishes — equivalent to calling port.run(version="2026-03-09") in Python. See Hooks & Execution for details.
Pre-publish checks¶
These checks must pass before any data is written
- Schema defined — fails if no schema has been declared.
- Version idempotency — if the version already completed, returns immediately (skipped in refresh mode).
- Schema drift — compares local schema to warehouse schema; fails with a diff if incompatible. Catalog connection failures also block the publish.
Idempotency and checkpoints¶
Every publish writes checkpoint properties to the Iceberg table:
dbport.upload.v2.<version>.completed— marks the version as donedbport.upload.v2.<version>.rows_appended— tracks row count
If a publish is interrupted, the next run detects the incomplete checkpoint and resumes from where it left off. Re-running a completed version is a no-op.
What happens on success¶
- Data is written to the Iceberg table at
<agency>.<dataset_id> - Codelists are auto-generated (or fetched from attached tables)
metadata.jsonis materialized and embedded in Iceberg table properties- A
VersionRecordis appended todbport.lock created_atis set on first publish only;last_updated_atis updated every time
Write strategy¶
The DuckDB iceberg extension is the primary write path.
Automatic streaming Arrow fallback
When the catalog does not support multi-table commits (e.g., returns 404 on the transactions endpoint), the adapter auto-switches to streaming Arrow:
- Streams 50K-row Arrow batches from DuckDB
- Each batch committed in a single pyiceberg transaction
- Checkpoint properties updated per batch
- On commit conflict: reload metadata, resume from remote checkpoint (max 5 retries)
- Peak memory: ~50K rows per batch (bounded)
This session-level fallback is transparent — once DuckDB writes fail, all subsequent writes use Arrow without retrying.