docs: architecture overview — concepts, lifecycle, sync kinds, plugin contracts by turban · Pull Request #128 · dhis2/climate-api

turban · 2026-05-13T11:47:20Z

Summary

Adds docs/architecture.md — a developer-facing reference that explains the platform's design, the relationships between its core concepts, and the consequences of key architectural decisions. Wired into the mkdocs nav under Concepts.

Why this document

Before merging feature work (especially new sync kinds from #127), we need a shared understanding of what the platform is, why it is structured this way, and what the rules are. This document is the foundation for that.

What it covers

Core concepts: Template, Artifact, Managed Dataset, Extent — clear definitions, what each concept is, why it exists, how they relate
Data lifecycle: the complete path from YAML template to published managed dataset, with explicit ownership boundaries
Sync kinds: temporal, release, and static — a decision table with when to use each, what happens on each sync, the trade-offs between append and rematerialize
Plugin contract: the download function signature — what the plugin must write vs. what the framework handles automatically
Transform pipeline: when transforms run, what they see, what they must not touch
GeoZarr root attributes: why they exist, who writes them (the framework, not the plugin), what breaks if they are absent
CRS handling: how the instance CRS is configured and how downloaded datasets are reprojected
Consequences: explicit write-up of the trade-offs from each design decision (single extent, no temporal gaps, append avoids re-downloading history)

What it does not cover

sync.kind: derived and sync.kind: remote are covered in #127, which depends on this PR. API endpoint reference, setup instructions, and provider-specific dataset details remain in their existing docs.

…ugin contracts

Copilot

Pull request overview

Adds a new developer-facing architecture reference and links it into the MkDocs navigation to help contributors understand Climate API core concepts, ingestion lifecycle, sync behavior, and plugin extension points.

Changes:

Adds docs/architecture.md describing core concepts (templates, artifacts, managed datasets, extent), lifecycle, sync kinds, and plugin contracts.
Documents ingestion/transform pipeline responsibilities and GeoZarr metadata expectations.
Wires the new document into the MkDocs Concepts nav.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
mkdocs.yml	Adds the new Architecture page under Concepts in the docs nav.
docs/architecture.md	Introduces a platform architecture overview, including lifecycle and plugin contract documentation.

Comments suppressed due to low confidence (7)

docs/architecture.md:37

This path is not always data/artifacts/records.json: the artifact index path is derived from data_dir in CLIMATE_API_CONFIG (e.g. ./data/artifacts/records.json) and otherwise falls back to an XDG data directory. Consider describing it as “{data_dir}/artifacts/records.json (or XDG fallback)” to match the implementation.

Artifacts are stored in `data/artifacts/records.json`. This is an internal implementation detail — consumers should never depend on artifact IDs or artifact paths directly.

docs/architecture.md:145

The download function contract here omits country_code, but the framework may pass a country_code kwarg when the provider function declares it (see downloader.download_dataset). To avoid misleading plugin authors, please include country_code: str as an optional parameter (similar to bbox) or document that additional optional kwargs like country_code may be provided based on the function signature.

def download(
    *,
    start: str,       # ISO 8601 date or datetime
    end: str,
    dirname: Path,    # write output files here
    prefix: str,      # use as filename prefix, e.g. f"{prefix}_{year}.nc"
    overwrite: bool,
    bbox: list[float],  # optional — only if the source needs a spatial filter
    **kwargs,           # default_params from the YAML template
) -> None:

docs/architecture.md:158

Transform functions can receive extra keyword parameters: the transforms pipeline forwards any YAML-declared params via func(ds, dataset, **params). The signature shown here ((ds, dataset)) will raise TypeError if template authors pass params. Please document that transforms should accept **kwargs or explicit keyword-only parameters.

def my_transform(ds: xr.Dataset, dataset: dict) -> xr.Dataset:
    # Receive the dataset after download, return a modified dataset.
    # Modify ds[dataset["variable"]] values and variable attributes.
    # Do not modify dataset-level ds.attrs — the framework manages those.

**docs/architecture.md:193**
* GeoZarr root attributes are not always stored in `zarr.json`: flat Zarr writes typically produce Zarr v2 metadata (`.zattrs`), while pyramid writes use Zarr v3 (`zarr.json`). The code explicitly supports reading attrs from either file. Please update this text to mention both layouts so readers don’t assume `zarr.json` will always exist.

Every zarr artifact must have GeoZarr root attributes for map rendering to work correctly. These are written into zarr.json at the store root:

**docs/architecture.md:213**
* The instance CRS is configured via a top-level `crs:` key in `climate-api.yaml` (see `climate-api.yaml.example` and `climate_api.config.get_crs()`), not as `extent.crs`. This example/config description should be updated to match the actual configuration shape.

The instance CRS is configured in climate-api.yaml:

extent:
  name: Norway
  bbox: [3.0, 57.0, 32.0, 72.5]
  crs: EPSG:32633 # optional; defaults to EPSG:4326

**docs/architecture.md:228**
* Artifact reuse/deduplication is described as matching an “overlapping time range”, but the implementation only reuses artifacts when `request_scope` matches exactly and temporal coverage matches the requested start (and end, unless the request is open-ended). Please adjust these bullets to reflect exact-scope reuse rather than overlap.

When a new ingestion request arrives, the framework checks whether an existing artifact already covers the requested scope:

same dataset_id
same bbox (from the configured extent)
overlapping time range

**docs/architecture.md:248**
* Two internal function names in this guarantee table don’t match the code: chunk sizing is implemented via `_compute_time_space_chunks` (plus xarray auto-chunking), and artifact persistence is handled by `_store_artifact_record` / `_mutate_records` rather than `_store_artifact`. Either update the names or remove the specific helper references to avoid sending readers to non-existent functions.

</details>

…igured data_dir)

turban

Fixed in 8cb6f6f — built-ins now reference climate_api/data/datasets/ (package path via importlib.resources) and artifacts reference {data_dir}/artifacts/records.json with a note that data_dir is configured in climate-api.yaml.

yambanso

The documentation refers to "download function" throughout, but this conflates two separate concerns. The function is actually responsible for ingestion (fetching source data + writing NetCDF files), while the framework handles the true download coordination (HTTP fetching, caching, retries). Additionally, there's no clear pattern for reusing ingestion logic across multiple YAML templates with similar source patterns.

turban · 2026-05-13T18:04:49Z

Good points, both addressed:

Terminology: "download function" was a poor name — the function is responsible for ingestion (fetching source data and writing NetCDF files), while the framework owns the HTTP coordination, caching, and retries. Renamed to "ingestion function" throughout the doc. This also aligns with the YAML key, which is already ingestion.function.

Reuse pattern: the intended pattern is to reference the same Python function from multiple YAML templates and differentiate via default_params. Added an explicit example to the plugin contract section showing two ERA5-Land templates sharing one ingestion function but specifying different variable params. No framework changes are needed for a new variable from the same source.

One more clarification added: the ingestion function is called identically by both POST /ingestions and POST /sync — the trigger makes no difference to the function. This was implied but not stated, and is worth making explicit for plugin authors.

…dpoint usage and reuse pattern

yambanso

i think we on the reuse pattern we can go for the code once and reuse many whereby we have a single generic ingest function with the default parameters shared (specified) across multiple YAML function.....

yambanso

i think it would be good to add the function signature for the "lagged_latest_available"

turban · 2026-05-15T10:18:45Z

@yambanso the actual implementation of the ingestion functions might change if follow this route: #64

See "Plugin contract" and examples of implementations below in the comments.

turban added 3 commits May 13, 2026 13:46

docs: add architecture overview — concepts, lifecycle, sync kinds, pl…

578492a

…ugin contracts

docs: scope architecture doc to main — remove derived/remote sync kinds

9a80df4

Minor fixes

ff7949e

turban mentioned this pull request May 13, 2026

feat: derived GEFS forecast artifact (ensemble mean, valid_time) #127

Draft

9 tasks

turban requested a review from Copilot May 13, 2026 12:36

Copilot started reviewing on behalf of turban May 13, 2026 12:37 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread docs/architecture.md Outdated

docs: fix template and artifact path descriptions (package path, conf…

8cb6f6f

…igured data_dir)

turban commented May 13, 2026

View reviewed changes

turban requested review from abyot, mortenoh and yambanso May 13, 2026 12:46

yambanso reviewed May 13, 2026

View reviewed changes

turban mentioned this pull request May 13, 2026

feat: zarr generation improvements and Zarr/GeoZarr concepts doc #129

Open

docs: rename download function to ingestion function, clarify dual-en…

d4d42c9

…dpoint usage and reuse pattern

turban added a commit that referenced this pull request May 14, 2026

docs: apply text improvements from architecture doc (PR #128)

36005ba

yambanso reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: architecture overview — concepts, lifecycle, sync kinds, plugin contracts#128

docs: architecture overview — concepts, lifecycle, sync kinds, plugin contracts#128
turban wants to merge 5 commits into
mainfrom
docs/platform-concepts

turban commented May 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

turban left a comment

Uh oh!

yambanso left a comment

Uh oh!

turban commented May 13, 2026

Uh oh!

yambanso left a comment

Uh oh!

yambanso left a comment

Uh oh!

turban commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

turban commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this document

What it covers

What it does not cover

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

turban left a comment

Choose a reason for hiding this comment

Uh oh!

yambanso left a comment

Choose a reason for hiding this comment

Uh oh!

turban commented May 13, 2026

Uh oh!

yambanso left a comment

Choose a reason for hiding this comment

Uh oh!

yambanso left a comment

Choose a reason for hiding this comment

Uh oh!

turban commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

turban commented May 13, 2026 •

edited

Loading