Skip to content

docs: architecture overview — concepts, lifecycle, sync kinds, plugin contracts#128

Open
turban wants to merge 5 commits into
mainfrom
docs/platform-concepts
Open

docs: architecture overview — concepts, lifecycle, sync kinds, plugin contracts#128
turban wants to merge 5 commits into
mainfrom
docs/platform-concepts

Conversation

@turban
Copy link
Copy Markdown
Contributor

@turban turban commented May 13, 2026

Summary

Adds docs/architecture.md — a developer-facing reference that explains the platform's design, the relationships between its core concepts, and the consequences of key architectural decisions. Wired into the mkdocs nav under Concepts.

Why this document

Before merging feature work (especially new sync kinds from #127), we need a shared understanding of what the platform is, why it is structured this way, and what the rules are. This document is the foundation for that.

What it covers

  • Core concepts: Template, Artifact, Managed Dataset, Extent — clear definitions, what each concept is, why it exists, how they relate
  • Data lifecycle: the complete path from YAML template to published managed dataset, with explicit ownership boundaries
  • Sync kinds: temporal, release, and static — a decision table with when to use each, what happens on each sync, the trade-offs between append and rematerialize
  • Plugin contract: the download function signature — what the plugin must write vs. what the framework handles automatically
  • Transform pipeline: when transforms run, what they see, what they must not touch
  • GeoZarr root attributes: why they exist, who writes them (the framework, not the plugin), what breaks if they are absent
  • CRS handling: how the instance CRS is configured and how downloaded datasets are reprojected
  • Consequences: explicit write-up of the trade-offs from each design decision (single extent, no temporal gaps, append avoids re-downloading history)

What it does not cover

sync.kind: derived and sync.kind: remote are covered in #127, which depends on this PR. API endpoint reference, setup instructions, and provider-specific dataset details remain in their existing docs.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new developer-facing architecture reference and links it into the MkDocs navigation to help contributors understand Climate API core concepts, ingestion lifecycle, sync behavior, and plugin extension points.

Changes:

  • Adds docs/architecture.md describing core concepts (templates, artifacts, managed datasets, extent), lifecycle, sync kinds, and plugin contracts.
  • Documents ingestion/transform pipeline responsibilities and GeoZarr metadata expectations.
  • Wires the new document into the MkDocs Concepts nav.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
mkdocs.yml Adds the new Architecture page under Concepts in the docs nav.
docs/architecture.md Introduces a platform architecture overview, including lifecycle and plugin contract documentation.
Comments suppressed due to low confidence (7)

docs/architecture.md:37

  • This path is not always data/artifacts/records.json: the artifact index path is derived from data_dir in CLIMATE_API_CONFIG (e.g. ./data/artifacts/records.json) and otherwise falls back to an XDG data directory. Consider describing it as “{data_dir}/artifacts/records.json (or XDG fallback)” to match the implementation.
Artifacts are stored in `data/artifacts/records.json`. This is an internal implementation detail — consumers should never depend on artifact IDs or artifact paths directly.

docs/architecture.md:145

  • The download function contract here omits country_code, but the framework may pass a country_code kwarg when the provider function declares it (see downloader.download_dataset). To avoid misleading plugin authors, please include country_code: str as an optional parameter (similar to bbox) or document that additional optional kwargs like country_code may be provided based on the function signature.
def download(
    *,
    start: str,       # ISO 8601 date or datetime
    end: str,
    dirname: Path,    # write output files here
    prefix: str,      # use as filename prefix, e.g. f"{prefix}_{year}.nc"
    overwrite: bool,
    bbox: list[float],  # optional — only if the source needs a spatial filter
    **kwargs,           # default_params from the YAML template
) -> None:

docs/architecture.md:158

  • Transform functions can receive extra keyword parameters: the transforms pipeline forwards any YAML-declared params via func(ds, dataset, **params). The signature shown here ((ds, dataset)) will raise TypeError if template authors pass params. Please document that transforms should accept **kwargs or explicit keyword-only parameters.
def my_transform(ds: xr.Dataset, dataset: dict) -> xr.Dataset:
    # Receive the dataset after download, return a modified dataset.
    # Modify ds[dataset["variable"]] values and variable attributes.
    # Do not modify dataset-level ds.attrs — the framework manages those.
**docs/architecture.md:193**
* GeoZarr root attributes are not always stored in `zarr.json`: flat Zarr writes typically produce Zarr v2 metadata (`.zattrs`), while pyramid writes use Zarr v3 (`zarr.json`). The code explicitly supports reading attrs from either file. Please update this text to mention both layouts so readers don’t assume `zarr.json` will always exist.

Every zarr artifact must have GeoZarr root attributes for map rendering to work correctly. These are written into zarr.json at the store root:

**docs/architecture.md:213**
* The instance CRS is configured via a top-level `crs:` key in `climate-api.yaml` (see `climate-api.yaml.example` and `climate_api.config.get_crs()`), not as `extent.crs`. This example/config description should be updated to match the actual configuration shape.

The instance CRS is configured in climate-api.yaml:

extent:
  name: Norway
  bbox: [3.0, 57.0, 32.0, 72.5]
  crs: EPSG:32633 # optional; defaults to EPSG:4326
**docs/architecture.md:228**
* Artifact reuse/deduplication is described as matching an “overlapping time range”, but the implementation only reuses artifacts when `request_scope` matches exactly and temporal coverage matches the requested start (and end, unless the request is open-ended). Please adjust these bullets to reflect exact-scope reuse rather than overlap.

When a new ingestion request arrives, the framework checks whether an existing artifact already covers the requested scope:

  • same dataset_id
  • same bbox (from the configured extent)
  • overlapping time range
**docs/architecture.md:248**
* Two internal function names in this guarantee table don’t match the code: chunk sizing is implemented via `_compute_time_space_chunks` (plus xarray auto-chunking), and artifact persistence is handled by `_store_artifact_record` / `_mutate_records` rather than `_store_artifact`. Either update the names or remove the specific helper references to avoid sending readers to non-existent functions.

| Coordinate name normalisation (laty, etc.) | build_dataset_zarr |
| Reprojection to instance CRS | reproject_to_instance_crs |
| Zarr chunking (auto-sized per period_type) | _compute_chunk_sizes |
| Multiscale pyramid generation (when dims > 2048×2048) | build_dataset_zarr |
| GeoZarr root attributes (spatial:bbox, proj:code) | build_dataset_zarr |
| Artifact coverage computation | _coverage_from_dataset |
| Artifact record persistence | _store_artifact |
| pygeoapi publication | publish_artifact_record if publish=true |

</details>

Comment thread docs/architecture.md Outdated
Copy link
Copy Markdown
Contributor Author

@turban turban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8cb6f6f — built-ins now reference climate_api/data/datasets/ (package path via importlib.resources) and artifacts reference {data_dir}/artifacts/records.json with a note that data_dir is configured in climate-api.yaml.

@turban turban requested review from abyot, mortenoh and yambanso May 13, 2026 12:46
Copy link
Copy Markdown
Collaborator

@yambanso yambanso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation refers to "download function" throughout, but this conflates two separate concerns. The function is actually responsible for ingestion (fetching source data + writing NetCDF files), while the framework handles the true download coordination (HTTP fetching, caching, retries). Additionally, there's no clear pattern for reusing ingestion logic across multiple YAML templates with similar source patterns.

@turban
Copy link
Copy Markdown
Contributor Author

turban commented May 13, 2026

Good points, both addressed:

Terminology: "download function" was a poor name — the function is responsible for ingestion (fetching source data and writing NetCDF files), while the framework owns the HTTP coordination, caching, and retries. Renamed to "ingestion function" throughout the doc. This also aligns with the YAML key, which is already ingestion.function.

Reuse pattern: the intended pattern is to reference the same Python function from multiple YAML templates and differentiate via default_params. Added an explicit example to the plugin contract section showing two ERA5-Land templates sharing one ingestion function but specifying different variable params. No framework changes are needed for a new variable from the same source.

One more clarification added: the ingestion function is called identically by both POST /ingestions and POST /sync — the trigger makes no difference to the function. This was implied but not stated, and is worth making explicit for plugin authors.

Copy link
Copy Markdown
Collaborator

@yambanso yambanso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we on the reuse pattern we can go for the code once and reuse many whereby we have a single generic ingest function with the default parameters shared (specified) across multiple YAML function.....

Copy link
Copy Markdown
Collaborator

@yambanso yambanso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be good to add the function signature for the "lagged_latest_available"

@turban
Copy link
Copy Markdown
Contributor Author

turban commented May 15, 2026

@yambanso the actual implementation of the ingestion functions might change if follow this route: #64

See "Plugin contract" and examples of implementations below in the comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants