From 0c00254b330091e16264a989b39e02b1a494734a Mon Sep 17 00:00:00 2001 From: Daniel McIlvaney Date: Thu, 4 Jun 2026 16:37:16 -0700 Subject: [PATCH 1/4] docs: add schema migration rfc --- docs/developer/rfc/lazy-schema-migration.md | 316 ++++++++++++++++++++ 1 file changed, 316 insertions(+) create mode 100644 docs/developer/rfc/lazy-schema-migration.md diff --git a/docs/developer/rfc/lazy-schema-migration.md b/docs/developer/rfc/lazy-schema-migration.md new file mode 100644 index 00000000..3f6c8f74 --- /dev/null +++ b/docs/developer/rfc/lazy-schema-migration.md @@ -0,0 +1,316 @@ +# RFC 002: Lazy Schema Migration for Lock-File Fingerprints + +- **Status**: Draft +- **Author**: @damcilva +- **Created**: 2026-06-04 +- **Related code**: + - [`internal/fingerprint/fingerprint.go`](../../../internal/fingerprint/fingerprint.go) — `ComputeIdentity`, `combineInputs` + - [`internal/lockfile/lockfile.go`](../../../internal/lockfile/lockfile.go) — `ComponentLock`, version gate + - [`internal/projectconfig/fingerprint_test.go`](../../../internal/projectconfig/fingerprint_test.go) — field-inclusion audit + - [`internal/app/azldev/core/components/resolver.go`](../../../internal/app/azldev/core/components/resolver.go) — `computeFreshnessStatus` + +## Background + +### Lock files and fingerprints + +`azldev` tracks the resolved state of each component in a per-component lock file under `locks/.lock`. A lock pins the upstream commit and records a content **fingerprint** of every input that affects the component's build output: + +```go +// internal/lockfile/lockfile.go +type ComponentLock struct { + Version int // lock file FORMAT version, currently 1 + ImportCommit string // write-once fork point + UpstreamCommit string // resolved upstream commit + ManualBump int // mass-rebuild counter + InputFingerprint string // sha256 of all render inputs + ResolutionInputHash string // sha256 of upstream-resolution inputs +} +``` + +The fingerprint is computed by [`fingerprint.ComputeIdentity`](../../../internal/fingerprint/fingerprint.go). Its core is a single structural hash of the resolved component config: + +```go +// hashstructure walks every exported field of ComponentConfig. +// Fields tagged `fingerprint:"-"` are excluded; everything else is included. +configHash, err := hashstructure.Hash(component, hashstructure.FormatV2, + &hashstructure.HashOptions{TagName: "fingerprint"}) +``` + +`configHash` is then folded together with the source identity, overlay file hashes, manual bump, and distro release version into a domain-separated SHA256 (`combineInputs`). Field inclusion is policed by [`TestAllFingerprintedFieldsHaveDecision`](../../../internal/projectconfig/fingerprint_test.go): every field of every fingerprinted struct must be consciously categorized as **included** (no tag) or **excluded** (`fingerprint:"-"`). The safe default is *included* — a new field contributes to the hash unless told otherwise. + +Drift is detected in [`resolver.go`](../../../internal/app/azldev/core/components/resolver.go): `computeFreshnessStatus` → `checkFingerprintFreshness` recomputes the identity and compares it to `InputFingerprint`, yielding `FreshnessCurrent` or `FreshnessStale`. `component update` ([`update.go`](../../../internal/app/azldev/cmds/component/update.go)) re-stamps the lock and flips a user-visible `Changed` flag whenever the fingerprint moves. + +### The three version axes + +As the tool matures, three *independent* notions of "version" are emerging. Conflating them is the source of the problems in this RFC: + +| Axis | Versions what | Lives where | Exists today? | +| ---- | ------------- | ----------- | ------------- | +| **Config schema version** | on-disk TOML field shape | load / migration layer | No | +| **Fingerprint algorithm version** | how inputs fold into the hash | `fingerprint` combiner | No (implicitly v1) | +| **Lock file format version** | lock file serialization | `lockfile` | Yes (`Version = 1`) | + +### The problem + +Because field inclusion defaults to *included*, **adding any new fingerprinted config field re-hashes every component**, even components that never set the field. `hashstructure` hashes a zero-value field identically to a present-but-empty field — but *differently* from a field that does not exist in the struct at all. So the moment the Go struct gains `Foo string`, every component's `configHash` changes, every `InputFingerprint` changes, and every `*.lock` shows drift on the next `component update`. + +Concretely: we add field `foo` and set `foo = "baz"` on package `bar`. The desired outcome is that **only** `bar.lock` drifts. The actual outcome today is that **all** lock files drift. + +**The root concern is git churn, not rebuilds.** The mass rebuild is a knock-on effect; the thing we actually want to protect is the **lock-file diff in a PR**. A change that touches one package should produce exactly one changed `*.lock` — ideally zero changed bytes in any other lock file, in any way. Lock files should change *only* when there is a real, per-component change. Clean diffs keep PRs reviewable, keep `git blame` meaningful, and make "this lock moved" a trustworthy signal that *that component's* inputs actually changed. The rebuild fan-out follows for free once the diffs are clean. + +There is a harder variant lurking behind the additive case: **non-additive** schema changes — renaming a field, removing one, changing a baked-in default, or fixing a bug in the hashing logic itself. These legitimately change the *meaning* of the config without changing user intent, and we will eventually need to absorb them without forcing every consumer to rebuild. + +### Goals + +- **G1 (primary, non-functional): no spurious lock-file diffs.** Landing a config-schema or hashing change must not rewrite `*.lock` files for components whose effective inputs are unchanged — not even to bump a version field. Soft requirement (strongly preferred, not a hard gate), but it shapes which solutions are acceptable: it rules out any eager "migrate everything" pass. +- **G2: only real changes drift.** A lock changes iff that component's build-effective inputs changed. +- **G3: piecemeal, lazy migration.** Schema/algorithm evolution rolls out per-component, riding along with independent changes, never as a big-bang. +- **G4: additive fields are drift-neutral by construction.** Adding an unset field should be invisible to every existing lock with no author effort beyond declaring intent. +- **G5: correctness backstop preserved.** Never silently under-rebuild: a genuine input change must always drift its lock. + +## Problem inventory + +| # | Problem | Root cause | Severity | +| - | ------- | ---------- | -------- | +| 1 | Adding a config field drifts every lock, even unaffected components | Field inclusion defaults to *included*; zero-value ≠ absent in struct hash | Mass rebuild | +| 2 | No way to land a semantically no-op schema change (rename/move) without drift | Fingerprint hashes raw struct shape, not normalized intent | Mass rebuild | +| 3 | No way to evolve the hashing algorithm (bugfix, input reorder) without drift | `combineInputs` has no version; old and new outputs are incomparable | Mass rebuild + lock churn | +| 4 | No on-disk config schema version | `ConfigFile` has a `$schema` URL but no version field | Blocks managed migration | +| 5 | Migration is all-or-nothing | Freshness check is binary match/no-match against one stored hash | No piecemeal rollout | + +Problems 1–3 share a shape: a change that *should* be invisible to most components is forced to be visible to all of them, because the fingerprint cannot distinguish "input changed" from "encoding changed." Problem 4 is the missing primitive for managed config evolution. Problem 5 is the property we actually want from any solution — **per-component, lazy** migration, where a lock upgrades only when something independently touches it. + +## How fingerprinting works today (detail) + +```text +ComponentConfig ──hashstructure(TagName:"fingerprint")──► configHash (uint64) + │ +SourceIdentity ───────────────────────────────────────────┐ │ +OverlayFileHashes ────────────────────────────────────────┤ │ +ManualBump ───────────────────────────────────────────────┤ ▼ +ReleaseVer ───────────────────────────────────────────► combineInputs ──► "sha256:…" (InputFingerprint) +``` + +Two properties of `hashstructure` v2.0.2 are load-bearing for this RFC: + +1. **No per-field `omitempty`.** The only field tags it recognizes are `-`/`ignore` (skip) and `set`/`string` (encoding). A zero-value field is hashed; it is not skipped. +2. **It honors the `Includable` interface.** If the value (or a pointer to it) implements `HashInclude(field string, v interface{}) (bool, error)`, the walker calls it per field and omits the field when it returns `false`. **An omitted field hashes identically to a field that was never declared.** There is also a global `IgnoreZeroValue` option that skips *all* zero-value fields. + +The struct's type name *is* part of the hash (`hashstructure` mixes in `reflect.Type.Name()`), but that name does not change when fields are added, so it is irrelevant to drift. + +One constraint: the top-level value passed to `hashstructure.Hash` is not addressable, so an `Includable` implementation must use a **value receiver** to be seen for the root struct. + +## Change taxonomy + +Not every config change should be treated the same way. The right mechanism depends on what kind of change it is. This taxonomy drives the design. + +| Class | Example | Should unaffected locks drift? | Mechanism | +| ----- | ------- | ------------------------------ | --------- | +| **Additive field** | new `foo` field, unset on most components | No — only setters drift | Default omitempty (Layer 1); no version bump | +| **Additive with non-zero default** | new field defaulted to `"auto"` via defaults merge | No | Algorithm version + replay (Layer 2) | +| **Rename / move** | `foo` → `bar`, same semantics | No | Schema migration → canonical hash (Layer 3) + Layer 2 | +| **Semantic change** | meaning of `foo` changes; output differs | Yes — that's correct | None; drift is intended | +| **Hashing bugfix** | overlay ordering bug in `combineInputs` | No | Algorithm version + replay (Layer 2) | +| **Field removal** | drop deprecated `foo` | No, if nobody set it | Migration drops field; Layer 2 for setters | + +The recurring requirement across the "No" rows is the same: **distinguish a change in user intent from a change in encoding, and only drift on the former.** Note the first row: with omitempty as the *default* (Layer 1), additive fields need no version bump and no replay at all — they are hash-neutral by construction. Layer 2 then carries only the genuinely hard cases (rows 2, 5). + +## Research + +### `hashstructure` options + +- **`Includable` (per-field callback)** keeps existing hashes byte-identical: fields that don't opt into omission hash exactly as they do today. This is the only option that solves Problem 1 *without* itself triggering a mass rebuild. +- **`IgnoreZeroValue` (global)** is simpler to wire but flips the hash of *every* struct that has any zero-value field — i.e. it is itself a mass-rebuild event, and it removes our ability to say "this empty field is meaningful." Rejected for the default path. + +### How other tools version lock state + +- **Cargo (`Cargo.lock`)** carries an explicit `version = 4` at the top of the lock and teaches `cargo` to read older versions, upgrading in place on the next write. Migration is lazy — touching the lock upgrades it. +- **npm (`package-lock.json`)** uses `lockfileVersion` and supports reading v1/v2/v3, rewriting to the current version on install. +- **Terraform state** stores a `version` and a `terraform_version`; state is upgraded forward on use, never downgraded. +- **Go modules** avoid the problem entirely by hashing *content* (`h1:` dirhashes) rather than a struct shape, so adding metadata fields never perturbs existing sums. + +The common pattern: an **integer version stamped into the persisted artifact**, plus the ability to **read and replay older versions**, plus **lazy forward-migration on write**. Our `ComponentLock.Version` already provides the slot; today we only ever reject mismatches instead of migrating. + +### Where the hashing logic should live + +A natural question (raised during design) is whether to move hashing onto the config types as a method. The hashing logic decomposes into two separable jobs: + +1. **Pure config hash** — `hashstructure.Hash(component, …)` plus field-inclusion policy. This is genuinely *about the config type*; `HashInclude` is already a method on it. +2. **Combiner / orchestration** — reads overlay file contents (needs `opctx.FS`), folds in source identity / releasever / bump, applies domain separation, and (Layer 2) selects an algorithm version. None of these are config fields. + +Moving (1) onto the type improves cohesion and version-locality. Moving (2) onto the type would drag I/O and cross-cutting algorithm versioning into `projectconfig` (a pure data package that `lockfile` imports), and would scatter the centralized field-inclusion audit. The combiner must own algorithm versioning because "I changed how overlays fold in" is not a per-type concern. **Recommendation: a hybrid seam** — expose `ComponentConfig.ConfigHash()` on the type; keep the combiner in `fingerprint`. + +## Proposed approach + +The design is **layered**, not a single switch. Each layer is independently shippable and addresses a distinct row of the taxonomy. Layers 1 and 2 cover the immediate need (Problems 1–3); Layer 3 is the forward-looking config-schema-version axis (Problem 4) and can follow later. + +### Layer 1 — Omitempty as the default inclusion policy + +Today the safe default is *include-always*: a new field contributes to the hash even at zero value. We **flip the default to omitempty** (include only when non-zero) and make the inclusion policy an explicit, exhaustive, CI-enforced choice per field. + +Every fingerprinted field must carry one of three `fingerprint` tag values: + +| Tag | Meaning | When to use | +| --- | ------- | ----------- | +| `fingerprint:"omitempty"` | included **only when non-zero** (the new default) | almost all fields | +| `fingerprint:"always"` | included even at zero value | fields whose **zero value is build-meaningful** (e.g. a `bool` that defaults true, where `false` must rebuild) | +| `fingerprint:"-"` | excluded from the hash entirely | paths, publish routing, runtime state | + +There is no untagged state. `TestAllFingerprintedFieldsHaveDecision` is rewritten to assert that **every** field of every fingerprinted struct carries a valid tag value — failing CI on any bare field. This is *simpler* than today's audit: it no longer maintains an `expectedExclusions` registry, it just checks for tag presence and validity. The conscious decision moves to the point of field definition, where the author has the context to judge whether zero is meaningful. + +Implement `Includable` on each fingerprinted struct, delegating to one shared helper: + +```go +// includeFingerprintField reports whether a field participates in the hash. +// "-" fields never reach here (hashstructure skips them first). "always" fields +// are included unconditionally; "omitempty" (the default) is included only when +// the resolved value is non-zero. +func includeFingerprintField(t reflect.Type, field string, v reflect.Value) (bool, error) { + sf, ok := t.FieldByName(field) + if !ok { + return true, nil + } + switch sf.Tag.Get("fingerprint") { + case "always": + return true, nil + default: // "omitempty" + return !v.IsZero(), nil + } +} + +// Value receiver: the root struct passed to hashstructure.Hash is not addressable. +func (c ComponentConfig) HashInclude(field string, v interface{}) (bool, error) { + return includeFingerprintField(reflect.TypeOf(c), field, reflect.ValueOf(v)) +} +``` + +**Why flipping the default is safe — fingerprints see the resolved config.** The usual objection to blanket omitempty is the false-negative footgun: a field whose zero is meaningful gets omitted and collides with "unset," so two semantically different configs hash the same and a rebuild is missed. That objection assumes we hash *raw user input*. We do not. `ComputeIdentity` runs on the **resolved, post-merge** config (`*result.config`, after defaults are applied). The omit predicate is therefore "the *resolved value* equals Go-zero," not "the user didn't type it." Consequences: + +- Two configs that both resolve a field to zero build identically → hashing them the same is **correct**, not a collision. +- "Unset" never reaches the hasher — it has already been resolved to its default. If the default is non-zero, the field is non-zero and is included anyway. If the default *is* zero, then unset and explicit-zero resolve identically → same build → same hash → correct. + +So the classic false-negative requires absence ≠ zero-default *at the point of hashing*, and post-merge resolution closes that gap. The load-bearing invariant is **G5's guarantee restated structurally: the fingerprint must see exactly the build-effective resolved config.** That invariant must already hold, or fingerprinting is broken independently of this change. The `fingerprint:"always"` escape hatch (plus the mandatory-tag audit) is cheap insurance against the invariant silently drifting later — e.g. if someone applies a default *after* fingerprinting. + +**Result:** additive fields are drift-neutral **by construction** (G4) — an unset field omits identically to a field that never existed, with no version bump and no replay. Only setters drift (G2). The cost is one tag per field (verbose but mechanical) and two genuine edge cases (see below). + +#### Edge cases under default omitempty + +- **Meaningful zero with a non-zero default** (e.g. `int Jobs` defaulting to `4`, where `0` means serial). Post-merge: unset → `4` (included), explicit `0` → `0` (omitted-by-omitempty). These build differently *and* hash differently, so there is no collision — they are consistent. Such fields rarely trigger omission at all because the default keeps them non-zero. Tag them `always` only if a zero value must be distinguishable from a future change of default. +- **nil vs empty slice.** `reflect.Value.IsZero` on a slice is `IsNil`. A missing TOML key → nil → omitted; `key = []` → non-nil empty → included. Default omitempty thus makes nil-vs-empty a hash distinction that include-always collapses. Almost never observable, but it is a real behavioral edge; `always` forces both to hash. + +**Adopting this flip is itself a fingerprint-algorithm change** (every config's hash moves), so it does not land for free — it is absorbed by Layer 2's versioned replay rather than by rewriting locks. See Layer 2. + +### Layer 2 — Versioned fingerprint with lazy replay (algorithm and default changes) + +Stamp the algorithm version into the lock and teach the freshness check to **replay** older versions: + +1. Add `FingerprintVersion int` (`toml:"fingerprint-version,omitempty"`) to `ComponentLock`. Old locks read as `0` = baseline. The lock **format** `Version` stays `1`; this is a *content* version and is fully backward compatible. +2. Turn `ComputeIdentity` into a thin dispatcher over a small registry of historical compute functions, keyed by version. Keep the last *N* versions: + + ```go + var fingerprinters = map[int]computeFn{ + 1: computeV1, // current algorithm + 2: computeV2, // e.g. fixes overlay-ordering bug, or absorbs a new default + } + const currentFingerprintVersion = 2 + ``` + +3. In `checkFingerprintFreshness`, compute at the **current** version. On mismatch, if `lock.FingerprintVersion < current`, recompute at the lock's recorded version. If *that* matches the stored hash, the inputs are unchanged and only the algorithm evolved → treat as `FreshnessCurrent` and flag for silent re-stamp. Otherwise → `FreshnessStale`. +4. `component update` always stamps `FingerprintVersion = current` when it writes. Migration is therefore **lazy and per-component**: a lock upgrades only when something independently touches it. + +This resolves Problems 2 (for default changes), 3 (hashing bugfixes), and 5 (piecemeal rollout). It is the same lazy-forward-migration pattern Cargo/npm use, specialized to a content hash. + +#### Churn-avoidance policies (G1) + +The version stamp is itself a potential source of spurious diffs — the exact thing G1 forbids. Two policies keep it invisible until a real change forces a write: + +- **`fingerprint-version` is `omitempty` in TOML.** A baseline (`version 0/absent`) lock that is never otherwise touched never materializes the field, so its bytes stay identical. The field only appears in a lock that was *already* being rewritten for an independent reason. Existing checked-in locks therefore produce **zero diff** on the day this lands. +- **Re-stamp only on a real write; never write to advance the version.** The "silent re-stamp" in step 3 is *piggybacked* onto a write that is already happening — it must never be its own trigger. `component update` must keep its existing write-on-change guard: if nothing else changed, the version bump alone does **not** dirty the lock. (Concretely, the equivalent of `if !result.Changed && !resHashChanged { return false, nil }` stays in force; the re-stamp rides the `Changed` path, it does not create one.) + +Together these make migration strictly opportunistic: a lock advances its version the next time its component changes for real, and not one commit sooner. + +#### First concrete use: the Layer 1 switchover + +Flipping the inclusion default to omitempty (Layer 1) moves every config's hash, so it cannot ship as a free additive change — it is **Layer 2's first real customer.** It registers as `computeV2` (omitempty default) alongside `computeV1` (include-always), bumps `currentFingerprintVersion`, and is absorbed by replay: every existing lock recomputes clean at v1, is recognized as unchanged-inputs, and re-stamps to v2 *only when next written* per the churn policy above. No mass regen, no flag day. And because omitempty makes all future additive changes hash-neutral by construction (G4), it permanently **shrinks** the set of changes that need a Layer 2 version event at all — Layer 1 is both the first user of Layer 2 and the thing that reduces Layer 2's future workload. + +### Layer 3 — Config schema version and canonical migration (future) + +This is the on-disk TOML axis. It is **independent** of the fingerprint axis and only needed once we make *non-additive* TOML changes (rename/move/remove fields in the file format itself). + +1. Add an explicit `schema-version` to the config file (distinct from the existing `$schema` URL, which is for editor validation). +2. At **load time**, migrate older config shapes forward into the single latest canonical struct *before* anything hashes them. Fingerprinting stays blissfully unaware of file-format history. +3. Pair with the **hybrid seam**: expose `ComponentConfig.ConfigHash()` on the type (pure struct hash + inclusion policy); keep the combiner in `fingerprint`. + +The critical invariant: **migrate old TOML → latest canonical struct, then hash once.** A semantically no-op migration (rename `foo`→`bar`) must produce the *same* canonical struct, hence the same hash, hence no drift — handled by Layer 2's replay only if the *encoding* changed, and by Layer 3's normalization for the *file shape*. Do **not** keep parallel `V1.Hash()`/`V2.Hash()` methods on versioned structs: that couples the lock to a Go type identity instead of a simple integer, and forces two independent code paths to agree on a hash forever. + +### Layer interaction + +```text +TOML on disk ──Layer 3: migrate to canonical struct──► ComponentConfig + │ + Layer 1: HashInclude omits zero fields (default omitempty) + ▼ + Layer 2: ComputeIdentity[version] ──► InputFingerprint + │ + lazy replay + re-stamp on update + ▼ + locks/.lock +``` + +## Design decisions + +### D1 — `Includable` vs `IgnoreZeroValue` + +Both omit zero values; the difference is **control granularity and escape hatches.** + +| | `Includable` per-field (chosen) | `IgnoreZeroValue` global | +| --- | --- | --- | +| Meaningful empties | Preserved via `fingerprint:"always"` | Lost — no opt-out | +| Per-field intent | Explicit, CI-audited | Invisible | +| Wiring | One helper + value-receiver method per struct | One option flag | + +`IgnoreZeroValue` is a blunt global switch with no way to keep a build-meaningful zero. `Includable` gives the same default behavior **plus** the `always` escape hatch and a point-of-definition audit. Both move every hash once on adoption — that cost is absorbed by Layer 2 either way (see the switchover note), so it is not a differentiator. + +### D2 — Mandatory explicit tags, default omitempty + +Every fingerprinted field must carry `fingerprint:"-"`, `"omitempty"`, or `"always"` — there is no untagged state. Rationale: + +- The *unsafe* failure direction is the false-negative (a meaningful field omitted → missed rebuild). Defaulting to omitempty tilts toward that direction, so the safety check must be loud, not implicit. +- A mandatory tag forces the "is this field's zero value build-meaningful?" decision **at the point of definition**, where the author has the context — better locality than a far-away exclusions registry. +- It *simplifies* the audit: assert every field has a valid tag value; delete the `expectedExclusions` map entirely. + +Fully implicit (omitempty default, no tags, no audit) was rejected — it removes the only guard against the unsafe direction. `fingerprint:"omitempty"` mirrors Go's own `json:",omitempty"`; `"always"` and `"-"` read unambiguously alongside it. + +### D3 — Content version vs format version in the lock + +Reusing `ComponentLock.Version` for the algorithm would force a format-version bump (and the strict `Parse` gate would reject old locks outright). A separate `FingerprintVersion` keeps the format stable and old locks readable, enabling lazy migration instead of hard rejection. + +### D4 — Method-on-type hashing + +Adopt the **hybrid seam**: pure `ConfigHash()` on the config type, combiner in `fingerprint`. A full move was rejected (layering regression: I/O + crypto + algorithm versioning do not belong on a data type). See [Research](#where-the-hashing-logic-should-live). + +## Alternatives considered + +- **Global `IgnoreZeroValue`** — see D1. Same default behavior but no per-field escape hatch for meaningful zeros and no point-of-definition audit. Rejected. +- **Implicit omitempty (no mandatory tags, no audit)** — see D2. Removes the only guard against the unsafe false-negative direction. Rejected in favor of mandatory 3-way tags. +- **Content-hash the rendered config** (Go-modules style) instead of struct-hashing — would sidestep field-shape sensitivity, but we deliberately exclude many fields (`paths`, `publish`, snapshots) from the fingerprint, so a blanket content hash over-captures. Rejected. +- **Parallel versioned structs with per-struct `Hash()`** — couples locks to Go type identity and duplicates hashing logic per version. Rejected in favor of Layer 2's integer-versioned combiner + Layer 3 canonical migration. +- **Bump lock format `Version` and migrate eagerly** — eager migration rewrites every lock at once, the exact mass-churn we are trying to avoid. Rejected in favor of lazy per-component re-stamp. + +## Incremental delivery + +1. **PR A (Layer 1)**: shared `includeFingerprintField` helper + `HashInclude` on `ComponentConfig` and `PackageConfig`; tag every fingerprinted field with one of `-`/`omitempty`/`always`; rewrite the field-decision audit to assert valid-tag presence and drop the `expectedExclusions` registry. **Note:** flipping the default moves every hash, so PR A must land *with or after* PR B's version machinery — it registers as `computeV2`, not as a standalone change. Unit test: an unset `omitempty` field is hash-invisible; setting it drifts; an `always` field drifts even at zero. +2. **PR B (Layer 2)**: `FingerprintVersion` on `ComponentLock`; version-dispatched `ComputeIdentity`; replay + re-stamp in `checkFingerprintFreshness` and `update.go`. Unit test: old-version lock with unchanged inputs → `Current`; changed inputs → `Stale`; re-stamp on update. +3. **PR C (validation)**: scenario test (in the style of `scenario/component_changed_test.go`) — set a new `omitempty` field on a single component and assert only that lock drifts. +4. **PR D (Layer 3, later)**: `schema-version` field, load-time canonical migration, `ComponentConfig.ConfigHash()` seam. Gated on the first real non-additive TOML change. + +Each PR is independently revertible. Because the Layer 1 default flip is a hash-moving change, PRs A and B ship together (or B first); the `fingerprint-version` omitempty stamp and churn policies ensure existing locks see zero diff until independently touched. Layer 3 migrates lazily on next write. + +## Open questions + +1. How many historical fingerprint versions should the registry retain before dropping the oldest? (Trade-off: replay coverage vs. dead code.) +2. Should a lazy re-stamp during a *read-only* command (`render`, `build` freshness check) write the lock back, or defer all writes to `component update`? Writing on read is surprising; deferring means freshness checks stay slightly slower until the next update. +3. For Layer 3, does `schema-version` live per-config-file or per-component? Per-file is simpler; per-component allows mixed-version projects during migration. +4. Should `omitempty` semantics use `reflect.Value.IsZero()` (Go's notion) or a config-aware notion of "unset" (e.g. nil pointer vs empty string)? Pointers would make "set to empty" expressible but complicate the structs. +5. Do we want a `component update --rehash` escape hatch that force-advances `FingerprintVersion` across the whole project (for when a change *is* intended to be global)? +6. Can the audit go further than tag-presence and *statically* flag fields whose zero value is likely meaningful (e.g. a `bool` defaulting true) and nudge toward `always`? Or is the point-of-definition tag plus code review sufficient? From 78870c3e39bdc3a9575d01a632b66ff6db4cc361 Mon Sep 17 00:00:00 2001 From: Daniel McIlvaney Date: Fri, 5 Jun 2026 10:17:49 -0700 Subject: [PATCH 2/4] update --- docs/developer/rfc/lazy-schema-migration.md | 177 ++++++++++++++++---- 1 file changed, 147 insertions(+), 30 deletions(-) diff --git a/docs/developer/rfc/lazy-schema-migration.md b/docs/developer/rfc/lazy-schema-migration.md index 3f6c8f74..af24f6da 100644 --- a/docs/developer/rfc/lazy-schema-migration.md +++ b/docs/developer/rfc/lazy-schema-migration.md @@ -7,7 +7,9 @@ - [`internal/fingerprint/fingerprint.go`](../../../internal/fingerprint/fingerprint.go) — `ComputeIdentity`, `combineInputs` - [`internal/lockfile/lockfile.go`](../../../internal/lockfile/lockfile.go) — `ComponentLock`, version gate - [`internal/projectconfig/fingerprint_test.go`](../../../internal/projectconfig/fingerprint_test.go) — field-inclusion audit - - [`internal/app/azldev/core/components/resolver.go`](../../../internal/app/azldev/core/components/resolver.go) — `computeFreshnessStatus` + - [`internal/app/azldev/core/components/resolver.go`](../../../internal/app/azldev/core/components/resolver.go) — `computeFreshnessStatus`, `BuildDirtyChange` + - [`internal/app/azldev/cmds/component/update.go`](../../../internal/app/azldev/cmds/component/update.go) — `Changed` decision, re-stamp write + - [`internal/app/azldev/core/sources/synthistory.go`](../../../internal/app/azldev/core/sources/synthistory.go) — `FindFingerprintChanges` (synthetic changelog/release) ## Background @@ -47,7 +49,7 @@ As the tool matures, three *independent* notions of "version" are emerging. Conf | Axis | Versions what | Lives where | Exists today? | | ---- | ------------- | ----------- | ------------- | | **Config schema version** | on-disk TOML field shape | load / migration layer | No | -| **Fingerprint algorithm version** | how inputs fold into the hash | `fingerprint` combiner | No (implicitly v1) | +| **Lock content-hash version** | how inputs fold into the lock's stored hashes (`InputFingerprint` *and* `ResolutionInputHash`) | `fingerprint` combiner | No (implicitly v1) | | **Lock file format version** | lock file serialization | `lockfile` | Yes (`Version = 1`) | ### The problem @@ -158,6 +160,8 @@ Every fingerprinted field must carry one of three `fingerprint` tag values: There is no untagged state. `TestAllFingerprintedFieldsHaveDecision` is rewritten to assert that **every** field of every fingerprinted struct carries a valid tag value — failing CI on any bare field. This is *simpler* than today's audit: it no longer maintains an `expectedExclusions` registry, it just checks for tag presence and validity. The conscious decision moves to the point of field definition, where the author has the context to judge whether zero is meaningful. +**`Includable` is resolved per-struct — every fingerprinted struct needs the method.** `hashstructure` looks up `Includable` on each struct it walks (and the whole tree is non-addressable, since the root is passed by value), so a `HashInclude` on `ComponentConfig` alone governs only `ComponentConfig`'s own fields. On any nested struct that lacks its own value-receiver `HashInclude`, the `omitempty`/`always` tags are **decorative** — `hashstructure` natively understands only `-`/`ignore`/`set`/`string`, so the tag passes the CI audit while the field is still hashed at zero, and G4 silently holds only at the top level. The audit (`fingerprint_test.go` registers ~10 fingerprinted structs: `ComponentConfig`, `ComponentBuildConfig`, `CheckConfig`, `PackageConfig`, `ComponentOverlay`, `SpecSource`, `DistroReference`, `SourceFileReference`, `ReleaseConfig`, `ComponentRenderConfig`) must therefore **also assert that every registered struct implements `Includable`** — so a new fingerprinted struct cannot ship with inert tags. All registered structs get the one-line delegating method. + Implement `Includable` on each fingerprinted struct, delegating to one shared helper: ```go @@ -165,7 +169,7 @@ Implement `Includable` on each fingerprinted struct, delegating to one shared he // "-" fields never reach here (hashstructure skips them first). "always" fields // are included unconditionally; "omitempty" (the default) is included only when // the resolved value is non-zero. -func includeFingerprintField(t reflect.Type, field string, v reflect.Value) (bool, error) { +func includeFingerprintField(t reflect.Type, field string, val reflect.Value) (bool, error) { sf, ok := t.FieldByName(field) if !ok { return true, nil @@ -174,13 +178,20 @@ func includeFingerprintField(t reflect.Type, field string, v reflect.Value) (boo case "always": return true, nil default: // "omitempty" - return !v.IsZero(), nil + return !val.IsZero(), nil } } // Value receiver: the root struct passed to hashstructure.Hash is not addressable. +// +// CRITICAL: hashstructure calls HashInclude(field, innerV) where innerV is +// ALREADY a reflect.Value (the field's value), boxed into the interface{}. +// So we must TYPE-ASSERT it, not reflect.ValueOf it. reflect.ValueOf(v) would +// describe the reflect.Value struct itself (always non-zero) → !IsZero() always +// true → omitempty silently never fires and Layer 1 no-ops. Verified against +// hashstructure v2.0.2 hashstructure.go:346 (`include.HashInclude(name, innerV)`). func (c ComponentConfig) HashInclude(field string, v interface{}) (bool, error) { - return includeFingerprintField(reflect.TypeOf(c), field, reflect.ValueOf(v)) + return includeFingerprintField(reflect.TypeOf(c), field, v.(reflect.Value)) } ``` @@ -196,42 +207,110 @@ So the classic false-negative requires absence ≠ zero-default *at the point of #### Edge cases under default omitempty - **Meaningful zero with a non-zero default** (e.g. `int Jobs` defaulting to `4`, where `0` means serial). Post-merge: unset → `4` (included), explicit `0` → `0` (omitted-by-omitempty). These build differently *and* hash differently, so there is no collision — they are consistent. Such fields rarely trigger omission at all because the default keeps them non-zero. Tag them `always` only if a zero value must be distinguishable from a future change of default. -- **nil vs empty slice.** `reflect.Value.IsZero` on a slice is `IsNil`. A missing TOML key → nil → omitted; `key = []` → non-nil empty → included. Default omitempty thus makes nil-vs-empty a hash distinction that include-always collapses. Almost never observable, but it is a real behavioral edge; `always` forces both to hash. +- **nil vs empty slice.** `reflect.Value.IsZero` on a slice is `IsNil`. A missing TOML key → nil → omitted; `key = []` → non-nil empty → included. Default omitempty thus makes nil-vs-empty a hash distinction that include-always collapses. Almost never observable — but a TOML formatter that strips empty arrays (or any round-trip that maps `[]`→absent) would flip hashes. **Tag rule: for any slice/map field where an explicit-empty value is reachable and build-meaningful, prefer `fingerprint:"always"`** so nil and empty both hash and the distinction can't silently move a fingerprint. **Adopting this flip is itself a fingerprint-algorithm change** (every config's hash moves), so it does not land for free — it is absorbed by Layer 2's versioned replay rather than by rewriting locks. See Layer 2. -### Layer 2 — Versioned fingerprint with lazy replay (algorithm and default changes) +### Layer 2 — Versioned lock content with lazy replay (algorithm and default changes) -Stamp the algorithm version into the lock and teach the freshness check to **replay** older versions: +Stamp one **lock content-hash version** into the lock and teach the freshness check to **replay** older versions. The version governs *both* stored hashes (`InputFingerprint` and `ResolutionInputHash`) — they live in one lock, share one write event, and a single integer is the natural fit (see [scope note](#both-hashes-share-one-version) for why one version, not two): -1. Add `FingerprintVersion int` (`toml:"fingerprint-version,omitempty"`) to `ComponentLock`. Old locks read as `0` = baseline. The lock **format** `Version` stays `1`; this is a *content* version and is fully backward compatible. -2. Turn `ComputeIdentity` into a thin dispatcher over a small registry of historical compute functions, keyed by version. Keep the last *N* versions: +1. Add `LockContentVersion int` (`toml:"lock-content-version,omitempty"`) to `ComponentLock`. **An absent field reads as `1`** — the current, pre-RFC algorithms — *not* `0`. (`0` is the Go zero value but no `v0` exists; map the zero to the baseline at read time: `ver := lock.LockContentVersion; if ver == 0 { ver = 1 }`.) The lock **format** `Version` stays `1`; this is a *content* version and is fully backward compatible. +2. Turn the combiner into a thin dispatcher over a small registry of historical algorithms, keyed by version. Each entry pairs the two compute functions; when only one algorithm changes, the other slot **reuses** the prior function (no version-neutral hash moves for the untouched one). Keep versions back to a declared floor (see [Registry floor](#registry-floor-and-forced-migration)): ```go - var fingerprinters = map[int]computeFn{ - 1: computeV1, // current algorithm - 2: computeV2, // e.g. fixes overlay-ordering bug, or absorbs a new default + type lockAlgo struct { + fingerprint computeFn // produces InputFingerprint + resolution resolveFn // produces ResolutionInputHash + } + var lockAlgos = map[int]lockAlgo{ + 1: {computeFP1, computeRes1}, // current (pre-RFC) algorithms — the implicit baseline + 2: {computeFP2, computeRes1}, // omitempty default (Layer 1); resolution UNCHANGED → reuse v1 fn } - const currentFingerprintVersion = 2 + const currentLockContentVersion = 2 + const minSupportedLockContentVersion = 1 ``` -3. In `checkFingerprintFreshness`, compute at the **current** version. On mismatch, if `lock.FingerprintVersion < current`, recompute at the lock's recorded version. If *that* matches the stored hash, the inputs are unchanged and only the algorithm evolved → treat as `FreshnessCurrent` and flag for silent re-stamp. Otherwise → `FreshnessStale`. -4. `component update` always stamps `FingerprintVersion = current` when it writes. Migration is therefore **lazy and per-component**: a lock upgrades only when something independently touches it. +3. In `checkFingerprintFreshness`, compute at the **current** version. On mismatch, if the lock's recorded version `< current`, recompute at the lock's recorded version. If *that* matches the stored hash, the inputs are unchanged and only the algorithm evolved → treat as `FreshnessCurrent` and flag for silent re-stamp. Otherwise → `FreshnessStale`. (Phase 1 wires this for the fingerprint hash; the resolution hash reuses `computeRes1` until its algorithm first changes — see scope note.) +4. `component update` stamps `LockContentVersion = current` **only when it is already writing for an independent reason** (see the churn policy below). Migration is therefore **lazy and per-component**: a lock upgrades only when something independently touches it. This resolves Problems 2 (for default changes), 3 (hashing bugfixes), and 5 (piecemeal rollout). It is the same lazy-forward-migration pattern Cargo/npm use, specialized to a content hash. +#### Both hashes share one version + +`ComponentLock` carries two persisted content hashes: `InputFingerprint` (render inputs, via `hashstructure` + `Includable`) and `ResolutionInputHash` (upstream-resolution inputs — a flat SHA256 over seven explicit fields in `ComputeResolutionHash`, *not* a struct walk, so the omitempty/`Includable` story does not apply to it). Both have the **same evolution problem**: appending an input or reordering the fold moves every lock's hash → G1 churn. + +We version them with **one shared integer**, not two axes, because: they co-locate in a single lock, they are written in the same `update` pass, and a paired registry lets either evolve independently while the other reuses its prior function. Two separate version fields would double the floor/replay/`--rehash` machinery for an input set (`ResolutionInputHash`) that changes rarely — YAGNI. + +**Phasing.** Naming the field `lock-content-version` *now* is the one expensive-to-reverse decision (it is baked into the on-disk TOML schema the moment Layer 2 ships; renaming a persisted key is itself a migration). The fingerprint replay is wired in the first Layer 2 PR. **Resolution-hash replay is reserved, not yet wired** — the registry slot exists and `computeRes1` is reused, so the day `ComputeResolutionHash` first changes we add `computeRes2` and extend replay to its one comparison site (`checkResolutionFreshness` + the `resHashChanged` silent-write guard in `update.go`), with no schema change. Critically, `ResolutionInputHash` does **not** feed the synthetic changelog path, so its churn is a one-line lock rewrite + a wasted re-resolution, never a phantom release (unlike `InputFingerprint`; see [Downstream consumers](#downstream-fingerprint-consumers-blast-radius)). + #### Churn-avoidance policies (G1) The version stamp is itself a potential source of spurious diffs — the exact thing G1 forbids. Two policies keep it invisible until a real change forces a write: -- **`fingerprint-version` is `omitempty` in TOML.** A baseline (`version 0/absent`) lock that is never otherwise touched never materializes the field, so its bytes stay identical. The field only appears in a lock that was *already* being rewritten for an independent reason. Existing checked-in locks therefore produce **zero diff** on the day this lands. -- **Re-stamp only on a real write; never write to advance the version.** The "silent re-stamp" in step 3 is *piggybacked* onto a write that is already happening — it must never be its own trigger. `component update` must keep its existing write-on-change guard: if nothing else changed, the version bump alone does **not** dirty the lock. (Concretely, the equivalent of `if !result.Changed && !resHashChanged { return false, nil }` stays in force; the re-stamp rides the `Changed` path, it does not create one.) +- **`lock-content-version` is `omitempty` in TOML.** A baseline (absent / version `1`) lock that is never otherwise touched never materializes the field, so its bytes stay identical. The field only appears in a lock that was *already* being rewritten for an independent reason. Existing checked-in locks therefore produce **zero diff** on the day this lands. +- **The `Changed` decision must replay *before* it compares — this is the subtle seam.** The naive read of the existing guard `if !result.Changed && !resHashChanged { return false, nil }` suggests the re-stamp harmlessly "rides the `Changed` path." **It does not.** In [`update.go`](../../../internal/app/azldev/cmds/component/update.go), `result.Changed` is set to `true` the instant `lock.InputFingerprint != identity.Fingerprint` — and `identity` is computed at the *current* version. That comparison sits **upstream** of the write guard. So after the v1→v2 switchover, the current-version hash differs from every stored v1 hash, `Changed` flips for ~every component, and we get exactly the mass auto-release-bump + mass lock rewrite G1 forbids. The fix is mandatory, not incidental: + + ```go + // Replay at the lock's recorded version BEFORE deciding Changed. + lockVer := lock.LockContentVersion + if lockVer == 0 { + lockVer = 1 + } + replayed, _ := fingerprint.ComputeIdentityAt(lockVer, *result.config, releaseVer, opts) + if lock.InputFingerprint != replayed.Fingerprint { + result.Changed = true // a REAL input change under the lock's own algorithm + } + // else: hashes match under the old algorithm → inputs unchanged, only the + // algorithm moved → NOT Changed. Advance the version only if some other real + // change is already dirtying this lock. + lock.InputFingerprint = identity.Fingerprint // current-version hash + if result.Changed { // re-stamp piggybacks a real write; never its own trigger + lock.LockContentVersion = currentLockContentVersion + } + ``` + + The principle: **"changed?" is judged under the lock's own algorithm version; the stored hash is only upgraded to the current version when the lock is already dirty for a real reason.** (When resolution replay is wired, the same replay-before-compare applies to the `resHashChanged` silent-write guard.) Together these make migration strictly opportunistic: a lock advances its version the next time its component changes for real, and not one commit sooner. +#### Registry floor and forced migration + +Lazy migration means an untouched lock can sit at an old version **indefinitely** (G3 by design). That makes "keep the last *N* versions" a **correctness cliff, not a tuning knob**: if pruning drops the compute function a lock still depends on, replay becomes impossible → forced `FreshnessStale` → the mass rebuild/rewrite (and, via the downstream-consumer analysis below, mass changelog churn) the whole design exists to avoid. So the floor must be explicit and paired with an escape hatch, decided now: + +- **`minSupportedLockContentVersion`** is a hard floor. A lock below it cannot be replayed and is treated as `Stale`. Dropping a registry entry is therefore a deliberate, breaking, announced act — never incidental cleanup. +- **`component update --rehash`** (Open Q#5, promoted to a requirement) force-advances every lock to the current version in one deliberate pass. This is the *only* sanctioned way to retire an old version: rehash the fleet first (one intentional, reviewed, fleet-wide commit), then raise the floor. Note this pass is a deliberate G1 exception — it *is* the eager migration G1 normally forbids, made safe by being explicit and operator-driven rather than a silent side effect. + +**Mixed-toolchain hazard.** `go-toml` silently drops unknown fields, so an *older* azldev binary that rewrites a lock a newer binary had stamped will strip `lock-content-version`, regressing it to the baseline. On the next new-binary run the stored (baseline-replayed) hash won't match the current algorithm → spurious `Changed` + bump. This is the classic down-migration trap. Mitigation is a documented invariant ("all writers of a given `locks/` tree must be ≥ the version that tree was last stamped at"), enforced in CI by pinning the azldev version; a hard guard (refuse to write a lock whose on-disk version exceeds the binary's `currentLockContentVersion`) is a possible belt-and-suspenders. + +#### Replaying across a changed input set — `{a,b,c}` → `{a,b,d}` + +A lock stores **one opaque hash string** plus its `LockContentVersion`; it does *not* store the individual inputs. So when the measured set changes — say the fingerprint stops measuring `c` and starts measuring `d` — an existing lock (whose stored hash was computed over `{a,b,c}` at v1) is reconciled the only way an opaque hash allows: **recompute and compare, at the lock's own version.** + +Split the change into its two halves; they are handled independently: + +- **Adding `d`** is the additive case — `d` is tagged `omitempty`, so for any component that doesn't set it the hash is byte-identical (G4). Free. No version bump. +- **Dropping `c`** is what forces the version bump, and it is reconciled by replay: + 1. `computeFP2` (measures `{a,b,d}`) ≠ stored hash → mismatch. + 2. lock version (1) < current (2) → **replay `computeFP1`** (still measures `{a,b,c}`). + 3. v1-replay == stored hash? **Yes** → `a,b,c` unchanged since the lock was written; only the *measurement* evolved → `FreshnessCurrent`, lazy re-stamp. **No** → a real input moved → `Stale`, rebuild. Both correct. + +So the bump is **not breaking**: replay answers "were the *old* inputs unchanged?" without rebuilding. + +**The load-bearing constraint the rest of Layer 2 assumes implicitly:** *a replay function reads the live config struct.* `computeFP1` is Go code in **today's** binary, reading fields off **today's** struct. That is fine when the struct shape is unchanged (the omitempty flip, a combiner bugfix, a changed default — all replay against the same fields). But **physically deleting field `c` from the struct breaks `computeFP1`** — it can no longer read `c`, cannot reproduce the `{a,b,c}` hash, and every lock that set `c` is forced `Stale`. Removal-from-the-struct is therefore the one edit that silently defeats replay. + +The way around it is a **deprecate-then-delete** two-step, both non-breaking: + +1. **Bump to v2 measuring `{a,b,d}` but keep field `c` in the struct**, tagged `fingerprint:"-"` so `computeFP2` ignores it while `computeFP1` can still read it for replay. Every old lock replays clean at v1, is recognized as unchanged, lazy re-stamps to v2. Zero forced rebuilds. +2. **Only after the floor passes v1** (`minSupportedLockContentVersion = 2`, ideally after a deliberate `--rehash`) physically delete field `c`. `computeFP1` is already retired, so nothing reads `c` anymore. + +> **Invariant:** a field may be physically removed from the config struct only after *every* registry entry that measured it has been retired below `minSupportedLockContentVersion`. Equivalently: retained replay functions and the struct they read must stay in sync — you cannot delete a field a live version still needs. + +This makes "drop an input" a lazy, per-component migration rather than a fleet-wide rebuild — at the cost of carrying a deprecated field on the struct until its replay function ages out. + #### First concrete use: the Layer 1 switchover -Flipping the inclusion default to omitempty (Layer 1) moves every config's hash, so it cannot ship as a free additive change — it is **Layer 2's first real customer.** It registers as `computeV2` (omitempty default) alongside `computeV1` (include-always), bumps `currentFingerprintVersion`, and is absorbed by replay: every existing lock recomputes clean at v1, is recognized as unchanged-inputs, and re-stamps to v2 *only when next written* per the churn policy above. No mass regen, no flag day. And because omitempty makes all future additive changes hash-neutral by construction (G4), it permanently **shrinks** the set of changes that need a Layer 2 version event at all — Layer 1 is both the first user of Layer 2 and the thing that reduces Layer 2's future workload. +Flipping the inclusion default to omitempty (Layer 1) moves every config's hash, so it cannot ship as a free additive change — it is **Layer 2's first real customer.** It registers as the `computeFP2` algorithm (omitempty default) alongside `computeFP1` (include-always), bumps `currentLockContentVersion` to 2, and is absorbed by replay: every existing lock recomputes clean at v1, is recognized as unchanged-inputs, and re-stamps to v2 *only when next written* per the churn policy above. (The resolution slot is unchanged across this bump — v2 reuses `computeRes1`.) No mass regen, no flag day. And because omitempty makes all future additive changes hash-neutral by construction (G4), it permanently **shrinks** the set of changes that need a Layer 2 version event at all — Layer 1 is both the first user of Layer 2 and the thing that reduces Layer 2's future workload. ### Layer 3 — Config schema version and canonical migration (future) @@ -243,6 +322,8 @@ This is the on-disk TOML axis. It is **independent** of the fingerprint axis and The critical invariant: **migrate old TOML → latest canonical struct, then hash once.** A semantically no-op migration (rename `foo`→`bar`) must produce the *same* canonical struct, hence the same hash, hence no drift — handled by Layer 2's replay only if the *encoding* changed, and by Layer 3's normalization for the *file shape*. Do **not** keep parallel `V1.Hash()`/`V2.Hash()` methods on versioned structs: that couples the lock to a Go type identity instead of a simple integer, and forces two independent code paths to agree on a hash forever. +**Caveat — `hashstructure` hashes the struct type name.** It mixes `reflect.Type.Name()` into the hash, so a Layer-3 migration that moves content into a *renamed* Go struct changes the fingerprint even when the content is byte-identical. "Rename is drift-neutral" therefore holds only if the canonical struct **keeps the original type name**, or the rename is shipped as a Layer-2 version bump that absorbs it. Prefer keeping the type name; reserve the version bump for when the type genuinely must be renamed. + ### Layer interaction ```text @@ -257,6 +338,39 @@ TOML on disk ──Layer 3: migrate to canonical struct──► ComponentConfig locks/.lock ``` +## Downstream fingerprint consumers (blast radius) + +The versioned-replay story in Layer 2 must hold for **every** reader of `InputFingerprint`, not just the two paths it grew up around. This is the migration blast-radius map; each consumer's behavior under a v1→v2 switchover is stated explicitly. + +| Consumer | Reads | Compares | Migration behavior required | +| -------- | ----- | -------- | --------------------------- | +| `checkFingerprintFreshness` (resolver) | recomputed identity | vs stored hash | Replay at lock version (Layer 2 core) | +| `component update` `Changed` decision | recomputed identity | vs stored hash | **Replay before `Changed`** (see churn policy / M2 seam) | +| `synthistory.FindFingerprintChanges` | stored hash strings across git history | adjacent commits | **No change needed — if migration stays lazy** | +| `synthistory.BuildDirtyChange` | recomputed (current ver) | vs stored `headLock` hash | **Replay at headLock version** before declaring dirty | +| `ResolutionInputHash` staleness/write | recomputed resolution hash | vs stored | **Shares the version; replay reserved, not yet wired** | + +### The synthetic changelog/release path is the real hazard + +[`synthistory.go`](../../../internal/app/azldev/core/sources/synthistory.go) turns fingerprint movement into **user-visible, shipped** package state — `%autochangelog` entries and `%autorelease` increments. There are two distinct comparators, and the design resolves them asymmetrically. + +- **`FindFingerprintChanges` (historical walker)** does a raw, version-blind string compare of `InputFingerprint` across the lock's git history and emits a synthetic changelog/release entry on every change. Making it genuinely version-aware is hard-to-infeasible — it only has committed *strings*, no inputs to replay. **It does not need to be**, *provided migration stays strictly lazy.* Under the churn policy, a version bump only ever rides a commit where a real input also changed, so there is never a version-only commit in history for the walker to misread. The migration folds honestly into that real change's entry. **This is a design decision, not a code fix:** the v1→v2 conversion is an *accepted, per-component, notable* changelog event that piggybacks a real change. + - **Trap:** this only holds while migration is lazy. A fleet-wide `--rehash` (or the M2 bug where `Changed` flips for everyone) converts *phantom* → *honest-but-fleet-wide* — a truthful but fleet-wide release bump, i.e. **G1 is dead.** "Accept as notable" is therefore conditional on **migration never riding a version-only or fleet-wide write** (the `--rehash` floor pass excepted, because it is deliberate and operator-driven). +- **`BuildDirtyChange` (live dirty check)** compares a *recomputed* current-version (v2) hash against the *stored* (possibly v1) `headLock.InputFingerprint` and declares dirty on inequality. "Accept as notable" does **not** save this path: post-switchover an *unchanged* component would read **dirty on every `render`/`build`** until re-stamped — a persistent, recurring spurious signal, worse than a one-time entry. The fix is **free**: it is the *same replay Layer 2 already owes the freshness check* — replay at `headLock`'s recorded version before declaring dirty. One additional call site for logic already being written, no new mechanism. + +**Net:** M1 is not "make the changelog walker version-aware" (hard, maybe infeasible). It is two things already on the books — (1) the strict lazy churn policy, so the walker never sees a version-only commit; and (2) extend the freshness replay to `BuildDirtyChange`, one extra call site. + +### `ResolutionInputHash` — shares the version, replay deferred + +`ComponentLock` carries a *second* persisted content hash, `ResolutionInputHash`, with its own staleness logic and its own silent-write path (it writes when only `resHashChanged`, never flipping `Changed`). It has the **identical** evolution problem as `InputFingerprint`: any future change to `ComputeResolutionHash`'s algorithm moves every lock's hash — exactly the mass-churn this RFC exists to prevent. + +The single `lock-content-version` covers it (see [Both hashes share one version](#both-hashes-share-one-version)). What differs is **blast radius**, which is why we wire its replay later, not now: + +- `ResolutionInputHash` does **not** feed `synthistory` — so an algorithm change can never mint a phantom changelog/release (the M1 hazard is fingerprint-only). Worst case is a one-line `resolution-input-hash` rewrite per lock plus a wasted re-resolution that usually yields the same commit. Churn, not corruption. +- It is a flat seven-field SHA256, not a struct walk, so the Layer 1 omitempty flip leaves it untouched — it has no pending v1→v2 event. Its registry slot stays `computeRes1` until its inputs genuinely change. + +**Decision:** name the field for the general case now (`lock-content-version`); wire fingerprint replay in Layer 2's first PR; reserve resolution replay (slot present, prior fn reused) and wire it the day `ComputeResolutionHash` first changes — a localized follow-up with no schema change. This fixes the one irreversible thing (the persisted key name) without speculative code (KISS/YAGNI on the second replay). + ## Design decisions ### D1 — `Includable` vs `IgnoreZeroValue` @@ -283,34 +397,37 @@ Fully implicit (omitempty default, no tags, no audit) was rejected — it remove ### D3 — Content version vs format version in the lock -Reusing `ComponentLock.Version` for the algorithm would force a format-version bump (and the strict `Parse` gate would reject old locks outright). A separate `FingerprintVersion` keeps the format stable and old locks readable, enabling lazy migration instead of hard rejection. +Reusing `ComponentLock.Version` for the algorithm would force a format-version bump (and the strict `Parse` gate would reject old locks outright). A separate `LockContentVersion` keeps the format stable and old locks readable, enabling lazy migration instead of hard rejection. It is named for the *general* case — it versions every content hash the lock stores (`InputFingerprint` now, `ResolutionInputHash` when its replay is wired) — because the persisted TOML key is the one thing that is expensive to rename after ship. ### D4 — Method-on-type hashing Adopt the **hybrid seam**: pure `ConfigHash()` on the config type, combiner in `fingerprint`. A full move was rejected (layering regression: I/O + crypto + algorithm versioning do not belong on a data type). See [Research](#where-the-hashing-logic-should-live). +Two constraints keep the seam from eroding back into the rejected methods-on-type design: **`ConfigHash()` must stay version-frozen** (it computes exactly one algorithm; it does *not* dispatch over versions — a single method "can't replay its own past"), and **the combiner is the sole version authority.** Version dispatch lives entirely in `fingerprint`'s registry; `ConfigHash()` is just the current pure-config step it calls. Keep `ConfigHash()` unexported-or-narrow if practical, so callers cannot route around the registry to get a raw, version-agnostic hash. + ## Alternatives considered - **Global `IgnoreZeroValue`** — see D1. Same default behavior but no per-field escape hatch for meaningful zeros and no point-of-definition audit. Rejected. - **Implicit omitempty (no mandatory tags, no audit)** — see D2. Removes the only guard against the unsafe false-negative direction. Rejected in favor of mandatory 3-way tags. -- **Content-hash the rendered config** (Go-modules style) instead of struct-hashing — would sidestep field-shape sensitivity, but we deliberately exclude many fields (`paths`, `publish`, snapshots) from the fingerprint, so a blanket content hash over-captures. Rejected. +- **Content-hash the rendered config** (Go-modules style) instead of struct-hashing. The naive version of this — "hash all the bytes" — over-captures, since we deliberately exclude many fields (`paths`, `publish`, snapshots) from the fingerprint. The *stronger* form is a **canonical-projection hash**: serialize only the included fields, keys sorted, and hash those bytes — immune to field-shape drift without per-field reflection tags. We still stay with `hashstructure` + `Includable` because our inclusion policy is **conditional** (omitempty = include-if-non-zero, evaluated on the resolved value), which a static byte serializer would have to re-implement anyway — so the projection hash buys field-shape immunity at the cost of reimplementing the very predicate `Includable` already gives us, plus a second serialization format to keep stable forever. Rejected on that basis, but recorded as the principled alternative; it is the one foundational choice that would be expensive to reverse post-adoption. - **Parallel versioned structs with per-struct `Hash()`** — couples locks to Go type identity and duplicates hashing logic per version. Rejected in favor of Layer 2's integer-versioned combiner + Layer 3 canonical migration. - **Bump lock format `Version` and migrate eagerly** — eager migration rewrites every lock at once, the exact mass-churn we are trying to avoid. Rejected in favor of lazy per-component re-stamp. ## Incremental delivery -1. **PR A (Layer 1)**: shared `includeFingerprintField` helper + `HashInclude` on `ComponentConfig` and `PackageConfig`; tag every fingerprinted field with one of `-`/`omitempty`/`always`; rewrite the field-decision audit to assert valid-tag presence and drop the `expectedExclusions` registry. **Note:** flipping the default moves every hash, so PR A must land *with or after* PR B's version machinery — it registers as `computeV2`, not as a standalone change. Unit test: an unset `omitempty` field is hash-invisible; setting it drifts; an `always` field drifts even at zero. -2. **PR B (Layer 2)**: `FingerprintVersion` on `ComponentLock`; version-dispatched `ComputeIdentity`; replay + re-stamp in `checkFingerprintFreshness` and `update.go`. Unit test: old-version lock with unchanged inputs → `Current`; changed inputs → `Stale`; re-stamp on update. +1. **PR A (Layer 1)**: shared `includeFingerprintField` helper + a delegating value-receiver `HashInclude` on **every** fingerprinted struct (all ~10 registered in `fingerprint_test.go`, not just `ComponentConfig`/`PackageConfig` — see the per-struct resolution note in Layer 1); tag every fingerprinted field with one of `-`/`omitempty`/`always`; rewrite the field-decision audit to (a) assert valid-tag presence and (b) assert every registered struct implements `Includable`, then drop the `expectedExclusions` registry. **Note:** flipping the default moves every hash, so PR A must land *with or after* PR B's version machinery — it registers as the `computeFP2` algorithm, not a standalone change. Unit tests: a zeroed `omitempty` field hashes **equal to its absence-equivalent** (not merely "setting it drifts" — that positive-direction test passes even if `HashInclude` is a no-op, so it must be paired with the zero-equals-absent assertion that actually exercises omission); an `always` field drifts even at zero. +2. **PR B (Layer 2)**: `LockContentVersion` on `ComponentLock` (+ `ComponentLockData` and `populateFromLock`, so the replay site can read the version); a paired version registry (fingerprint + resolution compute fns) with a `minSupportedLockContentVersion` floor; fingerprint replay-before-`Changed` in `update.go`; fingerprint replay in `checkFingerprintFreshness` **and `BuildDirtyChange`** (same replay logic, two call sites). Resolution-hash replay is *reserved* — the registry slot reuses `computeRes1`; not wired until `ComputeResolutionHash` first changes. Unit tests: old-version lock with unchanged inputs → `Current` and **not** `Changed`; changed inputs → `Stale`; re-stamp only on an already-dirty write. 3. **PR C (validation)**: scenario test (in the style of `scenario/component_changed_test.go`) — set a new `omitempty` field on a single component and assert only that lock drifts. 4. **PR D (Layer 3, later)**: `schema-version` field, load-time canonical migration, `ComponentConfig.ConfigHash()` seam. Gated on the first real non-additive TOML change. -Each PR is independently revertible. Because the Layer 1 default flip is a hash-moving change, PRs A and B ship together (or B first); the `fingerprint-version` omitempty stamp and churn policies ensure existing locks see zero diff until independently touched. Layer 3 migrates lazily on next write. +Each PR is independently revertible. Because the Layer 1 default flip is a hash-moving change, PRs A and B ship together (or B first); the `lock-content-version` omitempty stamp and churn policies ensure existing locks see zero diff until independently touched. Layer 3 migrates lazily on next write. ## Open questions -1. How many historical fingerprint versions should the registry retain before dropping the oldest? (Trade-off: replay coverage vs. dead code.) -2. Should a lazy re-stamp during a *read-only* command (`render`, `build` freshness check) write the lock back, or defer all writes to `component update`? Writing on read is surprising; deferring means freshness checks stay slightly slower until the next update. -3. For Layer 3, does `schema-version` live per-config-file or per-component? Per-file is simpler; per-component allows mixed-version projects during migration. -4. Should `omitempty` semantics use `reflect.Value.IsZero()` (Go's notion) or a config-aware notion of "unset" (e.g. nil pointer vs empty string)? Pointers would make "set to empty" expressible but complicate the structs. -5. Do we want a `component update --rehash` escape hatch that force-advances `FingerprintVersion` across the whole project (for when a change *is* intended to be global)? -6. Can the audit go further than tag-presence and *statically* flag fields whose zero value is likely meaningful (e.g. a `bool` defaulting true) and nudge toward `always`? Or is the point-of-definition tag plus code review sufficient? +1. Should a lazy re-stamp during a *read-only* command (`render`, `build` freshness check) write the lock back, or defer all writes to `component update`? Writing on read is surprising; deferring means freshness checks stay slightly slower until the next update. (Leaning: defer all writes to `update`, keeping reads side-effect-free.) +2. For Layer 3, does `schema-version` live per-config-file or per-component? Per-file is simpler; per-component allows mixed-version projects during migration. +3. Should `omitempty` semantics use `reflect.Value.IsZero()` (Go's notion) or a config-aware notion of "unset" (e.g. nil pointer vs empty string)? Pointers would make "set to empty" expressible but complicate the structs. +4. Can the audit go further than tag-presence and *statically* flag fields whose zero value is likely meaningful (e.g. a `bool` defaulting true) and nudge toward `always`? Or is the point-of-definition tag plus code review sufficient? +5. Should the mixed-toolchain hazard get a hard write-time guard (refuse to write a lock whose on-disk version exceeds the binary's `currentLockContentVersion`), or is the CI version-pin invariant enough? + +*Resolved in-text (recorded here so they aren't re-litigated):* registry retention is a **floor**, not "last N" (M8 / Registry floor); `--rehash` is the sanctioned forced-migration pass (promoted from a question to a requirement); absent `LockContentVersion` reads as `1`; one shared `lock-content-version` covers both stored hashes, with resolution-hash replay reserved (slot present, fn reused) until `ComputeResolutionHash` first changes. From 3ed59c0f9c4eef3c657f2c6e7ae4b81e915be0d0 Mon Sep 17 00:00:00 2001 From: Daniel McIlvaney Date: Fri, 5 Jun 2026 16:08:35 -0700 Subject: [PATCH 3/4] update 2 --- docs/developer/rfc/lazy-schema-migration.md | 480 ++++++++++++-------- 1 file changed, 288 insertions(+), 192 deletions(-) diff --git a/docs/developer/rfc/lazy-schema-migration.md b/docs/developer/rfc/lazy-schema-migration.md index af24f6da..8e9a56c6 100644 --- a/docs/developer/rfc/lazy-schema-migration.md +++ b/docs/developer/rfc/lazy-schema-migration.md @@ -1,15 +1,17 @@ -# RFC 002: Lazy Schema Migration for Lock-File Fingerprints +# RFC 002: Lock-File Fingerprint Reset and Lazy Schema Migration - **Status**: Draft - **Author**: @damcilva - **Created**: 2026-06-04 - **Related code**: - - [`internal/fingerprint/fingerprint.go`](../../../internal/fingerprint/fingerprint.go) — `ComputeIdentity`, `combineInputs` - - [`internal/lockfile/lockfile.go`](../../../internal/lockfile/lockfile.go) — `ComponentLock`, version gate + - [`internal/fingerprint/fingerprint.go`](../../../internal/fingerprint/fingerprint.go) — `ComputeIdentity`, `ComputeResolutionHash`, `combineInputs` + - [`internal/lockfile/lockfile.go`](../../../internal/lockfile/lockfile.go) — `ComponentLock`, `Parse` format-version gate - [`internal/projectconfig/fingerprint_test.go`](../../../internal/projectconfig/fingerprint_test.go) — field-inclusion audit - - [`internal/app/azldev/core/components/resolver.go`](../../../internal/app/azldev/core/components/resolver.go) — `computeFreshnessStatus`, `BuildDirtyChange` + - [`internal/app/azldev/core/components/resolver.go`](../../../internal/app/azldev/core/components/resolver.go) — `computeFreshnessStatus`, `checkFingerprintFreshness` - [`internal/app/azldev/cmds/component/update.go`](../../../internal/app/azldev/cmds/component/update.go) — `Changed` decision, re-stamp write - - [`internal/app/azldev/core/sources/synthistory.go`](../../../internal/app/azldev/core/sources/synthistory.go) — `FindFingerprintChanges` (synthetic changelog/release) + - [`internal/app/azldev/cmds/component/changed.go`](../../../internal/app/azldev/cmds/component/changed.go) — `classifyComponent`, `haveMatchingFingerprints` (CI classification) + - [`internal/app/azldev/core/sources/synthistory.go`](../../../internal/app/azldev/core/sources/synthistory.go) — `FindFingerprintChanges`, `BuildDirtyChange` (synthetic changelog/release) + - [`internal/app/azldev/core/sources/sourceprep.go`](../../../internal/app/azldev/core/sources/sourceprep.go) — `computeCurrentFingerprint` ## Background @@ -46,11 +48,11 @@ Drift is detected in [`resolver.go`](../../../internal/app/azldev/core/component As the tool matures, three *independent* notions of "version" are emerging. Conflating them is the source of the problems in this RFC: -| Axis | Versions what | Lives where | Exists today? | -| ---- | ------------- | ----------- | ------------- | -| **Config schema version** | on-disk TOML field shape | load / migration layer | No | -| **Lock content-hash version** | how inputs fold into the lock's stored hashes (`InputFingerprint` *and* `ResolutionInputHash`) | `fingerprint` combiner | No (implicitly v1) | -| **Lock file format version** | lock file serialization | `lockfile` | Yes (`Version = 1`) | +| Axis | Versions what | Lives where | Exists today? | Forced-migration verb | +| ---- | ------------- | ----------- | ------------- | --------------------- | +| **Config schema version** | on-disk TOML field shape | load / migration layer | No | `config migrate` (future) | +| **Lock content-hash version** | how inputs fold into the lock's stored hashes (`InputFingerprint` *and* `ResolutionInputHash`) | `fingerprint` combiner | No (implicitly v1) | `component migrate` | +| **Lock file format version** | lock file serialization | `lockfile` | Yes (`Version = 1`) | — (frozen at `1`) | ### The problem @@ -62,13 +64,33 @@ Concretely: we add field `foo` and set `foo = "baz"` on package `bar`. The desir There is a harder variant lurking behind the additive case: **non-additive** schema changes — renaming a field, removing one, changing a baked-in default, or fixing a bug in the hashing logic itself. These legitimately change the *meaning* of the config without changing user intent, and we will eventually need to absorb them without forcing every consumer to rebuild. +### The substrate problem: replay only works if old algorithms stay frozen + +The natural fix for non-additive change is **versioned replay**: stamp an algorithm version into the lock, keep the old algorithm around, and when a lock is behind, recompute with *its* algorithm to ask "were the inputs actually unchanged, or did only the encoding move?" If unchanged, accept the lock without a rebuild. + +Replay only works if an old algorithm function can faithfully reproduce the hash it produced when the lock was written. **On the current `hashstructure` substrate, it cannot** — a "frozen" algorithm function is not actually frozen: + +- Its body is `hashstructure.Hash(component, …)`, which **reflects over the live Go struct**. Add a field later and the old function now sees that field (at zero value, included) → its output moves → it can no longer reproduce the historical hash. So *adding* a field breaks *replay of older versions*, which is exactly the additive case we are trying to make free. +- It also resolves the live **method set**: once `ComponentConfig` implements `Includable`, the same `hashstructure.Hash` call silently switches inclusion behavior, with no per-call opt-out (the interface is resolved automatically). + +The consequence is sharp: an incremental "flip the default to omitempty, lazily migrate" plan **cannot keep its central promise.** "Additive fields are drift-neutral by construction" holds only for locks already at the new version; for the older locks that lazy migration deliberately leaves alone, the next field addition forces a hash change anyway. You do not avoid the mass rebuild — you defer it to the first field addition, and you build the whole replay apparatus on a substrate that makes replay unsound. + +### The opportunity: a coordinated cutover is already scheduled + +The project has a **dev→prod environment cutover** coming that forces a full rebuild regardless. This is a *coordinated cutover* — a one-time, distro-wide switch with no mixed-version window, the sanctioned moment to make changes that cannot be made lazily. That changes the calculus completely. The entire "lazy" framing exists to *avoid* a mass update; if exactly one sanctioned mass update is already on the calendar, the strategy inverts: + +> **Lazy migration is for the cheap and additive. The one free rebuild is a budget — spend it exclusively on the one-way doors that are cheap now and a coordinated-cutover-only change later.** + +This RFC therefore has two parts: **(1)** a one-time **reset** at the dev→prod cutover that replaces the hashing substrate with one whose old algorithms are *genuinely* frozen, and **(2)** a **post-reset lazy migration** mechanism (versioned registry + replay) that rides that clean substrate for the rare genuine algorithm change thereafter. Part 2 is what the original "lazy" design was reaching for; part 1 is what makes it sound. + ### Goals -- **G1 (primary, non-functional): no spurious lock-file diffs.** Landing a config-schema or hashing change must not rewrite `*.lock` files for components whose effective inputs are unchanged — not even to bump a version field. Soft requirement (strongly preferred, not a hard gate), but it shapes which solutions are acceptable: it rules out any eager "migrate everything" pass. -- **G2: only real changes drift.** A lock changes iff that component's build-effective inputs changed. -- **G3: piecemeal, lazy migration.** Schema/algorithm evolution rolls out per-component, riding along with independent changes, never as a big-bang. -- **G4: additive fields are drift-neutral by construction.** Adding an unset field should be invisible to every existing lock with no author effort beyond declaring intent. -- **G5: correctness backstop preserved.** Never silently under-rebuild: a genuine input change must always drift its lock. +- **G1 (primary, non-functional): no spurious lock-file diffs *after the reset*.** Once prod locks exist, landing a config-schema or hashing change must not rewrite `*.lock` files for components whose effective inputs are unchanged. The reset itself is the *one* sanctioned exception, absorbed by the already-scheduled rebuild. +- **G2: only real changes drift.** Post-reset, a lock changes iff that component's build-effective inputs changed. +- **G3: piecemeal, lazy migration post-reset.** Genuine algorithm evolution after the reset rolls out per-component, riding independent changes, never as a big-bang. +- **G4: additive fields are drift-neutral by construction — *truly*, not just for new locks.** On the projection substrate (below) an unset additive field is invisible to *every* lock including old ones, because old algorithms pin an explicit field list and never reflect over the live struct. +- **G5: correctness backstop preserved.** Never silently under-rebuild: a genuine input change must always drift its lock. Replay may accept encoding/over-capture changes; it must never mask a behavior-changing one. +- **G6 (new, hard): back-compatible reads for synthetic history.** The new binary must still **read** pre-reset locks across git history (synthetic changelog/release walks them), even though it **writes** only the new format. Reading never recomputes a historical hash — it compares stored strings only. ## Problem inventory @@ -79,8 +101,9 @@ There is a harder variant lurking behind the additive case: **non-additive** sch | 3 | No way to evolve the hashing algorithm (bugfix, input reorder) without drift | `combineInputs` has no version; old and new outputs are incomparable | Mass rebuild + lock churn | | 4 | No on-disk config schema version | `ConfigFile` has a `$schema` URL but no version field | Blocks managed migration | | 5 | Migration is all-or-nothing | Freshness check is binary match/no-match against one stored hash | No piecemeal rollout | +| 6 | Versioned replay is unsound on the current substrate | "Frozen" algorithm = `hashstructure.Hash` over the **live** struct/method-set; adding a field moves the old function's output | Replay cannot reproduce historical hashes | -Problems 1–3 share a shape: a change that *should* be invisible to most components is forced to be visible to all of them, because the fingerprint cannot distinguish "input changed" from "encoding changed." Problem 4 is the missing primitive for managed config evolution. Problem 5 is the property we actually want from any solution — **per-component, lazy** migration, where a lock upgrades only when something independently touches it. +Problems 1–5 share a shape: a change that *should* be invisible to most components is forced to be visible to all of them, because the fingerprint cannot distinguish "input changed" from "encoding changed." Problem 4 is the missing primitive for managed config evolution. Problem 5 is the property we want from any post-reset solution — **per-component, lazy** migration. Problem 6 is the one that kills the *incremental* path outright: the very mechanism that would make problems 1–3 free (versioned replay) is unsound while the substrate reflects the live struct. Fixing 6 is what the reset buys. ## How fingerprinting works today (detail) @@ -98,9 +121,15 @@ Two properties of `hashstructure` v2.0.2 are load-bearing for this RFC: 1. **No per-field `omitempty`.** The only field tags it recognizes are `-`/`ignore` (skip) and `set`/`string` (encoding). A zero-value field is hashed; it is not skipped. 2. **It honors the `Includable` interface.** If the value (or a pointer to it) implements `HashInclude(field string, v interface{}) (bool, error)`, the walker calls it per field and omits the field when it returns `false`. **An omitted field hashes identically to a field that was never declared.** There is also a global `IgnoreZeroValue` option that skips *all* zero-value fields. -The struct's type name *is* part of the hash (`hashstructure` mixes in `reflect.Type.Name()`), but that name does not change when fields are added, so it is irrelevant to drift. +The struct's type name *is* part of the hash (`hashstructure` mixes in `reflect.Type.Name()`), so a rename of the Go type moves every hash even when content is byte-identical. + +**Why this substrate cannot host frozen replay.** Every property above is resolved *at hash time against the live program*, not against a pinned description of the v1 encoding: -One constraint: the top-level value passed to `hashstructure.Hash` is not addressable, so an `Includable` implementation must use a **value receiver** to be seen for the root struct. +- The set of fields walked is whatever the struct has *now* — add a field, and last year's `computeFP1` (whose body is still just `hashstructure.Hash(component)`) now includes it. +- Whether `Includable` is consulted depends on whether the type implements it *now* — not on what was true when v1 locks were written. +- A `value` vs `pointer` receiver subtlety even decides whether the root struct's `HashInclude` is seen at all (the top-level value is not addressable). + +A function meant to be "the v1 algorithm, forever" therefore changes meaning every time the struct or its method set changes. That is the disqualifier for the incremental plan (Problem 6) and the motivation for the projection substrate below, whose v1 function pins an explicit field list and is immune to all three. ## Change taxonomy @@ -108,21 +137,28 @@ Not every config change should be treated the same way. The right mechanism depe | Class | Example | Should unaffected locks drift? | Mechanism | | ----- | ------- | ------------------------------ | --------- | -| **Additive field** | new `foo` field, unset on most components | No — only setters drift | Default omitempty (Layer 1); no version bump | -| **Additive with non-zero default** | new field defaulted to `"auto"` via defaults merge | No | Algorithm version + replay (Layer 2) | -| **Rename / move** | `foo` → `bar`, same semantics | No | Schema migration → canonical hash (Layer 3) + Layer 2 | -| **Semantic change** | meaning of `foo` changes; output differs | Yes — that's correct | None; drift is intended | -| **Hashing bugfix** | overlay ordering bug in `combineInputs` | No | Algorithm version + replay (Layer 2) | -| **Field removal** | drop deprecated `foo` | No, if nobody set it | Migration drops field; Layer 2 for setters | +| **Additive field** | new `foo` field, unset on most components | No — only setters drift | **Free, no bump.** Add `foo` to the current `projectVN` as omit-if-zero; a component that leaves it unset emits identical bytes, so no shipped hash moves. Setters drift (correct). | +| **Additive with non-zero default** | new field defaulted to `"auto"` via defaults merge | No | **Bump + replay.** The default resolves non-zero on *every* component, so it is emitted everywhere and would move every hash — omit-if-zero can't save it. Ship `projectV(N+1)` that emits it; old locks **replay at their version** (which didn't emit it), match their stored digest → recognized unchanged → lazy re-stamp, no rebuild. | +| **Rename / move** | `foo` → `bar`, same semantics | No | **Schema migration + bump + replay.** Migrate old TOML → canonical struct (the rename lands in the struct), then ship `projectV(N+1)` that emits the renamed field. Old locks replay at their version and are recognized unchanged → lazy re-stamp, no rebuild. | +| **Semantic change** | meaning of `foo` changes; output differs | Yes — that's correct | **None.** The build output genuinely differs, so the lock *should* drift. Replay at the old version would (correctly) mismatch → `Stale` → rebuild. Nothing to suppress. | +| **Hashing bugfix** | overlay ordering bug in the combiner | No | **Bump + replay.** Ship the fixed combiner as the version-`N+1` half of `computeFP(N+1)`; old locks replay at the old (buggy) version. If their inputs are unchanged the buggy digest still matches → recognized unchanged → lazy re-stamp to the fixed version, no rebuild. | +| **Newly measured input** | start folding in a new overlay source or identity element | No | **Bump + replay.** A non-config input is added in the combiner half of `computeFP(N+1)` (a config field would go in `projectV(N+1)`). Old locks replay at their version, which didn't fold it in, match their stored digest → recognized unchanged → lazy re-stamp, no rebuild. **Caveat:** until a lock migrates, replay is *blind* to the new input, so a change to it reads as fresh (false-fresh) — if it is build-critical, force a `component migrate` pass instead of riding lazy adoption (see [churn-avoidance](#churn-avoidance-policies-g1)). | +| **Field removal** | drop deprecated `foo` | No, if nobody set it | **Deprecate-then-delete (+ bump for setters).** Bump to a `projectV(N+1)` that stops emitting `foo` but **keep the field on the struct** so the old `projectVN` can still read it for replay. Only after the floor passes that version (ideally after a `component migrate`) physically delete the field. Setters drift on the bump; non-setters replay clean. | + +The recurring requirement across the "No" rows is the same: **distinguish a change in user intent from a change in encoding, and only drift on the former.** Note the first row: on the projection substrate, a new field is added to `projectVN` as *omit-if-zero*, so a component that does not set it emits identical bytes and stays hash-neutral — *for every lock, old or new*, because old configs never set the brand-new field. Adding it does not move any existing hash (no shipped lock set it), so it needs no version bump. Part 2 then carries only the genuinely hard cases (rows 2, 5, and post-reset renames/removals). The shared move in every "Bump + replay" row is the same primitive — **increment the content version, keep the old `projectVN` as a frozen replay function, and let unchanged locks re-stamp lazily** — detailed in [Part 2](#part-2--post-reset-lazy-migration). -The recurring requirement across the "No" rows is the same: **distinguish a change in user intent from a change in encoding, and only drift on the former.** Note the first row: with omitempty as the *default* (Layer 1), additive fields need no version bump and no replay at all — they are hash-neutral by construction. Layer 2 then carries only the genuinely hard cases (rows 2, 5). +> **`projectVN`** is shorthand used throughout this RFC for the hand-written *projection function* introduced by this design (defined in [Substrate options](#substrate-options) and [The projection substrate](#the-projection-substrate)). The `N` is the lock content version: `projectV1` is the function that names and serializes exactly the fields content-version 1 measures, `projectV2` the next algorithm, and so on. Each `projectVN` is frozen once shipped — that is the whole point. ## Research -### `hashstructure` options +### Substrate options -- **`Includable` (per-field callback)** keeps existing hashes byte-identical: fields that don't opt into omission hash exactly as they do today. This is the only option that solves Problem 1 *without* itself triggering a mass rebuild. -- **`IgnoreZeroValue` (global)** is simpler to wire but flips the hash of *every* struct that has any zero-value field — i.e. it is itself a mass-rebuild event, and it removes our ability to say "this empty field is meaningful." Rejected for the default path. +Two substrates can produce a content fingerprint of the resolved config. The difference that matters here is **whether an old algorithm function can be frozen.** + +- **`hashstructure` + `Includable` (rejected as the substrate).** Keeps existing hashes byte-identical and gives per-field omission via `HashInclude`. But, as established above (Problem 6), a function built on `hashstructure.Hash` reflects over the live struct and method set, so it cannot be a frozen historical algorithm. It also requires a value-receiver `HashInclude` on *every* nested fingerprinted struct and a subtle `v.(reflect.Value)` type-assert to work at all — brittle plumbing in service of a substrate that still can't host sound replay. +- **Canonical projection + stdlib hash (chosen).** Split the two jobs `hashstructure` fuses — *field selection* and *hashing* — into explicit steps. A `projectVN` function names the exact fields version N measures, emits them in a canonical, sorted, self-delimiting byte form, and an stdlib `sha256` hashes those bytes. Because `projectV1` references an **explicit, pinned field list**, it does not see fields added later, does not depend on the type's method set, and does not depend on receiver subtleties. It is a genuinely frozen pure function — the property replay requires. The cost is owning a small projection encoder plus **golden hash vectors** per version (a checked-in `(config, version) → hash` table) so "frozen" is CI-enforced, not merely intended. + +The projection substrate is what makes G4 true for old locks and what makes Part 2's replay sound. It is adopted at the reset (below), not incrementally. ### How other tools version lock state @@ -131,207 +167,259 @@ The recurring requirement across the "No" rows is the same: **distinguish a chan - **Terraform state** stores a `version` and a `terraform_version`; state is upgraded forward on use, never downgraded. - **Go modules** avoid the problem entirely by hashing *content* (`h1:` dirhashes) rather than a struct shape, so adding metadata fields never perturbs existing sums. -The common pattern: an **integer version stamped into the persisted artifact**, plus the ability to **read and replay older versions**, plus **lazy forward-migration on write**. Our `ComponentLock.Version` already provides the slot; today we only ever reject mismatches instead of migrating. +The common pattern: an **integer version stamped into the persisted artifact**, plus the ability to **read and replay older versions**, plus **lazy forward-migration on write**. We keep `ComponentLock.Version` (the lock *format* slot) fixed at `1` and carry the *content* version **inside the `InputFingerprint` token** (`v:sha256:…`) rather than in a separate struct field — one atomic value, no version/digest desync, no new TOML field for an old binary to mishandle. The Go-modules lesson is the deepest one: hashing *content* rather than struct shape is what makes additive metadata free — the canonical-projection substrate is our version of that lesson. ### Where the hashing logic should live -A natural question (raised during design) is whether to move hashing onto the config types as a method. The hashing logic decomposes into two separable jobs: +With the projection substrate the fingerprint algorithm decomposes into two steps. **Both are versioned together** by the single lock content version — the version pins the *entire* fingerprint computation, not just the field list: + +1. **Projection** — `projectVN(config)` names and serializes the config fields version N measures. This is *about the config type*, but it is data extraction, not hashing: it returns canonical **bytes**, not a hash. +2. **Combiner / orchestration** — reads overlay file contents (needs `opctx.FS`), folds in source identity / releasever / bump, applies domain separation, and runs `sha256` over the projection bytes plus those non-config inputs. None of these are config fields, but the combiner equally decides *what is measured*: starting to fold in a new overlay source, adding an identity input, or reordering the fold all change the digest exactly as a projection change does. -1. **Pure config hash** — `hashstructure.Hash(component, …)` plus field-inclusion policy. This is genuinely *about the config type*; `HashInclude` is already a method on it. -2. **Combiner / orchestration** — reads overlay file contents (needs `opctx.FS`), folds in source identity / releasever / bump, applies domain separation, and (Layer 2) selects an algorithm version. None of these are config fields. +So the per-version compute function in the registry is the **whole algorithm** — `computeFPN` = `projectVN` + the combiner step frozen at version N. "Watching another field" splits cleanly: if it is a *config* field, it goes in `projectV(N+1)`; if it is a *non-config* input (a new overlay source, a new identity element), it goes in the combiner half of `computeFP(N+1)`. Either way it is a content-version bump absorbed by replay, never a silent hash move. The combiner is the **sole version authority**: it owns the registry and the dispatch, and `projectVN` is just the frozen config-extraction step it calls. -Moving (1) onto the type improves cohesion and version-locality. Moving (2) onto the type would drag I/O and cross-cutting algorithm versioning into `projectconfig` (a pure data package that `lockfile` imports), and would scatter the centralized field-inclusion audit. The combiner must own algorithm versioning because "I changed how overlays fold in" is not a per-type concern. **Recommendation: a hybrid seam** — expose `ComponentConfig.ConfigHash()` on the type; keep the combiner in `fingerprint`. +Expose the projection on (or beside) the config type and keep the combiner in `fingerprint`. **Do not** expose a `ConfigHash()` method on the type: a method that returns a finished hash both drags a hashing concern onto a data type *and* tempts callers to route around the version registry to get a raw, version-agnostic hash. Returning bytes from `projectVN` keeps the type ignorant of versioning and crypto. ## Proposed approach -The design is **layered**, not a single switch. Each layer is independently shippable and addresses a distinct row of the taxonomy. Layers 1 and 2 cover the immediate need (Problems 1–3); Layer 3 is the forward-looking config-schema-version axis (Problem 4) and can follow later. +The design has **two parts** with very different cost profiles: -### Layer 1 — Omitempty as the default inclusion policy +1. **Part 1 — the reset (one coordinated cutover).** At the dev→prod cutover, swap the hashing substrate to canonical projection, declare the post-cutover projection as content-version **v1**, and spend the already-scheduled rebuild on every change that is *cheap now and a one-way door later* (the irreversible changes). Pre-reset locks already committed to **git history** stay readable and are never recomputed (the back-compat invariant below); a pre-reset lock in the **working tree** is force-rehashed to the `v1:` token on its first post-reset `update`. +2. **Part 2 — post-reset lazy migration (below).** A versioned registry + replay, now riding the *frozen* projection functions, absorbs the rare genuine algorithm change after the cutover, lazily and per-component, with no second coordinated cutover. -Today the safe default is *include-always*: a new field contributes to the hash even at zero value. We **flip the default to omitempty** (include only when non-zero) and make the inclusion policy an explicit, exhaustive, CI-enforced choice per field. +The original "lazy" instinct was right for Part 2 and wrong for Part 1: there is no way to make a substrate swap or a batch of one-way-door normalizations free, so they must ride the one rebuild we are already paying for. Everything that *can* be lazy (additive fields) is pushed into Part 2 and costs nothing. -Every fingerprinted field must carry one of three `fingerprint` tag values: +## Part 1 — The reset -| Tag | Meaning | When to use | -| --- | ------- | ----------- | -| `fingerprint:"omitempty"` | included **only when non-zero** (the new default) | almost all fields | -| `fingerprint:"always"` | included even at zero value | fields whose **zero value is build-meaningful** (e.g. a `bool` that defaults true, where `false` must rebuild) | -| `fingerprint:"-"` | excluded from the hash entirely | paths, publish routing, runtime state | +### The projection substrate -There is no untagged state. `TestAllFingerprintedFieldsHaveDecision` is rewritten to assert that **every** field of every fingerprinted struct carries a valid tag value — failing CI on any bare field. This is *simpler* than today's audit: it no longer maintains an `expectedExclusions` registry, it just checks for tag presence and validity. The conscious decision moves to the point of field definition, where the author has the context to judge whether zero is meaningful. +Replace `hashstructure.Hash(component, …)` with an explicit two-step pipeline: -**`Includable` is resolved per-struct — every fingerprinted struct needs the method.** `hashstructure` looks up `Includable` on each struct it walks (and the whole tree is non-addressable, since the root is passed by value), so a `HashInclude` on `ComponentConfig` alone governs only `ComponentConfig`'s own fields. On any nested struct that lacks its own value-receiver `HashInclude`, the `omitempty`/`always` tags are **decorative** — `hashstructure` natively understands only `-`/`ignore`/`set`/`string`, so the tag passes the CI audit while the field is still hashed at zero, and G4 silently holds only at the top level. The audit (`fingerprint_test.go` registers ~10 fingerprinted structs: `ComponentConfig`, `ComponentBuildConfig`, `CheckConfig`, `PackageConfig`, `ComponentOverlay`, `SpecSource`, `DistroReference`, `SourceFileReference`, `ReleaseConfig`, `ComponentRenderConfig`) must therefore **also assert that every registered struct implements `Includable`** — so a new fingerprinted struct cannot ship with inert tags. All registered structs get the one-line delegating method. +```text +ComponentConfig ──projectV1(cfg)──► canonical bytes ──sha256──► configHash + (explicit field list, (stdlib) + sorted keys, emit-if-nonzero) +``` -Implement `Includable` on each fingerprinted struct, delegating to one shared helper: +`projectV1` is hand-written and names exactly the fields v1 measures. It emits a canonical, sorted, self-delimiting byte stream (length-prefixed keys + values) so distinct field sets cannot collide, and it omits a field when its **resolved value is zero** (the omitempty behavior, now a property of the encoder, not a struct tag). A field whose zero value is build-meaningful is simply listed as *always-emit* in `projectV1`. -```go -// includeFingerprintField reports whether a field participates in the hash. -// "-" fields never reach here (hashstructure skips them first). "always" fields -// are included unconditionally; "omitempty" (the default) is included only when -// the resolved value is non-zero. -func includeFingerprintField(t reflect.Type, field string, val reflect.Value) (bool, error) { - sf, ok := t.FieldByName(field) - if !ok { - return true, nil - } - switch sf.Tag.Get("fingerprint") { - case "always": - return true, nil - default: // "omitempty" - return !val.IsZero(), nil - } -} +Three things this buys that `hashstructure` could not: + +- **Frozen by construction.** `projectV1` references a pinned field list, so adding `Foo` to the struct later is invisible to it — `projectV1`'s output for an old config is unchanged. This is what makes Part 2's replay sound (Problem 6) and G4 true for *old* locks, not just new ones. +- **No method-set / receiver magic.** No `Includable`, no per-nested-struct method, no `v.(reflect.Value)` type-assert footgun. Field selection is ordinary code. +- **Golden-vector enforced.** A checked-in table of `(config, version) → hash` vectors is asserted in CI, so any accidental change to a historical `projectVN` fails the build. "Frozen" stops being a promise and becomes a test. + +The cost is owning the projection encoder and the golden vectors. That cost is paid once, at the reset, against a rebuild we are already doing. + +### Baseline v1 — omit-if-zero, no include-always legacy + +Because the reset rebuilds everything, there is **no pre-existing population to stay byte-compatible with.** That removes the single biggest constraint of the incremental plan: we do **not** need an `include-always` compatibility mode to preserve today's hashes. `projectV1` is the omit-if-zero projection from day one. There is no `computeFP1 = legacy include-always` entry to carry forever — the registry's floor *starts* at the clean projection. -// Value receiver: the root struct passed to hashstructure.Hash is not addressable. -// -// CRITICAL: hashstructure calls HashInclude(field, innerV) where innerV is -// ALREADY a reflect.Value (the field's value), boxed into the interface{}. -// So we must TYPE-ASSERT it, not reflect.ValueOf it. reflect.ValueOf(v) would -// describe the reflect.Value struct itself (always non-zero) → !IsZero() always -// true → omitempty silently never fires and Layer 1 no-ops. Verified against -// hashstructure v2.0.2 hashstructure.go:346 (`include.HashInclude(name, innerV)`). -func (c ComponentConfig) HashInclude(field string, v interface{}) (bool, error) { - return includeFingerprintField(reflect.TypeOf(c), field, v.(reflect.Value)) +```go +// projectV1 emits the canonical byte form of the fields v1 measures. +// Field selection is explicit code, not reflection — this is what freezes it. +// emit() length-prefixes key+value so distinct field sets cannot collide; +// it skips a field when the resolved value is zero (the omit-if-zero default). +func projectV1(c *ComponentConfig) []byte { + var b canonicalBuf + b.emit("upstream", c.Upstream) // omit-if-zero + b.emit("patches", c.Patches) // omit-if-zero (nil and [] both → absent) + b.emitAlways("strip_debug", c.StripDebug) // always: zero (false) is build-meaningful + // … one line per measured field, in a fixed order … + return b.Bytes() } ``` -**Why flipping the default is safe — fingerprints see the resolved config.** The usual objection to blanket omitempty is the false-negative footgun: a field whose zero is meaningful gets omitted and collides with "unset," so two semantically different configs hash the same and a rebuild is missed. That objection assumes we hash *raw user input*. We do not. `ComputeIdentity` runs on the **resolved, post-merge** config (`*result.config`, after defaults are applied). The omit predicate is therefore "the *resolved value* equals Go-zero," not "the user didn't type it." Consequences: +**Why omit-if-zero is safe — fingerprints see the resolved config.** The usual objection to blanket omit-if-zero is the false-negative footgun: a field whose zero is meaningful gets omitted and collides with "unset," so two semantically different configs hash the same and a rebuild is missed. That objection assumes we hash *raw user input*. We do not. `ComputeIdentity` runs on the **resolved, post-merge** config (`*result.config`, after defaults are applied). The omit predicate is therefore "the *resolved value* equals Go-zero," not "the user didn't type it." Consequences: - Two configs that both resolve a field to zero build identically → hashing them the same is **correct**, not a collision. -- "Unset" never reaches the hasher — it has already been resolved to its default. If the default is non-zero, the field is non-zero and is included anyway. If the default *is* zero, then unset and explicit-zero resolve identically → same build → same hash → correct. +- "Unset" never reaches the hasher — it has already been resolved to its default. If the default is non-zero, the field is non-zero and is emitted anyway. If the default *is* zero, then unset and explicit-zero resolve identically → same build → same hash → correct. + +So the classic false-negative requires absence ≠ zero-default *at the point of hashing*, and post-merge resolution closes that gap. The load-bearing invariant is **G5's guarantee restated structurally: the fingerprint must see exactly the build-effective resolved config.** That invariant must already hold, or fingerprinting is broken independently of this change. `emitAlways` is the escape hatch for the rare field whose zero value is build-meaningful. + +**Result:** additive fields are drift-neutral **by construction** (G4) — a newly added field, listed omit-if-zero in `projectVN`, emits nothing for any component that does not set it, so it is invisible to every lock that leaves it unset, old or new. Adding it moves no existing hash (no shipped lock could have set a field that did not yet exist), so it needs no version bump. Only setters drift (G2). + +#### Edge cases under omit-if-zero + +- **Meaningful zero with a non-zero default** (e.g. `int Jobs` defaulting to `4`, where `0` means serial). Post-merge: unset → `4` (emitted), explicit `0` → omitted. These build differently *and* hash differently, so there is no collision — they are consistent. Use `emitAlways` only if a zero value must be distinguishable from a future change of default. +- **nil vs empty slice.** A missing TOML key → nil → omitted; `key = []` → non-nil empty → emitted. For any slice/map field where an explicit-empty value is reachable and build-meaningful, use `emitAlways` so nil and empty both hash. + +### The reset load-out — what to spend the free rebuild on + +The reset rebuild is a budget. Spend it on the irreversible / cutover-only changes; **do not** spend it on anything Part 2 can do lazily for free. Priority order: -So the classic false-negative requires absence ≠ zero-default *at the point of hashing*, and post-merge resolution closes that gap. The load-bearing invariant is **G5's guarantee restated structurally: the fingerprint must see exactly the build-effective resolved config.** That invariant must already hold, or fingerprinting is broken independently of this change. The `fingerprint:"always"` escape hatch (plus the mandatory-tag audit) is cheap insurance against the invariant silently drifting later — e.g. if someone applies a default *after* fingerprinting. +1. **Switch the substrate to canonical projection.** Foundational, one-way, enables everything else. (Above.) +2. **Establish `projectV1` as omit-if-zero with no include-always legacy.** The compatibility mode never enters the registry, so it never has to age out. +3. **Keep the lock *format* `Version` at `1` — the content-version token carries the reset.** The reset adds **no new TOML field** (the atomic token in item 4 reuses `InputFingerprint`) and touches **no** pinning field (`upstream-commit`, `import-commit`, `manual-bump`), so an old binary still parses a reset lock and reads everything it needs to *queue a build*. The substrate swap rides entirely on the content-version machinery (Part 2): pre-reset locks carry a legacy (prefix-less) token below the registry floor, and the reset is simply the **first forced upgrade** of the fleet to the `v1:` token. This also makes the one real mixed-toolchain risk self-correcting: if an old binary ever rewrites a reset lock with its legacy-substrate hash, the next new-binary run sees a sub-floor token and **force-rehashes** it back to `v1` — a clean forced upgrade, never silent corruption (next subsection). +4. **Adopt an atomic, self-describing `v1:sha256:…` token** for the stored hash, so the version and the digest can never desync (closes the re-stamp/desync class of bug where the version field and the hash field are written independently). +5. **Unify on `sha256` everywhere**, retiring the `uint64`→decimal-string wart from the `hashstructure` era. One hash format, one encoding. +6. **Do every pending rename / default-normalization now.** Renaming a field, moving content between structs, or changing a baked-in default is a one-way door under Part 2 (it needs a version bump + replay); at the reset it is free because everything rebuilds anyway. This is where the schema-axis "hardest cases" get absorbed cheaply. -**Result:** additive fields are drift-neutral **by construction** (G4) — an unset field omits identically to a field that never existed, with no version bump and no replay. Only setters drift (G2). The cost is one tag per field (verbose but mechanical) and two genuine edge cases (see below). +**Anti-goal:** do *not* burn reset budget on additive fields — Part 2 handles those for free, forever. The single success criterion for the load-out is that **no second coordinated cutover is ever needed**: after the reset, every future change must be expressible as either a free additive field or a lazy Part 2 version bump. -#### Edge cases under default omitempty +### The lock changes at the reset — atomic token + forced upgrade -- **Meaningful zero with a non-zero default** (e.g. `int Jobs` defaulting to `4`, where `0` means serial). Post-merge: unset → `4` (included), explicit `0` → `0` (omitted-by-omitempty). These build differently *and* hash differently, so there is no collision — they are consistent. Such fields rarely trigger omission at all because the default keeps them non-zero. Tag them `always` only if a zero value must be distinguishable from a future change of default. -- **nil vs empty slice.** `reflect.Value.IsZero` on a slice is `IsNil`. A missing TOML key → nil → omitted; `key = []` → non-nil empty → included. Default omitempty thus makes nil-vs-empty a hash distinction that include-always collapses. Almost never observable — but a TOML formatter that strips empty arrays (or any round-trip that maps `[]`→absent) would flip hashes. **Tag rule: for any slice/map field where an explicit-empty value is reachable and build-meaningful, prefer `fingerprint:"always"`** so nil and empty both hash and the distinction can't silently move a fingerprint. +The stored hash becomes a single self-describing token: -**Adopting this flip is itself a fingerprint-algorithm change** (every config's hash moves), so it does not land for free — it is absorbed by Layer 2's versioned replay rather than by rewriting locks. See Layer 2. +```text +input-fingerprint = "v1:sha256:9f86d0…" # :: +``` + +One field carries both the content version and the digest, so they cannot be written out of step (a class of desync bug the prior split-field design was exposed to). Parsing splits on `:`; an absent prefix on a pre-reset lock reads as the legacy format. + +The lock **format** `Version` stays at `1`. The on-disk *schema* is unchanged — same fields, same TOML shape — so an old binary still parses a reset lock and reads its pins (`upstream-commit`, `import-commit`, `manual-bump`), which is all it needs to queue a build. What changes is the *value* of `InputFingerprint`: the substrate swap is expressed purely as a content-version step, and the reset is the **first forced upgrade** to the `v1:` token. The existing singleton `Parse` gate (`Version == 1`) is left untouched; all substrate/version reconciliation routes through the content-version registry instead of a format gate. + +Recovery from a sub-`v1` token is the **same mechanism** as the reset itself: a token with no `v:` prefix (or a version below `minSupportedLockContentVersion`) cannot be replayed, so it is treated as `Stale` and **force-rehashed** to the current version on the next `update`. One code path unifies three cases: + +- **Pre-reset locks** carry a legacy decimal hash with no prefix → force-rehashed to `v1` at the reset. +- **An old binary that rewrites a reset lock** stamps its legacy-substrate hash (no prefix) → the next new-binary run force-rehashes it back to `v1`. The mischief is self-correcting, never silent corruption. +- **A future floor raise** (after a deliberate `component migrate`) retires an old `v` the same way. + +This is the one place back-compatibility is load-bearing, and it is satisfied without a format bump: old binaries read pins and build; the fingerprint value reconciles by version. See the next section for why reading *historical* locks never needs to recompute their hash at all. + +### Back-compat invariant — synthetic history reads stored strings, never recomputes + +The reset is only safe because of a property of the codebase verified against the source: **nothing that reads a *historical* lock ever recomputes a fingerprint for it.** Every historical reader compares the *stored* hash strings; the only code that recomputes a fingerprint does so for the **current working tree against HEAD**, never against an arbitrary past commit. Concretely: + +| Reader | What it does with a historical lock | Recomputes? | +| ------ | ----------------------------------- | ----------- | +| `synthistory.FindFingerprintChanges` | walks `lockfile.ShowAtCommit`→`Parse`, compares `InputFingerprint` *strings* between adjacent commits | No | +| `synthistory.BuildDirtyChange` | compares the precomputed current fingerprint to HEAD's stored string | No (HEAD only) | +| `sourceprep.computeCurrentFingerprint` | the *only* `ComputeIdentity` call on this surface — computes for the **current tree**, compares to HEAD's stored hash | Current tree only | + +The consequence: **swapping the substrate is invisible to synthetic history.** A pre-reset (legacy-token) lock and a post-reset `v1:` lock are just two different opaque strings at two different commits; the walker reports "changed" across the reset commit (correct — it *is* a notable, deliberate, fleet-wide event, the coordinated cutover) and never tries to recompute either side. Applying historic overlays likewise reads stored lock fields and needs no hash recomputation. + +> **Invariant (must hold forever):** synthetic history and historic-overlay application operate on **stored lock fields only.** No reader recomputes a fingerprint for a historical commit. This is precisely what lets a frozen `projectVN` be *forward-only*: it never has to reproduce a hash from a different substrate generation, only hashes the lock that the *current* binary writes. A future change that recomputes a historical fingerprint would break this and must be rejected in review. + +This invariant — no reader recomputes a historical fingerprint — is the complete back-compatibility story: **new-reads-old by string, never-recompute-old by algorithm.** The lock *format* never bumps, so old and new binaries parse every lock identically; only the *interpretation* of the fingerprint value evolves, and that rides the content-version registry. + +## Part 2 — Post-reset lazy migration + +The reset gives us a clean, frozen substrate. Part 2 is the machinery that rides it for the rare genuine algorithm change *after* the cutover — lazily, per-component, with no second coordinated cutover. This is the original "lazy" design, now sound because `projectVN` is genuinely frozen. -### Layer 2 — Versioned lock content with lazy replay (algorithm and default changes) +### Versioned lock content with lazy replay (algorithm changes) -Stamp one **lock content-hash version** into the lock and teach the freshness check to **replay** older versions. The version governs *both* stored hashes (`InputFingerprint` and `ResolutionInputHash`) — they live in one lock, share one write event, and a single integer is the natural fit (see [scope note](#both-hashes-share-one-version) for why one version, not two): +Stamp one **lock content-hash version** into the lock (the `v1:` prefix of the atomic token) and teach the freshness check to **replay** older versions. The version governs *both* stored hashes (`InputFingerprint` and `ResolutionInputHash`) — they live in one lock, share one write event, and a single integer is the natural fit (see [scope note](#both-hashes-share-one-version) for why one version, not two): -1. Add `LockContentVersion int` (`toml:"lock-content-version,omitempty"`) to `ComponentLock`. **An absent field reads as `1`** — the current, pre-RFC algorithms — *not* `0`. (`0` is the Go zero value but no `v0` exists; map the zero to the baseline at read time: `ver := lock.LockContentVersion; if ver == 0 { ver = 1 }`.) The lock **format** `Version` stays `1`; this is a *content* version and is fully backward compatible. +1. The content version lives in the atomic `v:sha256:…` token (it is **not** the lock *format* `Version`, which stays at `1`). The registry floor *starts* at `1` = the projection baseline; there is no legacy pre-projection algorithm in the registry, because pre-reset locks are never replayed (they are read-only history, per the invariant above). A pre-reset lock's prefix-less token is therefore *below* the floor and reconciled by force-rehash, not replay. 2. Turn the combiner into a thin dispatcher over a small registry of historical algorithms, keyed by version. Each entry pairs the two compute functions; when only one algorithm changes, the other slot **reuses** the prior function (no version-neutral hash moves for the untouched one). Keep versions back to a declared floor (see [Registry floor](#registry-floor-and-forced-migration)): ```go type lockAlgo struct { - fingerprint computeFn // produces InputFingerprint - resolution resolveFn // produces ResolutionInputHash + fingerprint computeFn // produces the InputFingerprint digest + resolution resolveFn // produces the ResolutionInputHash digest } var lockAlgos = map[int]lockAlgo{ - 1: {computeFP1, computeRes1}, // current (pre-RFC) algorithms — the implicit baseline - 2: {computeFP2, computeRes1}, // omitempty default (Layer 1); resolution UNCHANGED → reuse v1 fn + 1: {computeFP1, computeRes1}, // projection + combiner baseline, established at the reset + // a future GENUINE algorithm change appends: 2: {computeFP2, computeRes1} } - const currentLockContentVersion = 2 - const minSupportedLockContentVersion = 1 + const currentLockContentVersion = 1 // == the reset baseline; bumps only on a real algo change + const minSupportedLockContentVersion = 1 // floor; raise only after a deliberate `component migrate` ``` -3. In `checkFingerprintFreshness`, compute at the **current** version. On mismatch, if the lock's recorded version `< current`, recompute at the lock's recorded version. If *that* matches the stored hash, the inputs are unchanged and only the algorithm evolved → treat as `FreshnessCurrent` and flag for silent re-stamp. Otherwise → `FreshnessStale`. (Phase 1 wires this for the fingerprint hash; the resolution hash reuses `computeRes1` until its algorithm first changes — see scope note.) -4. `component update` stamps `LockContentVersion = current` **only when it is already writing for an independent reason** (see the churn policy below). Migration is therefore **lazy and per-component**: a lock upgrades only when something independently touches it. +3. In `checkFingerprintFreshness`, compute at the **current** version. On mismatch, if the lock's token version `< current`, recompute at the lock's token version. If *that* matches the stored digest, the inputs are unchanged and only the algorithm evolved → treat as `FreshnessCurrent` and flag for silent re-stamp. Otherwise → `FreshnessStale`. (The resolution hash reuses `computeRes1` until its algorithm first changes — see scope note.) +4. `component update` re-stamps the token to the **current** version **only when it is already writing for an independent reason** (see the churn policy below). Migration is therefore **lazy and per-component**: a lock upgrades only when something independently touches it. This resolves Problems 2 (for default changes), 3 (hashing bugfixes), and 5 (piecemeal rollout). It is the same lazy-forward-migration pattern Cargo/npm use, specialized to a content hash. #### Both hashes share one version -`ComponentLock` carries two persisted content hashes: `InputFingerprint` (render inputs, via `hashstructure` + `Includable`) and `ResolutionInputHash` (upstream-resolution inputs — a flat SHA256 over seven explicit fields in `ComputeResolutionHash`, *not* a struct walk, so the omitempty/`Includable` story does not apply to it). Both have the **same evolution problem**: appending an input or reordering the fold moves every lock's hash → G1 churn. +`ComponentLock` carries two persisted content hashes: `InputFingerprint` (render inputs, via `projectVN` + `sha256`) and `ResolutionInputHash` (upstream-resolution inputs — a flat SHA256 over seven explicit fields in `ComputeResolutionHash`). Both have the **same evolution problem**: appending an input or reordering the fold moves every lock's hash → G1 churn. -We version them with **one shared integer**, not two axes, because: they co-locate in a single lock, they are written in the same `update` pass, and a paired registry lets either evolve independently while the other reuses its prior function. Two separate version fields would double the floor/replay/`--rehash` machinery for an input set (`ResolutionInputHash`) that changes rarely — YAGNI. +We version them with **one shared integer** (the token's `v` prefix), not two axes, because: they co-locate in a single lock, they are written in the same `update` pass, and a paired registry lets either evolve independently while the other reuses its prior function. Two separate version fields would double the floor/replay/migrate machinery for an input set (`ResolutionInputHash`) that changes rarely — YAGNI. -**Phasing.** Naming the field `lock-content-version` *now* is the one expensive-to-reverse decision (it is baked into the on-disk TOML schema the moment Layer 2 ships; renaming a persisted key is itself a migration). The fingerprint replay is wired in the first Layer 2 PR. **Resolution-hash replay is reserved, not yet wired** — the registry slot exists and `computeRes1` is reused, so the day `ComputeResolutionHash` first changes we add `computeRes2` and extend replay to its one comparison site (`checkResolutionFreshness` + the `resHashChanged` silent-write guard in `update.go`), with no schema change. Critically, `ResolutionInputHash` does **not** feed the synthetic changelog path, so its churn is a one-line lock rewrite + a wasted re-resolution, never a phantom release (unlike `InputFingerprint`; see [Downstream consumers](#downstream-fingerprint-consumers-blast-radius)). +**Phasing.** The atomic token format (`v:sha256:…`) is fixed at the reset, so there is no expensive-to-reverse key-naming decision left for Part 2. Fingerprint replay is wired in Part 2's first PR. **Resolution-hash replay is reserved, not yet wired** — the registry slot exists and `computeRes1` is reused, so the day `ComputeResolutionHash` first changes we add `computeRes2` and extend replay to its one comparison site (`checkResolutionFreshness` + the `resHashChanged` silent-write guard in `update.go`), with no schema change. Critically, `ResolutionInputHash` does **not** feed the synthetic changelog path, so its churn is a one-line lock rewrite + a wasted re-resolution, never a phantom release (unlike `InputFingerprint`; see [Downstream consumers](#downstream-fingerprint-consumers-blast-radius)). #### Churn-avoidance policies (G1) -The version stamp is itself a potential source of spurious diffs — the exact thing G1 forbids. Two policies keep it invisible until a real change forces a write: - -- **`lock-content-version` is `omitempty` in TOML.** A baseline (absent / version `1`) lock that is never otherwise touched never materializes the field, so its bytes stay identical. The field only appears in a lock that was *already* being rewritten for an independent reason. Existing checked-in locks therefore produce **zero diff** on the day this lands. -- **The `Changed` decision must replay *before* it compares — this is the subtle seam.** The naive read of the existing guard `if !result.Changed && !resHashChanged { return false, nil }` suggests the re-stamp harmlessly "rides the `Changed` path." **It does not.** In [`update.go`](../../../internal/app/azldev/cmds/component/update.go), `result.Changed` is set to `true` the instant `lock.InputFingerprint != identity.Fingerprint` — and `identity` is computed at the *current* version. That comparison sits **upstream** of the write guard. So after the v1→v2 switchover, the current-version hash differs from every stored v1 hash, `Changed` flips for ~every component, and we get exactly the mass auto-release-bump + mass lock rewrite G1 forbids. The fix is mandatory, not incidental: - - ```go - // Replay at the lock's recorded version BEFORE deciding Changed. - lockVer := lock.LockContentVersion - if lockVer == 0 { - lockVer = 1 - } - replayed, _ := fingerprint.ComputeIdentityAt(lockVer, *result.config, releaseVer, opts) - if lock.InputFingerprint != replayed.Fingerprint { - result.Changed = true // a REAL input change under the lock's own algorithm - } - // else: hashes match under the old algorithm → inputs unchanged, only the - // algorithm moved → NOT Changed. Advance the version only if some other real - // change is already dirtying this lock. - lock.InputFingerprint = identity.Fingerprint // current-version hash - if result.Changed { // re-stamp piggybacks a real write; never its own trigger - lock.LockContentVersion = currentLockContentVersion - } - ``` - - The principle: **"changed?" is judged under the lock's own algorithm version; the stored hash is only upgraded to the current version when the lock is already dirty for a real reason.** (When resolution replay is wired, the same replay-before-compare applies to the `resHashChanged` silent-write guard.) - -Together these make migration strictly opportunistic: a lock advances its version the next time its component changes for real, and not one commit sooner. +The version stamp is itself a potential source of spurious diffs — the exact thing G1 forbids. The rule that prevents it is one idea: **judge "changed?" by replaying the lock's *own* version, not the current one.** Everything below follows from that. + +**Why the obvious approach is wrong.** Today `update.go` sets `result.Changed = true` the instant `lock.InputFingerprint != identity.Fingerprint`, where `identity` is computed at the **current** version. That comparison sits *upstream* of the write guard `if !result.Changed && !resHashChanged { return false, nil }`. So the moment you ship a v1→v2 *algorithm* change, the current-version hash differs from every stored v1 token, `Changed` flips for **~every component at once**, and you get the mass auto-release-bump + mass lock rewrite G1 exists to prevent. The version stamp cannot "harmlessly ride the `Changed` path" — it *triggers* it. + +**The fix: replay before you compare.** Recompute at the lock's recorded version first, and only call it changed if *that* disagrees: + +```go +// Replay at the lock token's recorded version BEFORE deciding Changed. +lockVer := parseTokenVersion(lock.InputFingerprint) // "v1:sha256:…" → 1 +replayed := fingerprint.ComputeIdentityAt(lockVer, *result.config, releaseVer, opts) +if lock.InputFingerprint != replayed.Token() { + result.Changed = true // a REAL input change under the lock's own algorithm +} +// else: tokens match under the old algorithm → inputs unchanged, only the +// algorithm moved → NOT Changed. + +// Re-stamp to the current version ONLY when the lock is already dirty for a +// real reason — the version upgrade piggybacks a real write, never triggers one. +if result.Changed { + lock.InputFingerprint = identity.Token() // current version + digest, written together +} +``` + +This makes migration strictly **opportunistic**: a lock advances its version the next time its component changes for real, and not one commit sooner. Because the version lives *inside* the atomic token, a lock at `v1` with unchanged inputs keeps its exact `v1:sha256:…` bytes — there is no separate version field to materialize and no zero-diff bookkeeping. (When resolution replay is wired, the same replay-before-compare guards the `resHashChanged` write.) + +**The unavoidable flip side — false-fresh on a newly-measured input.** "Replay at the lock's own version" is what buys churn-avoidance, but it is the *same* property that creates a blind spot, because replaying `computeFP(old)` is **blind to any input that version did not measure.** Concretely, when v2 starts folding in an input v1 never touched (the [*Newly measured input*](#change-taxonomy) row): + +- A change to that **new** input on a still-`v1` lock replays at v1, which ignores it → digest still matches → **`Changed = false`** → the change is silently treated as fresh. +- The new input only takes effect on that lock when the lock migrates to v2 — i.e. the next time it is dirtied for an *independent* reason, or via `component migrate`. + +This is correct *by contract* (a v1 lock promises freshness under the v1 input set, which excludes the new input), and harmless for a cosmetic input. But for a **build-critical** new input it is a latent-stale hazard: artifacts can lag the new input by an unbounded number of commits. **Decision rule:** if a newly-measured input must take effect fleet-wide immediately, do **not** rely on lazy adoption — pair the version bump with a deliberate `component migrate` (see [Registry floor and forced migration](#registry-floor-and-forced-migration)). Lazy adoption is the default; `component migrate` is the opt-in for inputs that cannot wait. #### Registry floor and forced migration Lazy migration means an untouched lock can sit at an old version **indefinitely** (G3 by design). That makes "keep the last *N* versions" a **correctness cliff, not a tuning knob**: if pruning drops the compute function a lock still depends on, replay becomes impossible → forced `FreshnessStale` → the mass rebuild/rewrite (and, via the downstream-consumer analysis below, mass changelog churn) the whole design exists to avoid. So the floor must be explicit and paired with an escape hatch, decided now: - **`minSupportedLockContentVersion`** is a hard floor. A lock below it cannot be replayed and is treated as `Stale`. Dropping a registry entry is therefore a deliberate, breaking, announced act — never incidental cleanup. -- **`component update --rehash`** (Open Q#5, promoted to a requirement) force-advances every lock to the current version in one deliberate pass. This is the *only* sanctioned way to retire an old version: rehash the fleet first (one intentional, reviewed, fleet-wide commit), then raise the floor. Note this pass is a deliberate G1 exception — it *is* the eager migration G1 normally forbids, made safe by being explicit and operator-driven rather than a silent side effect. +- **`component migrate`** (Open Q#5, promoted to a requirement) force-advances every lock to the current content version in one deliberate pass. This is the *only* sanctioned way to retire an old version: migrate the fleet first (one intentional, reviewed, fleet-wide commit), then raise the floor. Note this pass is a deliberate G1 exception — it *is* the eager migration G1 normally forbids, made safe by being explicit and operator-driven rather than a silent side effect. **Contract:** it is *offline* — it loads each lock, recomputes the fingerprint at `currentLockContentVersion`, and rewrites the token; it does **not** re-resolve upstream (`upstream-commit`/`import-commit` untouched, unlike `update --force-recalculate`) and does **not** flip the release signal (unlike `--bump`). A migration that re-resolved or bumped would no longer be a pure version advance. The on-disk *config* axis has its own verb, [`config migrate`](#config-schema-version-and-canonical-migration-future); the two are orthogonal — each lives with the artifact its command group already owns (`component` writes locks, `config` owns the TOML). -**Mixed-toolchain hazard.** `go-toml` silently drops unknown fields, so an *older* azldev binary that rewrites a lock a newer binary had stamped will strip `lock-content-version`, regressing it to the baseline. On the next new-binary run the stored (baseline-replayed) hash won't match the current algorithm → spurious `Changed` + bump. This is the classic down-migration trap. Mitigation is a documented invariant ("all writers of a given `locks/` tree must be ≥ the version that tree was last stamped at"), enforced in CI by pinning the azldev version; a hard guard (refuse to write a lock whose on-disk version exceeds the binary's `currentLockContentVersion`) is a possible belt-and-suspenders. +**Mixed-toolchain hazard — handled by force-rehash, not a format gate.** The classic trap is an older binary regressing a newer lock. Because the lock *format* never bumps, an old binary *can* write a reset lock — but the **atomic token** makes that harmless: it stamps a legacy (prefix-less) or lower-`v` hash, which the next new-binary run detects as sub-floor and **force-rehashes** to the current version. Self-correcting, never silent corruption. The symmetric residual is a binary that predates a content-version `v2` and meets a `v2` token it cannot replay: it must **error** (the token version exceeds its `currentLockContentVersion`), not silently restamp at `v1`. A one-line write guard (refuse to write a token whose version exceeds the binary's `currentLockContentVersion`) plus the CI version-pin closes that direction. #### Replaying across a changed input set — `{a,b,c}` → `{a,b,d}` -A lock stores **one opaque hash string** plus its `LockContentVersion`; it does *not* store the individual inputs. So when the measured set changes — say the fingerprint stops measuring `c` and starts measuring `d` — an existing lock (whose stored hash was computed over `{a,b,c}` at v1) is reconciled the only way an opaque hash allows: **recompute and compare, at the lock's own version.** +A lock stores **one atomic token** (`v:sha256:…`); it does *not* store the individual inputs. So when the measured set changes — say the fingerprint stops measuring `c` and starts measuring `d` — an existing lock is reconciled the only way an opaque digest allows: **recompute and compare, at the lock's own version.** Split the change into its two halves; they are handled independently: -- **Adding `d`** is the additive case — `d` is tagged `omitempty`, so for any component that doesn't set it the hash is byte-identical (G4). Free. No version bump. +- **Adding `d`** is the additive case — `projectV1` never listed `d`, so for any lock at v1 the digest is byte-identical whether or not the struct now has `d` (G4, *truly* — the property `hashstructure` could not give). Free. No version bump. - **Dropping `c`** is what forces the version bump, and it is reconciled by replay: - 1. `computeFP2` (measures `{a,b,d}`) ≠ stored hash → mismatch. - 2. lock version (1) < current (2) → **replay `computeFP1`** (still measures `{a,b,c}`). - 3. v1-replay == stored hash? **Yes** → `a,b,c` unchanged since the lock was written; only the *measurement* evolved → `FreshnessCurrent`, lazy re-stamp. **No** → a real input moved → `Stale`, rebuild. Both correct. + 1. `computeFP2` (measures `{a,b,d}`) ≠ stored digest → mismatch. + 2. token version (1) < current (2) → **replay `computeFP1`** (still measures `{a,b,c}`). + 3. v1-replay == stored digest? **Yes** → `a,b,c` unchanged since the lock was written; only the *measurement* evolved → `FreshnessCurrent`, lazy re-stamp. **No** → a real input moved → `Stale`, rebuild. Both correct. So the bump is **not breaking**: replay answers "were the *old* inputs unchanged?" without rebuilding. -**The load-bearing constraint the rest of Layer 2 assumes implicitly:** *a replay function reads the live config struct.* `computeFP1` is Go code in **today's** binary, reading fields off **today's** struct. That is fine when the struct shape is unchanged (the omitempty flip, a combiner bugfix, a changed default — all replay against the same fields). But **physically deleting field `c` from the struct breaks `computeFP1`** — it can no longer read `c`, cannot reproduce the `{a,b,c}` hash, and every lock that set `c` is forced `Stale`. Removal-from-the-struct is therefore the one edit that silently defeats replay. +**The one constraint replay still imposes: a retained `projectVN` must be able to read every field it lists.** Unlike the `hashstructure` substrate, `projectV1` is immune to field *additions* (it never reflects the live struct). It is *not* immune to field *removal*: `projectV1` names `c` explicitly, so physically deleting `c` from the struct stops `projectV1` from compiling. Removal is therefore the one edit still gated by a **deprecate-then-delete** two-step, both non-breaking: -The way around it is a **deprecate-then-delete** two-step, both non-breaking: +1. **Bump to v2 measuring `{a,b,d}` but keep field `c` on the struct** so `projectV1` can still read it for replay (`projectV2` simply does not list `c`). Every old lock replays clean at v1, is recognized as unchanged, lazy re-stamps to v2. Zero forced rebuilds. +2. **Only after the floor passes v1** (`minSupportedLockContentVersion = 2`, ideally after a deliberate `component migrate`) physically delete field `c` and `projectV1`. -1. **Bump to v2 measuring `{a,b,d}` but keep field `c` in the struct**, tagged `fingerprint:"-"` so `computeFP2` ignores it while `computeFP1` can still read it for replay. Every old lock replays clean at v1, is recognized as unchanged, lazy re-stamps to v2. Zero forced rebuilds. -2. **Only after the floor passes v1** (`minSupportedLockContentVersion = 2`, ideally after a deliberate `--rehash`) physically delete field `c`. `computeFP1` is already retired, so nothing reads `c` anymore. +> **Invariant:** a field may be physically removed from the config struct only after *every* retained `projectVN` that lists it has been retired below `minSupportedLockContentVersion`. Retained projection functions and the struct they read must stay in sync — you cannot delete a field a live version still names. -> **Invariant:** a field may be physically removed from the config struct only after *every* registry entry that measured it has been retired below `minSupportedLockContentVersion`. Equivalently: retained replay functions and the struct they read must stay in sync — you cannot delete a field a live version still needs. +This makes "drop an input" a lazy, per-component migration rather than a fleet-wide rebuild — at the cost of carrying a deprecated field on the struct until its projection function ages out. -This makes "drop an input" a lazy, per-component migration rather than a fleet-wide rebuild — at the cost of carrying a deprecated field on the struct until its replay function ages out. +#### First post-reset customer -#### First concrete use: the Layer 1 switchover +The reset establishes `projectV1` directly; it is *not* itself a Part 2 version event (it rides the rebuild, not replay). Part 2's machinery therefore sits idle until the **first genuine algorithm change after the cutover** — e.g. a `computeFP2` that fixes an overlay-folding bug, folds in a newly measured input, or changes a baked-in default. That change registers `computeFP2`, bumps `currentLockContentVersion` to 2, and is absorbed by replay with no second coordinated cutover. Because the projection substrate makes additive config changes hash-neutral by construction (G4), the *only* changes that ever need a Part 2 version event are genuine non-additive algorithm changes — a deliberately small set. -Flipping the inclusion default to omitempty (Layer 1) moves every config's hash, so it cannot ship as a free additive change — it is **Layer 2's first real customer.** It registers as the `computeFP2` algorithm (omitempty default) alongside `computeFP1` (include-always), bumps `currentLockContentVersion` to 2, and is absorbed by replay: every existing lock recomputes clean at v1, is recognized as unchanged-inputs, and re-stamps to v2 *only when next written* per the churn policy above. (The resolution slot is unchanged across this bump — v2 reuses `computeRes1`.) No mass regen, no flag day. And because omitempty makes all future additive changes hash-neutral by construction (G4), it permanently **shrinks** the set of changes that need a Layer 2 version event at all — Layer 1 is both the first user of Layer 2 and the thing that reduces Layer 2's future workload. +## Config schema version and canonical migration (future) -### Layer 3 — Config schema version and canonical migration (future) - -This is the on-disk TOML axis. It is **independent** of the fingerprint axis and only needed once we make *non-additive* TOML changes (rename/move/remove fields in the file format itself). +This is the on-disk TOML axis. It is **independent** of the fingerprint axis and only needed once we make *non-additive* TOML changes (rename/move/remove fields in the file format itself) that were *not* already absorbed by the reset's normalization pass. Most of the hardest cases are spent at the reset (load-out item 6); this axis covers whatever non-additive TOML change arises *after*. 1. Add an explicit `schema-version` to the config file (distinct from the existing `$schema` URL, which is for editor validation). -2. At **load time**, migrate older config shapes forward into the single latest canonical struct *before* anything hashes them. Fingerprinting stays blissfully unaware of file-format history. -3. Pair with the **hybrid seam**: expose `ComponentConfig.ConfigHash()` on the type (pure struct hash + inclusion policy); keep the combiner in `fingerprint`. +2. At **load time**, migrate older config shapes forward into the single latest canonical struct *before* anything hashes them. Fingerprinting stays blissfully unaware of file-format history. A `config migrate` command (sibling to today's `config schema` / `config dump`) makes this an explicit, reviewable pass that rewrites stale TOML files in place to the current `schema-version`. +3. The projection substrate already provides the clean seam: `projectVN` reads the post-migration canonical struct; the combiner stays in `fingerprint`. No `ConfigHash()` method is added (see [the seam note](#where-the-hashing-logic-should-live)). -The critical invariant: **migrate old TOML → latest canonical struct, then hash once.** A semantically no-op migration (rename `foo`→`bar`) must produce the *same* canonical struct, hence the same hash, hence no drift — handled by Layer 2's replay only if the *encoding* changed, and by Layer 3's normalization for the *file shape*. Do **not** keep parallel `V1.Hash()`/`V2.Hash()` methods on versioned structs: that couples the lock to a Go type identity instead of a simple integer, and forces two independent code paths to agree on a hash forever. +The critical invariant: **migrate old TOML → latest canonical struct, then project once.** A semantically no-op migration (rename `foo`→`bar`) must produce the *same* canonical struct, hence the same projection bytes, hence no drift. This is what keeps the schema axis **orthogonal** to the lock axis: a faithful `config migrate` is a pure re-encoding that moves *no* fingerprint, so it never triggers a `component migrate`. If a TOML change genuinely alters build meaning, that is a content-version bump (Part 2), not a `config migrate`. -**Caveat — `hashstructure` hashes the struct type name.** It mixes `reflect.Type.Name()` into the hash, so a Layer-3 migration that moves content into a *renamed* Go struct changes the fingerprint even when the content is byte-identical. "Rename is drift-neutral" therefore holds only if the canonical struct **keeps the original type name**, or the rename is shipped as a Layer-2 version bump that absorbs it. Prefer keeping the type name; reserve the version bump for when the type genuinely must be renamed. +**Resolved by projection:** the old `hashstructure` caveat — that it mixed `reflect.Type.Name()` into the hash, so renaming a Go struct moved every fingerprint even with identical content — **no longer applies.** `projectVN` hashes only the explicit field bytes it emits, never the type name. A struct rename is now genuinely drift-neutral. -### Layer interaction +## Pipeline ```text -TOML on disk ──Layer 3: migrate to canonical struct──► ComponentConfig +TOML on disk ──migrate to canonical struct (schema axis)──► ComponentConfig │ - Layer 1: HashInclude omits zero fields (default omitempty) + projectVN: emit explicit fields, omit-if-zero ▼ - Layer 2: ComputeIdentity[version] ──► InputFingerprint + combiner: sha256 over projection + overlays + identity │ lazy replay + re-stamp on update ▼ @@ -340,94 +428,102 @@ TOML on disk ──Layer 3: migrate to canonical struct──► ComponentConfig ## Downstream fingerprint consumers (blast radius) -The versioned-replay story in Layer 2 must hold for **every** reader of `InputFingerprint`, not just the two paths it grew up around. This is the migration blast-radius map; each consumer's behavior under a v1→v2 switchover is stated explicitly. +The versioned-replay story in Part 2 must hold for **every** reader of `InputFingerprint`, not just the two paths it grew up around. This is the post-reset migration blast-radius map; each consumer's behavior under a Part 2 v1→v2 algorithm switchover is stated explicitly. (The *reset itself* is invisible to these consumers as analyzed under [Back-compat invariant](#back-compat-invariant--synthetic-history-reads-stored-strings-never-recomputes): they compare stored strings, and pre-reset locks are never recomputed.) | Consumer | Reads | Compares | Migration behavior required | | -------- | ----- | -------- | --------------------------- | -| `checkFingerprintFreshness` (resolver) | recomputed identity | vs stored hash | Replay at lock version (Layer 2 core) | -| `component update` `Changed` decision | recomputed identity | vs stored hash | **Replay before `Changed`** (see churn policy / M2 seam) | -| `synthistory.FindFingerprintChanges` | stored hash strings across git history | adjacent commits | **No change needed — if migration stays lazy** | -| `synthistory.BuildDirtyChange` | recomputed (current ver) | vs stored `headLock` hash | **Replay at headLock version** before declaring dirty | +| `checkFingerprintFreshness` (resolver) | recomputed identity | vs stored token | Replay at token version (Part 2 core) | +| `component update` `Changed` decision | recomputed identity | vs stored token | **Replay before `Changed`** (see churn policy seam) | +| `changed.go` `classifyComponent` / `haveMatchingFingerprints` (CI classifier) | stored token strings | version-blind compare | **Replay-aware compare** — a v1 token must match its v2 re-stamp as "same" | +| `synthistory.FindFingerprintChanges` | stored token strings across git history | adjacent commits | **No change needed — if migration stays lazy** | +| `synthistory.BuildDirtyChange` | recomputed (current ver) | vs stored `headLock` token | **Replay at headLock version** before declaring dirty | | `ResolutionInputHash` staleness/write | recomputed resolution hash | vs stored | **Shares the version; replay reserved, not yet wired** | +The `changed.go` classifier is the easily-missed fifth consumer: [`classifyComponent`](../../../internal/app/azldev/cmds/component/changed.go) and `haveMatchingFingerprints` do raw, version-blind token compares to decide CI classification. Post-switchover a v1 token and its semantically-identical v2 re-stamp are different strings, so a naive compare would misclassify the component as changed. It needs the same replay-aware comparison as the freshness check (compare at the older token's version), not a raw string equality. + ### The synthetic changelog/release path is the real hazard [`synthistory.go`](../../../internal/app/azldev/core/sources/synthistory.go) turns fingerprint movement into **user-visible, shipped** package state — `%autochangelog` entries and `%autorelease` increments. There are two distinct comparators, and the design resolves them asymmetrically. - **`FindFingerprintChanges` (historical walker)** does a raw, version-blind string compare of `InputFingerprint` across the lock's git history and emits a synthetic changelog/release entry on every change. Making it genuinely version-aware is hard-to-infeasible — it only has committed *strings*, no inputs to replay. **It does not need to be**, *provided migration stays strictly lazy.* Under the churn policy, a version bump only ever rides a commit where a real input also changed, so there is never a version-only commit in history for the walker to misread. The migration folds honestly into that real change's entry. **This is a design decision, not a code fix:** the v1→v2 conversion is an *accepted, per-component, notable* changelog event that piggybacks a real change. - - **Trap:** this only holds while migration is lazy. A fleet-wide `--rehash` (or the M2 bug where `Changed` flips for everyone) converts *phantom* → *honest-but-fleet-wide* — a truthful but fleet-wide release bump, i.e. **G1 is dead.** "Accept as notable" is therefore conditional on **migration never riding a version-only or fleet-wide write** (the `--rehash` floor pass excepted, because it is deliberate and operator-driven). -- **`BuildDirtyChange` (live dirty check)** compares a *recomputed* current-version (v2) hash against the *stored* (possibly v1) `headLock.InputFingerprint` and declares dirty on inequality. "Accept as notable" does **not** save this path: post-switchover an *unchanged* component would read **dirty on every `render`/`build`** until re-stamped — a persistent, recurring spurious signal, worse than a one-time entry. The fix is **free**: it is the *same replay Layer 2 already owes the freshness check* — replay at `headLock`'s recorded version before declaring dirty. One additional call site for logic already being written, no new mechanism. + - **Trap:** this only holds while migration is lazy. A fleet-wide `component migrate` (or a regression where `Changed` flips for everyone) converts *phantom* → *honest-but-fleet-wide* — a truthful but fleet-wide release bump, i.e. **G1 is dead.** "Accept as notable" is therefore conditional on **migration never riding a version-only or fleet-wide write** (the `component migrate` floor pass and the one-time reset excepted, because they are deliberate and operator-driven). +- **`BuildDirtyChange` (live dirty check)** compares a *recomputed* current-version (v2) hash against the *stored* (possibly v1) `headLock.InputFingerprint` and declares dirty on inequality. "Accept as notable" does **not** save this path: post-switchover an *unchanged* component would read **dirty on every `render`/`build`** until re-stamped — a persistent, recurring spurious signal, worse than a one-time entry. The fix is **free**: it is the *same replay Part 2 already owes the freshness check* — replay at `headLock`'s recorded version before declaring dirty. One additional call site for logic already being written, no new mechanism. -**Net:** M1 is not "make the changelog walker version-aware" (hard, maybe infeasible). It is two things already on the books — (1) the strict lazy churn policy, so the walker never sees a version-only commit; and (2) extend the freshness replay to `BuildDirtyChange`, one extra call site. +**Net:** the changelog-walker concern is not "make the walker version-aware" (hard, maybe infeasible). It is two things already on the books — (1) the strict lazy churn policy, so the walker never sees a version-only commit; and (2) extend the freshness replay to `BuildDirtyChange` and the `changed.go` classifier, a few extra call sites for logic already being written. The reset commit is the single deliberate exception: it *is* a fleet-wide notable event, the coordinated cutover, intentionally visible. ### `ResolutionInputHash` — shares the version, replay deferred `ComponentLock` carries a *second* persisted content hash, `ResolutionInputHash`, with its own staleness logic and its own silent-write path (it writes when only `resHashChanged`, never flipping `Changed`). It has the **identical** evolution problem as `InputFingerprint`: any future change to `ComputeResolutionHash`'s algorithm moves every lock's hash — exactly the mass-churn this RFC exists to prevent. -The single `lock-content-version` covers it (see [Both hashes share one version](#both-hashes-share-one-version)). What differs is **blast radius**, which is why we wire its replay later, not now: +The single shared content version (the token's `v` prefix) covers it (see [Both hashes share one version](#both-hashes-share-one-version)). What differs is **blast radius**, which is why we wire its replay later, not now: -- `ResolutionInputHash` does **not** feed `synthistory` — so an algorithm change can never mint a phantom changelog/release (the M1 hazard is fingerprint-only). Worst case is a one-line `resolution-input-hash` rewrite per lock plus a wasted re-resolution that usually yields the same commit. Churn, not corruption. -- It is a flat seven-field SHA256, not a struct walk, so the Layer 1 omitempty flip leaves it untouched — it has no pending v1→v2 event. Its registry slot stays `computeRes1` until its inputs genuinely change. +- `ResolutionInputHash` does **not** feed `synthistory` — so an algorithm change can never mint a phantom changelog/release (that hazard is fingerprint-only). Worst case is a one-line `resolution-input-hash` rewrite per lock plus a wasted re-resolution that usually yields the same commit. Churn, not corruption. +- It is a flat seven-field SHA256, not a struct walk, so the projection substrate leaves it untouched — it has no pending version event. Its registry slot stays `computeRes1` until its inputs genuinely change. -**Decision:** name the field for the general case now (`lock-content-version`); wire fingerprint replay in Layer 2's first PR; reserve resolution replay (slot present, prior fn reused) and wire it the day `ComputeResolutionHash` first changes — a localized follow-up with no schema change. This fixes the one irreversible thing (the persisted key name) without speculative code (KISS/YAGNI on the second replay). +**Decision:** the atomic token format is fixed at the reset, so there is no irreversible key-naming decision left; wire fingerprint replay in Part 2's first PR; reserve resolution replay (slot present, prior fn reused) and wire it the day `ComputeResolutionHash` first changes — a localized follow-up with no schema change. KISS/YAGNI on the second replay. ## Design decisions -### D1 — `Includable` vs `IgnoreZeroValue` +### D1 — Canonical projection vs `hashstructure` + `Includable` -Both omit zero values; the difference is **control granularity and escape hatches.** +Both can omit zero values; the decisive difference is **whether an old algorithm can be frozen**, which `Includable` cannot deliver (Problem 6). -| | `Includable` per-field (chosen) | `IgnoreZeroValue` global | +| | Canonical projection (chosen) | `hashstructure` + `Includable` | | --- | --- | --- | -| Meaningful empties | Preserved via `fingerprint:"always"` | Lost — no opt-out | -| Per-field intent | Explicit, CI-audited | Invisible | -| Wiring | One helper + value-receiver method per struct | One option flag | +| Old algorithm frozen | Yes — explicit pinned field list | No — reflects the live struct/method-set | +| Sound replay (Part 2) | Yes | No (the disqualifier) | +| Meaningful empties | `emitAlways` per field | `fingerprint:"always"` per field | +| Type-name in hash | No (rename is drift-neutral) | Yes (rename moves every hash) | +| Plumbing | Projection encoder + golden vectors | Value-receiver `HashInclude` on every nested struct + `v.(reflect.Value)` assert | -`IgnoreZeroValue` is a blunt global switch with no way to keep a build-meaningful zero. `Includable` gives the same default behavior **plus** the `always` escape hatch and a point-of-definition audit. Both move every hash once on adoption — that cost is absorbed by Layer 2 either way (see the switchover note), so it is not a differentiator. +`Includable` keeps today's hashes byte-identical, which mattered for an *incremental* rollout — but that property is worthless once the reset rebuilds everything anyway, and it comes attached to a substrate that makes replay unsound. Projection trades byte-compatibility (which we are spending on the coordinated cutover regardless) for frozen replay (which we need forever). Adopted at the reset. -### D2 — Mandatory explicit tags, default omitempty +### D2 — Explicit field lists + golden vectors over reflection tags -Every fingerprinted field must carry `fingerprint:"-"`, `"omitempty"`, or `"always"` — there is no untagged state. Rationale: +Field selection lives in `projectVN` as ordinary, explicit Go code (one `emit`/`emitAlways` line per measured field), not in struct tags read by a reflective walker. Rationale: -- The *unsafe* failure direction is the false-negative (a meaningful field omitted → missed rebuild). Defaulting to omitempty tilts toward that direction, so the safety check must be loud, not implicit. -- A mandatory tag forces the "is this field's zero value build-meaningful?" decision **at the point of definition**, where the author has the context — better locality than a far-away exclusions registry. -- It *simplifies* the audit: assert every field has a valid tag value; delete the `expectedExclusions` map entirely. +- The *unsafe* failure direction is the false-negative (a meaningful field silently omitted → missed rebuild). An explicit list makes "what does v1 measure?" greppable in one function, and the **golden-vector test** turns any accidental change to a historical projection into a CI failure — a far stronger guard than a tag-presence audit. +- It forces the "is this field's zero value build-meaningful?" decision at the call site (`emit` vs `emitAlways`), with full context. +- It removes the `Includable` nested-struct trap entirely: there is no per-struct method to forget, no decorative tag that passes the audit while silently hashing a zero. -Fully implicit (omitempty default, no tags, no audit) was rejected — it removes the only guard against the unsafe direction. `fingerprint:"omitempty"` mirrors Go's own `json:",omitempty"`; `"always"` and `"-"` read unambiguously alongside it. +The cost is writing `projectVN` by hand instead of leaning on reflection. That is the point: hand-written selection is what makes the function frozen and auditable. -### D3 — Content version vs format version in the lock +### D3 — Atomic self-describing token; no format bump, reconcile via force-rehash -Reusing `ComponentLock.Version` for the algorithm would force a format-version bump (and the strict `Parse` gate would reject old locks outright). A separate `LockContentVersion` keeps the format stable and old locks readable, enabling lazy migration instead of hard rejection. It is named for the *general* case — it versions every content hash the lock stores (`InputFingerprint` now, `ResolutionInputHash` when its replay is wired) — because the persisted TOML key is the one thing that is expensive to rename after ship. +The stored hash is a single `v:sha256:` token, not separate version and digest fields. One field, written atomically, so the version and the digest can never desync (the class of bug a split-field design invites when one is written and the other is not). -### D4 — Method-on-type hashing +The lock **format** `Version` stays at `1`. An earlier draft bumped it (1→2) as a poison pill to stop old binaries touching reset locks, but that also stops them reading pins to *queue a build* — too blunt. Instead, back-compat rests on two cheaper properties: the format is unchanged so every binary parses every lock, and the content-version registry **force-rehashes** any sub-floor token (legacy, or downgraded by an old binary) up to the current version. Old binaries stay useful (read pins, build); their only possible mischief — writing a legacy-substrate hash — is self-correcting on the next new-binary run, not silent corruption. Back-compat is therefore: **same format forever, reconcile fingerprints by version, never recompute history.** -Adopt the **hybrid seam**: pure `ConfigHash()` on the config type, combiner in `fingerprint`. A full move was rejected (layering regression: I/O + crypto + algorithm versioning do not belong on a data type). See [Research](#where-the-hashing-logic-should-live). +### D4 — Project to bytes, not a `ConfigHash()` method on the type -Two constraints keep the seam from eroding back into the rejected methods-on-type design: **`ConfigHash()` must stay version-frozen** (it computes exactly one algorithm; it does *not* dispatch over versions — a single method "can't replay its own past"), and **the combiner is the sole version authority.** Version dispatch lives entirely in `fingerprint`'s registry; `ConfigHash()` is just the current pure-config step it calls. Keep `ConfigHash()` unexported-or-narrow if practical, so callers cannot route around the registry to get a raw, version-agnostic hash. +`projectVN(config) []byte` returns canonical bytes; the combiner in `fingerprint` owns the `sha256` and the version dispatch. A `ConfigHash()` method that returns a finished hash was rejected: it drags crypto + versioning onto a data type, and it tempts callers to route around the version registry to get a raw, version-agnostic hash. Returning bytes keeps the config type ignorant of versioning, and keeps the combiner the **sole version authority**. See [the seam note](#where-the-hashing-logic-should-live). ## Alternatives considered -- **Global `IgnoreZeroValue`** — see D1. Same default behavior but no per-field escape hatch for meaningful zeros and no point-of-definition audit. Rejected. -- **Implicit omitempty (no mandatory tags, no audit)** — see D2. Removes the only guard against the unsafe false-negative direction. Rejected in favor of mandatory 3-way tags. -- **Content-hash the rendered config** (Go-modules style) instead of struct-hashing. The naive version of this — "hash all the bytes" — over-captures, since we deliberately exclude many fields (`paths`, `publish`, snapshots) from the fingerprint. The *stronger* form is a **canonical-projection hash**: serialize only the included fields, keys sorted, and hash those bytes — immune to field-shape drift without per-field reflection tags. We still stay with `hashstructure` + `Includable` because our inclusion policy is **conditional** (omitempty = include-if-non-zero, evaluated on the resolved value), which a static byte serializer would have to re-implement anyway — so the projection hash buys field-shape immunity at the cost of reimplementing the very predicate `Includable` already gives us, plus a second serialization format to keep stable forever. Rejected on that basis, but recorded as the principled alternative; it is the one foundational choice that would be expensive to reverse post-adoption. -- **Parallel versioned structs with per-struct `Hash()`** — couples locks to Go type identity and duplicates hashing logic per version. Rejected in favor of Layer 2's integer-versioned combiner + Layer 3 canonical migration. -- **Bump lock format `Version` and migrate eagerly** — eager migration rewrites every lock at once, the exact mass-churn we are trying to avoid. Rejected in favor of lazy per-component re-stamp. +- **Incremental lazy migration on the `hashstructure` substrate** (the original plan): flip the inclusion default to omitempty via `Includable`, version the lock content, and migrate lazily — *without* a reset. Rejected: Problem 6 makes its central promise unkeepable. A "frozen" replay function built on `hashstructure.Hash` reflects the live struct, so the first field addition after the switchover moves the old algorithm's output and forces a rehash anyway. The incremental path therefore does not actually avoid a coordinated cutover — it defers one to the first field addition, on a substrate that makes replay unsound. With a coordinated cutover already scheduled (the dev→prod cutover), spending it once on a clean projection substrate strictly dominates. +- **Global `IgnoreZeroValue`** — a blunt switch that omits *all* zero fields with no escape hatch for build-meaningful zeros, and still on the non-frozen `hashstructure` substrate. Rejected. +- **Parallel versioned structs with per-struct `Hash()`** — couples locks to Go type identity and duplicates hashing logic per version. Rejected in favor of Part 2's integer-versioned combiner over frozen projections. +- **Bump the lock format `Version` 1→2 as a poison pill** (an earlier draft's choice) — makes old binaries hard-reject reset locks. Rejected: it also blocks old binaries from reading pins to queue a build, and it is unnecessary, since the content-version registry already force-rehashes any sub-floor or downgraded token (D3). Same-format + force-rehash keeps old binaries useful without risking silent corruption. +- **Eager fleet-wide migration as the steady-state mechanism** — rewriting every lock on every algorithm change is the mass-churn the design exists to prevent. Rejected for the steady state. The *reset* is a deliberate, one-time, operator-driven eager pass riding an already-scheduled rebuild — the sanctioned exception, not the rule; `component migrate` is its post-reset equivalent for retiring an old version. ## Incremental delivery -1. **PR A (Layer 1)**: shared `includeFingerprintField` helper + a delegating value-receiver `HashInclude` on **every** fingerprinted struct (all ~10 registered in `fingerprint_test.go`, not just `ComponentConfig`/`PackageConfig` — see the per-struct resolution note in Layer 1); tag every fingerprinted field with one of `-`/`omitempty`/`always`; rewrite the field-decision audit to (a) assert valid-tag presence and (b) assert every registered struct implements `Includable`, then drop the `expectedExclusions` registry. **Note:** flipping the default moves every hash, so PR A must land *with or after* PR B's version machinery — it registers as the `computeFP2` algorithm, not a standalone change. Unit tests: a zeroed `omitempty` field hashes **equal to its absence-equivalent** (not merely "setting it drifts" — that positive-direction test passes even if `HashInclude` is a no-op, so it must be paired with the zero-equals-absent assertion that actually exercises omission); an `always` field drifts even at zero. -2. **PR B (Layer 2)**: `LockContentVersion` on `ComponentLock` (+ `ComponentLockData` and `populateFromLock`, so the replay site can read the version); a paired version registry (fingerprint + resolution compute fns) with a `minSupportedLockContentVersion` floor; fingerprint replay-before-`Changed` in `update.go`; fingerprint replay in `checkFingerprintFreshness` **and `BuildDirtyChange`** (same replay logic, two call sites). Resolution-hash replay is *reserved* — the registry slot reuses `computeRes1`; not wired until `ComputeResolutionHash` first changes. Unit tests: old-version lock with unchanged inputs → `Current` and **not** `Changed`; changed inputs → `Stale`; re-stamp only on an already-dirty write. -3. **PR C (validation)**: scenario test (in the style of `scenario/component_changed_test.go`) — set a new `omitempty` field on a single component and assert only that lock drifts. -4. **PR D (Layer 3, later)**: `schema-version` field, load-time canonical migration, `ComponentConfig.ConfigHash()` seam. Gated on the first real non-additive TOML change. +The reset (Part 1) must land as one coherent change at the dev→prod cutover; its pieces are independently reviewable but ship together because they all move the hash. + +1. **PR A (substrate)**: `projectVN` encoder (`canonicalBuf`, `emit`/`emitAlways`), `projectV1` with the explicit field list, `sha256` combiner, and the golden-vector test. Pure addition alongside the existing path; not yet wired into `ComputeIdentity`. Unit tests: a field absent from `projectV1` is invisible to the digest; `emitAlways` fields hash even at zero; golden vectors pin the v1 output. +2. **PR B (reset cutover)**: switch `ComputeIdentity` to `projectV1`; adopt the atomic `v1:sha256:` token; unify on sha256. Lock format `Version` stays `1`. Ships at the cutover; absorbed by the scheduled rebuild. Unit tests: a legacy prefix-less token is read as sub-floor and force-rehashed to `v1`; a `v1:` token round-trips; an old binary (format `1`) still parses pins from a reset lock. +3. **PR C (Part 2 machinery)**: the version registry (`lockAlgos`, `currentLockContentVersion`, `minSupportedLockContentVersion`), `ComputeIdentityAt`, replay-before-`Changed` in `update.go`, and replay in `checkFingerprintFreshness`, `BuildDirtyChange`, **and the `changed.go` classifier**. Resolution replay reserved (slot reuses `computeRes1`). With only `v1` registered this is inert but proven. Unit tests: a synthetic `v1`/`v2` pair with unchanged inputs → `Current` and **not** `Changed`; changed inputs → `Stale`; re-stamp only on an already-dirty write. +4. **PR D (validation)**: scenario test (in the style of `scenario/component_changed_test.go`) — add a field absent from `projectV1` and set it on one component; assert only that lock drifts and every other lock is byte-identical. +5. **PR E (config schema axis, later)**: `schema-version` field + load-time canonical migration + the `config migrate` command. Gated on the first post-reset non-additive TOML change not already absorbed by the reset's normalization pass. -Each PR is independently revertible. Because the Layer 1 default flip is a hash-moving change, PRs A and B ship together (or B first); the `lock-content-version` omitempty stamp and churn policies ensure existing locks see zero diff until independently touched. Layer 3 migrates lazily on next write. +Each PR is independently revertible up to the cutover. PRs A–B land together at the dev→prod cutover (they move every hash and are absorbed by the scheduled rebuild); PR C is inert until the first post-reset algorithm change; PRs D–E follow. ## Open questions 1. Should a lazy re-stamp during a *read-only* command (`render`, `build` freshness check) write the lock back, or defer all writes to `component update`? Writing on read is surprising; deferring means freshness checks stay slightly slower until the next update. (Leaning: defer all writes to `update`, keeping reads side-effect-free.) -2. For Layer 3, does `schema-version` live per-config-file or per-component? Per-file is simpler; per-component allows mixed-version projects during migration. -3. Should `omitempty` semantics use `reflect.Value.IsZero()` (Go's notion) or a config-aware notion of "unset" (e.g. nil pointer vs empty string)? Pointers would make "set to empty" expressible but complicate the structs. -4. Can the audit go further than tag-presence and *statically* flag fields whose zero value is likely meaningful (e.g. a `bool` defaulting true) and nudge toward `always`? Or is the point-of-definition tag plus code review sufficient? -5. Should the mixed-toolchain hazard get a hard write-time guard (refuse to write a lock whose on-disk version exceeds the binary's `currentLockContentVersion`), or is the CI version-pin invariant enough? +2. For the config schema axis, does `schema-version` live per-config-file or per-component? Per-file is simpler; per-component allows mixed-version projects during migration. +3. Should the omit-if-zero predicate use `reflect.Value.IsZero()` (Go's notion) or a config-aware notion of "unset" (e.g. nil pointer vs empty string)? `projectVN` makes this a per-field choice in code, so it can differ field to field — but a default convention is still worth settling. +4. What is the canonical byte encoding for `projectVN` (length-prefixed key+value? a stable subset of TOML/CBOR?), and how are golden vectors stored and regenerated? This is the one substrate detail that is expensive to change after the reset. +5. Should the residual mixed-toolchain case get a hard write-time guard (refuse to write a token whose version exceeds `currentLockContentVersion`), or is force-rehash on read + the CI version-pin enough? (The operator escape hatch is `component migrate`; this question is only about the *automatic* guard.) -*Resolved in-text (recorded here so they aren't re-litigated):* registry retention is a **floor**, not "last N" (M8 / Registry floor); `--rehash` is the sanctioned forced-migration pass (promoted from a question to a requirement); absent `LockContentVersion` reads as `1`; one shared `lock-content-version` covers both stored hashes, with resolution-hash replay reserved (slot present, fn reused) until `ComputeResolutionHash` first changes. +*Resolved in-text (recorded here so they aren't re-litigated):* the reset rides the already-scheduled dev→prod rebuild as the one sanctioned coordinated cutover; the substrate is canonical projection (frozen `projectVN` + golden vectors), not `hashstructure`; baseline `v1` is omit-if-zero with **no** include-always legacy in the registry; the lock format `Version` stays at `1` (old binaries keep reading pins to build); the substrate swap and any old-binary downgrade are reconciled by **force-rehashing** sub-floor tokens, not a format gate; the stored hash is an **atomic** `v:sha256:` token; back-compat rests on the verified invariant that **no reader recomputes a historical fingerprint** (synthetic history and historic-overlay application read stored strings only); registry retention is a **floor**, not "last N"; `component migrate` is the post-reset forced-migration pass (lock axis; `config migrate` is its schema-axis sibling); one shared content version covers both stored hashes, with resolution-hash replay reserved (slot present, fn reused) until `ComputeResolutionHash` first changes. From eff913d438538c4899a6f1da87f9dcd2901ee141 Mon Sep 17 00:00:00 2001 From: Daniel McIlvaney Date: Fri, 5 Jun 2026 17:47:56 -0700 Subject: [PATCH 4/4] update 3 --- docs/developer/rfc/lazy-schema-migration.md | 190 ++++++++++++++------ 1 file changed, 135 insertions(+), 55 deletions(-) diff --git a/docs/developer/rfc/lazy-schema-migration.md b/docs/developer/rfc/lazy-schema-migration.md index 8e9a56c6..65f403c9 100644 --- a/docs/developer/rfc/lazy-schema-migration.md +++ b/docs/developer/rfc/lazy-schema-migration.md @@ -88,7 +88,7 @@ This RFC therefore has two parts: **(1)** a one-time **reset** at the dev→prod - **G1 (primary, non-functional): no spurious lock-file diffs *after the reset*.** Once prod locks exist, landing a config-schema or hashing change must not rewrite `*.lock` files for components whose effective inputs are unchanged. The reset itself is the *one* sanctioned exception, absorbed by the already-scheduled rebuild. - **G2: only real changes drift.** Post-reset, a lock changes iff that component's build-effective inputs changed. - **G3: piecemeal, lazy migration post-reset.** Genuine algorithm evolution after the reset rolls out per-component, riding independent changes, never as a big-bang. -- **G4: additive fields are drift-neutral by construction — *truly*, not just for new locks.** On the projection substrate (below) an unset additive field is invisible to *every* lock including old ones, because old algorithms pin an explicit field list and never reflect over the live struct. +- **G4: additive fields are drift-neutral by construction — *truly*, not just for new locks.** On the projection substrate (below) an unset additive field is invisible to *every* lock including old ones, because old versions emit only the fields their tags include — a field added later is not in any shipped version's tag set, so it cannot move an existing hash. - **G5: correctness backstop preserved.** Never silently under-rebuild: a genuine input change must always drift its lock. Replay may accept encoding/over-capture changes; it must never mask a behavior-changing one. - **G6 (new, hard): back-compatible reads for synthetic history.** The new binary must still **read** pre-reset locks across git history (synthetic changelog/release walks them), even though it **writes** only the new format. Reading never recomputes a historical hash — it compares stored strings only. @@ -129,7 +129,7 @@ The struct's type name *is* part of the hash (`hashstructure` mixes in `reflect. - Whether `Includable` is consulted depends on whether the type implements it *now* — not on what was true when v1 locks were written. - A `value` vs `pointer` receiver subtlety even decides whether the root struct's `HashInclude` is seen at all (the top-level value is not addressable). -A function meant to be "the v1 algorithm, forever" therefore changes meaning every time the struct or its method set changes. That is the disqualifier for the incremental plan (Problem 6) and the motivation for the projection substrate below, whose v1 function pins an explicit field list and is immune to all three. +A function meant to be "the v1 algorithm, forever" therefore changes meaning every time the struct or its method set changes. That is the disqualifier for the incremental plan (Problem 6) and the motivation for the projection substrate below, whose v1 projection emits only its version-tagged fields and reads neither the method set nor the type name — immune to all three. ## Change taxonomy @@ -137,17 +137,19 @@ Not every config change should be treated the same way. The right mechanism depe | Class | Example | Should unaffected locks drift? | Mechanism | | ----- | ------- | ------------------------------ | --------- | -| **Additive field** | new `foo` field, unset on most components | No — only setters drift | **Free, no bump.** Add `foo` to the current `projectVN` as omit-if-zero; a component that leaves it unset emits identical bytes, so no shipped hash moves. Setters drift (correct). | -| **Additive with non-zero default** | new field defaulted to `"auto"` via defaults merge | No | **Bump + replay.** The default resolves non-zero on *every* component, so it is emitted everywhere and would move every hash — omit-if-zero can't save it. Ship `projectV(N+1)` that emits it; old locks **replay at their version** (which didn't emit it), match their stored digest → recognized unchanged → lazy re-stamp, no rebuild. | -| **Rename / move** | `foo` → `bar`, same semantics | No | **Schema migration + bump + replay.** Migrate old TOML → canonical struct (the rename lands in the struct), then ship `projectV(N+1)` that emits the renamed field. Old locks replay at their version and are recognized unchanged → lazy re-stamp, no rebuild. | +| **Additive field** | new `foo` field, unset on most components | No — only setters drift | **Free, no bump.** Tag the new field `vN..*` (current version, omit-if-zero); a component that leaves it unset emits identical bytes, so no shipped hash moves — adding an omit-if-zero field to the live version is the one output-preserving no-bump edit. Setters drift (correct). | +| **Additive with non-zero default** | new field defaulted to `"auto"` via defaults merge | No | **Bump + replay.** The default resolves non-zero on *every* component, so it is emitted everywhere and would move every hash — omit-if-zero can't save it. Bump and tag the field `v(N+1)..*`; old locks **replay at their version** (whose set excludes it), match their stored digest → recognized unchanged → lazy re-stamp, no rebuild. | +| **Default change on an *existing* field** | bump `jobs` default `4`→`8` | Yes — every component's effective input moved | **Not lazy-maskable.** Replay recomputes the *current* config (now resolving to `8`) under the old algorithm → `jobs=8` ≠ stored `jobs=4` → honest fleet-wide drift; replay cannot suppress it because the resolved value genuinely changed for everyone. Escape hatch: `config migrate` writes the *old* resolved value explicitly (`jobs=4`) into each config **before** moving the default — existing components then pin the old value (no drift) and only new components pick up `8`. Without that pre-pass it is a legitimate (if large) fleet rebuild, not a bug. | +| **Rename / move** | `foo` → `bar`, same semantics | No | **Schema migration + bump + replay.** Migrate old TOML → canonical struct (the rename lands in the struct), then tag the renamed field `v(N+1)..*`. Old locks replay at their version and are recognized unchanged → lazy re-stamp, no rebuild. | | **Semantic change** | meaning of `foo` changes; output differs | Yes — that's correct | **None.** The build output genuinely differs, so the lock *should* drift. Replay at the old version would (correctly) mismatch → `Stale` → rebuild. Nothing to suppress. | | **Hashing bugfix** | overlay ordering bug in the combiner | No | **Bump + replay.** Ship the fixed combiner as the version-`N+1` half of `computeFP(N+1)`; old locks replay at the old (buggy) version. If their inputs are unchanged the buggy digest still matches → recognized unchanged → lazy re-stamp to the fixed version, no rebuild. | -| **Newly measured input** | start folding in a new overlay source or identity element | No | **Bump + replay.** A non-config input is added in the combiner half of `computeFP(N+1)` (a config field would go in `projectV(N+1)`). Old locks replay at their version, which didn't fold it in, match their stored digest → recognized unchanged → lazy re-stamp, no rebuild. **Caveat:** until a lock migrates, replay is *blind* to the new input, so a change to it reads as fresh (false-fresh) — if it is build-critical, force a `component migrate` pass instead of riding lazy adoption (see [churn-avoidance](#churn-avoidance-policies-g1)). | -| **Field removal** | drop deprecated `foo` | No, if nobody set it | **Deprecate-then-delete (+ bump for setters).** Bump to a `projectV(N+1)` that stops emitting `foo` but **keep the field on the struct** so the old `projectVN` can still read it for replay. Only after the floor passes that version (ideally after a `component migrate`) physically delete the field. Setters drift on the bump; non-setters replay clean. | +| **Newly measured input** | start folding in a new overlay source or identity element | No | **Bump + replay.** A non-config input is added in the combiner half of `computeFP(N+1)` (a config field would be tagged `v(N+1)..*` instead). Old locks replay at their version, which didn't fold it in, match their stored digest → recognized unchanged → lazy re-stamp, no rebuild. **Caveat:** until a lock migrates, replay is *blind* to the new input, so a change to it reads as fresh (false-fresh) — if it is build-critical, force a `component migrate` pass instead of riding lazy adoption (see [churn-avoidance](#churn-avoidance-policies-g1)). | +| **Field removal** | drop deprecated `foo` | No, if nobody set it | **Deprecate-then-delete (+ bump for setters).** Close the field's range at the prior version (`vK..*` → `vK..vN`, so v(N+1) stops measuring it) but **keep the field on the struct** so older versions can still read it for replay. Only after the floor passes vN (ideally after a `component migrate`) physically delete the field. Setters drift on the bump; non-setters replay clean. | +| **Resurrected field** | re-measure a previously-dropped `foo` | Depends — only if its value moved | **Tag edit (+ bump).** Append a new range to the field's set (`v1..v3,v8..*`) so v8+ measures it again while v1–v7 stay byte-identical (golden-vector-enforced). If the field was already physically deleted, bring it back as a fresh additive field tagged `v8..*`. The earlier life and the revival never collide because each version's output is pinned independently. | -The recurring requirement across the "No" rows is the same: **distinguish a change in user intent from a change in encoding, and only drift on the former.** Note the first row: on the projection substrate, a new field is added to `projectVN` as *omit-if-zero*, so a component that does not set it emits identical bytes and stays hash-neutral — *for every lock, old or new*, because old configs never set the brand-new field. Adding it does not move any existing hash (no shipped lock set it), so it needs no version bump. Part 2 then carries only the genuinely hard cases (rows 2, 5, and post-reset renames/removals). The shared move in every "Bump + replay" row is the same primitive — **increment the content version, keep the old `projectVN` as a frozen replay function, and let unchanged locks re-stamp lazily** — detailed in [Part 2](#part-2--post-reset-lazy-migration). +The recurring requirement across the "No" rows is the same: **distinguish a change in user intent from a change in encoding, and only drift on the former.** Note the first row: on the projection substrate, a new field is added to `projectVN` as *omit-if-zero*, so a component that does not set it emits identical bytes and stays hash-neutral — *for every lock, old or new*, because old configs never set the brand-new field. Adding it does not move any existing hash (no shipped lock set it), so it needs no version bump. Part 2 then carries only the genuinely hard cases (rows 2, 5, and post-reset renames/removals). The shared move in every "Bump + replay" row is the same primitive — **increment the content version, keep the old `projectVN` as a frozen replay projection, and let unchanged locks re-stamp lazily** — detailed in [Part 2](#part-2--post-reset-lazy-migration). -> **`projectVN`** is shorthand used throughout this RFC for the hand-written *projection function* introduced by this design (defined in [Substrate options](#substrate-options) and [The projection substrate](#the-projection-substrate)). The `N` is the lock content version: `projectV1` is the function that names and serializes exactly the fields content-version 1 measures, `projectV2` the next algorithm, and so on. Each `projectVN` is frozen once shipped — that is the whole point. +> **`projectVN`** is shorthand used throughout this RFC for the canonical *projection at content-version N* introduced by this design (defined in [Substrate options](#substrate-options) and [The projection substrate](#the-projection-substrate)). It is **not** N hand-written functions: it is a single generic walker, `project(cfg, N)`, whose per-field membership is declared in version-set tags on the struct fields (see [Version-tagged field selection](#version-tagged-field-selection)). `projectV1` means `project(cfg, 1)` — the fields whose tag set includes v1; `projectV2` the next version, and so on. Each version's projection is frozen once shipped (its tags never move; golden vectors enforce it) — that is the whole point. ## Research @@ -156,7 +158,7 @@ The recurring requirement across the "No" rows is the same: **distinguish a chan Two substrates can produce a content fingerprint of the resolved config. The difference that matters here is **whether an old algorithm function can be frozen.** - **`hashstructure` + `Includable` (rejected as the substrate).** Keeps existing hashes byte-identical and gives per-field omission via `HashInclude`. But, as established above (Problem 6), a function built on `hashstructure.Hash` reflects over the live struct and method set, so it cannot be a frozen historical algorithm. It also requires a value-receiver `HashInclude` on *every* nested fingerprinted struct and a subtle `v.(reflect.Value)` type-assert to work at all — brittle plumbing in service of a substrate that still can't host sound replay. -- **Canonical projection + stdlib hash (chosen).** Split the two jobs `hashstructure` fuses — *field selection* and *hashing* — into explicit steps. A `projectVN` function names the exact fields version N measures, emits them in a canonical, sorted, self-delimiting byte form, and an stdlib `sha256` hashes those bytes. Because `projectV1` references an **explicit, pinned field list**, it does not see fields added later, does not depend on the type's method set, and does not depend on receiver subtleties. It is a genuinely frozen pure function — the property replay requires. The cost is owning a small projection encoder plus **golden hash vectors** per version (a checked-in `(config, version) → hash` table) so "frozen" is CI-enforced, not merely intended. +- **Canonical projection + stdlib hash (chosen).** Split the two jobs `hashstructure` fuses — *field selection* and *hashing* — into explicit steps. Field selection is **declared per field** as a version-set in the `fingerprint` tag (`fingerprint:"v1..*"`); a single generic walker, `project(cfg, N)`, emits the fields whose set includes version N in a canonical, sorted, self-delimiting byte form, and an stdlib `sha256` hashes those bytes. Because a shipped version's tag membership is **fixed and golden-vector-pinned**, `project(cfg, 1)` does not see fields added later, does not depend on the type's method set, and does not depend on receiver subtleties. It is a genuinely frozen pure function of `(cfg, version)` — the property replay requires. The cost is owning a small projection encoder, the version-set tags, and **golden hash vectors** per version (a checked-in `(config, version) → hash` table) so "frozen" is CI-enforced, not merely intended. The projection substrate is what makes G4 true for old locks and what makes Part 2's replay sound. It is adopted at the reset (below), not incrementally. @@ -169,6 +171,8 @@ The projection substrate is what makes G4 true for old locks and what makes Part The common pattern: an **integer version stamped into the persisted artifact**, plus the ability to **read and replay older versions**, plus **lazy forward-migration on write**. We keep `ComponentLock.Version` (the lock *format* slot) fixed at `1` and carry the *content* version **inside the `InputFingerprint` token** (`v:sha256:…`) rather than in a separate struct field — one atomic value, no version/digest desync, no new TOML field for an old binary to mishandle. The Go-modules lesson is the deepest one: hashing *content* rather than struct shape is what makes additive metadata free — the canonical-projection substrate is our version of that lesson. +**Where we go *beyond* the precedent (stated honestly).** All four tools above keep exactly **one** active algorithm: Cargo/npm/Terraform rewrite the *whole* artifact to the current version on next touch (eager-on-write), and Go modules sidestep replay entirely by never re-migrating semantics. **None of them keeps N historical hashing algorithms alive simultaneously across an indefinitely-unmigrated fleet** — which is exactly Part 2's behavior. The citations support "version stamp + lazy forward-migrate on write"; they do *not* cover "frozen algorithms coexisting forever." That coexistence is justified here on its own terms (it is what avoids a fleet rebuild on every algorithm change), and its one real cost — append-only registry growth — is bounded by the [floor-advance cadence](#registry-floor-and-forced-migration), not by precedent. + ### Where the hashing logic should live With the projection substrate the fingerprint algorithm decomposes into two steps. **Both are versioned together** by the single lock content version — the version pins the *entire* fingerprint computation, not just the field list: @@ -196,36 +200,92 @@ The original "lazy" instinct was right for Part 2 and wrong for Part 1: there is Replace `hashstructure.Hash(component, …)` with an explicit two-step pipeline: ```text -ComponentConfig ──projectV1(cfg)──► canonical bytes ──sha256──► configHash - (explicit field list, (stdlib) - sorted keys, emit-if-nonzero) +ComponentConfig ──project(cfg,1)──► canonical bytes ──sha256──► configHash + (version-tagged fields, (stdlib) + sorted keys, emit-if-nonzero) ``` -`projectV1` is hand-written and names exactly the fields v1 measures. It emits a canonical, sorted, self-delimiting byte stream (length-prefixed keys + values) so distinct field sets cannot collide, and it omits a field when its **resolved value is zero** (the omitempty behavior, now a property of the encoder, not a struct tag). A field whose zero value is build-meaningful is simply listed as *always-emit* in `projectV1`. +`projectV1` is the projection at version 1 — `project(cfg, 1)`. Field membership is declared **on each struct field** as a version-set in the `fingerprint` tag (`fingerprint:"v1..*"`); a single generic walker emits, in stable key order, every field whose set includes the target version, length-prefixing key+value so distinct field sets cannot collide. It omits a field when its **resolved value is zero** (omit-if-zero, an encoder property now, not a struct-tag toggle); a range prefixed with `!` (e.g. `!v1..*`) always-emits, for fields whose zero is build-meaningful. There is no per-version function — only the generic walker parametrized by version. (Grammar and recovery semantics: [Version-tagged field selection](#version-tagged-field-selection) below.) Three things this buys that `hashstructure` could not: -- **Frozen by construction.** `projectV1` references a pinned field list, so adding `Foo` to the struct later is invisible to it — `projectV1`'s output for an old config is unchanged. This is what makes Part 2's replay sound (Problem 6) and G4 true for *old* locks, not just new ones. -- **No method-set / receiver magic.** No `Includable`, no per-nested-struct method, no `v.(reflect.Value)` type-assert footgun. Field selection is ordinary code. -- **Golden-vector enforced.** A checked-in table of `(config, version) → hash` vectors is asserted in CI, so any accidental change to a historical `projectVN` fails the build. "Frozen" stops being a promise and becomes a test. +- **Frozen by construction.** A version's field set is fixed by tags that never change for a shipped version (golden vectors enforce it), so adding `Foo` to the struct later is invisible to `project(cfg, 1)` — its output for an old config is unchanged. This is what makes Part 2's replay sound (Problem 6) and G4 true for *old* locks, not just new ones. +- **No method-set / receiver magic.** No `Includable`, no per-nested-struct method, no `v.(reflect.Value)` type-assert footgun. Selection is a declarative tag the walker reads. +- **Golden-vector enforced.** A checked-in table of `(config, version) → hash` vectors is asserted in CI, so any accidental change to a historical projection — a tag edit that moves a shipped version's membership — fails the build. "Frozen" stops being a promise and becomes a test. The cost is owning the projection encoder and the golden vectors. That cost is paid once, at the reset, against a rebuild we are already doing. +### Version-tagged field selection + +Field membership in each version's projection is declared **on the struct field**, as a version-set in the existing `fingerprint` tag — not in a hand-written per-version function. One generic walker, `project(cfg, N)`, emits every field whose set includes `N`. This is the chosen mechanism; hand-written `projectVN` functions are the [Option B alternative](#alternatives-considered). + +**Grammar** (deliberately small): + +```ebnf +tag = "-" | member, { ",", member } ; +member = [ "!" ], range ; (* leading "!" ⇒ always-emit for this range *) +range = version, [ "..", ( version | "*" ) ] ; +version = "v", digit, { digit } ; +``` + +| Tag | Meaning | +| --- | --- | +| *(absent)* | **build failure** — every fingerprinted field must carry an explicit decision | +| `-` | never measured (unchanged from today) | +| `v1..*` | measured from v1 onward, omit-if-zero — the common "active field" case | +| `v1..v4` | measured v1–v4, then dropped | +| `v3..*` | introduced at v3 | +| `v1..v4,v6..*` | measured v1–v4, **dropped at v5, brought back at v6** | +| `!v1..*` | measured v1 onward, **always-emit** (zero value still hashes) | +| `v1..v4,!v5..*` | omit-if-zero v1–v4, then **always-emit from v5** (the temporal toggle) | + +`*` resolves to "this version and every later one," so an *active* field never needs a tag edit across a version bump — only a field that is *dropped* at the bump gets its range closed (`v1..*` → `v1..vN`). + +**Recovery is the property that justifies the range syntax.** The hard requirement: if we drop a field, then versions later realize we need it again, we must be able to bring it back *without* disturbing any frozen historical hash. The rule that guarantees it: **you only ever *add* a range for the *new* version; you never edit a shipped version's membership.** Walk it: + +- `Foo` tagged `v1..*`. At the v2 bump we drop it → edit to `v1..v1`. v1 still emits `Foo`; v2+ does not. +- At v5 we need it back → edit to `v1..v1,v5..*`. **v1's membership is unchanged (still in the set), v2–v4 unchanged (still out), only v5+ is added.** + +Every frozen output is byte-preserved, and the **golden vectors prove it**: the edit `v1..v1` → `v1..v1,v5..*` must leave the v1–v4 vectors identical or CI fails. The grammar lets you *express* the non-contiguous set; the golden vectors *forbid* rewriting history while doing so. Two recovery flavors, both covered: (a) field still on the struct (lingering for replay) → reopen its range; (b) field already physically deleted (floor passed it) → bring-back is just a fresh additive field tagged `vN..*`. Same outcome, no special case. + +**Always-emit is per-range, for the same reason.** Whether a field's *zero value emits* can change over time just as its membership can — so `!` flags an individual range, not the whole field. `v1..v4,!v5..*` means omit-if-zero through v4, then always-emit from v5. Toggling it is an *output-changing* edit (a zero-valued field starts or stops emitting), so it lands as a new range at a new version exactly like a drop/re-add — same output-preservation rule, same golden-vector enforcement. The walker just asks "which range holds N, and is it `!`?" This is why the earlier whole-field `always` flag was wrong: it could not be toggled temporally without abandoning the generic walker. + +**What tags version, and what they don't.** Tags version *membership* — which fields a version measures. They do **not** version *encoding* — how a field's bytes are formed, or how the combiner folds non-field inputs. So the generic walker absorbs additive / removal / bring-back changes as pure tag edits (zero code), while a genuine encoding or combiner change still ships as versioned code in `computeFP(N+1)` (the walker output + the combiner step frozen at N). The taxonomy's non-additive rows are exactly that small set. + +**Enforcement**, three layers: + +1. **No tag → build failure** restores the safe default the projection otherwise gives up (G-1): a forgotten field fails loudly instead of silently dropping out of the hash. The tag *is* the per-field completeness ledger — no separate audit, no field→key bridge. +2. **Well-formedness:** ranges parse, are sorted and non-overlapping, and name no version above `currentLockContentVersion` (open `*` excepted); a `!` prefix is per-range and orthogonal to those checks. A malformed or future-referencing set fails the build. +3. **Golden vectors** pin `(config, version) → hash`, so any edit that changes a *shipped* version's output for an existing config fails CI. The precise rule is **output-preservation**, not literal membership-freezing: adding an *omit-if-zero* field to the current version (`vN..*`) is the one no-bump edit, because no existing config set it so no existing vector moves; every *output-changing* edit instead defines a **new** version (closing or opening a range at the bump), leaving all earlier versions byte-identical. You can introduce membership for the new version freely, but you can never silently rewrite a shipped hash. + +This retires the `expectedExclusions` map in `fingerprint_test.go` outright: "no tag → fail" makes every decision explicit and local, and golden vectors catch accidental include→exclude edits — the map's two jobs, both subsumed. + ### Baseline v1 — omit-if-zero, no include-always legacy Because the reset rebuilds everything, there is **no pre-existing population to stay byte-compatible with.** That removes the single biggest constraint of the incremental plan: we do **not** need an `include-always` compatibility mode to preserve today's hashes. `projectV1` is the omit-if-zero projection from day one. There is no `computeFP1 = legacy include-always` entry to carry forever — the registry's floor *starts* at the clean projection. ```go -// projectV1 emits the canonical byte form of the fields v1 measures. -// Field selection is explicit code, not reflection — this is what freezes it. -// emit() length-prefixes key+value so distinct field sets cannot collide; -// it skips a field when the resolved value is zero (the omit-if-zero default). -func projectV1(c *ComponentConfig) []byte { +// Membership is declared per field as a version-set tag; one generic walker +// emits the fields whose set includes the target version, in stable key order. +// What freezes a version is that its tags never change once shipped (golden +// vectors enforce it) — not that the walker is bespoke. +type ComponentConfig struct { + Upstream string `fingerprint:"v1..*"` // measured v1+, omit-if-zero + Patches []string `fingerprint:"v1..*"` // omit-if-zero (nil and [] both → absent) + StripDebug bool `fingerprint:"!v1..*"` // always-emit: zero (false) is build-meaningful + Internal string `fingerprint:"-"` // never measured + // … every fingerprinted field carries an explicit tag; absent ⇒ build failure … +} + +func project(c *ComponentConfig, version int) []byte { var b canonicalBuf - b.emit("upstream", c.Upstream) // omit-if-zero - b.emit("patches", c.Patches) // omit-if-zero (nil and [] both → absent) - b.emitAlways("strip_debug", c.StripDebug) // always: zero (false) is build-meaningful - // … one line per measured field, in a fixed order … + for _, f := range fingerprintFields(c) { // reflection, cached, sorted by key + r := f.set.rangeContaining(version) + if r == nil { + continue // field not measured at this version + } + b.emit(f.key, f.value, r.always) // r.always (the range's '!') ⇒ emit even when zero + } return b.Bytes() } ``` @@ -235,14 +295,14 @@ func projectV1(c *ComponentConfig) []byte { - Two configs that both resolve a field to zero build identically → hashing them the same is **correct**, not a collision. - "Unset" never reaches the hasher — it has already been resolved to its default. If the default is non-zero, the field is non-zero and is emitted anyway. If the default *is* zero, then unset and explicit-zero resolve identically → same build → same hash → correct. -So the classic false-negative requires absence ≠ zero-default *at the point of hashing*, and post-merge resolution closes that gap. The load-bearing invariant is **G5's guarantee restated structurally: the fingerprint must see exactly the build-effective resolved config.** That invariant must already hold, or fingerprinting is broken independently of this change. `emitAlways` is the escape hatch for the rare field whose zero value is build-meaningful. +So the classic false-negative requires absence ≠ zero-default *at the point of hashing*, and post-merge resolution closes that gap. The load-bearing invariant is **G5's guarantee restated structurally: the fingerprint must see exactly the build-effective resolved config.** That invariant must already hold, or fingerprinting is broken independently of this change. A `!`-prefixed range is the escape hatch for the rare field whose zero value is build-meaningful. **Result:** additive fields are drift-neutral **by construction** (G4) — a newly added field, listed omit-if-zero in `projectVN`, emits nothing for any component that does not set it, so it is invisible to every lock that leaves it unset, old or new. Adding it moves no existing hash (no shipped lock could have set a field that did not yet exist), so it needs no version bump. Only setters drift (G2). #### Edge cases under omit-if-zero -- **Meaningful zero with a non-zero default** (e.g. `int Jobs` defaulting to `4`, where `0` means serial). Post-merge: unset → `4` (emitted), explicit `0` → omitted. These build differently *and* hash differently, so there is no collision — they are consistent. Use `emitAlways` only if a zero value must be distinguishable from a future change of default. -- **nil vs empty slice.** A missing TOML key → nil → omitted; `key = []` → non-nil empty → emitted. For any slice/map field where an explicit-empty value is reachable and build-meaningful, use `emitAlways` so nil and empty both hash. +- **Meaningful zero with a non-zero default** (e.g. `int Jobs` defaulting to `4`, where `0` means serial). Post-merge: unset → `4` (emitted), explicit `0` → omitted. These build differently *and* hash differently, so there is no collision — they are consistent. Use a `!` range only if a zero value must be distinguishable from a future change of default. +- **nil vs empty slice.** A missing TOML key → nil → omitted; `key = []` → non-nil empty → emitted. For any slice/map field where an explicit-empty value is reachable and build-meaningful, use a `!` range so nil and empty both hash. ### The reset load-out — what to spend the free rebuild on @@ -255,7 +315,7 @@ The reset rebuild is a budget. Spend it on the irreversible / cutover-only chang 5. **Unify on `sha256` everywhere**, retiring the `uint64`→decimal-string wart from the `hashstructure` era. One hash format, one encoding. 6. **Do every pending rename / default-normalization now.** Renaming a field, moving content between structs, or changing a baked-in default is a one-way door under Part 2 (it needs a version bump + replay); at the reset it is free because everything rebuilds anyway. This is where the schema-axis "hardest cases" get absorbed cheaply. -**Anti-goal:** do *not* burn reset budget on additive fields — Part 2 handles those for free, forever. The single success criterion for the load-out is that **no second coordinated cutover is ever needed**: after the reset, every future change must be expressible as either a free additive field or a lazy Part 2 version bump. +**Anti-goal:** do *not* burn reset budget on additive fields — Part 2 handles those for free, forever. The success criterion for the load-out is that **no *routine* change ever forces a second coordinated cutover**: after the reset, every ordinary change must be expressible as either a free additive field or a lazy Part 2 version bump. Retiring an *old* content version is the one sanctioned exception — a fleet-wide `component migrate` is itself a deliberate, planned, reset-grade event (see [Registry floor](#registry-floor-and-forced-migration)); the goal is that nothing *unplanned* ever forces one. ### The lock changes at the reset — atomic token + forced upgrade @@ -315,6 +375,19 @@ Stamp one **lock content-hash version** into the lock (the `v1:` prefix of the a } const currentLockContentVersion = 1 // == the reset baseline; bumps only on a real algo change const minSupportedLockContentVersion = 1 // floor; raise only after a deliberate `component migrate` + + // init enforces the registry/floor contract: every version in + // [minSupported, current] MUST have an entry, or replay panics at + // runtime instead of failing the build. The map and the two consts are + // edited independently, so this assertion is load-bearing, not decorative. + func init() { + for v := minSupportedLockContentVersion; v <= currentLockContentVersion; v++ { + if _, ok := lockAlgos[v]; !ok { + panic(fmt.Sprintf("lockAlgos missing version %d in [%d,%d]", + v, minSupportedLockContentVersion, currentLockContentVersion)) + } + } + } ``` 3. In `checkFingerprintFreshness`, compute at the **current** version. On mismatch, if the lock's token version `< current`, recompute at the lock's token version. If *that* matches the stored digest, the inputs are unchanged and only the algorithm evolved → treat as `FreshnessCurrent` and flag for silent re-stamp. Otherwise → `FreshnessStale`. (The resolution hash reuses `computeRes1` until its algorithm first changes — see scope note.) @@ -369,9 +442,10 @@ This is correct *by contract* (a v1 lock promises freshness under the v1 input s Lazy migration means an untouched lock can sit at an old version **indefinitely** (G3 by design). That makes "keep the last *N* versions" a **correctness cliff, not a tuning knob**: if pruning drops the compute function a lock still depends on, replay becomes impossible → forced `FreshnessStale` → the mass rebuild/rewrite (and, via the downstream-consumer analysis below, mass changelog churn) the whole design exists to avoid. So the floor must be explicit and paired with an escape hatch, decided now: - **`minSupportedLockContentVersion`** is a hard floor. A lock below it cannot be replayed and is treated as `Stale`. Dropping a registry entry is therefore a deliberate, breaking, announced act — never incidental cleanup. -- **`component migrate`** (Open Q#5, promoted to a requirement) force-advances every lock to the current content version in one deliberate pass. This is the *only* sanctioned way to retire an old version: migrate the fleet first (one intentional, reviewed, fleet-wide commit), then raise the floor. Note this pass is a deliberate G1 exception — it *is* the eager migration G1 normally forbids, made safe by being explicit and operator-driven rather than a silent side effect. **Contract:** it is *offline* — it loads each lock, recomputes the fingerprint at `currentLockContentVersion`, and rewrites the token; it does **not** re-resolve upstream (`upstream-commit`/`import-commit` untouched, unlike `update --force-recalculate`) and does **not** flip the release signal (unlike `--bump`). A migration that re-resolved or bumped would no longer be a pure version advance. The on-disk *config* axis has its own verb, [`config migrate`](#config-schema-version-and-canonical-migration-future); the two are orthogonal — each lives with the artifact its command group already owns (`component` writes locks, `config` owns the TOML). +- **`component migrate`** (Open Q#5, promoted to a requirement) force-advances every lock to the current content version in one deliberate pass. This is the *only* sanctioned way to retire an old version: migrate the fleet first (one intentional, reviewed, fleet-wide commit), then raise the floor. Note this pass is a deliberate G1 exception — it *is* the eager migration G1 normally forbids, made safe by being explicit and operator-driven rather than a silent side effect. **Contract:** it is *offline* — it loads each lock, recomputes the fingerprint at `currentLockContentVersion`, and rewrites the token; it does **not** re-resolve upstream (`upstream-commit`/`import-commit` untouched, unlike `update --force-recalculate`) and does **not** touch the manual-bump counter (unlike `--bump`). It *does*, however, move every token's digest — advancing the algorithm version is the whole point — so a fleet-wide migrate **is a fleet-wide, release-grade event**: `FindFingerprintChanges` reads each moved token as notable, exactly as [the synthetic-history trap](#the-synthetic-changelogrelease-path-is-the-real-hazard) warns. That is *why* migrate is reset-grade and rare, not a free background sweep — the release churn is the deliberate cost of retiring a version. The on-disk *config* axis has its own verb, [`config migrate`](#config-schema-version-and-canonical-migration-future); the two are orthogonal — each lives with the artifact its command group already owns (`component` writes locks, `config` owns the TOML). +- **Floor-advance cadence.** Because raising the floor requires a release-grade `component migrate`, pruning cannot be routine — left alone, the registry, golden vectors, and deprecated tombstone fields grow **append-only** (a real cost the opaque-token model accepts; see the manifest alternative). Policy: piggyback floor-raises onto *already-planned* mass rebuilds (the next environment cutover or a major release), and enforce a CI ceiling on the `currentLockContentVersion − minSupportedLockContentVersion` *spread* so the backlog cannot grow unbounded between those planned events. The spread, not the absolute version number, is the quantity kept small. -**Mixed-toolchain hazard — handled by force-rehash, not a format gate.** The classic trap is an older binary regressing a newer lock. Because the lock *format* never bumps, an old binary *can* write a reset lock — but the **atomic token** makes that harmless: it stamps a legacy (prefix-less) or lower-`v` hash, which the next new-binary run detects as sub-floor and **force-rehashes** to the current version. Self-correcting, never silent corruption. The symmetric residual is a binary that predates a content-version `v2` and meets a `v2` token it cannot replay: it must **error** (the token version exceeds its `currentLockContentVersion`), not silently restamp at `v1`. A one-line write guard (refuse to write a token whose version exceeds the binary's `currentLockContentVersion`) plus the CI version-pin closes that direction. +**Mixed-toolchain hazard — bounded by the version-pin, not auto-repair.** The classic trap is an older binary regressing a newer lock. Because the lock *format* never bumps, an old binary *can* write a reset lock, stamping a legacy (prefix-less) or lower-`v` hash. In the **working tree** this is self-correcting: the next new-binary run detects the sub-floor token and force-rehashes it to the current version. But "self-correcting" stops at the working tree — if a downgraded lock is **committed**, `FindFingerprintChanges` reads `v1 → legacy → v1` as two real release events, and a published `%autorelease` increment cannot be withdrawn. So the load-bearing guard against *committed* phantom releases is the **CI version-pin**: post-cutover, no old binary may run the `update`-and-commit step. (The force-rehash only cleans the working tree; it does not undo history.) The *symmetric* residual — a binary that predates content-version `v2` meeting a `v2` token it cannot replay — is closed by a **required** write-time guard (Open Q#5, now a requirement): refuse to write a token whose version exceeds the binary's `currentLockContentVersion`, erroring rather than silently restamping at `v1`. Note this guard lives in the binary doing the write, so it constrains *newer-but-not-newest* binaries; it does **not** retroactively constrain a genuinely *old* binary — that direction is the version-pin's job. #### Replaying across a changed input set — `{a,b,c}` → `{a,b,d}` @@ -387,14 +461,14 @@ Split the change into its two halves; they are handled independently: So the bump is **not breaking**: replay answers "were the *old* inputs unchanged?" without rebuilding. -**The one constraint replay still imposes: a retained `projectVN` must be able to read every field it lists.** Unlike the `hashstructure` substrate, `projectV1` is immune to field *additions* (it never reflects the live struct). It is *not* immune to field *removal*: `projectV1` names `c` explicitly, so physically deleting `c` from the struct stops `projectV1` from compiling. Removal is therefore the one edit still gated by a **deprecate-then-delete** two-step, both non-breaking: +**The one constraint replay still imposes: a field a retained version still measures must stay on the struct.** The projection is immune to field *additions* (the walker only emits fields whose tag set includes the target version, so a new field is invisible to old versions). It is *not* immune to field *removal*: v1 still measures `c` (its tag set includes v1) and the retained v1 golden vector sets `c`, so physically deleting `c` from the struct makes that vector's config unconstructable → the golden-vector test fails to build. (Hand-written `projectVN` functions would make this a *compile* error instead — a marginally stronger guard the tag walker trades for an equally-blocking CI one; see [D2](#d2--version-tagged-field-selection--golden-vectors).) Removal is therefore the one edit still gated by a **deprecate-then-delete** two-step, both non-breaking: -1. **Bump to v2 measuring `{a,b,d}` but keep field `c` on the struct** so `projectV1` can still read it for replay (`projectV2` simply does not list `c`). Every old lock replays clean at v1, is recognized as unchanged, lazy re-stamps to v2. Zero forced rebuilds. +1. **Bump to v2 measuring `{a,b,d}` but keep field `c` on the struct** so the v1 projection can still read it for replay (close `c`'s tag to `v1..v1`, so v2 does not measure it). Every old lock replays clean at v1, is recognized as unchanged, lazy re-stamps to v2. Zero forced rebuilds. 2. **Only after the floor passes v1** (`minSupportedLockContentVersion = 2`, ideally after a deliberate `component migrate`) physically delete field `c` and `projectV1`. -> **Invariant:** a field may be physically removed from the config struct only after *every* retained `projectVN` that lists it has been retired below `minSupportedLockContentVersion`. Retained projection functions and the struct they read must stay in sync — you cannot delete a field a live version still names. +> **Invariant:** a field may be physically removed from the config struct only after *every* retained version whose tag set includes it has been retired below `minSupportedLockContentVersion`. Retained versions and the struct they read must stay in sync — you cannot delete a field a live version's golden vector still sets. -This makes "drop an input" a lazy, per-component migration rather than a fleet-wide rebuild — at the cost of carrying a deprecated field on the struct until its projection function ages out. +This makes "drop an input" a lazy, per-component migration rather than a fleet-wide rebuild — at the cost of carrying a deprecated field on the struct until the last version measuring it ages out. #### First post-reset customer @@ -410,7 +484,7 @@ This is the on-disk TOML axis. It is **independent** of the fingerprint axis and The critical invariant: **migrate old TOML → latest canonical struct, then project once.** A semantically no-op migration (rename `foo`→`bar`) must produce the *same* canonical struct, hence the same projection bytes, hence no drift. This is what keeps the schema axis **orthogonal** to the lock axis: a faithful `config migrate` is a pure re-encoding that moves *no* fingerprint, so it never triggers a `component migrate`. If a TOML change genuinely alters build meaning, that is a content-version bump (Part 2), not a `config migrate`. -**Resolved by projection:** the old `hashstructure` caveat — that it mixed `reflect.Type.Name()` into the hash, so renaming a Go struct moved every fingerprint even with identical content — **no longer applies.** `projectVN` hashes only the explicit field bytes it emits, never the type name. A struct rename is now genuinely drift-neutral. +**Resolved by projection:** the old `hashstructure` caveat — that it mixed `reflect.Type.Name()` into the hash, so renaming a Go struct moved every fingerprint even with identical content — **no longer applies.** `projectVN` hashes only the explicit field bytes it emits, never the type name. A struct rename is now genuinely drift-neutral — **pinned by a golden test** (rename a fingerprinted struct while keeping its fields identical → byte-identical digest), so the property is CI-enforced, not just asserted here. ## Pipeline @@ -434,12 +508,17 @@ The versioned-replay story in Part 2 must hold for **every** reader of `InputFin | -------- | ----- | -------- | --------------------------- | | `checkFingerprintFreshness` (resolver) | recomputed identity | vs stored token | Replay at token version (Part 2 core) | | `component update` `Changed` decision | recomputed identity | vs stored token | **Replay before `Changed`** (see churn policy seam) | -| `changed.go` `classifyComponent` / `haveMatchingFingerprints` (CI classifier) | stored token strings | version-blind compare | **Replay-aware compare** — a v1 token must match its v2 re-stamp as "same" | +| `changed.go` `classifyComponent` / `haveMatchingFingerprints` (CI classifier) | stored token strings (two historical git refs) | string compare | **String-only — must NOT replay** (no inputs available, and replaying historical configs would violate the no-recompute invariant); kept honest by strict-lazy churn, exactly like `FindFingerprintChanges` | | `synthistory.FindFingerprintChanges` | stored token strings across git history | adjacent commits | **No change needed — if migration stays lazy** | | `synthistory.BuildDirtyChange` | recomputed (current ver) | vs stored `headLock` token | **Replay at headLock version** before declaring dirty | | `ResolutionInputHash` staleness/write | recomputed resolution hash | vs stored | **Shares the version; replay reserved, not yet wired** | -The `changed.go` classifier is the easily-missed fifth consumer: [`classifyComponent`](../../../internal/app/azldev/cmds/component/changed.go) and `haveMatchingFingerprints` do raw, version-blind token compares to decide CI classification. Post-switchover a v1 token and its semantically-identical v2 re-stamp are different strings, so a naive compare would misclassify the component as changed. It needs the same replay-aware comparison as the freshness check (compare at the older token's version), not a raw string equality. +**Two comparator classes, not one — and only one of them can replay.** The consumers split cleanly by *what they hold*: + +- **Current-tree comparators** (`checkFingerprintFreshness`, `update`'s `Changed`, `BuildDirtyChange`) recompute against *live inputs*, so they **can and must** replay at the stored token's version. Feasible and invariant-safe. +- **Stored-vs-stored historical comparators** (`FindFingerprintChanges`, `changed.go`'s `classifyComponent`/`haveMatchingFingerprints`) hold only committed token *strings* from two git refs — no config, no FS, no inputs. They **cannot** replay, and replaying would require recomputing a historical fingerprint, which the [forever-invariant](#back-compat-invariant--synthetic-history-reads-stored-strings-never-recomputes) forbids outright. Both stay **string-only**, kept honest by the *same* strict-lazy churn policy: under lazy migration a v1→v2 re-stamp only ever rides a commit whose inputs genuinely changed, so a raw string compare never sees a version-only delta. + +The `changed.go` classifier was the easily-missed member of the *second* class. The fix is **not** to make it replay-aware (impossible, and invariant-violating) — it is to confirm it lives under the strict-lazy guarantee, exactly as `FindFingerprintChanges` does. An earlier draft of this table wrongly demanded replay here; that obligation is removed. ### The synthetic changelog/release path is the real hazard @@ -460,7 +539,7 @@ The single shared content version (the token's `v` prefix) covers it (see [Bo - `ResolutionInputHash` does **not** feed `synthistory` — so an algorithm change can never mint a phantom changelog/release (that hazard is fingerprint-only). Worst case is a one-line `resolution-input-hash` rewrite per lock plus a wasted re-resolution that usually yields the same commit. Churn, not corruption. - It is a flat seven-field SHA256, not a struct walk, so the projection substrate leaves it untouched — it has no pending version event. Its registry slot stays `computeRes1` until its inputs genuinely change. -**Decision:** the atomic token format is fixed at the reset, so there is no irreversible key-naming decision left; wire fingerprint replay in Part 2's first PR; reserve resolution replay (slot present, prior fn reused) and wire it the day `ComputeResolutionHash` first changes — a localized follow-up with no schema change. KISS/YAGNI on the second replay. +**Decision:** the atomic token format is fixed at the reset, so there is no irreversible key-naming decision left; wire fingerprint replay in Part 2's first PR; reserve resolution replay (slot present, prior fn reused) and wire it the day `ComputeResolutionHash` first changes — a localized follow-up with no schema change. KISS/YAGNI on the second replay. **One constraint for that day:** give `ResolutionInputHash` its *own* version prefix (decoupled from the `InputFingerprint` token) when replay is wired. Sharing one integer is fine *now* because resolution has no pending version event; but a *resolution-only* algorithm change must not be forced to advance the `InputFingerprint` token — that field feeds `FindFingerprintChanges`, so bumping it for a resolution change would mint a phantom release. Independent prefixes keep a resolution bump off the release-bearing field. ## Design decisions @@ -470,23 +549,24 @@ Both can omit zero values; the decisive difference is **whether an old algorithm | | Canonical projection (chosen) | `hashstructure` + `Includable` | | --- | --- | --- | -| Old algorithm frozen | Yes — explicit pinned field list | No — reflects the live struct/method-set | +| Old algorithm frozen | Yes — version-tagged fields, golden-vector pinned | No — reflects the live struct/method-set | | Sound replay (Part 2) | Yes | No (the disqualifier) | -| Meaningful empties | `emitAlways` per field | `fingerprint:"always"` per field | +| Meaningful empties | `!`-prefixed range per field | `fingerprint:"always"` per field | | Type-name in hash | No (rename is drift-neutral) | Yes (rename moves every hash) | -| Plumbing | Projection encoder + golden vectors | Value-receiver `HashInclude` on every nested struct + `v.(reflect.Value)` assert | +| Plumbing | Generic walker + version tags + golden vectors | Value-receiver `HashInclude` on every nested struct + `v.(reflect.Value)` assert | `Includable` keeps today's hashes byte-identical, which mattered for an *incremental* rollout — but that property is worthless once the reset rebuilds everything anyway, and it comes attached to a substrate that makes replay unsound. Projection trades byte-compatibility (which we are spending on the coordinated cutover regardless) for frozen replay (which we need forever). Adopted at the reset. -### D2 — Explicit field lists + golden vectors over reflection tags +### D2 — Version-tagged field selection + golden vectors -Field selection lives in `projectVN` as ordinary, explicit Go code (one `emit`/`emitAlways` line per measured field), not in struct tags read by a reflective walker. Rationale: +Field membership lives in a per-field version-set tag (`fingerprint:"v1..*"`) read by one generic walker — not in N hand-written functions, and not in the binary include/exclude tag of today's reflective audit. Rationale: -- The *unsafe* failure direction is the false-negative (a meaningful field silently omitted → missed rebuild). An explicit list makes "what does v1 measure?" greppable in one function, and the **golden-vector test** turns any accidental change to a historical projection into a CI failure — a far stronger guard than a tag-presence audit. -- It forces the "is this field's zero value build-meaningful?" decision at the call site (`emit` vs `emitAlways`), with full context. -- It removes the `Includable` nested-struct trap entirely: there is no per-struct method to forget, no decorative tag that passes the audit while silently hashing a zero. +- **The unsafe direction is the false-negative** (a meaningful field silently omitted → missed rebuild → stale artifact). A *mandatory* tag — absent → build failure — makes the include/exclude decision impossible to forget, restoring the safe default a bare hand-written list quietly gives up (G-1). The tag *is* the per-field completeness ledger: no separate audit, no field→emit-key bridge. +- **Version-awareness is declarative.** A field's whole lifecycle — introduced at v3, dropped at v5, revived at v8 — is one greppable string on the field (`v3..v4,v8..*`), not a diff smeared across three function bodies. Recovery (bring-back) is *expressible* precisely because the set is non-contiguous. +- **Golden vectors freeze it.** Editing a shipped version's output changes a checked-in `(config, version) → hash` vector → CI failure — the same backstop that would protect hand-written functions, minus the per-version boilerplate. +- **It retires `expectedExclusions`.** The map in `fingerprint_test.go` exists to (a) force a decision on every field and (b) catch accidental exclusion-tag removal; no-tag-fail and golden vectors subsume both. -The cost is writing `projectVN` by hand instead of leaning on reflection. That is the point: hand-written selection is what makes the function frozen and auditable. +The one thing hand-written functions do better: field *removal* is **compile-enforced** there (deleting a field a retained `projectVN` still names won't compile), where the tag walker downgrades it to an equally-blocking *golden-vector* build failure. That single marginal loss buys declarative lifecycles, native completeness, and first-class recovery — a good trade. The hand-written variant is kept as [Option B](#alternatives-considered). ### D3 — Atomic self-describing token; no format bump, reconcile via force-rehash @@ -496,7 +576,7 @@ The lock **format** `Version` stays at `1`. An earlier draft bumped it (1→2) a ### D4 — Project to bytes, not a `ConfigHash()` method on the type -`projectVN(config) []byte` returns canonical bytes; the combiner in `fingerprint` owns the `sha256` and the version dispatch. A `ConfigHash()` method that returns a finished hash was rejected: it drags crypto + versioning onto a data type, and it tempts callers to route around the version registry to get a raw, version-agnostic hash. Returning bytes keeps the config type ignorant of versioning, and keeps the combiner the **sole version authority**. See [the seam note](#where-the-hashing-logic-should-live). +`project(config, version) []byte` returns canonical bytes; the combiner in `fingerprint` owns the `sha256` and the version dispatch. A `ConfigHash()` method that returns a finished hash was rejected: it drags crypto + versioning onto a data type, and it tempts callers to route around the version registry to get a raw, version-agnostic hash. Returning bytes keeps the config type ignorant of versioning, and keeps the combiner the **sole version authority**. See [the seam note](#where-the-hashing-logic-should-live). ## Alternatives considered @@ -505,14 +585,16 @@ The lock **format** `Version` stays at `1`. An earlier draft bumped it (1→2) a - **Parallel versioned structs with per-struct `Hash()`** — couples locks to Go type identity and duplicates hashing logic per version. Rejected in favor of Part 2's integer-versioned combiner over frozen projections. - **Bump the lock format `Version` 1→2 as a poison pill** (an earlier draft's choice) — makes old binaries hard-reject reset locks. Rejected: it also blocks old binaries from reading pins to queue a build, and it is unnecessary, since the content-version registry already force-rehashes any sub-floor or downgraded token (D3). Same-format + force-rehash keeps old binaries useful without risking silent corruption. - **Eager fleet-wide migration as the steady-state mechanism** — rewriting every lock on every algorithm change is the mass-churn the design exists to prevent. Rejected for the steady state. The *reset* is a deliberate, one-time, operator-driven eager pass riding an already-scheduled rebuild — the sanctioned exception, not the rule; `component migrate` is its post-reset equivalent for retiring an old version. +- **Hand-written per-version `projectVN` selection functions (instead of version tags).** Each version gets a bespoke `func projectVN(c) []byte` with one explicit `emit`/`emitAlways` line per measured field. *Win:* field removal is **compile-enforced** — deleting a struct field a retained `projectVN` still names won't compile (the tag walker downgrades this to a CI-time golden-vector failure). *Losses:* membership is smeared across N function bodies instead of one declarative tag per field; "bring a field back a few versions later" has no first-class expression (you re-add an `emit` line, with nothing tying it to the field's earlier life); and the mandatory-decision property (G-1) needs a *separate* completeness test with an awkward field→emit-key bridge, where the tag simply *is* the ledger. Rejected in favor of version tags: the declarative lifecycle, native completeness, and expressible recovery outweigh trading one compile-time guard for an equally-blocking CI guard. +- **Per-field hash manifest in the lock (instead of one opaque token).** Store `{field → hash}` (à la `go.sum`) rather than a single `v:sha256:…` digest. *Genuine wins:* dropping a field becomes ignoring its manifest line — no projection kept alive for replay, so the **deprecate-then-delete two-step and the registry-retirement deadlock** (the append-only growth above) both vanish; and the stored-vs-stored historical comparators become structural set-diffs rather than version-blind string compares. *Why the opaque token still wins for azldev:* (1) the projection substrate **already** delivers additive immunity (G4) — the manifest's headline draw — so that advantage is moot, not additive; (2) the manifest does **not** kill the false-fresh hazard, contrary to first impression — an old lock has *no line* for a newly-measured input, so there is still no baseline to detect a change to it (the blind spot is relocated, not removed); (3) it makes *algorithm evolution* — the entire point of Part 2 — **harder**, needing per-field versioning where the token needs one integer for the whole algorithm; and (4) it bloats every lock to O(fields × components) (the well-known `go.sum` size cost). The manifest is the better tool for a *static* input set that mainly grows and shrinks; the opaque token + single version is the better tool for an *evolving hashing algorithm*, which is azldev's actual problem. Recorded explicitly because the reset bakes the storage model in — token-vs-manifest is irreversible after PR B — and the retirement deadlock the manifest would have dissolved is instead answered by the floor-advance cadence above. ## Incremental delivery The reset (Part 1) must land as one coherent change at the dev→prod cutover; its pieces are independently reviewable but ship together because they all move the hash. -1. **PR A (substrate)**: `projectVN` encoder (`canonicalBuf`, `emit`/`emitAlways`), `projectV1` with the explicit field list, `sha256` combiner, and the golden-vector test. Pure addition alongside the existing path; not yet wired into `ComputeIdentity`. Unit tests: a field absent from `projectV1` is invisible to the digest; `emitAlways` fields hash even at zero; golden vectors pin the v1 output. +1. **PR A (substrate)**: the canonical encoder (`canonicalBuf`, `emit` with a per-range always flag), the generic tag-driven `project(cfg, N)` walker + version-set tag parser, **version tags on every fingerprinted field** (absent → build failure), the `sha256` combiner, and golden vectors. The mandatory-tag test replaces both the retired `TestAllFingerprintedFieldsHaveDecision` audit and its `expectedExclusions` map — the G-1 fix is now native to the tag, no field→key bridge needed. Pure addition alongside the existing path; not yet wired into `ComputeIdentity`. Unit tests: a field tagged `v2..*` is invisible to a v1 projection; a `!`-prefixed range hashes even at zero; a field with **no** `fingerprint` tag fails the build; golden vectors pin v1; editing a shipped version's tag membership so an existing config's output moves fails a golden vector; a non-contiguous set (`v1..v1,v3..*`) round-trips through the parser. 2. **PR B (reset cutover)**: switch `ComputeIdentity` to `projectV1`; adopt the atomic `v1:sha256:` token; unify on sha256. Lock format `Version` stays `1`. Ships at the cutover; absorbed by the scheduled rebuild. Unit tests: a legacy prefix-less token is read as sub-floor and force-rehashed to `v1`; a `v1:` token round-trips; an old binary (format `1`) still parses pins from a reset lock. -3. **PR C (Part 2 machinery)**: the version registry (`lockAlgos`, `currentLockContentVersion`, `minSupportedLockContentVersion`), `ComputeIdentityAt`, replay-before-`Changed` in `update.go`, and replay in `checkFingerprintFreshness`, `BuildDirtyChange`, **and the `changed.go` classifier**. Resolution replay reserved (slot reuses `computeRes1`). With only `v1` registered this is inert but proven. Unit tests: a synthetic `v1`/`v2` pair with unchanged inputs → `Current` and **not** `Changed`; changed inputs → `Stale`; re-stamp only on an already-dirty write. +3. **PR C (Part 2 machinery)**: the version registry (`lockAlgos`, `currentLockContentVersion`, `minSupportedLockContentVersion`), `ComputeIdentityAt`, and replay at the three *current-tree* sites — replay-before-`Changed` in `update.go`, `checkFingerprintFreshness`, and `BuildDirtyChange`. The two *historical* comparators (`FindFingerprintChanges`, `changed.go`'s `classifyComponent`) stay **string-only** and rely only on the strict-lazy guarantee, not replay. Resolution replay reserved (slot reuses `computeRes1`). **Not fully inert:** this PR switches the live current-tree compares from raw-string to replay-aware *on merge* — only the *registry dispatch* is dormant while just `v1` exists, and `BuildDirtyChange`'s replay is a hard prerequisite for any later PR that registers `v2`. Unit tests: a synthetic `v1`/`v2` pair with unchanged inputs → `Current` and **not** `Changed`; changed inputs → `Stale`; re-stamp only on an already-dirty write. 4. **PR D (validation)**: scenario test (in the style of `scenario/component_changed_test.go`) — add a field absent from `projectV1` and set it on one component; assert only that lock drifts and every other lock is byte-identical. 5. **PR E (config schema axis, later)**: `schema-version` field + load-time canonical migration + the `config migrate` command. Gated on the first post-reset non-additive TOML change not already absorbed by the reset's normalization pass. @@ -523,7 +605,5 @@ Each PR is independently revertible up to the cutover. PRs A–B land together a 1. Should a lazy re-stamp during a *read-only* command (`render`, `build` freshness check) write the lock back, or defer all writes to `component update`? Writing on read is surprising; deferring means freshness checks stay slightly slower until the next update. (Leaning: defer all writes to `update`, keeping reads side-effect-free.) 2. For the config schema axis, does `schema-version` live per-config-file or per-component? Per-file is simpler; per-component allows mixed-version projects during migration. 3. Should the omit-if-zero predicate use `reflect.Value.IsZero()` (Go's notion) or a config-aware notion of "unset" (e.g. nil pointer vs empty string)? `projectVN` makes this a per-field choice in code, so it can differ field to field — but a default convention is still worth settling. -4. What is the canonical byte encoding for `projectVN` (length-prefixed key+value? a stable subset of TOML/CBOR?), and how are golden vectors stored and regenerated? This is the one substrate detail that is expensive to change after the reset. -5. Should the residual mixed-toolchain case get a hard write-time guard (refuse to write a token whose version exceeds `currentLockContentVersion`), or is force-rehash on read + the CI version-pin enough? (The operator escape hatch is `component migrate`; this question is only about the *automatic* guard.) -*Resolved in-text (recorded here so they aren't re-litigated):* the reset rides the already-scheduled dev→prod rebuild as the one sanctioned coordinated cutover; the substrate is canonical projection (frozen `projectVN` + golden vectors), not `hashstructure`; baseline `v1` is omit-if-zero with **no** include-always legacy in the registry; the lock format `Version` stays at `1` (old binaries keep reading pins to build); the substrate swap and any old-binary downgrade are reconciled by **force-rehashing** sub-floor tokens, not a format gate; the stored hash is an **atomic** `v:sha256:` token; back-compat rests on the verified invariant that **no reader recomputes a historical fingerprint** (synthetic history and historic-overlay application read stored strings only); registry retention is a **floor**, not "last N"; `component migrate` is the post-reset forced-migration pass (lock axis; `config migrate` is its schema-axis sibling); one shared content version covers both stored hashes, with resolution-hash replay reserved (slot present, fn reused) until `ComputeResolutionHash` first changes. +*Resolved in-text (recorded here so they aren't re-litigated):* the reset rides the already-scheduled dev→prod rebuild as the one sanctioned coordinated cutover; the substrate is canonical projection (frozen `projectVN` + golden vectors), not `hashstructure`; the **canonical byte encoding is the existing length-prefixed `:=:` form** used by `combineInputs`, committed and pinned by golden vectors at the reset (former Open Q#4 — a precondition for PR A, not an open question, because the reset makes it irreversible); the **version write-guard is a requirement, not an option** (former Open Q#5): a binary refuses to write a token whose version exceeds its own `currentLockContentVersion`, and the CI version-pin prevents *old* binaries from committing downgrades; **field membership is declared in mandatory per-field version-set tags** (`fingerprint:"v1..*"`; absent → build failure, `!`-prefix for always-emit), read by one generic walker — this restores "forgotten field → loud build failure" (G-1) natively and retires the `expectedExclusions` map; baseline `v1` is omit-if-zero with **no** include-always legacy in the registry; the lock format `Version` stays at `1` (old binaries keep reading pins to build); the substrate swap and any old-binary downgrade are reconciled by **force-rehashing** sub-floor tokens, not a format gate; the stored hash is an **atomic** `v:sha256:` token; back-compat rests on the verified invariant that **no reader recomputes a historical fingerprint** (synthetic history and historic-overlay application read stored strings only); registry retention is a **floor**, not "last N"; `component migrate` is the post-reset forced-migration pass (lock axis; `config migrate` is its schema-axis sibling) and is itself a deliberate release-grade event; one shared content version covers both stored hashes now, with resolution-hash replay reserved (slot present, fn reused) and given its **own** prefix when `ComputeResolutionHash` first changes.