Skip to content

Version storage restructure: problem and plan #252

@kptdobe

Description

@kptdobe

Problem

  • ~81% of bucket objects are version-related; ~67% are empty (one R2 object per edit, metadata only, no body).
  • .da-versions at org root ({Org}/.da-versions/{FileID}/) is a single huge prefix: slow to list and doesn't scale.
  • Two concepts mixed: (1) real version snapshots (contentLength > 0, explicit "Save version" or Restore Point), (2) audit-only entries (empty objects created on every PUT for "Collab Parse" and similar). The latter explode object count without adding real versions.

Plan (condensed)

1. Labelled versions only as R2 objects

  • Remove "Collab Parse" version: stop creating the automatic first-save snapshot and empty version objects on every PUT. Only create version objects for explicit labelled version (Save version, future preview/publish) or Restore Point.
  • New path: {Org}/{Repo}/.da-versions/{FileID}/{VersionUUID}.{ext} — move under repo so listing is per-repo, not org-wide.

2. Single audit file per file (read-before-write dedupe)

  • Path: {Org}/{Repo}/.da-versions/{FileID}/audit.txt
  • Format: One line per entry (tab-separated): timestamp \t users \t path \t versionLabel \t versionId
    • path: stored without repo prefix (e.g. /surf-copy.html) so the file is readable.
    • versionLabel: human-readable name when entry is a labelled version (e.g. "v1", "Restore Point"); empty for edits.
    • versionId: snapshot id without extension when entry is a version (e.g. UUID); empty for edits.
    • Backward compat: 3-column (path only) and 4-column (path + versionId) lines are still parsed.
  • Write: On every versionable PUT, append or update audit.txt. Read-before-write with 30 min window: if last line is same user, within 30 min, and both last and new entries are edits (no version), overwrite that line with new timestamp; else append. Labelled version entries always append and are never replaced — they "interrupt" the dedup window (e.g. edit at 12:23, version at 12:25, edit at 12:40 → three entries). No empty version objects.

3. API behaviour during migration

  • List: Prefer new path (list repo/.da-versions/{id}/ + read audit.txt). Always merge with legacy (list org/.da-versions/{id}/) so old versions and audit entries show up until migration is complete. Response adds repo prefix back to path and extension to versionId so the API contract matches the previous implementation.
  • GET: Try new key first, then legacy key.
  • PUT/POST: New writes only to new structure (snapshots + audit.txt). No new writes under org/.da-versions.

4. Migration

  • Scripts (in scripts/): (1) Analyse — list version folders, count empty vs non-empty; (2) Migrate — copy snapshots to org/repo/.da-versions/fileId/, build audit.txt from empty-object metadata using the same 5-column format (path without repo, versionId without extension), same dedup rule (30 min window; version entries do not collapse), merge with any existing audit.txt in new path (hybrid case); (3) Validate — compare list/GET old vs new for a sample path.
  • Dual-read: Keep supporting both old and new paths until migration is complete; then remove legacy fallback.

5. Benefits

  • Far fewer objects: no per-edit empty version files; one audit.txt per file with collapsed entries.
  • Faster listing: .da-versions scoped per repo, not one giant org prefix.
  • Clear separation: real versions (snapshots) vs audit log (single file, deduped, human-readable labels in file).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions