Skip to content

Lazy walkable-v8 migration: old buckets never migrated + unguarded-upload-to-damaged-bucket data-loss footgun #27

@ehsan6sha

Description

@ehsan6sha

Summary

Two related concerns surfaced while designing GC recovery. Neither blocks recovery (the client-select + server-repoint plan works on pre-v8), but both deserve a deep-dive afterward.

1. Old buckets stuck on pre-v8 despite lazy walkable-v8 migration

  • walkable_v8_writer_enabled defaults to true (crates/fula-client/src/config.rs:332); a per-bucket migration marker exists (crates/fula-client/src/wal.rs:537, issue Walkable-v8: force-rewrite v7 buckets to v8 on first master-up load (closes lazy-migration gap) #10). So any flush/upload to a bucket should emit v8 CID hints / migrate it.
  • Yet the big old buckets documents (6139 objects) and audio (11886) are still pre-v8 (manifest page refs have cid = None) — which is exactly why they could not self-heal a gc'd page offline.
  • Open question: is this simply "no upload/flush to those buckets since v8 shipped," or is the lazy migration not firing for them (stuck marker, gated path)? If buckets migrate only on WRITE, large read-mostly buckets stay pre-v8 indefinitely and remain non-self-healing — consider a proactive/forced migration path.

2. Data-loss footgun: unguarded upload to a damaged bucket

  • A 404 (NoSuchKey) at index_key makes the SDK create a fresh EMPTY v7 forest (crates/fula-client/src/encryption.rs ~:3376; intentional for genuinely-new buckets, tests at :12092). 5xx / connection-refused now correctly propagate (:12133, ~:12168).
  • Risk: if a DAMAGED bucket (object_count > 0) ever returns 404 at index_key, the empty-forest path engages → a subsequent upload+flush writes a ~1-file manifest at index_key → orphans the prior 6139/11886 object references (listing data-loss; underlying blocks stay pinned).
  • The gc-damaged buckets appear to return 410 (which propagates → upload fails, no overwrite), so this likely has not triggered — but it is a latent footgun.
  • Hardening: refuse to flush an overwrite when the loaded forest is EMPTY but the server reports object_count > 0 (a damaged-bucket guard), independent of the recovery feature's upload-gate.

Does NOT block recovery

The recovery plan (detect -> client walks the newest consistent manifest -> server rebuilds the index from its pins + repoints index_key/forest_manifest_cid) works on pre-v8 and gates uploads on damaged buckets (closing #2 for the damaged case). #1 (migration) and the SDK-level guard in #2 are separate hardening. Priority: after recovery.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions