Skip to content

fula-client: online forest walk aborts on gc-orphaned nodes (recover via manifest CID-hint on 404) #24

@ehsan6sha

Description

@ehsan6sha

Summary

Online, an encrypted bucket fails to list when its forest references a HAMT node (or manifest page) that a server-side ipfs repo gc orphaned: the node's storage-key -> CID mapping is gone from the gateway index (gateway returns 404 NoSuchKey for the storage-key), even though the node's block still exists in IPFS by CID. The forest walk fetches that node by storage-key, gets a 404, and the entire walk aborts -> the bucket won't list.

The data is not lost: the manifest carries each node's content CID (walkable-v8 LinkV2 pointer), and the block is fetchable by CID. Offline mode already recovers (the walker's fetch falls back to a gateway-race-by-CID), which is why forcing the app offline lists the affected buckets. The gap is that the online path doesn't engage that recovery on a 404.

Root cause

FulaClient::get_object_with_offline_fallback_known_cid (fula-client/src/client.rs) engages the verified gateway-race-by-CID only on is_master_unreachable_error, never on a 404:

match self.get_object_with_metadata(bucket, key).await {
    Ok(result) => { /* cache + return */ }
    Err(e) if is_master_unreachable_error(&e) => {   // only master-DOWN races the CID
        self.try_offline_fallback_with_cid_hint(bucket, key, cid_hint, e).await
    }
    Err(e) => Err(e),                                 // a 404 dies here -> walk aborts
}

is_master_unreachable_error deliberately excludes legitimate S3 errors (NoSuchKey/404), so an orphaned-node 404 never triggers the CID-race when the master is up.

Fix

Add a forest-scoped wrapper get_forest_object_known_cid that engages the verified CID-race also on a 404 (e.is_not_found()), and route the two forest-infrastructure callers through it:

  • EncryptedClient::load_manifest_pages (manifest pages)
  • S3BlobBackend::get_with_cid_hint (HAMT nodes)

The generic get_object_with_offline_fallback_known_cid keeps its existing propagate-404 invariant, and its test test_cid_hint_master_4xx_propagates_without_fallback is left unchanged. Only forest infrastructure opts into 404-recovery (design per advisor review: "the invariant is the asset" -- don't let a future non-forest caller silently inherit hide-404 behavior).

Why this is safe (the CID is the capability)

  • fetch_verified content-checks the fetched bytes against the manifest-supplied CID, so a gateway cannot serve different/forged bytes.
  • The node store re-decrypts (AEAD) and recomputes the storage-key + page-id/seq, binding the bytes to this walk.
  • The CID comes from the freshly-decrypted, authoritative manifest, so it is the current node; stale-root reads are already rejected by the seq/version guards.
  • Worst case is a benign consistency window, never unauthorized or forged data.

Scope

This is the native path (cfg(not(target_arch = "wasm32"))). The wasm S3BlobBackend::get_with_cid_hint degrades to plain get() (no gateway pool on web), so the web client (pinning-webui) is NOT fixed by this and needs separate work -- it also currently lists via the HEAD-per-object path rather than the forest walk.

Self-heal (deliberately out of scope here)

The recovered node is not re-uploaded on the read path (avoids PUT-on-read permission/latency/consistency issues, per advisor review). The gateway 404 therefore persists and is re-raced on each read until the user's next forest write (flush) re-pins the node and restores the index mapping. Chosen tradeoff.

Verification

  • New unit/integration test: S3BlobBackend::get_with_cid_hint with master -> 404 NoSuchKey + a gateway serving the block by CID -> must return the block. Fails before the fix, passes after. The existing propagate-404 test is retained.
  • E2E (not committed; uses real credentials): native fula-client with gateway_fallback_enabled, a real walkable-v8 bucket with a known orphaned node, master up -> bucket lists end-to-end.

Review

Concept and design reviewed by independent advisors (Gemini + Copilot): approved. Design (scoped wrapper, keep the existing invariant + its test, add a new test) is per their recommendation. (Codex + Cursor were unavailable at review time.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions