fix(node): close under-withholding via full ref scope and full-history classification (#42)#84
fix(node): close under-withholding via full ref scope and full-history classification (#42)#84beardthelion wants to merge 8 commits into
Conversation
…nly pin set blob_paths walked only refs/heads/* and refs/tags/* and skipped silently on a failed git ls-tree, so a blob reachable only through another namespace, or a ref that failed to traverse, could fall out of the withheld set and ship in cleartext. Walk every ref and fail closed on traversal error. The pin enumerators (ipfs_pin, pinata) used git cat-file --batch-all-objects, which includes unreachable/dangling objects that have no path and cannot be classified for withholding. Switch them to git rev-list --objects --all so the pin set matches the reachable graph blob_paths evaluates.
…asymmetry blob_paths classified objects from for-each-ref only, but the pin set (git rev-list --objects --all) and git upload-pack both also expose objects reachable from HEAD. A detached HEAD (or any HEAD pointing outside the ref-covered set) therefore put a blob into the pin/serve path that the withholding walk never classified — a latent cleartext leak of a withheld blob. Walk HEAD too (guarded for unborn HEAD on an empty repo) so the withheld set is a superset of every exposure path. Adds a regression test: a blob reachable only via a detached HEAD must still be withheld (fails without this change).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthrough
ChangesVisibility pack withholding traversal
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/gitlawb-node/src/git/visibility_pack.rs`:
- Around line 40-43: The current implementation in the ref-walking loop only
uses git ls-tree to examine blobs at each ref tip, but this misses blobs in
historical commits that remain reachable via commit history. This creates a
security gap where the visibility pack fails to classify all reachable objects
that other flows like rev-list traversals would include. Replace the git ls-tree
command with git rev-list --objects (or an equivalent command that traverses
full commit history) to ensure all blobs reachable from each ref, including
those in older commits, are properly classified. This will align the visibility
classification with what the rev-list based flows use elsewhere in the codebase.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b7f77d48-b23e-4b59-a8ac-d96a6cd83b78
📒 Files selected for processing (3)
crates/gitlawb-node/src/git/visibility_pack.rscrates/gitlawb-node/src/ipfs_pin.rscrates/gitlawb-node/src/pinata.rs
blob_paths walked only ref-tip trees (git ls-tree -r <ref>), so a blob that existed in an older commit and was deleted or rotated at the tip was never classified, while the pin set (git rev-list --objects --all) and upload-pack still expose it from history. That under-withheld the blob into the IPFS/Pinata replication paths in cleartext. Enumerate all reachable commits (git rev-list --all + HEAD for the detached case) and walk each commit tree, deduping (oid, path) pairs. ls-tree per commit is kept over rev-list --objects so every path a blob lives at is still represented for the per-path visibility check. A ref that resolves to a non-commit now fails closed explicitly, since rev-list --all silently skips it where the old per-ref ls-tree errored. Regression test: a secret blob deleted at the tip but reachable in history is withheld.
|
@kevincodex1 flagging this one as higher priority. It closes the under-withholding hole (#42): full ref scope plus a reachable-only pin set so history blobs can't leak past a path-scoped withhold. It's also a gate for follow-on work, the reconciliation sweep can't land until this is in main. All green (tests, clippy, cargo audit, MSRV, CodeQL), mergeable, no open threads. Ready when you are. |
…he whole repo (#90) * feat(node): add push-delta object enumeration module Compute the per-push object delta (git rev-list --objects <new> --not <old>) for the pin path instead of enumerating the whole repo, with a fully-peeled ref-type guard (cat-file -t '<sha>^{}') that fails closed to a full scan on a non-commit tip, any git error, or the GITLAWB_PIN_FORCE_FULLSCAN kill-switch. The whole-repo lister moves here for the fallback and the reconciliation sweep. Pin-enumeration scope only (N16); the withheld filter is unchanged. * feat(node): pin the per-push object delta instead of the whole repo Wire the push-delta enumeration into the IPFS and Pinata pin paths so pin work scales with push size, not repo size (N16). - U2: both pin_new_objects twins take a candidate OID set instead of scanning the whole repo internally; repo_path is retained for reading object bytes. - U4: extract the anonymous replication withheld gate (caller=None, no caller param) into a shared helper returning (announce, withheld), so the push path and the future reconciliation sweep cannot drift. The per-caller read-serve withheld call in git_upload_pack is deliberately left separate. - U3: the post-receive handler resolves the push delta once (off-thread), falls back to a full scan on a non-commit tip / git error / kill-switch, and passes the candidate set to both pin paths. Logs delta size and fallbacks. The reconciliation sweep (U5) is deferred: it depends on the history-deep blob_paths fix (#84) landing on main first. * fix(review): log degraded pin-candidate paths, extract wiring to push_delta Code review found the push candidate computation collapsed both a failed full-scan fallback and a panicked spawn_blocking task to a silent empty set, diverging from the sibling withheld path that logs and fails closed. Extract resolve_candidates_for_push into git::push_delta (it owns the PinCandidates dispatch and full-scan fallback), logging every degraded path (fallback taken, fallback failed, task panicked) instead of failing silently. Pinning nothing on a hard failure is still safe (the withheld filter always runs; the reconciliation sweep backstops durability) but is now observable. Moving the dispatch out of the HTTP handler also makes it unit-testable. Add tests: tag-of-tag-of-commit peels to a Delta, and resolve_candidates_for_push delta + non-commit-tip full-scan-fallback paths. * fix(node): fail announce closed when withheld walk fails (#90) A failed or panicked withheld_blob_oids walk left announce=true while withheld=None, so a push we could not vet would still gossip ref updates, notify peers, and write a permanent Arweave anchor. Force announce false in that arm so the gate fails closed on both axes.
jatmn
left a comment
There was a problem hiding this comment.
Findings
1. Merge conflicts with main — BLOCKING
- Severity: High
- Evidence:
gh pr view 84reports"mergeable":"CONFLICTING". A localgit diff mainagainst the checked-out branch shows large unrelated deletions (.env.example,auth/mod.rs,config.rs,git/push_delta.rs,docs/RUN-A-NODE.md,api/repos.rs, etc.) that are not part of the intended PR diff (gh pr diff 84only touchesvisibility_pack.rs,ipfs_pin.rs, andpinata.rs). - Impact: Cannot be merged as-is; must be rebased/merged onto current
mainand conflicts resolved before review approval.
2. Detached HEAD not included in pin enumerators
- Severity: Medium
- Files:
crates/gitlawb-node/src/ipfs_pin.rs(lines 154–180),crates/gitlawb-node/src/pinata.rs(lines 137–159) - Details:
blob_pathsinvisibility_pack.rsexplicitly appendsHEADto the commit enumeration whenstore::head_commitresolves (visibility_pack.rs:81–85), ensuring a blob reachable only via a detachedHEADis still classified.- The two pin listers,
ipfs_pin::list_all_objectsandpinata::list_all_objects, rungit rev-list --objects --allwithoutHEAD. In a detached-HEADrepository,git rev-list --objects --alldoes not traverse the commit recorded inHEADbecause--allonly walks refs underrefs/. - This creates a mismatch: the withholding set is a superset of what the pin set actually replicates. Public objects reachable only through a detached
HEADwould be classified as replicable but then not pinned, producing an silent under-pinning/availability gap.
- Recommendation: Make the pin enumerators mirror
blob_paths: appendHEADto therev-listarguments wheneverstore::head_commit(repo_path).is_some(), or, better, share a single "reachable objects" helper betweenvisibility_pack,ipfs_pin, andpinataso the three sites cannot drift again.
3. blob_paths is now expensive for large/long histories
- Severity: Medium
- File:
crates/gitlawb-node/src/git/visibility_pack.rs(lines 34–132) - Details:
- The old implementation ran
git ls-tree -r <ref>once per ref tip. - The new implementation runs
git rev-list --allto enumerate every reachable commit, then runsgit ls-tree -r <commit>for every commit in history. - For repositories with deep history, this is O(history × tree-size) and may materially slow down every push/serve/pin path that calls
withheld_blob_oids.
- The old implementation ran
- Recommendation: This may be a necessary cost for per-path classification across history, but it should be benchmarked on a representative repo. If the cost is too high, consider caching the
(oid, path)set per repo or documenting the expected latency. Avoid further duplicating this work (see next finding).
4. withheld_blob_recipients recomputes the full blob_paths set
- Severity: Low
- File:
crates/gitlawb-node/src/git/visibility_pack.rs(lines 176–205) - Details:
withheld_blob_recipientscallswithheld_blob_oids(which computesblob_paths) and then callsblob_pathsagain to build the recipient map.- With the new O(history) implementation, this doubles the expensive enumeration.
- Recommendation: Refactor so the caller computes
blob_pathsonce and reuses it, or havewithheld_blob_oidsreturn both the withheld set and the path map used to compute it.
5. Minor: for-each-ref parsing could be more robust
- Severity: Low
- File:
crates/gitlawb-node/src/git/visibility_pack.rs(lines 41–75) - Details:
- The code formats each ref as
%(objecttype) %(*objecttype) %(refname)and splits on whitespace. - Refnames cannot contain whitespace, so this is safe today, but a NUL-delimited format (
--format=...%00...) plussplit('\0')would be more robust and avoid the ad-hoc token arithmetic for annotated tags.
- The code formats each ref as
- Recommendation: Optional cleanup; not a blocker.
Resolves conflicts in ipfs_pin.rs and pinata.rs against #90 (per-push delta pin). #90 centralized object enumeration in push_delta::list_all_objects (cat-file --batch-all-objects) and reworked the pin path to take a candidate OID set, so this branch's local reachable-only list_all_objects helpers and their tests are now dead duplicates — dropped both. The withholding fix (full ref scope + full-history classification + explicit detached-HEAD walk) in visibility_pack.rs is unaffected; updated its stale comments that still described the pin set as 'git rev-list --objects --all'.
…d classifier
Addresses two review findings on the under-withholding classifier:
- withheld_blob_recipients walked blob_paths twice (once via
withheld_blob_oids, once directly). Extract withheld_from_pairs so a
single history walk feeds both the withheld set and the recipient map.
- The ref-type fail-closed guard used for-each-ref %(*objecttype), which
peels only one tag level and so misclassified a nested annotated tag of
a commit as a non-commit, refusing the whole walk (over-withhold). List
refnames and peel each fully with <ref>^{} through one cat-file
--batch-check; reap the child even if the stdin write fails. Fail closed
on a short read or any non-commit/missing target.
Adds tests: annotated + nested-tag of a commit does not fail closed,
annotated tag of a blob still fails closed, and a blob at both an allowed
and a denied path is not withheld.
The ref-type guard peeled every <ref>^{} through git cat-file --batch-check,
writing all queries to stdin before draining stdout. On a repo with enough
output the child's stdout pipe fills, it stops reading stdin, and write_all
blocks forever — hanging every serve/pin that classifies the repo. Batching
the queries does NOT bound this: cat-file echoes the full query on a
'<query> missing' line, so output scales with refname length, and a few
hundred long-named dangling refs already exceed the pipe buffer.
Run the stdin write on a separate thread so the main thread drains stdout
concurrently via wait_with_output, which reaps the child unconditionally
before any error is surfaced; the writer is joined after the drain so it
cannot deadlock, and a vanished stdin pipe becomes a broken-pipe write error.
One pass, no batch bound, deadlock-free regardless of ref count or refname
length.
Extracted as assert_all_refs_are_commits with corrected doc attribution.
Detect the missing-object line by its trailing status word. Adds a
timeout-bounded regression test with 500 long-named dangling refs (>64 KiB of
cat-file output) that fails rather than hangs if the deadlock returns.
|
Pushed 3266eaf. Rebased onto main and worked your findings.
One adjacent thing I hit while tracing this: #90's pin candidate set uses |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/gitlawb-node/src/git/visibility_pack.rs (1)
180-203: 🔒 Security & Privacy | 🔴 CriticalUse
git ls-tree -rzand parse raw entries here.
git ls-tree -rC-quotes paths by default, so a path likesecret/café.txtis stored as/"secret/caf\303\251.txt"and visibility rules such as/secret/**can miss it. Parse NUL-delimited output and reject malformed entries instead of lossy-decoding them.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gitlawb-node/src/git/visibility_pack.rs` around lines 180 - 203, The path parsing in the git tree listing logic is using the quoted text output from Command::new("git") with ls-tree -r, which can mangle non-ASCII or special-character paths and break visibility matching. Update the listing flow in the visibility_pack.rs code that builds out from git ls-tree to use git ls-tree -rz, parse NUL-delimited raw entries, and reject malformed entries explicitly instead of relying on lossy UTF-8 decoding or split_once on quoted lines.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/gitlawb-node/src/git/visibility_pack.rs`:
- Around line 180-203: The path parsing in the git tree listing logic is using
the quoted text output from Command::new("git") with ls-tree -r, which can
mangle non-ASCII or special-character paths and break visibility matching.
Update the listing flow in the visibility_pack.rs code that builds out from git
ls-tree to use git ls-tree -rz, parse NUL-delimited raw entries, and reject
malformed entries explicitly instead of relying on lossy UTF-8 decoding or
split_once on quoted lines.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e0b64e30-fd64-49dd-a5ca-efef347fdba6
📒 Files selected for processing (1)
crates/gitlawb-node/src/git/visibility_pack.rs
jatmn
left a comment
There was a problem hiding this comment.
1. git ls-tree -r C-quotes non-ASCII paths, causing under-withholding (CodeRabbit finding, unaddressed)
Severity: High / blocking — privacy/security.
Location: crates/gitlawb-node/src/git/visibility_pack.rs, blob_paths (lines 180–205).
Details:
The code parses tree entries from plain git ls-tree -r <commit> output and extracts the path by splitting on \t:
let listing = std::process::Command::new("git")
.args(["ls-tree", "-r", commit])
...
let Some((meta, path)) = line.split_once('\t') else {
continue;
};Git's default text output C-quotes path names that contain non-ASCII bytes. I verified this directly: a blob at secret/café.txt is emitted by git ls-tree -r HEAD as:
100644 blob 33ea03ba1db1ac1fdfdc0c0a7cc57d3673ae3c1b "secret/caf\303\251.txt"
The path string the code stores is therefore "/\"secret/caf\\303\\251.txt\"" (quotes and escapes included). That literal string does not match a visibility rule such as /secret/**, so the blob is not withheld. A secret blob stored under a non-ASCII path inside a denied subtree would be served/pinned in cleartext.
The fix is to use git ls-tree -rz and parse NUL-delimited raw records. With -z, the same entry is emitted as:
100644 blob 33ea03ba1db1ac1fdfdc0c0a7cc57d3673ae3c1b secret/café.txt<NUL>
The parser would need to split records on \0 and split each record on the first ASCII space sequence, rather than on \t and newlines.
Recommendation:
- Switch
blob_pathstogit ls-tree -rz <commit>and parse NUL-delimited raw output. - Add a regression test with a non-ASCII (or otherwise C-quoted) path under a denied subtree to ensure it is withheld.
… stop under-withholding Plain `git ls-tree -r` C-quotes any path with non-ASCII or special bytes (café.txt becomes "secret/caf\303\251.txt"), so the stored path no longer matched a visibility rule like /secret/** and the blob was served/pinned in cleartext. Switch blob_paths to `git ls-tree -rz` and split the NUL-delimited records, which emit paths raw; the TAB field separator survives -z so the per-record parse is otherwise unchanged. Parse strictly with str::from_utf8 and fail closed on invalid UTF-8 rather than from_utf8_lossy: a lossy decode would rewrite an invalid byte in a denied directory name to U+FFFD, and the mangled path would again miss its deny rule (the same under-withholding class one layer down). Tests: a non-ASCII path stays withheld (and the public blob still replicates), and a non-UTF-8 path fails closed.
|
Fixed in
While there I closed the adjacent byte-vs-string hole: the parse used Regression tests cover all three: a non-ASCII path stays withheld (and the public blob still replicates), a TAB+newline path stays withheld (pins the Separate from this PR: |
jatmn
left a comment
There was a problem hiding this comment.
No blocking findings.
The implementation correctly:
- Covers non-standard ref namespaces and detached
HEAD. - Walks full reachable history rather than only ref-tip trees.
- Fails closed on unclassifiable refs, missing objects, and invalid UTF-8 paths.
- Avoids the
cat-filestdin/stdout deadlock via a concurrent writer thread. - Reuses the
blob_pathslisting between withheld-set and recipient-map computation. - Parses raw (
-z)ls-treeoutput so non-ASCII and special-character paths match visibility rules. - Uses strict
std::str::from_utf8rather than lossy decoding for path bytes.
Non-blocking observations
- Performance of per-commit
ls-tree: Walking every reachable commit and runninggit ls-tree -rzper commit isO(history × tree-size). The author documents this as a deliberate correctness-vs-thoroughness trade-off in the code comments and PR description; caching can be added later if profiling shows it is needed. Not a blocker for this security fix. - Unreachable/dangling objects (#99): The PR explicitly scopes out unreachable objects. The author already filed #99 to track the adjacent risk that
push_delta.rs::list_all_objectsusesgit cat-file --batch-all-objects(which includes unreachable/dangling objects) while the classifier is reachable-only. This is noted and out of scope here. - Windows-only test coverage: The two
cfg(unix)tests for TAB/newline and non-UTF-8 paths do not run on Windows. This is appropriate because those byte sequences are not legal in Windows filenames, so the fixtures cannot be built there. The non-ASCII path test runs on Windows and passed.
@kevincodex1 LGTM
Closes #42.
Re-homes the under-withholding fix onto current main as a focused same-repo PR. It previously lived in #46 (a fork PR from when this work predated a maintainer); #46's other commits are already covered elsewhere (blind recipients via #40, partial-recipient and mirror fail-closed via #67/#69), so #46 will be closed in favor of this.
Rebased onto main after #90 (per-push delta pin) landed. #90 reworked object enumeration for the pin path, so the "reachable-only pin set" half this work originally carried is no longer needed and has been dropped. The PR is now scoped to the withheld-blob classifier.
What it does:
refs/heads+refs/tagstips. A blob reachable only through a non-standard ref namespace, only in older history (a since-deleted file), or only via a detached HEAD is still withheld.upload-pack(serve) and feat(node): pin the per-push object delta instead of re-enumerating the whole repo #90's whole-repo pin fallback both expose the full reachable object graph, so classifying only ref-tip trees would under-withhold.<ref>^{}throughgit cat-file --batch-check, so a ref pointing at a blob or tree, or an annotated tag (including a nested one) of such an object, is refused rather than silently skipped.git rev-list --allskips those refs while the serve and pin paths still expose their target object.withheld_blob_recipientsnow shares theblob_pathsresult with the withheld-set computation instead of walking it twice.Implementation note: the ref peel writes its queries to
cat-fileon a separate thread while the main thread drains stdout, so a repo with many (or long-named) refs cannot deadlock the classifier.Tests: withholding via a non-standard ref, a detached HEAD, and a blob deleted at the tip but reachable in history; fail-closed on an untraversable ref, an annotated tag of a blob, and a ref pointing at a missing object; an annotated tag and a nested tag of a commit that must not fail closed; a blob shared across an allowed and a denied path; and a many-long-named-ref regression guard for the peel.
Known adjacent gap (not this PR): #90's pin candidate set (
git cat-file --batch-all-objects) can include unreachable/dangling objects the reachable-only classifier never sees, andreplicable_objectsreplicates anything not explicitly withheld, so an unreachable private blob could be pinned in cleartext. Tracked separately in #99.Summary by CodeRabbit