Skip to content

Deduplicate repeated chunk tree entries#18

Merged
flyingrobots merged 2 commits intomainfrom
fix/tree-entry-dedupe
Mar 15, 2026
Merged

Deduplicate repeated chunk tree entries#18
flyingrobots merged 2 commits intomainfrom
fix/tree-entry-dedupe

Conversation

@flyingrobots
Copy link
Member

@flyingrobots flyingrobots commented Mar 14, 2026

Summary

  • deduplicate chunk blob tree entries by digest in createTree() and _createMerkleTree() while preserving first-seen emission order
  • keep manifests unchanged so chunk order and multiplicity still come entirely from the manifest
  • add unit and integration regressions for repeated chunks, including git fsck --full --no-dangling coverage on a real Git repo

Validation

  • npx eslint .
  • npm test
  • GIT_STUNTS_DOCKER=1 npx vitest run test/integration
  • bunx vitest run test/unit
  • GIT_STUNTS_DOCKER=1 bunx vitest run test/integration
  • deno run -A npm:vitest run test/unit
  • GIT_STUNTS_DOCKER=1 deno run -A npm:vitest run test/integration

Notes

  • I also attempted the repo's Docker-based runtime commands, but the image build failed before tests started because apt-get update inside the test image rejected Debian repo signatures in this environment.

Summary by CodeRabbit

  • Bug Fixes

    • Tree emissions now deduplicate repeated chunk entries while preserving first-seen order, preventing duplicate chunk entries in emitted trees.
  • Tests

    • New unit and integration tests verify deduplication across tree types (including Merkle-style trees), restore correctness, and run repository integrity checks to ensure no duplicate entries.

@coderabbitai
Copy link

coderabbitai bot commented Mar 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 833900bc-61d5-4fce-a65b-c65328a2ed15

📥 Commits

Reviewing files that changed from the base of the PR and between 9ec123b and 94d7a34.

📒 Files selected for processing (1)
  • test/integration/round-trip.test.js

📝 Walkthrough

Walkthrough

Adds a helper _createChunkTreeEntries(chunks) to CasService that emits one tree entry per unique chunk digest in first-seen order; createTree() and _createMerkleTree() are refactored to use it. Unit and integration tests added to verify deduplication and git fsck behavior.

Changes

Cohort / File(s) Summary
CasService Implementation
src/domain/services/CasService.js
Adds _createChunkTreeEntries(chunks) to deduplicate chunk blob tree entries by digest (first-seen order). Refactors createTree() and _createMerkleTree() to use this helper and removes intermediate chunkBlobEntries accumulation.
Unit Tests
test/unit/domain/services/CasService.test.js, test/unit/domain/services/CasService.merkle.test.js
Adds tests verifying createTree and Merkle tree generation emit only unique chunk entries in first-seen order and that names/digests are unique.
Integration Tests
test/integration/round-trip.test.js
Adds helpers (chunkEntryNames, uniqueChunkDigests, runGitFsck) and integration tests for v1 and Merkle emission verifying deduplication, restore correctness, and that git fsck reports no duplicate entries.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 I hopped through blobs, then hopped once more,
I kept the first, let repeats be lore.
One entry per nibble, tidy and bright,
First-seen forever, tucked in the tree at night. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Deduplicate repeated chunk tree entries' directly and accurately summarizes the main change: introducing deduplication logic for chunk blob tree entries by digest in the CasService class.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/tree-entry-dedupe
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/integration/round-trip.test.js (1)

317-317: These checks currently drop first-seen order validation.

Sorting both sides verifies set equality only; it won’t catch order regressions.

♻️ Suggested assertion tightening
-      expect([...emittedChunkNames].sort()).toEqual([...uniqueChunkDigests(manifest)].sort());
+      expect(emittedChunkNames).toEqual(uniqueChunkDigests(manifest));
...
-      expect([...emittedChunkNames].sort()).toEqual([...uniqueChunkDigests(manifest)].sort());
+      expect(emittedChunkNames).toEqual(uniqueChunkDigests(manifest));

Also applies to: 362-362

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/integration/round-trip.test.js` at line 317, The test currently sorts
both sides which only verifies set equality and loses first-seen order; change
the assertions to compare the sequences directly (preserve order) by removing
the .sort() calls and asserting that [...emittedChunkNames] ===/toEqual
uniqueChunkDigests(manifest) (i.e., compare the arrays in their natural order)
for both occurrences that use emittedChunkNames and
uniqueChunkDigests(manifest).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/integration/round-trip.test.js`:
- Around line 76-85: The runGitFsck function must detect spawn failures and
non-clean termination instead of masking them; change runGitFsck to first check
result.error and throw or return a failing status with the error message if
present, and if result.status is null inspect result.signal (treat as failure)
and set status to a non-zero default (e.g., 1) rather than 0; include
result.stderr/result.stdout in the returned output and ensure callers/tests that
expect fsck.status === 0 will only pass when result.status === 0 and not when
spawnSync failed.

---

Nitpick comments:
In `@test/integration/round-trip.test.js`:
- Line 317: The test currently sorts both sides which only verifies set equality
and loses first-seen order; change the assertions to compare the sequences
directly (preserve order) by removing the .sort() calls and asserting that
[...emittedChunkNames] ===/toEqual uniqueChunkDigests(manifest) (i.e., compare
the arrays in their natural order) for both occurrences that use
emittedChunkNames and uniqueChunkDigests(manifest).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e4551944-d355-4e01-9dd0-8bf65de9eaa3

📥 Commits

Reviewing files that changed from the base of the PR and between 6d4507d and 9ec123b.

📒 Files selected for processing (4)
  • src/domain/services/CasService.js
  • test/integration/round-trip.test.js
  • test/unit/domain/services/CasService.merkle.test.js
  • test/unit/domain/services/CasService.test.js

@flyingrobots
Copy link
Member Author

Pushed follow-up commit 94d7a34 to make runGitFsck() fail closed on spawn errors and signal termination.

I intentionally did not switch the integration assertions from sorted-set comparison to direct sequence comparison. readTree() reflects Git tree name ordering (git ls-tree / tree object filename sort), not the original mktree input order, so a direct sequence equality check would fail even when first-seen emission order is correct. First-seen emit order is already covered by the unit tests that assert the exact writeTree() input; the integration tests are checking uniqueness, reachability, and fsck/restore behavior against a real Git repo.

@flyingrobots flyingrobots merged commit 40f92e6 into main Mar 15, 2026
6 checks passed
@flyingrobots flyingrobots deleted the fix/tree-entry-dedupe branch March 15, 2026 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant