Skip to content

refactor(extractors/solana): in-memory CAR file parsing#1940

Open
sistemd wants to merge 5 commits intomainfrom
sistemd/in-memory-car-file-parsing
Open

refactor(extractors/solana): in-memory CAR file parsing#1940
sistemd wants to merge 5 commits intomainfrom
sistemd/in-memory-car-file-parsing

Conversation

@sistemd
Copy link
Contributor

@sistemd sistemd commented Mar 8, 2026

Parse CAR data in memory as it is downloaded, never write CAR files to disk. This reduces disk size requirements and completely removes the time wasted on writing to disk when the data could already be getting processed instead.

Closes #1630.
Obsoletes and therefore closes #1408.

@sistemd sistemd force-pushed the sistemd/in-memory-car-file-parsing branch 2 times, most recently from c9b28db to 3e42aec Compare March 8, 2026 16:58
Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: In-memory CAR file parsing refactor

Overall this is a well-structured refactoring that cleanly removes the file-based CAR download/mmap approach in favor of streaming HTTP reads. The removal of wait_for_cleanup from the BlockStreamer trait is a nice simplification across the codebase. The CarReader implementing AsyncRead with range-request resume and exponential backoff is a solid design.

Key findings (by priority):

Potential Bug:

  • HTTP 404 infinite retry — The CarReader retries all non-success HTTP statuses indefinitely, but 404 means the epoch doesn't exist and should not be retried. The CarReaderError::Http variant exists but is never constructed by the CarReader, making the downstream 404 handling in stream() dead code. See inline comment for suggested fix.

Robustness:

  • Fragile unreachable! — The catch-all match arm in the IO error downcasting relies on internal knowledge of what CarReader surfaces. If CarReader is later modified (e.g. to fix the 404 issue), this will panic in production.
  • No upper bound on reconnect attempts — Persistent non-transient errors (403, misconfigured URL) will retry at 30s intervals forever.
  • Metrics lost on reconnect — Partial byte counts from interrupted connections are not flushed before the MonitoredByteStream is replaced.

Cleanup:

  • Dead dependency memmap2 — Still in Cargo.toml but no longer used after the mmap removal.
  • fs-err may be demotable — Only used in the example, which can use the dev-dependency.
  • Typo — "indefinetly" in the unreachable message.

Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting stale pending review to clear state.

Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional finding from guideline review (logging).

@sistemd sistemd force-pushed the sistemd/in-memory-car-file-parsing branch from 3e42aec to 0fddf29 Compare March 9, 2026 12:14
Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: In-Memory CAR File Parsing

Overall this is a well-structured refactor that cleanly removes disk I/O from the CAR download pipeline. The CarReader state machine (Connect/Stream/Backoff) with AsyncRead implementation is a solid approach for streaming HTTP data with automatic retry. Good removal of the wait_for_cleanup trait method from BlockStreamer — it simplifies the interface.

Key findings (by priority):

Potential Bugs / Robustness:

  1. Unbounded retry in CarReader — The reader retries indefinitely on non-404 HTTP errors with no maximum attempt count. Persistent server issues could cause the stream to hang.
  2. Unbounded prev_blockhash lookup loop — No upper bound on slot skipping when searching for the previous blockhash via RPC.
  3. expect calls on external data — Multiple expect calls in read_next_slot operate on data from CAR files (entries, blockhash length, blocktime conversion). Malformed CAR data would cause panics instead of returning errors.
  4. expect on RPC-returned blockhash — Lines 96-100 use expect on RPC data that could be malformed.

Performance:
5. Overflow buffer growthVec<u8> overflow buffer can accumulate retained capacity over long downloads. Consider BytesMut or periodic shrink_to_fit.

Metrics accuracy:
6. Per-segment vs total download durationByteStreamMonitor is recreated on each reconnect, so record_car_download reflects segment time, not total download time.

Dead code:
7. fs-err dependency — Appears unused after disk I/O removal.
8. bytes dependency — May be unnecessary if only used via reqwest's re-export.

Minor:
9. compute_backoff u64 to u32 cast — Currently safe due to .min(30) clamp, but fragile to future edits.

Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: logging guideline violations found during automated pattern review.

Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: error handling guideline violations found during automated pattern review.

@sistemd sistemd force-pushed the sistemd/in-memory-car-file-parsing branch from 0fddf29 to bd00084 Compare March 10, 2026 16:34
@sistemd sistemd requested review from LNSD and leoyvens March 10, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Solana performance improvement Solana: use ObjectStore instead of the file system for Old Faithful files

2 participants