refactor(extractors/solana): in-memory CAR file parsing#1940
refactor(extractors/solana): in-memory CAR file parsing#1940
Conversation
c9b28db to
3e42aec
Compare
There was a problem hiding this comment.
Code Review: In-memory CAR file parsing refactor
Overall this is a well-structured refactoring that cleanly removes the file-based CAR download/mmap approach in favor of streaming HTTP reads. The removal of wait_for_cleanup from the BlockStreamer trait is a nice simplification across the codebase. The CarReader implementing AsyncRead with range-request resume and exponential backoff is a solid design.
Key findings (by priority):
Potential Bug:
- HTTP 404 infinite retry — The
CarReaderretries all non-success HTTP statuses indefinitely, but 404 means the epoch doesn't exist and should not be retried. TheCarReaderError::Httpvariant exists but is never constructed by theCarReader, making the downstream 404 handling instream()dead code. See inline comment for suggested fix.
Robustness:
- Fragile
unreachable!— The catch-all match arm in the IO error downcasting relies on internal knowledge of whatCarReadersurfaces. IfCarReaderis later modified (e.g. to fix the 404 issue), this will panic in production. - No upper bound on reconnect attempts — Persistent non-transient errors (403, misconfigured URL) will retry at 30s intervals forever.
- Metrics lost on reconnect — Partial byte counts from interrupted connections are not flushed before the
MonitoredByteStreamis replaced.
Cleanup:
- Dead dependency
memmap2— Still inCargo.tomlbut no longer used after the mmap removal. fs-errmay be demotable — Only used in the example, which can use the dev-dependency.- Typo — "indefinetly" in the unreachable message.
3e42aec to
0fddf29
Compare
There was a problem hiding this comment.
Code Review: In-Memory CAR File Parsing
Overall this is a well-structured refactor that cleanly removes disk I/O from the CAR download pipeline. The CarReader state machine (Connect/Stream/Backoff) with AsyncRead implementation is a solid approach for streaming HTTP data with automatic retry. Good removal of the wait_for_cleanup trait method from BlockStreamer — it simplifies the interface.
Key findings (by priority):
Potential Bugs / Robustness:
- Unbounded retry in
CarReader— The reader retries indefinitely on non-404 HTTP errors with no maximum attempt count. Persistent server issues could cause the stream to hang. - Unbounded
prev_blockhashlookup loop — No upper bound on slot skipping when searching for the previous blockhash via RPC. expectcalls on external data — Multipleexpectcalls inread_next_slotoperate on data from CAR files (entries, blockhash length, blocktime conversion). Malformed CAR data would cause panics instead of returning errors.expecton RPC-returned blockhash — Lines 96-100 useexpecton RPC data that could be malformed.
Performance:
5. Overflow buffer growth — Vec<u8> overflow buffer can accumulate retained capacity over long downloads. Consider BytesMut or periodic shrink_to_fit.
Metrics accuracy:
6. Per-segment vs total download duration — ByteStreamMonitor is recreated on each reconnect, so record_car_download reflects segment time, not total download time.
Dead code:
7. fs-err dependency — Appears unused after disk I/O removal.
8. bytes dependency — May be unnecessary if only used via reqwest's re-export.
Minor:
9. compute_backoff u64 to u32 cast — Currently safe due to .min(30) clamp, but fragile to future edits.
0fddf29 to
bd00084
Compare
Parse CAR data in memory as it is downloaded, never write CAR files to disk. This reduces disk size requirements and completely removes the time wasted on writing to disk when the data could already be getting processed instead.
Closes #1630.
Obsoletes and therefore closes #1408.