Follow-up review fixes for noodles BAM migration (#115)#117
Open
ewels wants to merge 2 commits into
Open
Conversation
Follow-up to the rust-htslib → noodles migration (#115). Three fixes surfaced during code review: - align_record: propagate CIGAR decode errors in `Record::from_bam` instead of silently falling back to an empty CIGAR via `unwrap_or_default()`. A swallowed decode error would have produced wrong coverage/indel statistics with no signal. - align_record: complete the ASCII→4-bit sequence packing table to the full htslib `seq_nt16_table` (all 16 IUPAC codes), so SAM/CRAM records containing ambiguity codes round-trip faithfully and produce samtools-identical CHK checksums. Previously every non-ACGT base was collapsed to N. - io: build the fetch `Region` directly from the reference name and a full interval instead of formatting and re-parsing `name:start-end`. The old approach mangled reference names containing `:` or `-` (e.g. HLA contigs) and lossily decoded non-UTF-8 names. Adds unit tests covering IUPAC packing round-trips, lowercase handling, and odd-length trailing-nibble layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The indexed BAM reader buffered every record of a fetched reference (and all unmapped reads) into a `Vec` before yielding any. dupRadar fetches one chromosome per worker thread concurrently, so peak memory scaled as reference size × thread count — gigabytes for a deep chromosome. Replace the buffering with a lazy `BamCursor` that mirrors noodles' `csi::io::Query` chunk state machine: `fetch(tid)` resolves the reference's index chunks up front (small) and `read()` seeks/decodes one record at a time straight from the BGZF stream, filtering out the occasional neighbouring-reference record that shares a chunk boundary block. `fetch(Unmapped)` seeks to the index's unmapped offset (falling back to a reopen + first-record scan when the index lacks the metadata pseudo-bin) and streams to EOF. Peak memory is now O(1) per reader. CRAM (sequential and indexed) still buffers — noodles' slice-based CRAM iterators don't expose an equivalent low-level chunk API — and is left as a documented follow-up. Output is unchanged: the dupRadar integration tests assert exact counts and matrices over a 2-reference fixture and drive both the per-tid and unmapped streaming paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Member
Author
Review items addressedThis PR's review of #115 found four issues, all fixed here:
Added unit tests for the IUPAC packing; the streaming path is covered by the existing dupRadar integration tests (exact counts/matrices over a 2-reference fixture, driving both the per-tid and unmapped paths). Left as documented follow-ups (not addressed)
Verification: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
PR #115 (Migrate BAM I/O from rust-htslib to noodles) was merged early by accident, so this is a follow-up review-and-fix pass on the merged code. It contains a code review of #115 plus the fixes that review surfaced.
Code review of #115
Overview
#115 replaces
rust-htslibwith the pure-Rust noodles crate for all BAM/SAM/CRAM I/O, behind a compatibility shim atsrc/rna/bam/(Reader/IndexedReader/Record/Header/ CIGAR types) that preserves the rust-htslib-shaped API the rest of the pipeline depends on. The consuming-side changes are almost entirely mechanical (rust_htslib::bam→crate::rna::bam, plususize↔i32casts now thatseq_len()returnsi32). The headline win is dropping the cmake/htslib system-library build dependency. The PR reports bit-identical samtools/dupRadar/preseq/RSeQC/Qualimap output on the test fixture.Strengths
Seq::encoded_bytes).set_threadsis honestly documented rather than silently dropped.Issues found
align_record.rs,Record::from_bam)..collect::<Result<…>>().unwrap_or_default()turned any decode error into an empty CIGAR, which would corrupt coverage/indel stats with no signal. [fixed]pack_ascii_sequence). The ASCII→4-bit table only mapped=ACGTN; every other IUPAC ambiguity code (M R S V W Y H K D B) collapsed toN. For SAM/CRAM inputs (which reconstruct packed bytes from ASCII) this breaksSeq::as_bytes()fidelity and CHK parity. [fixed]region_for_tid). Building a region byformat!("{name}:1-{len}")and re-parsing mangles reference names containing:or-(e.g. HLA contigs likeHLA-A*01:01) and lossily decodes non-UTF-8 names viafrom_utf8_lossy. [fixed]io.rs).fetch(tid)collected every record on the reference (andfetch(Unmapped)every unmapped read) into aVecbefore yielding any. Since dupRadar fetches one chromosome per worker thread concurrently, peak memory scaled as reference size × thread count — gigabytes for a deep chromosome. [fixed]Risks / limitations noted (not changed here)
query/query_unmapped). noodles' slice-based CRAM iterators don't expose a low-level chunk API equivalent to the BAM/CSI one used for the streaming fix below, so streaming CRAM is a materially larger change. Left as a follow-up; BAM (the primary input) now streams.set_threadsis a no-op — BGZF decompression is single-threaded. Documented in the PR, but a potential throughput regression vs. htslib's multithreaded BGZF; wiringbgzf::MultithreadedReaderwould be a reasonable future enhancement.seq_len()returnsi32(rather than rust-htslib'susize), which forced the scatteredas usizecasts inaccumulators.rs. Minor API-ergonomics nit; left as-is to avoid churn.What this PR changes
All four [fixed] items above, with the no-op/limitation items left as documented follow-ups:
Correctness / robustness
Record::from_bam— propagate CIGAR decode errors with context instead ofunwrap_or_default().pack_ascii_sequence— complete the table to the full htslibseq_nt16_table(all 16 IUPAC codes; lowercase mapped to the same codes; unknown bytes still fall back toN).region_for_tid— construct theRegiondirectly from the raw reference-name bytes with a full interval (Region::new(name, ..)); the function is now infallible.Memory
BamCursorreplaces the per-referenceVecbuffering, mirroring noodles'csi::io::Querychunk state machine.fetch(tid)resolves the reference's index chunks up front andread()seeks/decodes one record at a time directly from the BGZF stream (filtering the occasional neighbouring-reference record that shares a chunk-boundary block).fetch(Unmapped)seeks to the index's unmapped offset (with a reopen + first-record-scan fallback when the.bailacks the metadata pseudo-bin) and streams to EOF. Peak memory is now O(1) per reader instead of reference size × thread count. No new dependencies, nounsafe.Tests
Testing
🤖 Generated with Claude Code