Skip to content

Follow-up review fixes for noodles BAM migration (#115)#117

Open
ewels wants to merge 2 commits into
mainfrom
review/pr115-followup
Open

Follow-up review fixes for noodles BAM migration (#115)#117
ewels wants to merge 2 commits into
mainfrom
review/pr115-followup

Conversation

@ewels

@ewels ewels commented Jun 16, 2026

Copy link
Copy Markdown
Member

Context

PR #115 (Migrate BAM I/O from rust-htslib to noodles) was merged early by accident, so this is a follow-up review-and-fix pass on the merged code. It contains a code review of #115 plus the fixes that review surfaced.


Code review of #115

Overview

#115 replaces rust-htslib with the pure-Rust noodles crate for all BAM/SAM/CRAM I/O, behind a compatibility shim at src/rna/bam/ (Reader / IndexedReader / Record / Header / CIGAR types) that preserves the rust-htslib-shaped API the rest of the pipeline depends on. The consuming-side changes are almost entirely mechanical (rust_htslib::bamcrate::rna::bam, plus usizei32 casts now that seq_len() returns i32). The headline win is dropping the cmake/htslib system-library build dependency. The PR reports bit-identical samtools/dupRadar/preseq/RSeQC/Qualimap output on the test fixture.

Strengths

  • Clean separation: the shim isolates noodles behind a small surface, so consumers barely changed.
  • The "samtools-identical CHK checksum" path is preserved by keeping the packed 4-bit sequence bytes (Seq::encoded_bytes).
  • Good doc comments throughout and the no-op set_threads is honestly documented rather than silently dropped.

Issues found

  1. CIGAR decode errors were silently swallowed (align_record.rs, Record::from_bam). .collect::<Result<…>>().unwrap_or_default() turned any decode error into an empty CIGAR, which would corrupt coverage/indel stats with no signal. [fixed]
  2. Lossy sequence packing for SAM/CRAM (pack_ascii_sequence). The ASCII→4-bit table only mapped =ACGTN; every other IUPAC ambiguity code (M R S V W Y H K D B) collapsed to N. For SAM/CRAM inputs (which reconstruct packed bytes from ASCII) this breaks Seq::as_bytes() fidelity and CHK parity. [fixed]
  3. Fragile region construction (region_for_tid). Building a region by format!("{name}:1-{len}") and re-parsing mangles reference names containing : or - (e.g. HLA contigs like HLA-A*01:01) and lossily decodes non-UTF-8 names via from_utf8_lossy. [fixed]
  4. Indexed reads buffered whole references into memory (io.rs). fetch(tid) collected every record on the reference (and fetch(Unmapped) every unmapped read) into a Vec before yielding any. Since dupRadar fetches one chromosome per worker thread concurrently, peak memory scaled as reference size × thread count — gigabytes for a deep chromosome. [fixed]

Risks / limitations noted (not changed here)

  • CRAM still buffers fully into memory (sequential whole-file read and indexed query/query_unmapped). noodles' slice-based CRAM iterators don't expose a low-level chunk API equivalent to the BAM/CSI one used for the streaming fix below, so streaming CRAM is a materially larger change. Left as a follow-up; BAM (the primary input) now streams.
  • set_threads is a no-op — BGZF decompression is single-threaded. Documented in the PR, but a potential throughput regression vs. htslib's multithreaded BGZF; wiring bgzf::MultithreadedReader would be a reasonable future enhancement.
  • seq_len() returns i32 (rather than rust-htslib's usize), which forced the scattered as usize casts in accumulators.rs. Minor API-ergonomics nit; left as-is to avoid churn.

What this PR changes

All four [fixed] items above, with the no-op/limitation items left as documented follow-ups:

Correctness / robustness

  • Record::from_bam — propagate CIGAR decode errors with context instead of unwrap_or_default().
  • pack_ascii_sequence — complete the table to the full htslib seq_nt16_table (all 16 IUPAC codes; lowercase mapped to the same codes; unknown bytes still fall back to N).
  • region_for_tid — construct the Region directly from the raw reference-name bytes with a full interval (Region::new(name, ..)); the function is now infallible.

Memory

  • Indexed BAM reads now stream — a lazy BamCursor replaces the per-reference Vec buffering, mirroring noodles' csi::io::Query chunk state machine. fetch(tid) resolves the reference's index chunks up front and read() seeks/decodes one record at a time directly from the BGZF stream (filtering the occasional neighbouring-reference record that shares a chunk-boundary block). fetch(Unmapped) seeks to the index's unmapped offset (with a reopen + first-record-scan fallback when the .bai lacks the metadata pseudo-bin) and streams to EOF. Peak memory is now O(1) per reader instead of reference size × thread count. No new dependencies, no unsafe.

Tests

  • Added unit tests for IUPAC packing round-trips, lowercase handling, and odd-length trailing-nibble layout. The streaming path is covered by the existing dupRadar integration tests, which assert exact counts/matrices over a 2-reference fixture and drive both the per-tid and unmapped paths.

Testing

CXX=g++ cargo fmt --check                            # clean
CXX=g++ cargo clippy --all-targets -- -D warnings    # clean
CXX=g++ cargo test --release                         # 203 lib + 12 bin + 18 integration + 2 doc tests pass

🤖 Generated with Claude Code

ewels and others added 2 commits June 16, 2026 23:29
Follow-up to the rust-htslib → noodles migration (#115). Three fixes
surfaced during code review:

- align_record: propagate CIGAR decode errors in `Record::from_bam`
  instead of silently falling back to an empty CIGAR via
  `unwrap_or_default()`. A swallowed decode error would have produced
  wrong coverage/indel statistics with no signal.

- align_record: complete the ASCII→4-bit sequence packing table to the
  full htslib `seq_nt16_table` (all 16 IUPAC codes), so SAM/CRAM records
  containing ambiguity codes round-trip faithfully and produce
  samtools-identical CHK checksums. Previously every non-ACGT base was
  collapsed to N.

- io: build the fetch `Region` directly from the reference name and a
  full interval instead of formatting and re-parsing `name:start-end`.
  The old approach mangled reference names containing `:` or `-`
  (e.g. HLA contigs) and lossily decoded non-UTF-8 names.

Adds unit tests covering IUPAC packing round-trips, lowercase handling,
and odd-length trailing-nibble layout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The indexed BAM reader buffered every record of a fetched reference (and
all unmapped reads) into a `Vec` before yielding any. dupRadar fetches
one chromosome per worker thread concurrently, so peak memory scaled as
reference size × thread count — gigabytes for a deep chromosome.

Replace the buffering with a lazy `BamCursor` that mirrors noodles'
`csi::io::Query` chunk state machine: `fetch(tid)` resolves the
reference's index chunks up front (small) and `read()` seeks/decodes one
record at a time straight from the BGZF stream, filtering out the
occasional neighbouring-reference record that shares a chunk boundary
block. `fetch(Unmapped)` seeks to the index's unmapped offset (falling
back to a reopen + first-record scan when the index lacks the metadata
pseudo-bin) and streams to EOF. Peak memory is now O(1) per reader.

CRAM (sequential and indexed) still buffers — noodles' slice-based CRAM
iterators don't expose an equivalent low-level chunk API — and is left
as a documented follow-up.

Output is unchanged: the dupRadar integration tests assert exact counts
and matrices over a 2-reference fixture and drive both the per-tid and
unmapped streaming paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ewels

ewels commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

Review items addressed

This PR's review of #115 found four issues, all fixed here:

  1. Swallowed CIGAR decode errors (Record::from_bam) — .unwrap_or_default() turned a decode failure into an empty CIGAR that would silently corrupt coverage/indel stats. Now propagated with context. (ea20e9b)
  2. Lossy SAM/CRAM sequence packing (pack_ascii_sequence) — the ASCII→4-bit table only handled =ACGTN; all other IUPAC codes (M R S V W Y H K D B) collapsed to N, breaking Seq::as_bytes() fidelity and CHK parity. Completed to the full htslib seq_nt16_table. (ea20e9b)
  3. Fragile region construction (region_for_tid) — formatting + re-parsing name:1-len mangled reference names containing :/- (e.g. HLA contigs) and lossily decoded non-UTF-8 names. Now builds Region::new(name, ..) directly. (ea20e9b)
  4. Indexed reads buffered whole references into memory (io.rs) — fetch(tid) collected every record on a reference (and fetch(Unmapped) every unmapped read) into a Vec, so peak memory scaled as reference size × thread count across dupRadar's per-chromosome workers. Replaced with a lazy BamCursor that streams one record at a time; peak memory is now O(1) per reader. (ddfe71a)

Added unit tests for the IUPAC packing; the streaming path is covered by the existing dupRadar integration tests (exact counts/matrices over a 2-reference fixture, driving both the per-tid and unmapped paths).

Left as documented follow-ups (not addressed)

  • CRAM still buffers fully (sequential whole-file and indexed) — noodles' slice-based CRAM iterators don't expose a low-level chunk API equivalent to the BAM/CSI one, so streaming CRAM is a materially larger change.
  • set_threads is a no-op — BGZF decompression is single-threaded; wiring bgzf::MultithreadedReader is a possible future enhancement.
  • seq_len() returns i32 rather than usize, forcing as usize casts in callers — minor ergonomics nit, left to avoid churn.

Verification: cargo fmt --check, cargo clippy --all-targets -- -D warnings, and the full suite (203 lib + 12 bin + 18 integration + 2 doc) all pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant