Follow-up review fixes for noodles BAM migration (#115) by ewels · Pull Request #117 · seqeralabs/RustQC

ewels · 2026-06-16T21:29:56Z

Context

PR #115 (Migrate BAM I/O from rust-htslib to noodles) was merged early by accident, so this is a follow-up review-and-fix pass on the merged code. It contains a code review of #115 plus the fixes that review surfaced.

Code review of #115

Overview

#115 replaces rust-htslib with the pure-Rust noodles crate for all BAM/SAM/CRAM I/O, behind a compatibility shim at src/rna/bam/ (Reader / IndexedReader / Record / Header / CIGAR types) that preserves the rust-htslib-shaped API the rest of the pipeline depends on. The consuming-side changes are almost entirely mechanical (rust_htslib::bam → crate::rna::bam, plus usize↔i32 casts now that seq_len() returns i32). The headline win is dropping the cmake/htslib system-library build dependency. The PR reports bit-identical samtools/dupRadar/preseq/RSeQC/Qualimap output on the test fixture.

Strengths

Clean separation: the shim isolates noodles behind a small surface, so consumers barely changed.
The "samtools-identical CHK checksum" path is preserved by keeping the packed 4-bit sequence bytes (Seq::encoded_bytes).
Good doc comments throughout and the no-op set_threads is honestly documented rather than silently dropped.

Issues found

CIGAR decode errors were silently swallowed (align_record.rs, Record::from_bam). .collect::<Result<…>>().unwrap_or_default() turned any decode error into an empty CIGAR, which would corrupt coverage/indel stats with no signal. [fixed]
Lossy sequence packing for SAM/CRAM (pack_ascii_sequence). The ASCII→4-bit table only mapped =ACGTN; every other IUPAC ambiguity code (M R S V W Y H K D B) collapsed to N. For SAM/CRAM inputs (which reconstruct packed bytes from ASCII) this breaks Seq::as_bytes() fidelity and CHK parity. [fixed]
Fragile region construction (region_for_tid). Building a region by format!("{name}:1-{len}") and re-parsing mangles reference names containing : or - (e.g. HLA contigs like HLA-A*01:01) and lossily decodes non-UTF-8 names via from_utf8_lossy. [fixed]
Indexed reads buffered whole references into memory (io.rs). fetch(tid) collected every record on the reference (and fetch(Unmapped) every unmapped read) into a Vec before yielding any. Since dupRadar fetches one chromosome per worker thread concurrently, peak memory scaled as reference size × thread count — gigabytes for a deep chromosome. [fixed]

Risks / limitations noted (not changed here)

CRAM still buffers fully into memory (sequential whole-file read and indexed query/query_unmapped). noodles' slice-based CRAM iterators don't expose a low-level chunk API equivalent to the BAM/CSI one used for the streaming fix below, so streaming CRAM is a materially larger change. Left as a follow-up; BAM (the primary input) now streams.
set_threads is a no-op — BGZF decompression is single-threaded. Documented in the PR, but a potential throughput regression vs. htslib's multithreaded BGZF; wiring bgzf::MultithreadedReader would be a reasonable future enhancement.
seq_len() returns i32 (rather than rust-htslib's usize), which forced the scattered as usize casts in accumulators.rs. Minor API-ergonomics nit; left as-is to avoid churn.

What this PR changes

All four [fixed] items above, with the no-op/limitation items left as documented follow-ups:

Correctness / robustness

Record::from_bam — propagate CIGAR decode errors with context instead of unwrap_or_default().
pack_ascii_sequence — complete the table to the full htslib seq_nt16_table (all 16 IUPAC codes; lowercase mapped to the same codes; unknown bytes still fall back to N).
region_for_tid — construct the Region directly from the raw reference-name bytes with a full interval (Region::new(name, ..)); the function is now infallible.

Memory

Indexed BAM reads now stream — a lazy BamCursor replaces the per-reference Vec buffering, mirroring noodles' csi::io::Query chunk state machine. fetch(tid) resolves the reference's index chunks up front and read() seeks/decodes one record at a time directly from the BGZF stream (filtering the occasional neighbouring-reference record that shares a chunk-boundary block). fetch(Unmapped) seeks to the index's unmapped offset (with a reopen + first-record-scan fallback when the .bai lacks the metadata pseudo-bin) and streams to EOF. Peak memory is now O(1) per reader instead of reference size × thread count. No new dependencies, no unsafe.

Tests

Added unit tests for IUPAC packing round-trips, lowercase handling, and odd-length trailing-nibble layout. The streaming path is covered by the existing dupRadar integration tests, which assert exact counts/matrices over a 2-reference fixture and drive both the per-tid and unmapped paths.

Testing

CXX=g++ cargo fmt --check                            # clean
CXX=g++ cargo clippy --all-targets -- -D warnings    # clean
CXX=g++ cargo test --release                         # 203 lib + 12 bin + 18 integration + 2 doc tests pass

🤖 Generated with Claude Code

Follow-up to the rust-htslib → noodles migration (#115). Three fixes surfaced during code review: - align_record: propagate CIGAR decode errors in `Record::from_bam` instead of silently falling back to an empty CIGAR via `unwrap_or_default()`. A swallowed decode error would have produced wrong coverage/indel statistics with no signal. - align_record: complete the ASCII→4-bit sequence packing table to the full htslib `seq_nt16_table` (all 16 IUPAC codes), so SAM/CRAM records containing ambiguity codes round-trip faithfully and produce samtools-identical CHK checksums. Previously every non-ACGT base was collapsed to N. - io: build the fetch `Region` directly from the reference name and a full interval instead of formatting and re-parsing `name:start-end`. The old approach mangled reference names containing `:` or `-` (e.g. HLA contigs) and lossily decoded non-UTF-8 names. Adds unit tests covering IUPAC packing round-trips, lowercase handling, and odd-length trailing-nibble layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The indexed BAM reader buffered every record of a fetched reference (and all unmapped reads) into a `Vec` before yielding any. dupRadar fetches one chromosome per worker thread concurrently, so peak memory scaled as reference size × thread count — gigabytes for a deep chromosome. Replace the buffering with a lazy `BamCursor` that mirrors noodles' `csi::io::Query` chunk state machine: `fetch(tid)` resolves the reference's index chunks up front (small) and `read()` seeks/decodes one record at a time straight from the BGZF stream, filtering out the occasional neighbouring-reference record that shares a chunk boundary block. `fetch(Unmapped)` seeks to the index's unmapped offset (falling back to a reopen + first-record scan when the index lacks the metadata pseudo-bin) and streams to EOF. Peak memory is now O(1) per reader. CRAM (sequential and indexed) still buffers — noodles' slice-based CRAM iterators don't expose an equivalent low-level chunk API — and is left as a documented follow-up. Output is unchanged: the dupRadar integration tests assert exact counts and matrices over a 2-reference fixture and drive both the per-tid and unmapped streaming paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ewels · 2026-06-16T21:51:32Z

Review items addressed

This PR's review of #115 found four issues, all fixed here:

✅ Swallowed CIGAR decode errors (Record::from_bam) — .unwrap_or_default() turned a decode failure into an empty CIGAR that would silently corrupt coverage/indel stats. Now propagated with context. (ea20e9b)
✅ Lossy SAM/CRAM sequence packing (pack_ascii_sequence) — the ASCII→4-bit table only handled =ACGTN; all other IUPAC codes (M R S V W Y H K D B) collapsed to N, breaking Seq::as_bytes() fidelity and CHK parity. Completed to the full htslib seq_nt16_table. (ea20e9b)
✅ Fragile region construction (region_for_tid) — formatting + re-parsing name:1-len mangled reference names containing :/- (e.g. HLA contigs) and lossily decoded non-UTF-8 names. Now builds Region::new(name, ..) directly. (ea20e9b)
✅ Indexed reads buffered whole references into memory (io.rs) — fetch(tid) collected every record on a reference (and fetch(Unmapped) every unmapped read) into a Vec, so peak memory scaled as reference size × thread count across dupRadar's per-chromosome workers. Replaced with a lazy BamCursor that streams one record at a time; peak memory is now O(1) per reader. (ddfe71a)

Added unit tests for the IUPAC packing; the streaming path is covered by the existing dupRadar integration tests (exact counts/matrices over a 2-reference fixture, driving both the per-tid and unmapped paths).

Left as documented follow-ups (not addressed)

CRAM still buffers fully (sequential whole-file and indexed) — noodles' slice-based CRAM iterators don't expose a low-level chunk API equivalent to the BAM/CSI one, so streaming CRAM is a materially larger change.
set_threads is a no-op — BGZF decompression is single-threaded; wiring bgzf::MultithreadedReader is a possible future enhancement.
seq_len() returns i32 rather than usize, forcing as usize casts in callers — minor ergonomics nit, left to avoid churn.

Verification: cargo fmt --check, cargo clippy --all-targets -- -D warnings, and the full suite (203 lib + 12 bin + 18 integration + 2 doc) all pass.

ewels and others added 2 commits June 16, 2026 23:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow-up review fixes for noodles BAM migration (#115)#117

Follow-up review fixes for noodles BAM migration (#115)#117
ewels wants to merge 2 commits into
mainfrom
review/pr115-followup

ewels commented Jun 16, 2026 •

edited

Loading

Uh oh!

ewels commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ewels commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Code review of #115

Overview

Strengths

Issues found

Risks / limitations noted (not changed here)

What this PR changes

Correctness / robustness

Memory

Tests

Testing

Uh oh!

ewels commented Jun 16, 2026

Review items addressed

Left as documented follow-ups (not addressed)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ewels commented Jun 16, 2026 •

edited

Loading