Skip to content

MR for all Elan work done since July 2025#16

Open
elanfisher wants to merge 49 commits intomainfrom
dev
Open

MR for all Elan work done since July 2025#16
elanfisher wants to merge 49 commits intomainfrom
dev

Conversation

@elanfisher
Copy link
Collaborator

No description provided.

noahcape and others added 30 commits July 22, 2024 17:09
…from the docs. Currently these only test length correctness and in my next push ill check content and some error handling paths. Lastly I added a benchmark so that we can compare runtime opts via criterion.
…nchor appears in the right place for sci-RNA-seq3 data. Added test which check error handling for anchor mismatches, exact geometry for no tollerence checks, and thresholded via hamming distance checks. Lastly added multi-thread consistency checks for 1v4 threads for 10x and sci-rna-seq3.
…ding of the inner workings before I make my next optimization commit.
… recent sucessful ANTISEQUENCE batch optimization.
elanfisher and others added 19 commits December 3, 2025 18:20
- Add search_whitelist(interval, file, dist, max_pos?) function for whitelist-based barcode search
- Implement mini-backtracking in interpreter for anchor-based parsing
- Support position-bounded search with optional max_pos parameter
- Add ExactBoundedMatch and HammingBoundedMatch for bounded search operations
- Rename test files from .fgdl to .geom extension
- Add comprehensive test cases for search_whitelist, search_anchor, and search_hamming
- Add r_umi_bc_anchor test for UMI extraction before barcode anchor
- Add anchor_relative() function for searching anchor and extracting
  preceding elements relative to found position
- Add search() function for forcing global search
- Add search_whitelist() function for barcode whitelist search
- Implement mini-backtracking in interpreter for UnboundedLen handling
- Remove experimental edit_distance feature (marginal improvement,
  significant complexity)
- Clean up code after removal of EditDistance/EditDistanceSearch
- Import fmt_expr for dynamic file path expressions
- Implement per-sample file routing based on sample attribute
- Create output directory automatically
- Skip output validation when demux is enabled
- Route unassigned reads to unassigned_{R1,R2}.fastq files

Tested on SPLiT-seq 2018 dataset (100 reads):
- 44 samples correctly demultiplexed
- Barcode-to-sample mapping validated
- Comments start with # and extend to end of line
- Comments can appear on their own line or after tokens
- Added lexer tests for comment parsing
…tion

- Update SearchWhitelist from tuple to struct for cleaner API
- Add followed_by parameter for chained barcode+linker validation
- Implement linker validation in interpreter after barcode match
- Add search_whitelist_followed_by test with expected output
- All existing tests continue to pass
The anchor_relative code was adding a SetOp to redirect seq2.* to seq2._r
for subsequent geometry processing. However, read.set() modifies the
underlying string buffer and adjusts all intersecting mappings, which
corrupted the umi/bc3 mappings that were created at correct positions.

The fix removes this SetOp since the label vector update (label.push(_r))
at the end of each iteration already handles redirection for subsequent
pieces without modifying the shared string buffer.

This fixes the anchor_relative extraction bug where UMI and BC3 were being
extracted from wrong positions (after-anchor region instead of before-anchor).

Results:
- Before: 110K reads matched
- After: 361K reads matched (matching splitcode)
…ntation

New Features:
- anchor_relative(): Search for anchor from current position and extract preceding
  elements relative to anchor position. Handles variable-length regions (indels).
- --unassigned1/--unassigned2 CLI flags: Output reads that fail geometry matching
  to separate files (uses TryOp routing).
- filter_within_dist(): Whitelist filtering with hamming distance tolerance.

Documentation:
- docs/anchor_relative.md: Full documentation for anchor_relative function
- docs/unassigned_output.md: Documentation for --unassigned CLI flags
- docs/GEOMETRY_EXTENSIONS.md: Updated with all new geometry functions

Tests:
- Added 3 tests for anchor_relative in compile_tests.rs
- All existing tests pass

Bug Fixes:
- Fixed SetOp corruption in anchor_relative that was modifying shared string buffer
- Proper 3-way split for anchor matching (before, anchor, after)

Performance validated on SPLiT-seq 2018 500k subset:
- 2.4x faster than splitcode (1.5s vs 1.7s)
- 3.7x less memory than splitcode (36MB vs 141MB)
- 100% barcode validity with whitelist filtering
- Updated chumsky from 0.10.1 to 0.12.0
- Updated lexer to use new map_with API while preserving comment support
- Updated parser with macro-based function definitions
- Preserved dev features: Search, Anchor, SearchWhitelist variants
- Fixed lexer_tests to use chumsky 0.12 API (parse().into_output_errors())
- All tests passing
- Made --file2 optional in CLI args
- Updated execution logic to dynamically construct input graphs for 1 or 2 files
- Relaxed geometry parser to accept single-read definitions
- Verified performance with new single-end vs paired-end benchmark (~2x speedup for SE)
- Added regression tests for single-end processing
@elanfisher elanfisher requested a review from noahcape January 30, 2026 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants