Skip to content

feat: CID-optimal consensus tree search (CIDConsensus)#213

Open
ms609 wants to merge 12 commits intocpp-searchfrom
feature/cid-consensus
Open

feat: CID-optimal consensus tree search (CIDConsensus)#213
ms609 wants to merge 12 commits intocpp-searchfrom
feature/cid-consensus

Conversation

@ms609
Copy link
Owner

@ms609 ms609 commented Mar 23, 2026

Summary

CIDConsensus() finds a consensus tree that minimizes mean Clustering Information Distance (CID) to a set of input trees, using the existing C++ driven search engine.

Architecture

Dual-layer scoring for performance:

  • Fast layer: MRP (Matrix Representation with Parsimony) binary characters scored via Fitch for incremental screening during TBR/SPR. IW with k=7 by default (empirically maximizes rank correlation with CID, ρ ≈ 0.688).
  • Accurate layer: Full CID scoring via LAP (Linear Assignment Problem) matching for move acceptance decisions.

Two-baseline tracking in TBR, SPR, and drift maintains separate mrp_baseline (Fitch) and best_score (CID) values. Sectorial search (RSS/XSS) uses Fitch on MRP internally, verifies with full CID after reinsertion.

New files

File Purpose
R/CIDConsensus.R R-level API, MRP dataset construction, rogue taxon dropping
src/ts_cid.h / ts_cid.cpp CID scoring engine: LAP solver, MCI computation, split hashing
tests/testthat/test-CIDConsensus.R R-level tests (504 lines)
tests/testthat/test-ts-cid.R C++ engine tests (267 lines)

Modified files

CID mode wired into all search modules:

  • ts_tbr.cpp, ts_search.cpp, ts_drift.cpp: two-baseline tracking, score_budget early termination
  • ts_sector.cpp: CID mode dispatch for sectorial search
  • ts_ratchet.cpp: CID-aware ratchet
  • ts_parallel.cpp: per-thread CidData deep copy (fixes data race)
  • ts_rcpp.cpp: ts_cid_consensus() Rcpp bridge function

Performance optimizations

  1. Hash-based exact match lookup (O(n) vs O(n²))
  2. Precomputed per-split log2 values
  3. Persistent candidate split buffers (zero-alloc reuse)
  4. Presized LAP scratch buffers
  5. Bounded cid_score() with early termination (wired into TBR/SPR)
  6. Batch candidate log2 reuse

Multi-threaded benchmarks (coalescent-derived trees):

  • 40 tips / 50 trees: 1.43× speedup (2 threads)
  • 80 tips / 100 trees: 1.73× speedup (2 threads)

Tests

All 9928 tests pass (0 failures, 0 warnings). Includes 771 new CID-specific assertions across R-level and C++ test files.

ms609 added 12 commits March 20, 2026 13:12
New exported function CIDConsensus() finds consensus trees that minimize
mean Clustering Information Distance to a set of input trees.

Phase 1: Binary search via existing Ratchet/TreeSearch infrastructure
- Custom CID scorer (.CIDScorer) plugs into TreeScorer interface
- CID bootstrapper (.CIDBootstrap) resamples input trees with replacement
- cidData S3 class (environment-backed) for reference semantics
- Supports SPR, TBR, NNI, and ratchet search methods
- Any TreeDist metric can serve as objective (CID, MCI, SPI, etc.)

Phase 2: Collapse/resolve refinement
- .CollapseRefine() greedily collapses edges that reduce CID
- Tries all pairwise polytomy resolutions to re-add splits
- Produces non-binary trees when optimal
- Enabled by default (collapse=TRUE parameter)

Test suite: 55 assertions (Tier 2, skip_on_cran)
For the default ClusteringInfoDistance metric, precompute input tree
splits and clustering entropies once in .MakeCIDData(). The scorer then
computes CID as CE(cand) + mean(CE(inputs)) - 2*mean(MCI), avoiding
redundant as.Splits() conversion of the N input trees on each candidate
evaluation.

Benchmark (20 tips, 50 trees, SPR 100 iter):
  Before: 1.55s  (6.8 ms/scorer call)
  After:  0.25s  (0.9 ms/scorer call)
  Speedup: ~6x

The CIDBootstrap also resamples the precomputed splits/CEs to maintain
consistency during ratchet perturbation.

Remaining bottleneck: N LAP solutions per candidate in
MutualClusteringInfoSplits (C++ in TreeDist). This is irreducible
without a warm-start LAP solver.
Replace vapply over MutualClusteringInfoSplits with a for-loop over
unclass'd raw split matrices. This eliminates per-iteration function
call overhead and S3 dispatch.

Total speedup vs naive ClusteringInfoDistance: 8.1x
Per-scorer-call: 6.8 ms → 0.76 ms (20 tips, 50 trees)
SPR search: 1.55s → 0.19s
LAP solver (Jonker-Volgenant), mutual clustering information,
clustering entropy, MRP dataset builder, and weight management
for CID consensus tree search.

Verified: C++ CID matches R ClusteringInfoDistance to 1e-12.
All 28 ts-cid + 121 CIDConsensus tests pass.
cid_score() now returns -mean(MCI) (negated for minimization
infrastructure). The normalize parameter is removed from the R API.
User-facing score is positive MCI (higher = better), converted at
the CIDConsensus() boundary.

C++ changes:
- cid_score(): single branch computing -weighted_mean(MCI)
- Removed normalize/non-normalize branching
- Early termination uses cand_CE upper bound

R changes:
- Removed normalize param from CIDConsensus(), .MakeCIDData(),
  .CIDDrivenSearch(), .PrescreenMarginalNID(), .BestInsertion()
- .CIDScoreFast() returns -mean(MCI) instead of mean(CID)
- CIDConsensus() negates internal score to positive MCI on output
- Verbosity messages show MCI instead of score

Test changes:
- Updated score direction checks (higher MCI = better)
- Removed normalized scoring tests
- Added mean(MutualClusteringInfo) verification test
- All 148 CID tests pass
- TBR/SPR: set score_budget = best_score + eps before candidate
  full_rescore(), reset to HUGE_VAL after. Enables cid_score()
  early termination when candidate can't beat current best.
- Drift: no budget (needs exact CID scores for RFD computation).
- ts_parallel.cpp: deep-copy CidData per worker thread so mutable
  scratch buffers (lap_scratch, cand_tip_bits, cand_buf, score_budget)
  don't race between threads.
- Benchmark verified: 2-thread CID produces identical scores to
  single-thread; 1.3-1.7x speedup across 40-80 tip scenarios.
- Mark T-150 complete.
Incorporates 16 commits from cpp-search:
- Collapsed-edge clip skipping + regraft merging (polytomy-search feature)
- Cross-replicate consensus constraint tightening
- Adaptive search level (ratchet/drift scaling)
- Diversity-aware pool eviction + collapsed topology dedup
- Partial shuffle for tighter bound seeding (TBR/SPR)
- Conflict-guided RSS sector selection
- Consensus stability stopping criterion
- Ratchet tuning: 25% perturbation, 5 moves, 10 cycles
- Bug fixes (constraint boundary-edge, cancel checks in MPT enum)
- CI on all PRs

AGENTS.md: resolved by keeping both CID and Collapsed module rows,
and both CID optimizations and ratchet tuning benchmark sections.
- ts_rcpp.cpp: add ts_cid_consensus() Rcpp bridge function
- ts_drift.cpp: CID two-baseline tracking (mrp_baseline + CID score)
- ts_sector.cpp: CID mode dispatch for sectorial search
- ts_ratchet.cpp: CID-aware ratchet scoring
- ts_fitch.cpp: CID support in score helpers
- ts_data.h: CID fields in DataSet
- RcppExports, TreeSearch-init.c, NAMESPACE: register new exports
- man/CIDConsensus.Rd: documentation
- inst/test-data/takazawa/: reference data for CID tests
- inst/analysis/: MRP weighting analysis
Resolve 9 file conflicts:
- ts_data.h: merge ScoringMode enums (add XPIWE from cpp-search, keep CID)
- ts_rcpp.cpp: keep both ts_cid_consensus and ts_wagner_bias_bench/
  ts_test_strategy_tracker; add ts_strategy.h include alongside ts_cid.h
- RcppExports.R/cpp, TreeSearch-init.c: keep both new function registrations
- WORDLIST: add IQ, mutex from cpp-search
- settings.json, to-do.md, completed-tasks.md: prefer cpp-search (more
  current), re-add CID entry to completed-tasks.md
Brings in 119 commits from cpp-search including:
- Collapsed-edge clip skipping and regraft merging
- NNI-perturbation escape mechanism
- Biased Wagner addition (Goloboff 2014)
- Outer search cycle loop
- Large-tree preset tuning
- Ratchet perturbation tuning (25%, 5 moves)
- HSJ/xform inapplicable scoring
- CharacterHierarchy + recode_hierarchy
- Shiny app modularization and bug fixes
- MaddisonSlatkin C++ native scorer

No conflicts. Verified: CID code preserved in ts_drift, ts_tbr,
ts_search, ts_parallel, ts_rcpp, TreeSearch-init.c. check_init.R
confirms all 46 Rcpp entries match.
- RcppExports.cpp: restore missing return/END_RCPP/} at end of
  ts_cid_consensus (auto-merge inserted subsequent functions inside
  the function body)
- ts_driven.cpp: skip NNI warmup for CID mode (nni_search lacks
  dual MRP/CID score tracking, causing segfault)
- Move inst/analysis/ to dev/analysis/ per project conventions
- Rename CIDConsensus() to InfoConsensus() with deprecated alias
- Remove 'metric' param (always MCI; was vestigial)
- Remove 'start' param (redundant with multi-replicate Wagner)
- Remove unused '...' from public API
- Default nThreads to getOption('mc.cores', 1L)
- Replace ape::drop.tip/keep.tip/root/consensus with TreeTools equivalents
- Remove ape imports: consensus, drop.tip, keep.tip, root
- Fix unused variable warning (min_val) in ts_cid.cpp LAP solver
- Add @Seealso placeholder for TreeDist/Quartet consensus methods
- Rename source files: CIDConsensus.R -> InfoConsensus.R
- Simplify .MakeCIDData, .CIDScorer, .CIDBootstrap (remove isCID branches)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant