feat: CID-optimal consensus tree search (CIDConsensus) by ms609 · Pull Request #213 · ms609/TreeSearch

ms609 · 2026-03-23T08:10:10Z

Summary

CIDConsensus() finds a consensus tree that minimizes mean Clustering Information Distance (CID) to a set of input trees, using the existing C++ driven search engine.

Architecture

Dual-layer scoring for performance:

Fast layer: MRP (Matrix Representation with Parsimony) binary characters scored via Fitch for incremental screening during TBR/SPR. IW with k=7 by default (empirically maximizes rank correlation with CID, ρ ≈ 0.688).
Accurate layer: Full CID scoring via LAP (Linear Assignment Problem) matching for move acceptance decisions.

Two-baseline tracking in TBR, SPR, and drift maintains separate mrp_baseline (Fitch) and best_score (CID) values. Sectorial search (RSS/XSS) uses Fitch on MRP internally, verifies with full CID after reinsertion.

New files

File	Purpose
`R/CIDConsensus.R`	R-level API, MRP dataset construction, rogue taxon dropping
`src/ts_cid.h` / `ts_cid.cpp`	CID scoring engine: LAP solver, MCI computation, split hashing
`tests/testthat/test-CIDConsensus.R`	R-level tests (504 lines)
`tests/testthat/test-ts-cid.R`	C++ engine tests (267 lines)

Modified files

CID mode wired into all search modules:

ts_tbr.cpp, ts_search.cpp, ts_drift.cpp: two-baseline tracking, score_budget early termination
ts_sector.cpp: CID mode dispatch for sectorial search
ts_ratchet.cpp: CID-aware ratchet
ts_parallel.cpp: per-thread CidData deep copy (fixes data race)
ts_rcpp.cpp: ts_cid_consensus() Rcpp bridge function

Performance optimizations

Hash-based exact match lookup (O(n) vs O(n²))
Precomputed per-split log2 values
Persistent candidate split buffers (zero-alloc reuse)
Presized LAP scratch buffers
Bounded cid_score() with early termination (wired into TBR/SPR)
Batch candidate log2 reuse

Multi-threaded benchmarks (coalescent-derived trees):

40 tips / 50 trees: 1.43× speedup (2 threads)
80 tips / 100 trees: 1.73× speedup (2 threads)

Tests

All 9928 tests pass (0 failures, 0 warnings). Includes 771 new CID-specific assertions across R-level and C++ test files.

New exported function CIDConsensus() finds consensus trees that minimize mean Clustering Information Distance to a set of input trees. Phase 1: Binary search via existing Ratchet/TreeSearch infrastructure - Custom CID scorer (.CIDScorer) plugs into TreeScorer interface - CID bootstrapper (.CIDBootstrap) resamples input trees with replacement - cidData S3 class (environment-backed) for reference semantics - Supports SPR, TBR, NNI, and ratchet search methods - Any TreeDist metric can serve as objective (CID, MCI, SPI, etc.) Phase 2: Collapse/resolve refinement - .CollapseRefine() greedily collapses edges that reduce CID - Tries all pairwise polytomy resolutions to re-add splits - Produces non-binary trees when optimal - Enabled by default (collapse=TRUE parameter) Test suite: 55 assertions (Tier 2, skip_on_cran)

For the default ClusteringInfoDistance metric, precompute input tree splits and clustering entropies once in .MakeCIDData(). The scorer then computes CID as CE(cand) + mean(CE(inputs)) - 2*mean(MCI), avoiding redundant as.Splits() conversion of the N input trees on each candidate evaluation. Benchmark (20 tips, 50 trees, SPR 100 iter): Before: 1.55s (6.8 ms/scorer call) After: 0.25s (0.9 ms/scorer call) Speedup: ~6x The CIDBootstrap also resamples the precomputed splits/CEs to maintain consistency during ratchet perturbation. Remaining bottleneck: N LAP solutions per candidate in MutualClusteringInfoSplits (C++ in TreeDist). This is irreducible without a warm-start LAP solver.

Replace vapply over MutualClusteringInfoSplits with a for-loop over unclass'd raw split matrices. This eliminates per-iteration function call overhead and S3 dispatch. Total speedup vs naive ClusteringInfoDistance: 8.1x Per-scorer-call: 6.8 ms → 0.76 ms (20 tips, 50 trees) SPR search: 1.55s → 0.19s

LAP solver (Jonker-Volgenant), mutual clustering information, clustering entropy, MRP dataset builder, and weight management for CID consensus tree search. Verified: C++ CID matches R ClusteringInfoDistance to 1e-12. All 28 ts-cid + 121 CIDConsensus tests pass.

cid_score() now returns -mean(MCI) (negated for minimization infrastructure). The normalize parameter is removed from the R API. User-facing score is positive MCI (higher = better), converted at the CIDConsensus() boundary. C++ changes: - cid_score(): single branch computing -weighted_mean(MCI) - Removed normalize/non-normalize branching - Early termination uses cand_CE upper bound R changes: - Removed normalize param from CIDConsensus(), .MakeCIDData(), .CIDDrivenSearch(), .PrescreenMarginalNID(), .BestInsertion() - .CIDScoreFast() returns -mean(MCI) instead of mean(CID) - CIDConsensus() negates internal score to positive MCI on output - Verbosity messages show MCI instead of score Test changes: - Updated score direction checks (higher MCI = better) - Removed normalized scoring tests - Added mean(MutualClusteringInfo) verification test - All 148 CID tests pass

- TBR/SPR: set score_budget = best_score + eps before candidate full_rescore(), reset to HUGE_VAL after. Enables cid_score() early termination when candidate can't beat current best. - Drift: no budget (needs exact CID scores for RFD computation). - ts_parallel.cpp: deep-copy CidData per worker thread so mutable scratch buffers (lap_scratch, cand_tip_bits, cand_buf, score_budget) don't race between threads. - Benchmark verified: 2-thread CID produces identical scores to single-thread; 1.3-1.7x speedup across 40-80 tip scenarios. - Mark T-150 complete.

Incorporates 16 commits from cpp-search: - Collapsed-edge clip skipping + regraft merging (polytomy-search feature) - Cross-replicate consensus constraint tightening - Adaptive search level (ratchet/drift scaling) - Diversity-aware pool eviction + collapsed topology dedup - Partial shuffle for tighter bound seeding (TBR/SPR) - Conflict-guided RSS sector selection - Consensus stability stopping criterion - Ratchet tuning: 25% perturbation, 5 moves, 10 cycles - Bug fixes (constraint boundary-edge, cancel checks in MPT enum) - CI on all PRs AGENTS.md: resolved by keeping both CID and Collapsed module rows, and both CID optimizations and ratchet tuning benchmark sections.

- ts_rcpp.cpp: add ts_cid_consensus() Rcpp bridge function - ts_drift.cpp: CID two-baseline tracking (mrp_baseline + CID score) - ts_sector.cpp: CID mode dispatch for sectorial search - ts_ratchet.cpp: CID-aware ratchet scoring - ts_fitch.cpp: CID support in score helpers - ts_data.h: CID fields in DataSet - RcppExports, TreeSearch-init.c, NAMESPACE: register new exports - man/CIDConsensus.Rd: documentation - inst/test-data/takazawa/: reference data for CID tests - inst/analysis/: MRP weighting analysis

Resolve 9 file conflicts: - ts_data.h: merge ScoringMode enums (add XPIWE from cpp-search, keep CID) - ts_rcpp.cpp: keep both ts_cid_consensus and ts_wagner_bias_bench/ ts_test_strategy_tracker; add ts_strategy.h include alongside ts_cid.h - RcppExports.R/cpp, TreeSearch-init.c: keep both new function registrations - WORDLIST: add IQ, mutex from cpp-search - settings.json, to-do.md, completed-tasks.md: prefer cpp-search (more current), re-add CID entry to completed-tasks.md

Brings in 119 commits from cpp-search including: - Collapsed-edge clip skipping and regraft merging - NNI-perturbation escape mechanism - Biased Wagner addition (Goloboff 2014) - Outer search cycle loop - Large-tree preset tuning - Ratchet perturbation tuning (25%, 5 moves) - HSJ/xform inapplicable scoring - CharacterHierarchy + recode_hierarchy - Shiny app modularization and bug fixes - MaddisonSlatkin C++ native scorer No conflicts. Verified: CID code preserved in ts_drift, ts_tbr, ts_search, ts_parallel, ts_rcpp, TreeSearch-init.c. check_init.R confirms all 46 Rcpp entries match.

- RcppExports.cpp: restore missing return/END_RCPP/} at end of ts_cid_consensus (auto-merge inserted subsequent functions inside the function body) - ts_driven.cpp: skip NNI warmup for CID mode (nni_search lacks dual MRP/CID score tracking, causing segfault) - Move inst/analysis/ to dev/analysis/ per project conventions

@Seealso

- Rename CIDConsensus() to InfoConsensus() with deprecated alias - Remove 'metric' param (always MCI; was vestigial) - Remove 'start' param (redundant with multi-replicate Wagner) - Remove unused '...' from public API - Default nThreads to getOption('mc.cores', 1L) - Replace ape::drop.tip/keep.tip/root/consensus with TreeTools equivalents - Remove ape imports: consensus, drop.tip, keep.tip, root - Fix unused variable warning (min_val) in ts_cid.cpp LAP solver - Add @Seealso placeholder for TreeDist/Quartet consensus methods - Rename source files: CIDConsensus.R -> InfoConsensus.R - Simplify .MakeCIDData, .CIDScorer, .CIDBootstrap (remove isCID branches)

ms609 added 12 commits March 20, 2026 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CID-optimal consensus tree search (CIDConsensus)#213

feat: CID-optimal consensus tree search (CIDConsensus)#213
ms609 wants to merge 12 commits intocpp-searchfrom
feature/cid-consensus

ms609 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ms609 commented Mar 23, 2026

Summary

Architecture

New files

Modified files

Performance optimizations

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant