feat: CID-optimal consensus tree search (CIDConsensus)#213
Open
ms609 wants to merge 12 commits intocpp-searchfrom
Open
feat: CID-optimal consensus tree search (CIDConsensus)#213ms609 wants to merge 12 commits intocpp-searchfrom
ms609 wants to merge 12 commits intocpp-searchfrom
Conversation
New exported function CIDConsensus() finds consensus trees that minimize mean Clustering Information Distance to a set of input trees. Phase 1: Binary search via existing Ratchet/TreeSearch infrastructure - Custom CID scorer (.CIDScorer) plugs into TreeScorer interface - CID bootstrapper (.CIDBootstrap) resamples input trees with replacement - cidData S3 class (environment-backed) for reference semantics - Supports SPR, TBR, NNI, and ratchet search methods - Any TreeDist metric can serve as objective (CID, MCI, SPI, etc.) Phase 2: Collapse/resolve refinement - .CollapseRefine() greedily collapses edges that reduce CID - Tries all pairwise polytomy resolutions to re-add splits - Produces non-binary trees when optimal - Enabled by default (collapse=TRUE parameter) Test suite: 55 assertions (Tier 2, skip_on_cran)
For the default ClusteringInfoDistance metric, precompute input tree splits and clustering entropies once in .MakeCIDData(). The scorer then computes CID as CE(cand) + mean(CE(inputs)) - 2*mean(MCI), avoiding redundant as.Splits() conversion of the N input trees on each candidate evaluation. Benchmark (20 tips, 50 trees, SPR 100 iter): Before: 1.55s (6.8 ms/scorer call) After: 0.25s (0.9 ms/scorer call) Speedup: ~6x The CIDBootstrap also resamples the precomputed splits/CEs to maintain consistency during ratchet perturbation. Remaining bottleneck: N LAP solutions per candidate in MutualClusteringInfoSplits (C++ in TreeDist). This is irreducible without a warm-start LAP solver.
Replace vapply over MutualClusteringInfoSplits with a for-loop over unclass'd raw split matrices. This eliminates per-iteration function call overhead and S3 dispatch. Total speedup vs naive ClusteringInfoDistance: 8.1x Per-scorer-call: 6.8 ms → 0.76 ms (20 tips, 50 trees) SPR search: 1.55s → 0.19s
LAP solver (Jonker-Volgenant), mutual clustering information, clustering entropy, MRP dataset builder, and weight management for CID consensus tree search. Verified: C++ CID matches R ClusteringInfoDistance to 1e-12. All 28 ts-cid + 121 CIDConsensus tests pass.
cid_score() now returns -mean(MCI) (negated for minimization infrastructure). The normalize parameter is removed from the R API. User-facing score is positive MCI (higher = better), converted at the CIDConsensus() boundary. C++ changes: - cid_score(): single branch computing -weighted_mean(MCI) - Removed normalize/non-normalize branching - Early termination uses cand_CE upper bound R changes: - Removed normalize param from CIDConsensus(), .MakeCIDData(), .CIDDrivenSearch(), .PrescreenMarginalNID(), .BestInsertion() - .CIDScoreFast() returns -mean(MCI) instead of mean(CID) - CIDConsensus() negates internal score to positive MCI on output - Verbosity messages show MCI instead of score Test changes: - Updated score direction checks (higher MCI = better) - Removed normalized scoring tests - Added mean(MutualClusteringInfo) verification test - All 148 CID tests pass
- TBR/SPR: set score_budget = best_score + eps before candidate full_rescore(), reset to HUGE_VAL after. Enables cid_score() early termination when candidate can't beat current best. - Drift: no budget (needs exact CID scores for RFD computation). - ts_parallel.cpp: deep-copy CidData per worker thread so mutable scratch buffers (lap_scratch, cand_tip_bits, cand_buf, score_budget) don't race between threads. - Benchmark verified: 2-thread CID produces identical scores to single-thread; 1.3-1.7x speedup across 40-80 tip scenarios. - Mark T-150 complete.
Incorporates 16 commits from cpp-search: - Collapsed-edge clip skipping + regraft merging (polytomy-search feature) - Cross-replicate consensus constraint tightening - Adaptive search level (ratchet/drift scaling) - Diversity-aware pool eviction + collapsed topology dedup - Partial shuffle for tighter bound seeding (TBR/SPR) - Conflict-guided RSS sector selection - Consensus stability stopping criterion - Ratchet tuning: 25% perturbation, 5 moves, 10 cycles - Bug fixes (constraint boundary-edge, cancel checks in MPT enum) - CI on all PRs AGENTS.md: resolved by keeping both CID and Collapsed module rows, and both CID optimizations and ratchet tuning benchmark sections.
- ts_rcpp.cpp: add ts_cid_consensus() Rcpp bridge function - ts_drift.cpp: CID two-baseline tracking (mrp_baseline + CID score) - ts_sector.cpp: CID mode dispatch for sectorial search - ts_ratchet.cpp: CID-aware ratchet scoring - ts_fitch.cpp: CID support in score helpers - ts_data.h: CID fields in DataSet - RcppExports, TreeSearch-init.c, NAMESPACE: register new exports - man/CIDConsensus.Rd: documentation - inst/test-data/takazawa/: reference data for CID tests - inst/analysis/: MRP weighting analysis
Resolve 9 file conflicts: - ts_data.h: merge ScoringMode enums (add XPIWE from cpp-search, keep CID) - ts_rcpp.cpp: keep both ts_cid_consensus and ts_wagner_bias_bench/ ts_test_strategy_tracker; add ts_strategy.h include alongside ts_cid.h - RcppExports.R/cpp, TreeSearch-init.c: keep both new function registrations - WORDLIST: add IQ, mutex from cpp-search - settings.json, to-do.md, completed-tasks.md: prefer cpp-search (more current), re-add CID entry to completed-tasks.md
Brings in 119 commits from cpp-search including: - Collapsed-edge clip skipping and regraft merging - NNI-perturbation escape mechanism - Biased Wagner addition (Goloboff 2014) - Outer search cycle loop - Large-tree preset tuning - Ratchet perturbation tuning (25%, 5 moves) - HSJ/xform inapplicable scoring - CharacterHierarchy + recode_hierarchy - Shiny app modularization and bug fixes - MaddisonSlatkin C++ native scorer No conflicts. Verified: CID code preserved in ts_drift, ts_tbr, ts_search, ts_parallel, ts_rcpp, TreeSearch-init.c. check_init.R confirms all 46 Rcpp entries match.
- RcppExports.cpp: restore missing return/END_RCPP/} at end of ts_cid_consensus (auto-merge inserted subsequent functions inside the function body) - ts_driven.cpp: skip NNI warmup for CID mode (nni_search lacks dual MRP/CID score tracking, causing segfault) - Move inst/analysis/ to dev/analysis/ per project conventions
- Rename CIDConsensus() to InfoConsensus() with deprecated alias
- Remove 'metric' param (always MCI; was vestigial)
- Remove 'start' param (redundant with multi-replicate Wagner)
- Remove unused '...' from public API
- Default nThreads to getOption('mc.cores', 1L)
- Replace ape::drop.tip/keep.tip/root/consensus with TreeTools equivalents
- Remove ape imports: consensus, drop.tip, keep.tip, root
- Fix unused variable warning (min_val) in ts_cid.cpp LAP solver
- Add @Seealso placeholder for TreeDist/Quartet consensus methods
- Rename source files: CIDConsensus.R -> InfoConsensus.R
- Simplify .MakeCIDData, .CIDScorer, .CIDBootstrap (remove isCID branches)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CIDConsensus()finds a consensus tree that minimizes mean Clustering Information Distance (CID) to a set of input trees, using the existing C++ driven search engine.Architecture
Dual-layer scoring for performance:
Two-baseline tracking in TBR, SPR, and drift maintains separate
mrp_baseline(Fitch) andbest_score(CID) values. Sectorial search (RSS/XSS) uses Fitch on MRP internally, verifies with full CID after reinsertion.New files
R/CIDConsensus.Rsrc/ts_cid.h/ts_cid.cpptests/testthat/test-CIDConsensus.Rtests/testthat/test-ts-cid.RModified files
CID mode wired into all search modules:
ts_tbr.cpp,ts_search.cpp,ts_drift.cpp: two-baseline tracking, score_budget early terminationts_sector.cpp: CID mode dispatch for sectorial searchts_ratchet.cpp: CID-aware ratchetts_parallel.cpp: per-thread CidData deep copy (fixes data race)ts_rcpp.cpp:ts_cid_consensus()Rcpp bridge functionPerformance optimizations
cid_score()with early termination (wired into TBR/SPR)Multi-threaded benchmarks (coalescent-derived trees):
Tests
All 9928 tests pass (0 failures, 0 warnings). Includes 771 new CID-specific assertions across R-level and C++ test files.