Skip to content

C++ search#210

Draft
ms609 wants to merge 330 commits intomainfrom
cpp-search
Draft

C++ search#210
ms609 wants to merge 330 commits intomainfrom
cpp-search

Conversation

@ms609
Copy link
Owner

@ms609 ms609 commented Mar 19, 2026

  • other optimizations + features

Manual testing underway; shiny app in particular has some usability issues.

ms609 added 30 commits March 20, 2026 12:55
Replace blunt (k, n) feasibility thresholds with a partition-shape-aware
metric: the split_count (coefficient of x^floor(n/2) in the generating
polynomial). This allows skewed multi-state characters to be computed
exactly even at large n, while still blocking balanced partitions that
would blow up.

- Add .MSSplitCount() and .MS_SC_THRESHOLD in data_manipulation.R
- Replace maxTips gate in PrepareDataProfile and pp_info_extra_step
- Shrink 5-state test from n=11 (blowup) to n=10 (0.6s)
- Add 18 split_count boundary + base-case tests (55 total, 5.6s)
…ocus 9

wagner_edge_violates_constraint (ts_wagner.cpp) and
regraft_violates_constraint (ts_constraint.cpp) both rejected the edge
directly above the constraint clade (below == cn) for MUST_OUTSIDE
tips/clades. is_ancestor_or_equal(cn, below) returns true when
below == cn, but inserting an outside element just above the constraint
clade makes it a sibling — monophyly is preserved.

Fix: add '&& below != cn' guard to the MUST_OUTSIDE rejection branch
in both functions. Search quality improvement; no correctness impact.

Tests: +2 constrained Wagner regression tests (43 total).
43 Wagner + 18 constraint + 152 driven-search tests pass.
The infeasible test cases used partitions from the old maxTips gate
(n > 25 for k=3). The split_count gate (f738091) replaced maxTips but
the test partitions weren't updated — (13,12,10) has sc=118, below
the k=3 threshold of 136, causing the exact solver to run indefinitely.

Changed infeasible k=3 partition from (13,12,10) n=35 to (13,13,12)
n=38 (sc=140 > 136). Also fixed stale comments and suppressed expected
fallback warnings.
Add compute_collapsed_flags() (ts_collapsed.h/.cpp) which identifies
zero-length edges where clipping provably cannot improve the score.
Integrated into TBR, SPR, and drift search. Disabled during MPT
enumeration to preserve equal-score topology discovery.

Benchmark (4 standard morphological datasets, 3 seeds each):
- Skip rate: 0% on all datasets — near-optimal trees have negligible
  zero-length edges in typical morphological data
- Score equivalence: confirmed (enabled vs disabled produce identical
  best scores)
- Overhead: negligible

Add test-ts-collapsed.R (Tier 2, 15 assertions): flag structure, score
equivalence, sparse-data skip counts, inapplicable characters, implied
weighting, driven search integration.

Add bench_collapsed.R benchmark script.
Update NEWS.md with search optimization entries.
Update AGENTS.md: module map, benchmark results, test count.
…r reuse

Binary (2-token) characters: O(1) Carter et al. (1990) closed-form
shortcut when both states are pure (no ambiguous tips). Eliminates
the exponential recursive algorithm entirely for this common case.
Benchmark: k=2 n=50 drops from 0.22s to <0.001s (>100x).

Multi-state (3-5 token) characters:
- FastLSETables: 4096-entry precomputed lookup tables for exp(-d)
  and log1p(exp(-d)) with linear interpolation. Used in
  LSEAccumulator::add() to replace std::exp() calls.
- lse_update(): shared inline helper for log-sum-exp accumulation,
  replacing ~6 copies of the same pattern in LogPVec/runAll.
- logconv(): changed from allocating+returning std::vector to
  writing into caller-provided buffer. Eliminates allocation per
  convolution call.
- Reusable noStepVec/yesStepVec/conv_buf buffers in LogPVec,
  allocated once per call, std::fill'd per draw pair.
Multi-state speedup: ~5-19% for k=3 boundary cases (within noise
for k=4/k=5 due to recursion depth dominating).

Also adds logDoubleFact, logN_carter, logCarter1_cpp helpers for
the Carter formula (C++ implementation of the R LogCarter1).
Phase 1: CollapsedRegions struct groups connected collapsed edges into
regions via reverse-postorder propagation. Region IDs enable efficient
per-region tracking.

Phase 3-4: TBR, SPR, and drift regraft loops skip interior collapsed
edges (collapsed[below] == 1). Boundary edges are always evaluated.
Within a collapsed region, all edges have zero local cost and identical
state propagation, so interior positions are redundant.

Phase 5: TreePool::add_collapsed() uses compute_collapsed_splits() to
hash only non-collapsed edges for dedup. Trees differing only in
zero-length resolutions are treated as duplicates, improving pool
diversity (Goloboff & Farris 2001).

Driven search pipeline updated to compute collapsed flags per
replicate and use add_collapsed() for both replicate results and
fuse results.

All 1875 ts-* tests pass. No score regressions on Vinther2008 (79)
or Zhu2013 (661).

Literature notes from Goloboff (1996), Goloboff & Farris (2001),
and Goloboff & Morales (2023) added to plan file.
The original threshold calibration (136/100/100) used sequential state
indices (1,2,3,...) instead of bitmask positions (1,2,4,8,16). This
resulted in testing a different (easier) combinatorial problem, yielding
dangerously high thresholds that allowed the exact solver to run
indefinitely on balanced partitions within the supposed safe range.

Recalibrated with correct encoding (states at 2^(i-1) positions):
  k=3: 136 → 75  (n=27 balanced 0.97s safe, n=31 balanced 1.32s marginal)
  k=4: 100 → 50  (n=13 balanced 0.36s safe, n=15 balanced 0.94s marginal)
  k=5: 100 → 35  (n=9  balanced 0.22s safe, n=10 balanced 0.49s)

Updated boundary tests and docstrings to match.
Replace CollapsedRegions struct with plain vector<uint8_t> in all
search functions. The region_id, n_regions, and n_collapsed fields
were computed but never read by any consumer — all callers only
check the collapsed flag array. Saves an O(n_node) allocation +
preorder traversal per recomputation.
ThreadSafePool::add_collapsed() wraps pool_.add_collapsed() with
mutex. Worker threads and fuse_round() now compute collapsed flags
and use add_collapsed(), matching the serial driven_search path.
extract_into() still uses standard add() since it runs post-search
and doesn't need collapsed dedup.
After a minimum number of replicates (default 5), extract the strict
consensus splits from the pool and enforce them as topological
constraints for subsequent replicates. This focuses search on
uncertain parts of the tree. Constraints are cleared whenever a
new best score is found. Opt-in via consensusConstrain = TRUE in
SearchControl(); disabled by default.

Also includes strategy preset tuning:
- default preset: wagnerStarts=3, sprFirst=TRUE, adaptiveLevel=TRUE
- thorough preset: sprFirst=TRUE

New files:
- build_constraint_from_bitsets() in ts_constraint
- extract_consensus_splits() in ts_pool
- test-ts-consensus-constrain.R (8 tests)
… consensus constraints

- Collapsed-region regraft merging (Goloboff 1996): skip interior
  zero-length edges in TBR/SPR/drift regraft loops
- Collapsed-topology pool dedup: trees differing only in zero-length
  resolutions treated as duplicates (serial + parallel paths)
- Diversity-aware pool eviction: evict most-similar entry on ties
- Cross-replicate consensus constraint tightening (opt-in)
- Strategy preset tuning: wagnerStarts=3, sprFirst, adaptiveLevel
Regraft edges were collected in preorder (root-adjacent first), which
systematically evaluates high-cost positions first before the bounded
indirect scoring has a tight cutoff.  This wastes full evaluations.

Add partial Fisher-Yates shuffle (first 20 positions) after edge
collection in TBR, SPR, and drift search.  The shuffled prefix seeds
the evaluation with edges drawn from across the tree, establishing a
tight cutoff early.  Remaining edges are evaluated in collection order.

Overhead: ~200ns per clip (20 swaps), negligible vs per-clip evaluation
cost (~3us for 75-tip tree).  Benefit: tighter bound → more early
bailouts in subsequent evaluations.

All 1782 ts-* tests pass.  Score quality unchanged (778 on Agnarsson2004
confirmed).
Add a 2-second wall-clock budget to SolverT.  Every LogB and LogPVec
entry checks std::chrono::steady_clock; if the budget is exceeded,
budget_exceeded is set and all loops bail out immediately.  runAll()
returns NA_REAL with an Rcpp::warning so downstream code can detect
the fallback.

The guard prevents infinite hangs even if a partition slips past the
split_count feasibility gate (e.g. through future threshold changes).
Chrono overhead is ~20 ns per call — negligible versus per-call work.

Verified:
- k=3 n=35 balanced (known blowup): returns NA in 2.02 s
- 137 existing tests pass (55 MS + 82 pp-multistate, 9.2 s)
Systematic profiling across all 14 benchmark datasets showed the
previous 4% perturbation probability was too gentle to escape deep
local optima. Key changes:

  default preset:
  - ratchetPerturbProb: 0.04 → 0.25
  - ratchetPerturbMaxMoves: 0 (auto=20) → 5
  - ratchetCycles: 5 → 10
  - driftCycles: 2 → 4
  - driftAfdLimit: 3 → 5
  - driftRfdLimit: 0.1 → 0.15

  thorough preset:
  - ratchetPerturbProb: 0.04 → 0.25
  - ratchetPerturbMaxMoves: 0 (auto=20) → 5
  (keep 20 cycles)

Validated on all 14 datasets: 9 improved, 4 unchanged, 1 marginal at
10s budget (positive at 20s). Median improvement on hard datasets:
Zanol2014 +7, Giles2015 +4, Wortley2006 +3.

SearchControl() defaults updated to match the default preset.
Benchmark framework strategies updated to match.
…profiles

Replace binary reduction with a log-quadratic MC approximation that
preserves the full multi-state character for infeasible MaddisonSlatkin
computations.

Key changes:
- 'auto' and 'mc' approx modes now both use MC for infeasible chars
  (no more state reduction to binary)
- .ApproxStepInformation() uses log-quadratic interpolation anchored
  at the exact P(s_min) to bridge the gap between the analytic
  minimum-steps probability and the MC body
- Default n_mc increased from 5000 to 50000 for better tail resolution
- PrepareDataProfile() no longer reduces multi-state characters in
  the data matrix; infeasibility handled entirely in StepInformation()
- Log-linear fallback when quadratic sanity checks fail (monotonicity
  or concavity)

Validated: R² = 0.9997 for log-quadratic fit vs exact on (8,8,8) n=24.
MC-estimated IC(0) matches exact within 0.5 bits at the feasibility
boundary. All 105 tests pass.
Per-phase profiling showed drift contributes ~0 score improvement per
replicate (Zhu2013, 10 seeds). Shifting 2 drift cycles to ratchet:
  ratchetCycles: 10 → 12
  driftCycles: 4 → 2

Full-pipeline validation (20s, 5 seeds):
  Zhu2013: median 644/643 (was 645/644)
  Giles2015: median 713/712 (was 714/713)
  Dikow2009: median 1611/1611 (was 1612/1611)

More replicates completed in same time budget due to cheaper per-rep cost.
Add mc_fitch_scores() in src/ts_mc_fitch.cpp: generates random trees
via random_tree() (pure C, no Morphy dependency) and scores each with
an inline bitmask Fitch downpass.  No R object allocation per tree.

Performance: 357k trees/sec on 38-tip 3-state characters (vs ~100k/s
with the old Morphy path).  This makes 100k MC samples cost ~0.28s.

- Wire mc_fitch_scores into .ApproxStepInformation(), replacing the
  R-level RandomTree + TreeLength loop
- Bump default n_mc from 50k to 100k (now affordable)
- Update docstrings to reflect the new implementation
- All 288 tests pass (88 multistate + 17 data_manipulation + 183 ts-*)
Replace exp(-K) with tighter binomial bound (1-K/R)^R in
SearchConfidenceText(). Falls back to exp(-K) when K==R to avoid
overconfident 0%.

New optional parameters (wired by T-164):
- nTopologies: pool topology count; flags limited independence when T==1
- lastImprovedRep: trajectory info shown in message

Diagnostic flags:
- Ruggedness: warns when K/R < 0.3 and R >= 5 (only 10-29% of runs
  found best score)
- Small sample: nudges user to increase target hits when K==R and R<=5

Updated tooltip formula text and 58 search module tests pass.
ms609 added 30 commits March 25, 2026 14:22
ThreadSafePool::extract_into() rebuilt the output pool entry-by-entry,
which reset hits_to_best to the number of distinct topologies (often 1)
instead of the actual number of independent replicate hits.

Root cause: TreePool::add() tracks hits via its own internal counter,
but extract_into() populated a fresh pool from entries, losing the
accumulated hit count from the shared pool.

Fix: after transferring entries, propagate the real hits_to_best from
the internal pool via set_hits_to_best().

Impact: with nThreads >= 2, the Shiny app's convergence confidence
display ('K of R runs hit best score') was severely underreported
(e.g. showing 1 hit instead of 11).  The search algorithm itself was
unaffected — only the reported statistics were wrong.

Also fixes the user-facing symptom: Agnarsson2004 with k=5.62 (XPIWE)
appeared to have 2% hit rate across 230 runs, when the actual rate is
~55-60%.  The score 50.1872 is correct (exact k = 10^0.75 = 5.623413).
…orer

perf: MaddisonSlatkin optimizations + native C++ MC Fitch scorer
Add perturbStopFactor parameter to SearchControl(): stop after
nTip * K consecutive replicates that fail to improve the best score.
Default 0 (disabled). Small values (1-3) are typical.

Inspired by IQ-TREE's unsuccessful-perturbation stopping rule
(Nguyen et al. 2015), adapted from per-perturbation to per-replicate
granularity.

Implemented in both serial (ts_driven.cpp) and parallel (ts_parallel.cpp)
paths. The serial path tracks consecutive unsuccessful replicates directly;
the parallel path monitors score improvement in the main thread's
polling loop.

Two new tests: verify the rule stops search early when enabled,
and that it's a no-op when disabled (factor=0).
T-230 (commit a37984d) gated the replicate-count warning behind
verbosity > 0, but the test file used verbosity = 0L in all cases.
Update the two expect_warning() tests to use verbosity = 1L so the
warning is actually emitted.
…check

ThreadSafePool::fuse_round() was missing three things after
impose_constraint() that the serial driven_search fuse path has:

1. build_postorder() — without this, score_tree() iterates nodes in
   stale order after topology modification, producing wrong scores.
2. map_constraint_nodes() verification — impose_constraint() is
   heuristic and can fail; serial path discards the tree on failure.
3. fused_ok guard — serial path only adds to pool and resets hits
   when the repair succeeded.

Impact: with nThreads >= 2 AND topological constraints, fused trees
that violate constraints could enter the pool with incorrect scores.
Low practical impact (constraints + parallel + fuse violation is rare),
but the missing build_postorder() is a real scoring bug.

Found by S-RED focus 4 (Parallelism & RNG).
The consensus stability check in the parallel main-thread poll loop
called update_consensus_stability() every 200 ms, even when no new
replicates had completed. This caused the unchanged counter to
increment on idle polls, leading to premature search termination
when replicates are slow (e.g. large trees where each replicate
takes 30+ seconds).

Fix: track replicates_done at last check and only call
update_consensus_stability() when new replicates have been added.

Found by S-RED focus 4 (Parallelism & RNG).
Found and fixed: parallel consensus stability check incremented unchanged
counter on idle 200ms polls instead of per-replicate, causing premature
termination with slow replicates (large trees).

Also noted: R_CheckUserInterrupt in try/catch is ABI-fragile (longjmp vs
C++ exceptions), Rf_error in ts_wagner.cpp reachable from workers (low
risk), dead code in poll loop, multiple threads can trigger fuse_round
on same done count (correct but wasteful).

Reset S-RED to OPEN.
* feat(T-207/T-210): Multi-cycle PCSA perturbation phase

Cherry-picked from feature/pt-eval (closed PR #222).

- Multi-cycle PCSA: SA -> TBR polish -> keep best, repeated
  annealCycles times (T-207)
- Best-tree tracking in anneal_search() across SA phases (T-210)
- Single-phase temperature fix: n_phases=1 runs at t_start (hot)
- New SearchControl parameter: annealCycles (default 0 = disabled)
- Large preset: annealCycles=3
- Backward compat: compat wrapper auto-sets annealCycles=1 when
  annealPhases > 0 but cycles not specified
- Vignette: PCSA section in search-algorithm.Rmd
- Tests updated for annealCycles

WIP: needs build verification and GHA check before PR.

* fix: restore mc_fitch_scores export + .ts_driven_search_raw naming (cherry-pick artifacts)
…ilder

Build random binary tree that satisfies topological constraints by:
1. Ordering constraint splits smallest-to-largest
2. Assigning each tip to its tightest enclosing split
3. Bottom-up: randomly resolve each split's children into binary subtree
4. Wire root-level items (unconstrained tips + top-level splits)

Replaces Wagner fallback (T-214 workaround) for RANDOM_TREE strategy,
restoring uniform random topology sampling diversity under constraints.

Tests: 916 constraint + 24 random-constrained + 373 simplify/driven pass.
- Clarify MaddisonSlatkin() @examples (show logp intermediate)
- Point FixedDraws overflow error to StepInformation(approx='mc')
- Note MC fallback in MaddisonSlatkin Rd documentation

From uncommitted work on feature/madslatkin-profiling (PR #211).
Cherry-picked from feature/parallel-temper (6dc28a2). Replaces
static extract_divided_steps() copies with shared extract_char_steps()
in TBR, SPR, and drift. NA blocks now use three-pass correction
formula instead of raw local_cost.

ts_temper.cpp already correct (via PR #227 PCSA merge).
- T-196: cherry-picked NA+IW fix from feature/parallel-temper
- T-198-201: closed (PT ruled out; PCSA landed via PR #227)
- Removed PT section from to-do.md
- Deleted feature/parallel-temper and feature/pt-eval (remote+local)
- PT findings preserved in .positai/expertise/pt-evaluation.md
Three issues in impose_one_pass / impose_constraint:

1. Bail-out threshold n_tip/4 was too aggressive. For n_tip=5 the
   threshold was 1, so any split requiring 2+ moves caused an immediate
   bail-out before making any repairs. Raised to n_tip.

2. impose_one_pass returned 0 both for 'no violations' and 'bailed out',
   so impose_constraint couldn't distinguish the two. Now returns -1 on
   bail-out, allowing the caller to know the repair may be incomplete.

3. Documented the root-child limitation: spr_clip() doesn't fully handle
   root children, so impose_one_pass skips them. The fuse path's
   post-repair verification guard (T-214) catches these cases.

Tests: 940 constraint + 308 simplify/driven/sector pass.
spr_clip() can't detach root children (root is its own parent, so the
bypass logic fails). All search callers skip root children, but
impose_constraint needs them for constraint repair.

New topology_spr() helper in ts_constraint.cpp handles the root-child
case by absorbing the sibling into root and repurposing it as the
insertion node. No changes to spr_clip or any search caller.

Also removes the build_postorder call that was missing between
individual moves within impose_one_pass — each topology_spr is now
followed by a postorder rebuild so edge enumeration stays valid.

Tests: 942 constraint (90 impose-constraint, incl 2 new root-child
tests) + 308 simplify/driven/sector pass.
T-208/T-211: random_constrained_tree() and impose_constraint() fixes
Benchmarked perturbStopFactor across 10 morphobank/inapplicable datasets
(23-213 tips). Key findings:

- PSF=2 gives 2.4-6.9x speedup on converged searches with zero score loss
- Complementary to targetHits: on hard landscapes where few replicates
  hit the best score, PSF fires first; on easy landscapes targetHits
  fires first (PSF is irrelevant)
- PSF=5 provides smaller speedups and is too conservative for large trees

Changed default from 0 (disabled) to 2 in SearchControl() and compat
wrapper. Updated docs and fixed timeout test (needs PSF=0 to test the
timeout path, not convergence).
T-187: Perturbation-count stopping rule
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant