B&B CPU determinism#798

Open

aliceb-nv wants to merge 429 commits intoNVIDIA:release/26.02from

aliceb-nv:determinism

Contributor

aliceb-nv commented Jan 26, 2026 •

edited by coderabbitai bot

Loading

This PR introduces a deterministic execution mode for the parallel branch-and-bound MIP solver.
When determinism_mode = CUOPT_MODE_DETERMINISTIC, the solver guarantees bitwise-identical results across runs regardless of thread scheduling variations, so long as the runs are on the same platform with the same environment (= same glibc, etc...)

The approach that was chosen is referred to as Bulk Synchronous Parallel (BSP) in the literature. This is an execution model where computation proceeds in discrete horizons of virtual time (=work units) separated by synchronization barriers.
To cope with the inherent nondeterminism of wall-clock-time due to factors such as caching state, OS scheduling, CPU throttling...
instead of time, progress is measured in "work units" approximating the execution time of an algorithm using known and deterministic factors, such as memory operations, problem features and sparsity...

Workers explore the tree independently and locally within a horizon, and collect events (branching decisions, solutions, pseudo-cost updates) during the process without exchanging them yet.
At each horizon boundary, events are sorted by work-unit timestamp with tie-breaking, history is replayed in deterministic order to update global state, and if need be nodes are redistributed across workers to balance load (which replaces the ramp-up mechanism).
Workers operate on local snapshots of shared state (pseudo-costs, upper bound, LP iteration count) taken at horizon start to avoid read races. This trades some accuracy in decisions (pruning may rely on a higher lower bound than in nondeterminsitic mode), for determinism.

Support for start CPU heuristics is present, based on a "producer" model: CPUFJ starts immediately and begins producing solutions with work unit timestamps, that are produced by the B&B thread once the appropriate time is reached. A form of one-way synchronization is implemented to prevent the CPUFJ thread from "falling behind" and producing solutiosn in the past from the perspective of the B&B thread. CPUFJ is allowed to run ahead, and is biased to run faster than B&B to avoid unnecessary syncs.

B&B BFS workers, diving workers, and a start CPUFJ thread are supported in deterministic mode. GPU Heuristics are disabled in this mode and will be incorporated into the deterministic framework in the future.

Note: files in the utilities/models/ folders are auto-generated from the LightGBM models trained to predict work units from algorithm features.

The implementation is described in more detail in this document: B&B Determinism

Benchmark results

B&B alone, 10min

Nondeterministic B&B alone (default): 174 feasible, 38.3% primal gap
Deterministic B&B alone: 167 feasible, 40.3% primal gap
Deterministic B&B+CPUFJ: 184 feasible, 35.9% primal gap

Regression testing again main branch:
With changes:
Baseline:

Summary by CodeRabbit

New Features
- Added work-limit control, a deterministic mode, and a configurable seed; new CLI flags to set work-limit and enable determinism.
Infrastructure
- Expanded profiling, memory and timing instrumentation; added work-unit synchronization and scheduling for coordinated runs; deterministic execution scaffolding.
Tests
- New deterministic reproducibility tests verifying consistent termination, objectives, and solutions across repeated runs.

aliceb-nv and others added 30 commits

January 7, 2026 16:55


          fix copyright

c374f69


          Merge branch 'clang-tsan' into determinism

2ae3d3e


          flag to disable gpu heuristics

d89c50a


          fix incorrect termination status

8dd2d08


          replace upper bound lock with atomic

1dcee03


          improve idling master thread

9e2e5c7


          added ramp-up-phase

34f7eaa


          refactoring

769a3d8


          updating code to match the new parallel bnb

c54033c


          Merge branch 'main' into diving-heuristics

28c61b8


          removed command line options

4bcf801


          fix style

d91369d


          BSP b&b progress

d67f230


          fix compilation failure

4ee57f9


          separated objective estimate and variable selection

b99a9c7


          separating objective estimate from variable selection

43f8b31


          added log

a36bf03


          Merge branch 'node-queue' into diving-heuristics

d4c9d54


          Merge branch 'diving-heuristics' into parallel-bnb-v2

b9a14bf


          small refactor

7c5c996


          code cleanup

5753de8


          fix reporting frequency

421cbfd


          fix style

6faeed0


          added missing stl headers. fix incorrect round-robin.

d7046e3


          initial working impl on small problens

eb6bc28


          more instrumentation

d80667c


          refactor to eliminate enum

14441d1


          keep basis when possible

8ed172a


          fix race condition in guided diving

89cc6de


          reliability branching from NVIDIA#599

525f013

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                PoolT& pool, const std::vector<f_t>& incumbent_snapshot, double horizon_start, double horizon_end)
+              {
+                for (auto& worker : pool) {
+                  worker.set_snapshots(upper_bound_.load(),

Contributor

chris-maes Feb 6, 2026

What does a snapshot consist of? The incumbent and the pseudocosts?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                  std::sort(to_repair.begin(),
+                            to_repair.end(),
+                            [](const std::vector<f_t>& a, const std::vector<f_t>& b) { return a < b; });

Contributor

chris-maes Feb 6, 2026

What is this sorting by? The address?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                      if (success) {
+                        // Queue repaired solution with work unit timestamp (...workstamp?)
+                        mutex_heuristic_queue_.lock();
+                        heuristic_solution_queue_.push_back(

Contributor

chris-maes Feb 6, 2026

This tries to repair the solution and if it can be repaired puts it into a heuristic queue? Since all these are tagged with the same determinisim_current_horizon is there a reason why they can't be processed now? That is, why do they need to go into a queue?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                {
+                  std::vector<queued_heuristic_solution_t> future_solutions;
+                  for (auto& sol : heuristic_solution_queue_) {
+                    if (sol.wut < determinism_current_horizon_) {

Contributor

chris-maes Feb 6, 2026

I keep forgetting what a wut is. It might be better to write this out as a work_unit_timestamp.

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                          [](const queued_heuristic_solution_t& a, const queued_heuristic_solution_t& b) {
+                            if (a.wut != b.wut) { return a.wut < b.wut; }
+                            if (a.objective != b.objective) { return a.objective < b.objective; }
+                            return a.solution < b.solution;  // edge-case - lexicographical comparison

Contributor

chris-maes Feb 6, 2026

What comparison is being used here? Address of the solution?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                      case bb_event_type_t::NODE_BRANCHED:
+                      case bb_event_type_t::NODE_FATHOMED:
+                      case bb_event_type_t::NODE_INFEASIBLE:
+                      case bb_event_type_t::NODE_NUMERICAL: break;

Contributor

chris-maes Feb 6, 2026

So if you hit any of these events you break out of the loop? I'm not sure I understand what happens then. The idea is that you can't process further solutions until you handle one of these events?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                }
+                // Merge integer solutions from BFS workers and update global incumbent
+                determinism_process_worker_solutions(*determinism_workers_,

Contributor

chris-maes Feb 6, 2026

I'm not sure I understand what's going here. What is lambda doing?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+              }
+              template <typename i_t, typename f_t>
+              void branch_and_bound_t<i_t, f_t>::determinism_prune_worker_nodes_vs_incumbent()

Contributor

chris-maes Feb 6, 2026

If I understand correctly, this is fathoming nodes based on if there lower bound is greater than the upper bound?

Is it necessary to do this work? In the current branch and bound code, we don't go through and clean out the heap each time we find a better incumbent as that would be quite expensive. Instead we just throw away nodes when we remove them from the heap.

Is this because you are trying load balance the worker's queue? If so, is the size of each worker's queue significantly smaller than the global heap?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                // Redistribute round-robin
+                std::vector<size_t> worker_order;
+                for (size_t w = 0; w < num_workers; ++w) {

Contributor

chris-maes Feb 6, 2026

Is this the same as:

std::vector<size_t> worker_order(num_workers)
std::iota(worker_order.begin(), worker_order.end(), 0);

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                // Distribute nodes
+                for (size_t i = 0; i < all_nodes.size(); ++i) {
+                  size_t worker_idx = worker_order[i % num_workers];

Contributor

chris-maes Feb 6, 2026

I'm not sure I understand what is happening here. I think worker_order[w] = w. So is this the same as

size_t worker_idx = i % num_workers;

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+              f_t branch_and_bound_t<i_t, f_t>::determinism_compute_lower_bound()
+              {
+                // Compute lower bound from BFS worker local structures only
+                const f_t inf   = std::numeric_limits<f_t>::infinity();

Contributor

chris-maes Feb 6, 2026

Nit: inf is already defined in dual_simplex/types.hpp. So you can use it without recreating from limits.

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                // Compute lower bound from BFS worker local structures only
+                const f_t inf   = std::numeric_limits<f_t>::infinity();
+                f_t lower_bound = lower_bound_ceiling_.load();
+                if (!std::isfinite(lower_bound)) lower_bound = inf;

Contributor

chris-maes Feb 7, 2026

I'm not sure I understand what this is doing. If the lower bound is not finite, shouldn't it already be inf?

Are we ever setting the lower_bound_ceiling to -inf or NaN?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                  for (auto* node : worker.backlog.data()) {
+                    if (node->lower_bound < upper_bound) {
+                      f_t score = node->objective_estimate;
+                      if (!std::isfinite(score)) { score = node->lower_bound; }

Contributor

chris-maes Feb 7, 2026

Nit: I'm struggling with the !isfinite logic. It's hard for me to read. Maybe should define an isinf function? Or just use score >= inf to check for infinity.

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                  // Skip workers that already have enough nodes
+                  if ((int)worker.dive_queue_size() >= target_nodes_per_worker) {
+                    worker_idx = (worker_idx + 1) % num_workers;

Contributor

chris-maes Feb 7, 2026

Nit: Maybe you could lift this out of the if statement and do it immediately after assigning worker. Since I think in all cases you increment the worker_idx?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                        break;
+                      }
+                    }
+                    if (all_full) break;

Contributor

chris-maes Feb 7, 2026

You might add some comments here

if (all_full) break;  // break out of the assignment loop as all workers have more than their target nodes
continue;  // move on to the next worker, this worker has more than its target node.

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+              }
+              template <typename i_t, typename f_t>
+              void branch_and_bound_t<i_t, f_t>::determinism_collect_diving_solutions()

Contributor

chris-maes Feb 7, 2026

This also updates the psuedocosts. Would it be better to call it: determistic_collect_diving_solutions_and_update_psuedocosts? I know that is not a short name. :)

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                  if (!feasible) {
+                    worker.recompute_bounds_and_basis = true;
+                    continue;
+                  }

Contributor

chris-maes Feb 7, 2026

Do you need to add in the nnzs from bound strengthening to the work estimate?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.cpp

+                                                                                         worker.basic_list,
+                                                                                         worker.nonbasic_list,
+                                                                                         leaf_vstatus,
+                                                                                         leaf_edge_norms);

Contributor

chris-maes Feb 7, 2026

Same as above: work not counted?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/branch_and_bound.hpp

+                // Determinism heuristic solution queue - solutions received from GPU heuristics
+                // Stored with work unit timestamp for deterministic ordering
+                struct queued_heuristic_solution_t {

Contributor

chris-maes Feb 7, 2026

How is queued_heuristic_solution_t different than a queued_integer_solution_t?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/deterministic_workers.hpp

+              namespace cuopt::linear_programming::dual_simplex {
+              template <typename i_t, typename f_t>
+              struct backlog_node_compare_t {

Contributor

chris-maes Feb 7, 2026

Ah ok. Here are the definitions of custom comparators.

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/deterministic_workers.hpp

+                double horizon_end{0.0};
+                work_limit_context_t work_context;
+                // Local snapshots of global state

Contributor

chris-maes Feb 7, 2026

Maybe it would be helpful to define a snapshot_t type that contains the psuedocosts and the incumbent

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/deterministic_workers.hpp

+                {
+                  const i_t n       = this->leaf_problem.num_cols;
+                  local_upper_bound = global_upper_bound;
+                  pc_sum_up_snapshot.assign(pc_sum_up, pc_sum_up + n);

Contributor

chris-maes Feb 7, 2026

I'm not sure I understand what is happening here. What does pc_sum_up and pc_num_up contain? I thought these would be the psuedocost sum and numbers for each variable. But you are assigning them based on a single value. So these must something different.

Contributor

chris-maes Feb 7, 2026

Oh these are pointers to an array's you are passing in. Is there a reason to use raw pointers here?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/mip_node.hpp

+                         static_cast<uint64_t>(static_cast<uint32_t>(creation_seq));
+                }
+                uint32_t compute_path_hash() const

Contributor

chris-maes Feb 7, 2026

where is the path hash used?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/phase2.cpp

+                    if (work_unit_context) {
+                      // TEMP;
+                      work_unit_context->record_work((total_loads + total_stores) / 1e8);

Contributor

chris-maes Feb 7, 2026

Does record_work block and act as a sync point across the threads? If so, it's probably good to call that out in the name of the function. Maybe record_work_and_sync_threads?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/phase2.hpp

               #include <vector>
+              namespace cuopt {
+              struct work_limit_context_t;

Contributor

chris-maes Feb 7, 2026

Why an empty definition?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/phase2.hpp

                 CONCURRENT_LIMIT = 6,
-                UNSET            = 7
+                UNSET            = 7,
+                WORK_LIMIT       = 8

Contributor

chris-maes Feb 7, 2026

It would probably be better to put WORK_LIMIT next to the other limits, and make UNSET the last entry in the status_t

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/right_looking_lu.cpp

    
              template <typename i_t, typename f_t, typename VectorI>

              i_t initialize_degree_data(const csc_matrix_t<i_t, f_t>& A,

                                         const std::vector<i_t>& column_list,

                                         const VectorI& column_list,

Contributor

chris-maes Feb 7, 2026

Here and below, why only change the type of column_list and not the rest of the variables?

chris-maes reviewed

View reviewed changes

cpp/src/dual_simplex/solve.hpp

    
                CONCURRENT_LIMIT = 7,

                UNSET            = 8

                UNSET            = 8,

                WORK_LIMIT       = 9

Contributor

chris-maes Feb 7, 2026

Nit: same here, better to place WORK_LIMIT near CONCURRENT_LIMT and leave UNSET for last.

Contributor

chris-maes commented Feb 7, 2026

There is quite a bit of vocabulary in this PR. I made a list of words/phrases as I went a long. It might be useful to have some documentation that defines these words, preferably in the repository itself:

Policy
Callback
Worker Pool
Work Unit Scheduler
Scoped Context Registration
Broadcast Snapshots
Incumbent Snapshots
Producer
Determinism Sync Callback
Event / BB Event
Packed Id
Producer Sync
Instrumentation Aggregator
Worker (Limit) Context
Queued Integer Solutions
Worker Plunge Stack
Worker Backlog Heap
Creation Sequence
Psuedocost Update

Contributor

chris-maes commented Feb 7, 2026

I made a first pass through the PR. Here are my initial comments. I'm sure as I spend more time with the code these will change.

First let me say, it is an impressive piece of work. It clearly took a lot of effort. And you had to rewrite large portions of the branch and bound and diving code. Figuring out how to synchronize branch and bound across multiple threads is no easy task. And thanks for incorporating the last minute changes from cuts and reliability branching.

From my perspective, because I didn't look at the CPUFJ changes, there are two main parts to the PR.

The changes to branch and bound necessary to make it deterministic, given work estimates from dual simplex.
The changes necessary to get work estimates from dual simplex.

For 1) I'm still not sure I understand the PR from a high-level. For example, it seems like each BFS worker must sync with the other workers before pruning a node. But BFS workers also maintain there own heaps. Is there a global heap as well? If a sync is necessary before pruning, why maintain a separate heap? I'm also not clear why we want to prune nodes in the workers heaps based on the incumbent. It seems like we could wait until the worker popped the node of the heap to prune it. But perhaps that isn't a concern because the worker heaps can't get large?

With any PR this big, I'd like understand how it was verified and how the non-deterministic code changed, as well as how the deterministic path compares to the non-deterministic path. Do you have new results after merging cuts and reliablitly branching? In addition to num feasible and primal integral, MIP gap and num optimal would be good measures to understand the variations. Also, since it touched dual simplex, it would be good due some runs on LPs to see if there are any changes (e.g. does the number of iterations change on any of the NETLIB problems after this PR).

For both 1 and 2, my biggest concern is that the increased complexity of the code will make it difficult to modify in the future.

For 1) I anticipate that we will make many changes to the branch and bound code in the next few months such as: branching on sums of variables, orbital branching/symmetry, restarts, reliability branching using batch PDLP etc. I'm worried that these types of changes will be difficult to incorporate into the current PR.

Some more concrete comments/suggestions on 1)

The pseudocost updating code is now spread out across many different files, functions, and classes. I understand that is because these updates need to be handled carefully in the deterministic code. But I worry that it will be difficult to change pseudocosts in the future with everything spread out. Perhaps, you could try to move all the pseudocost logic into pseudocosts.hpp/cpp and just reference these from other classes/structs.
The deterministic code path is separate from the non-deterministic code path. That makes sense now, but likely we should remove the non-deterministic code path in the future.
I found it difficult to follow the class hierarchies for bfs plunging / diving and deterministic versus non-deterministic. Again I think this is because they were separated across multiple classes/structs and files. Perhaps there is a way to consolidate them, so there is a single place a reader can look to understand the branch and bound logic.

For 2) I think your approach of an instrumented vector made it possible for you to get work estimates from dual simplex and get this code up and running. That is great. But I'm concerned about my ability to maintain the dual simplex code in this state in the long run. Right now, there is a mix of instrumented vector and std::vector in the code. I think it will be hard for me to modify functions or fix bugs in this state. Also, I know the compiler is smart and may be able to lift increments outside of a loop so they become O(1) work instead of O(n). But I worry about the overhead associated with logging every array access. I'd prefer to go through the dual simplex code and manually track work estimates and provide these to you. That would introduce minimal overhead and allow me to write new functions using the base types.

Some of the dual simplex changes involved making variables persist across loop iterations. I've created many bugs with persistent state across iterations. So it would be good to double check that these changes do not result in any iteration differences across the NETLIB LPs or the MIPLIB root relaxations.

Anyway, impressive piece of work. We should discuss more in person on Monday. I will try to make another pass through the PR as well on Monday. But I wanted to give you my first pass thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement non-breaking