Skip to content

B&B CPU determinism#798

Open
aliceb-nv wants to merge 429 commits intoNVIDIA:release/26.02from
aliceb-nv:determinism
Open

B&B CPU determinism#798
aliceb-nv wants to merge 429 commits intoNVIDIA:release/26.02from
aliceb-nv:determinism

Conversation

@aliceb-nv
Copy link
Contributor

@aliceb-nv aliceb-nv commented Jan 26, 2026

This PR introduces a deterministic execution mode for the parallel branch-and-bound MIP solver.
When determinism_mode = CUOPT_MODE_DETERMINISTIC, the solver guarantees bitwise-identical results across runs regardless of thread scheduling variations, so long as the runs are on the same platform with the same environment (= same glibc, etc...)

The approach that was chosen is referred to as Bulk Synchronous Parallel (BSP) in the literature. This is an execution model where computation proceeds in discrete horizons of virtual time (=work units) separated by synchronization barriers.
To cope with the inherent nondeterminism of wall-clock-time due to factors such as caching state, OS scheduling, CPU throttling...
instead of time, progress is measured in "work units" approximating the execution time of an algorithm using known and deterministic factors, such as memory operations, problem features and sparsity...

Workers explore the tree independently and locally within a horizon, and collect events (branching decisions, solutions, pseudo-cost updates) during the process without exchanging them yet.
At each horizon boundary, events are sorted by work-unit timestamp with tie-breaking, history is replayed in deterministic order to update global state, and if need be nodes are redistributed across workers to balance load (which replaces the ramp-up mechanism).
Workers operate on local snapshots of shared state (pseudo-costs, upper bound, LP iteration count) taken at horizon start to avoid read races. This trades some accuracy in decisions (pruning may rely on a higher lower bound than in nondeterminsitic mode), for determinism.

Support for start CPU heuristics is present, based on a "producer" model: CPUFJ starts immediately and begins producing solutions with work unit timestamps, that are produced by the B&B thread once the appropriate time is reached. A form of one-way synchronization is implemented to prevent the CPUFJ thread from "falling behind" and producing solutiosn in the past from the perspective of the B&B thread. CPUFJ is allowed to run ahead, and is biased to run faster than B&B to avoid unnecessary syncs.

B&B BFS workers, diving workers, and a start CPUFJ thread are supported in deterministic mode. GPU Heuristics are disabled in this mode and will be incorporated into the deterministic framework in the future.

Note: files in the utilities/models/ folders are auto-generated from the LightGBM models trained to predict work units from algorithm features.

The implementation is described in more detail in this document: B&B Determinism

Benchmark results

B&B alone, 10min

  • Nondeterministic B&B alone (default): 174 feasible, 38.3% primal gap
  • Deterministic B&B alone: 167 feasible, 40.3% primal gap
  • Deterministic B&B+CPUFJ: 184 feasible, 35.9% primal gap

Regression testing again main branch:
With changes:
Baseline:

Summary by CodeRabbit

  • New Features

    • Added work-limit control, a deterministic mode, and a configurable seed; new CLI flags to set work-limit and enable determinism.
  • Infrastructure

    • Expanded profiling, memory and timing instrumentation; added work-unit synchronization and scheduling for coordinated runs; deterministic execution scaffolding.
  • Tests

    • New deterministic reproducibility tests verifying consistent termination, objectives, and solutions across repeated runs.

PoolT& pool, const std::vector<f_t>& incumbent_snapshot, double horizon_start, double horizon_end)
{
for (auto& worker : pool) {
worker.set_snapshots(upper_bound_.load(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does a snapshot consist of? The incumbent and the pseudocosts?


std::sort(to_repair.begin(),
to_repair.end(),
[](const std::vector<f_t>& a, const std::vector<f_t>& b) { return a < b; });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this sorting by? The address?

if (success) {
// Queue repaired solution with work unit timestamp (...workstamp?)
mutex_heuristic_queue_.lock();
heuristic_solution_queue_.push_back(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tries to repair the solution and if it can be repaired puts it into a heuristic queue? Since all these are tagged with the same determinisim_current_horizon is there a reason why they can't be processed now? That is, why do they need to go into a queue?

{
std::vector<queued_heuristic_solution_t> future_solutions;
for (auto& sol : heuristic_solution_queue_) {
if (sol.wut < determinism_current_horizon_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep forgetting what a wut is. It might be better to write this out as a work_unit_timestamp.

[](const queued_heuristic_solution_t& a, const queued_heuristic_solution_t& b) {
if (a.wut != b.wut) { return a.wut < b.wut; }
if (a.objective != b.objective) { return a.objective < b.objective; }
return a.solution < b.solution; // edge-case - lexicographical comparison
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What comparison is being used here? Address of the solution?

case bb_event_type_t::NODE_BRANCHED:
case bb_event_type_t::NODE_FATHOMED:
case bb_event_type_t::NODE_INFEASIBLE:
case bb_event_type_t::NODE_NUMERICAL: break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you hit any of these events you break out of the loop? I'm not sure I understand what happens then. The idea is that you can't process further solutions until you handle one of these events?

}

// Merge integer solutions from BFS workers and update global incumbent
determinism_process_worker_solutions(*determinism_workers_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what's going here. What is lambda doing?

}

template <typename i_t, typename f_t>
void branch_and_bound_t<i_t, f_t>::determinism_prune_worker_nodes_vs_incumbent()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is fathoming nodes based on if there lower bound is greater than the upper bound?

Is it necessary to do this work? In the current branch and bound code, we don't go through and clean out the heap each time we find a better incumbent as that would be quite expensive. Instead we just throw away nodes when we remove them from the heap.

Is this because you are trying load balance the worker's queue? If so, is the size of each worker's queue significantly smaller than the global heap?


// Redistribute round-robin
std::vector<size_t> worker_order;
for (size_t w = 0; w < num_workers; ++w) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as:

std::vector<size_t> worker_order(num_workers)
std::iota(worker_order.begin(), worker_order.end(), 0);


// Distribute nodes
for (size_t i = 0; i < all_nodes.size(); ++i) {
size_t worker_idx = worker_order[i % num_workers];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what is happening here. I think worker_order[w] = w. So is this the same as

size_t worker_idx = i % num_workers;

f_t branch_and_bound_t<i_t, f_t>::determinism_compute_lower_bound()
{
// Compute lower bound from BFS worker local structures only
const f_t inf = std::numeric_limits<f_t>::infinity();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: inf is already defined in dual_simplex/types.hpp. So you can use it without recreating from limits.

// Compute lower bound from BFS worker local structures only
const f_t inf = std::numeric_limits<f_t>::infinity();
f_t lower_bound = lower_bound_ceiling_.load();
if (!std::isfinite(lower_bound)) lower_bound = inf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what this is doing. If the lower bound is not finite, shouldn't it already be inf?

Are we ever setting the lower_bound_ceiling to -inf or NaN?

for (auto* node : worker.backlog.data()) {
if (node->lower_bound < upper_bound) {
f_t score = node->objective_estimate;
if (!std::isfinite(score)) { score = node->lower_bound; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'm struggling with the !isfinite logic. It's hard for me to read. Maybe should define an isinf function? Or just use score >= inf to check for infinity.


// Skip workers that already have enough nodes
if ((int)worker.dive_queue_size() >= target_nodes_per_worker) {
worker_idx = (worker_idx + 1) % num_workers;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe you could lift this out of the if statement and do it immediately after assigning worker. Since I think in all cases you increment the worker_idx?

break;
}
}
if (all_full) break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might add some comments here

if (all_full) break;  // break out of the assignment loop as all workers have more than their target nodes
continue;  // move on to the next worker, this worker has more than its target node. 

}

template <typename i_t, typename f_t>
void branch_and_bound_t<i_t, f_t>::determinism_collect_diving_solutions()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also updates the psuedocosts. Would it be better to call it: determistic_collect_diving_solutions_and_update_psuedocosts? I know that is not a short name. :)

if (!feasible) {
worker.recompute_bounds_and_basis = true;
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to add in the nnzs from bound strengthening to the work estimate?

worker.basic_list,
worker.nonbasic_list,
leaf_vstatus,
leaf_edge_norms);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: work not counted?


// Determinism heuristic solution queue - solutions received from GPU heuristics
// Stored with work unit timestamp for deterministic ordering
struct queued_heuristic_solution_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is queued_heuristic_solution_t different than a queued_integer_solution_t?

namespace cuopt::linear_programming::dual_simplex {

template <typename i_t, typename f_t>
struct backlog_node_compare_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. Here are the definitions of custom comparators.

double horizon_end{0.0};
work_limit_context_t work_context;

// Local snapshots of global state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be helpful to define a snapshot_t type that contains the psuedocosts and the incumbent

{
const i_t n = this->leaf_problem.num_cols;
local_upper_bound = global_upper_bound;
pc_sum_up_snapshot.assign(pc_sum_up, pc_sum_up + n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what is happening here. What does pc_sum_up and pc_num_up contain? I thought these would be the psuedocost sum and numbers for each variable. But you are assigning them based on a single value. So these must something different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh these are pointers to an array's you are passing in. Is there a reason to use raw pointers here?

static_cast<uint64_t>(static_cast<uint32_t>(creation_seq));
}

uint32_t compute_path_hash() const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the path hash used?


if (work_unit_context) {
// TEMP;
work_unit_context->record_work((total_loads + total_stores) / 1e8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does record_work block and act as a sync point across the threads? If so, it's probably good to call that out in the name of the function. Maybe record_work_and_sync_threads?

#include <vector>

namespace cuopt {
struct work_limit_context_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why an empty definition?

CONCURRENT_LIMIT = 6,
UNSET = 7
UNSET = 7,
WORK_LIMIT = 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be better to put WORK_LIMIT next to the other limits, and make UNSET the last entry in the status_t

template <typename i_t, typename f_t, typename VectorI>
i_t initialize_degree_data(const csc_matrix_t<i_t, f_t>& A,
const std::vector<i_t>& column_list,
const VectorI& column_list,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below, why only change the type of column_list and not the rest of the variables?

CONCURRENT_LIMIT = 7,
UNSET = 8
UNSET = 8,
WORK_LIMIT = 9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: same here, better to place WORK_LIMIT near CONCURRENT_LIMT and leave UNSET for last.

@chris-maes
Copy link
Contributor

There is quite a bit of vocabulary in this PR. I made a list of words/phrases as I went a long. It might be useful to have some documentation that defines these words, preferably in the repository itself:

  • Policy
  • Callback
  • Worker Pool
  • Work Unit Scheduler
  • Scoped Context Registration
  • Broadcast Snapshots
  • Incumbent Snapshots
  • Producer
  • Determinism Sync Callback
  • Event / BB Event
  • Packed Id
  • Producer Sync
  • Instrumentation Aggregator
  • Worker (Limit) Context
  • Queued Integer Solutions
  • Worker Plunge Stack
  • Worker Backlog Heap
  • Creation Sequence
  • Psuedocost Update

@chris-maes
Copy link
Contributor

I made a first pass through the PR. Here are my initial comments. I'm sure as I spend more time with the code these will change.

First let me say, it is an impressive piece of work. It clearly took a lot of effort. And you had to rewrite large portions of the branch and bound and diving code. Figuring out how to synchronize branch and bound across multiple threads is no easy task. And thanks for incorporating the last minute changes from cuts and reliability branching.

From my perspective, because I didn't look at the CPUFJ changes, there are two main parts to the PR.

  1. The changes to branch and bound necessary to make it deterministic, given work estimates from dual simplex.
  2. The changes necessary to get work estimates from dual simplex.

For 1) I'm still not sure I understand the PR from a high-level. For example, it seems like each BFS worker must sync with the other workers before pruning a node. But BFS workers also maintain there own heaps. Is there a global heap as well? If a sync is necessary before pruning, why maintain a separate heap? I'm also not clear why we want to prune nodes in the workers heaps based on the incumbent. It seems like we could wait until the worker popped the node of the heap to prune it. But perhaps that isn't a concern because the worker heaps can't get large?

With any PR this big, I'd like understand how it was verified and how the non-deterministic code changed, as well as how the deterministic path compares to the non-deterministic path. Do you have new results after merging cuts and reliablitly branching? In addition to num feasible and primal integral, MIP gap and num optimal would be good measures to understand the variations. Also, since it touched dual simplex, it would be good due some runs on LPs to see if there are any changes (e.g. does the number of iterations change on any of the NETLIB problems after this PR).

For both 1 and 2, my biggest concern is that the increased complexity of the code will make it difficult to modify in the future.

For 1) I anticipate that we will make many changes to the branch and bound code in the next few months such as: branching on sums of variables, orbital branching/symmetry, restarts, reliability branching using batch PDLP etc. I'm worried that these types of changes will be difficult to incorporate into the current PR.

Some more concrete comments/suggestions on 1)

  • The pseudocost updating code is now spread out across many different files, functions, and classes. I understand that is because these updates need to be handled carefully in the deterministic code. But I worry that it will be difficult to change pseudocosts in the future with everything spread out. Perhaps, you could try to move all the pseudocost logic into pseudocosts.hpp/cpp and just reference these from other classes/structs.
  • The deterministic code path is separate from the non-deterministic code path. That makes sense now, but likely we should remove the non-deterministic code path in the future.
  • I found it difficult to follow the class hierarchies for bfs plunging / diving and deterministic versus non-deterministic. Again I think this is because they were separated across multiple classes/structs and files. Perhaps there is a way to consolidate them, so there is a single place a reader can look to understand the branch and bound logic.

For 2) I think your approach of an instrumented vector made it possible for you to get work estimates from dual simplex and get this code up and running. That is great. But I'm concerned about my ability to maintain the dual simplex code in this state in the long run. Right now, there is a mix of instrumented vector and std::vector in the code. I think it will be hard for me to modify functions or fix bugs in this state. Also, I know the compiler is smart and may be able to lift increments outside of a loop so they become O(1) work instead of O(n). But I worry about the overhead associated with logging every array access. I'd prefer to go through the dual simplex code and manually track work estimates and provide these to you. That would introduce minimal overhead and allow me to write new functions using the base types.

Some of the dual simplex changes involved making variables persist across loop iterations. I've created many bugs with persistent state across iterations. So it would be good to double check that these changes do not result in any iteration differences across the NETLIB LPs or the MIPLIB root relaxations.

Anyway, impressive piece of work. We should discuss more in person on Monday. I will try to make another pass through the PR as well on Monday. But I wanted to give you my first pass thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants