B&B with Local Heaps by nguidotti · Pull Request #1149 · NVIDIA/cuopt

nguidotti · 2026-04-27T11:22:35Z

In this PR, each best-first worker has its own local node heap, such that it push/pop nodes without synchronizing with other workers. Each best-first worker periodically steals a node from a random worker to keep the node distribution more or less balance across them. Additionally, each best-first worker has a (fixed) set of diving worker assigned to it, which are used for performing diving on its own nodes whenever possible. This essentially eliminates the need of the scheduler thread, freeing one additional thread to do something useful.

This also implements a compression scheme for vstatus using only 2bits per entry, which reduces the memory consumption by roughly 4x (previously was using int8_t per entry). Last, but not least, this PR replaces std::deque with a fixed-capacity circular_deque_t for the plunge/dive stacks and the idle-worker list.

MIPLIB results (GH200, 10min):

================================================================================
main (1, #1099) vs bnb-local-heap (2)
================================================================================

------------------------------------------------------------------------------------------------------------------------------
|                                        |       Run 1        |       Run 2        |     Abs. Diff.     |   Rel. Diff. (%)   |
------------------------------------------------------------------------------------------------------------------------------
| Feasible                                                 227                  228                   +1                 --- |
| Optimal                                                   75                   78                   +3                 --- |
| Solutions with <0.1% primal gap                          124                  130                   +6                 --- |
| Nodes explored (mean)                              4.866e+06            1.436e+07           +9.496e+06                +195 |
| Nodes explored (shifted geomean)                        6772            1.205e+04                +5275               +77.9 |
| Relative MIP gap (mean)                               0.3264               0.3415             +0.01506               +4.62 |
| Relative MIP gap (shifted geomean)                    0.1156               0.1131              -0.0025               -2.16 |
| Solve time (mean)                                      444.6                441.5               -3.054              -0.687 |
| Solve time (shifted geomean)                           221.5                219.1               -2.327               -1.05 |
| Primal gap (mean)                                      11.57                11.15              -0.4201               -3.63 |
| Primal gap (shifted geomean)                          0.6324               0.5604             -0.07203               -11.4 |
| Primal integral (mean)                                 32.63                33.02              +0.3805               +1.17 |
| Primal integral (shifted geomean)                      6.346                6.405             +0.05989              +0.944 |
------------------------------------------------------------------------------------------------------------------------------

In summary, we explored ~3x nodes in average` at the same time frame. The number of optimal solutions also increased by 3.

Checklist

I am familiar with the Contributing Guidelines.
Testing
- New or existing tests cover these changes
- Added tests
- Created an issue to follow-up
- NA
Documentation
- The documentation is up to date with these changes
- Added new documentation
- NA

Remove dependency on rmm::mr::device_memory_resource base class. Resources now satisfy the cuda::mr::resource concept directly. - Replace shared_ptr<device_memory_resource> with value types and cuda::mr::any_resource<cuda::mr::device_accessible> for type-erased storage - Replace set_current_device_resource(ptr) with set_current_device_resource_ref - Replace set_per_device_resource(id, ptr) with set_per_device_resource_ref - Remove make_owning_wrapper usage - Remove dynamic_cast on memory resources (no common base class) - Remove owning_wrapper.hpp and device_memory_resource.hpp includes - Add missing thrust/iterator/transform_output_iterator.h include (no longer transitively included via CCCL)

…nd deterministic mode. Signed-off-by: Nicolas Guidotti <224634272+nguidotti@users.noreply.github.com>

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

… shared_ptr to avoid unnecessary copy. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

…l crash in work-stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

…queue for now. refactoring. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

… are present Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

# Conflicts: # cpp/src/utilities/cuda_helpers.cuh

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

# Conflicts: # ci/validate_wheel.sh

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

# Conflicts: # cpp/src/branch_and_bound/branch_and_bound.cpp # cpp/src/branch_and_bound/constants.hpp # cpp/src/branch_and_bound/pseudo_costs.hpp # cpp/src/utilities/omp_helpers.hpp

nguidotti · 2026-05-05T09:32:09Z

/ok to test 2adec89

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-05T11:25:58Z

/ok to test bc3d50b

# Conflicts: # cpp/src/branch_and_bound/branch_and_bound.cpp # cpp/src/branch_and_bound/constants.hpp # cpp/src/branch_and_bound/diving_heuristics.cpp # cpp/src/branch_and_bound/mip_node.hpp # cpp/src/branch_and_bound/pseudo_costs.cpp # cpp/src/branch_and_bound/worker.hpp # cpp/src/branch_and_bound/worker_pool.hpp

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-11T11:53:48Z

/ok to test e1ddd8e

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-11T15:52:54Z

/ok to test 60de4ad

…ng tasks from that worker finishes when this happens). correctly handle time limit and unexplored counter after B&B. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-12T14:03:36Z

/ok to test 67f539c

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-12T14:54:45Z

/ok to test ef450ba

akifcorduk · 2026-05-12T14:41:50Z

+      f_t node_lower_bound = node ? node->lower_bound : std::numeric_limits<f_t>::infinity();
+      worker->lower_bound  = std::min(worker->lower_bound.load(), node_lower_bound);
+
+      worker->node_queue.unlock();


We can unlock this after poping the best first right?

Unfortunately, no. The lower bound of a node is calculated based on the heap and the node being explored by a worker. If we take a node from the heap, then it lower bound are no longer considered when retrieving the lower bound from the heap. So, we need to store it in the worker lower bound, otherwise it would temporarily vanish from the solver and potentially cause the solver to terminate early due to gap convergence rule.

This two operations (popping a node from the heap and updating the lower bound of the worker) needs to happen "simultaneously".

akifcorduk · 2026-05-12T14:56:33Z

-                        max_num_workers_per_type[strategy]);
+  if (!start_node || upper_bound_.load() < start_node->lower_bound ||
+      start_node->depth < settings_.diving_settings.min_node_depth) {
+    bfs_worker->node_queue.unlock();


Can't we unlock the queue as soon as we pop from the queue?

No, we must keep the mutex locked here until we create a detached copy of the node (i.e., a complete copy of the node without no connection to the main tree) and retrieve the bounds from the tree. This prevents another worker stealing that node and modify its content (or even fathoming it) while the first one is still copying the information, which causes a race condition.

akifcorduk · 2026-05-12T14:59:53Z

-      run_scheduler();
+      const i_t num_workers        = settings_.num_threads;
+      const i_t num_bfs_workers    = std::max(settings_.num_threads / 2, 1);
+      const i_t num_diving_workers = num_workers - num_bfs_workers;


Why don't we leave some threads to other threads like CPUF and RINS ?

This will be done automatically by the OpenMP. More specifically, the best-first workers are long-lived, blocking half of the CPU cores. While diving, reliability branching, RINS and CPUFJ tasks will share the remaining of the CPU cores. These tasks have a short (or medium) duration, and thus, allows the runtime to switch between them, keeping the cores always busy

I prioritize the RINS task based on our discussion in Slack

akifcorduk · 2026-05-12T15:07:05Z

 class branch_and_bound_worker_t {
 public:
-  const i_t worker_id;
+  using float_type = f_t;


Do we use float_type and int_type ?

It is mostly used in the worker_pool_t so it can retrieve the i_t and f_t from the underlying worker.

This is due the generic template parameter WorkerType in the worker_pool_t

akifcorduk · 2026-05-12T15:12:03Z

+      return steal;
+    }
+
+    while (num_nodes > 0) {


To reduce the amount of lock and unlock calls and minimize contention. Can we do something like this?

std::vector<mip_node_t<i_t, f_t>*> stolen; stolen.reserve(num_nodes); other->node_queue.lock(); while (num_nodes > 0 && other->node_queue.best_first_queue_size() > num_nodes) { stolen.push_back(other->node_queue.pop_best_first()); --num_nodes; } other->node_queue.unlock(); if (!stolen.empty()) { this->node_queue.lock(); for (auto* node : stolen) { this->node_queue.push(node); } this->node_queue.unlock(); steal = true; }

Unfortunately, no. In terms of contention, I think the original version is better since allows the two or more workers to pipeline the operations (the first one may be popping the node, while the other one will be pushing) and each operation will be quite short. We also need to be careful here how to handle the lower bound.

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

…plunge/dive. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

…e of diving Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-19T12:13:22Z

/ok to test d1f4c12

bdice and others added 30 commits April 3, 2026 13:51

split worker and worker pool in separated file. code cleanup.

e77dbc2

simplified logic for pseudo cost (and its snapshot) for the regular a…

62d0452

…nd deterministic mode. Signed-off-by: Nicolas Guidotti <224634272+nguidotti@users.noreply.github.com>

fixed compilation

a517f13

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

added missing header

f31599c

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

fixed guard against no incumbent when calling guided diving

202738f

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

addressing code rabbit comments. replaced AT in pseudo_costs_t with a…

4aed76c

… shared_ptr to avoid unnecessary copy. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

missing dereference

a5c111d

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'main' into simplify-pseudocost

919e445

split best-first and diving worker into separated objects

76ce1bb

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

increase the wheel size limit

c433e41

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed rng offset

52db538

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

increasing wheel size limit for CUDA 12

3676432

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

first version of the B&B workers with local heaps

d2f6eb7

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

implemented a lock-free stack to track the idle workers. fix potentia…

6a39187

…l crash in work-stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed lower bound calculation at end of the B&B. reverted to locking …

dec671c

…queue for now. refactoring. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

correctly handles the node in the stack when the solver stops if they…

1b3a282

… are present Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

added atomic in node queue to track size and lower bound without a lock.

e108a54

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

replaced std::deque with a circular buffer.

315aca6

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge remote-tracking branch 'upstream/main' into rmm-cccl-migration

536a692

# Conflicts: # cpp/src/utilities/cuda_helpers.cuh

Inline upstream memory resource variable in test fixture MR composition

31a6eab

Replace deprecated rmm::mr set_*_resource_ref calls with set_*_resource

f889d28

renamed method

3469026

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'main' into simplify-pseudocost

8e8c794

# Conflicts: # ci/validate_wheel.sh

Merge branch 'main' into simplify-pseudocost

3e6aa83

merging with main branch

e0444c2

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed compilation

f3e863f

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge remote-tracking branch 'upstream/main' into rmm-cccl-migration

76c9ece

fixed small bugs

56bf9ed

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

added cleanup routine for the diving heap

18e1e83

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti added 9 commits April 28, 2026 18:35

Merge branch 'main' into simplify-pseudocost

488ed8a

addresses reviewer's comments

14abf17

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'main' into simplify-pseudocost

fce0415

benchmarking the impact of atomics and mutexes in the deterministic mode

8b27e1b

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

revert changes to calculate_pseudocost

8a088f7

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

break down long expression for batch pdlp

5e21d9a

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

eliminated bnb mode in pseudocost. replace types into runtime checks

ec4514f

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

simplified code to only contain the path with atomics

34b6a40

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'simplify-pseudocost' into bnb-local-heap

2adec89

# Conflicts: # cpp/src/branch_and_bound/branch_and_bound.cpp # cpp/src/branch_and_bound/constants.hpp # cpp/src/branch_and_bound/pseudo_costs.hpp # cpp/src/utilities/omp_helpers.hpp

fixed compilation

bc3d50b

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti added 2 commits May 11, 2026 13:30

merged main branch

e1ddd8e

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti added 2 commits May 11, 2026 17:24

fixed missing unexplored counter

e1b81c7

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed determinism mode using just a single thread.

60de4ad

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

added taskwait before returning a bfs worker to the pool (so all divi…

67f539c

…ng tasks from that worker finishes when this happens). correctly handle time limit and unexplored counter after B&B. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

removed taskwait as it is not necessary

ef450ba

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

akifcorduk reviewed May 12, 2026

View reviewed changes

nguidotti added 6 commits May 18, 2026 10:49

fix lower bound calculation

4bf24f2

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fix type for min

aa2b7da

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

added function for index range calculation. fixed max stack size for …

4460b3d

…plunge/dive. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

merged main

24a1b78

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'main' into bnb-local-heap

8b3ecb6

set omp task priorities based on a fixed set of values. fix stack siz…

d1f4c12

…e of diving Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Conversation

nguidotti commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

nguidotti commented May 5, 2026

Uh oh!

nguidotti commented May 5, 2026

Uh oh!

nguidotti commented May 11, 2026

Uh oh!

nguidotti commented May 11, 2026

Uh oh!

nguidotti commented May 12, 2026

Uh oh!

nguidotti commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nguidotti May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nguidotti May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nguidotti May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akifcorduk May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nguidotti commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nguidotti commented Apr 27, 2026 •

edited

Loading

nguidotti May 12, 2026 •

edited

Loading

nguidotti May 19, 2026 •

edited

Loading

nguidotti May 19, 2026 •

edited

Loading

akifcorduk May 12, 2026 •

edited

Loading