Skip to content

B&B with Local Heaps#1149

Open
nguidotti wants to merge 83 commits into
NVIDIA:mainfrom
nguidotti:bnb-local-heap
Open

B&B with Local Heaps#1149
nguidotti wants to merge 83 commits into
NVIDIA:mainfrom
nguidotti:bnb-local-heap

Conversation

@nguidotti
Copy link
Copy Markdown
Contributor

@nguidotti nguidotti commented Apr 27, 2026

In this PR, each best-first worker has its own local node heap, such that it push/pop nodes without synchronizing with other workers. Each best-first worker periodically steals a node from a random worker to keep the node distribution more or less balance across them. Additionally, each best-first worker has a (fixed) set of diving worker assigned to it, which are used for performing diving on its own nodes whenever possible. This essentially eliminates the need of the scheduler thread, freeing one additional thread to do something useful.

This also implements a compression scheme for vstatus using only 2bits per entry, which reduces the memory consumption by roughly 4x (previously was using int8_t per entry). Last, but not least, this PR replaces std::deque with a fixed-capacity circular_deque_t for the plunge/dive stacks and the idle-worker list.

MIPLIB results (GH200, 10min):

================================================================================
main (1, #1099) vs bnb-local-heap (2)
================================================================================

------------------------------------------------------------------------------------------------------------------------------
|                                        |       Run 1        |       Run 2        |     Abs. Diff.     |   Rel. Diff. (%)   |
------------------------------------------------------------------------------------------------------------------------------
| Feasible                                                 227                  228                   +1                 --- |
| Optimal                                                   75                   78                   +3                 --- |
| Solutions with <0.1% primal gap                          124                  130                   +6                 --- |
| Nodes explored (mean)                              4.866e+06            1.436e+07           +9.496e+06                +195 |
| Nodes explored (shifted geomean)                        6772            1.205e+04                +5275               +77.9 |
| Relative MIP gap (mean)                               0.3264               0.3415             +0.01506               +4.62 |
| Relative MIP gap (shifted geomean)                    0.1156               0.1131              -0.0025               -2.16 |
| Solve time (mean)                                      444.6                441.5               -3.054              -0.687 |
| Solve time (shifted geomean)                           221.5                219.1               -2.327               -1.05 |
| Primal gap (mean)                                      11.57                11.15              -0.4201               -3.63 |
| Primal gap (shifted geomean)                          0.6324               0.5604             -0.07203               -11.4 |
| Primal integral (mean)                                 32.63                33.02              +0.3805               +1.17 |
| Primal integral (shifted geomean)                      6.346                6.405             +0.05989              +0.944 |
------------------------------------------------------------------------------------------------------------------------------

In summary, we explored ~3x nodes in average` at the same time frame. The number of optimal solutions also increased by 3.

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • New or existing tests cover these changes
    • Added tests
    • Created an issue to follow-up
    • NA
  • Documentation
    • The documentation is up to date with these changes
    • Added new documentation
    • NA

bdice and others added 30 commits April 3, 2026 13:51
Remove dependency on rmm::mr::device_memory_resource base class. Resources
now satisfy the cuda::mr::resource concept directly.

- Replace shared_ptr<device_memory_resource> with value types and
  cuda::mr::any_resource<cuda::mr::device_accessible> for type-erased storage
- Replace set_current_device_resource(ptr) with set_current_device_resource_ref
- Replace set_per_device_resource(id, ptr) with set_per_device_resource_ref
- Remove make_owning_wrapper usage
- Remove dynamic_cast on memory resources (no common base class)
- Remove owning_wrapper.hpp and device_memory_resource.hpp includes
- Add missing thrust/iterator/transform_output_iterator.h include
  (no longer transitively included via CCCL)
…nd deterministic mode.

Signed-off-by: Nicolas Guidotti <224634272+nguidotti@users.noreply.github.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
… shared_ptr to avoid unnecessary copy.

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…l crash in work-stealing

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…queue for now. refactoring.

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
… are present

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts:
#	cpp/src/utilities/cuda_helpers.cuh
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts:
#	ci/validate_wheel.sh
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts:
#	cpp/src/branch_and_bound/branch_and_bound.cpp
#	cpp/src/branch_and_bound/constants.hpp
#	cpp/src/branch_and_bound/pseudo_costs.hpp
#	cpp/src/utilities/omp_helpers.hpp
@nguidotti
Copy link
Copy Markdown
Contributor Author

/ok to test 2adec89

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
@nguidotti
Copy link
Copy Markdown
Contributor Author

/ok to test bc3d50b

nguidotti added 2 commits May 11, 2026 13:30
# Conflicts:
#	cpp/src/branch_and_bound/branch_and_bound.cpp
#	cpp/src/branch_and_bound/constants.hpp
#	cpp/src/branch_and_bound/diving_heuristics.cpp
#	cpp/src/branch_and_bound/mip_node.hpp
#	cpp/src/branch_and_bound/pseudo_costs.cpp
#	cpp/src/branch_and_bound/worker.hpp
#	cpp/src/branch_and_bound/worker_pool.hpp
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
@nguidotti
Copy link
Copy Markdown
Contributor Author

/ok to test e1ddd8e

nguidotti added 2 commits May 11, 2026 17:24
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
@nguidotti
Copy link
Copy Markdown
Contributor Author

/ok to test 60de4ad

…ng tasks from that worker finishes when this happens). correctly handle time limit and unexplored counter after B&B.

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
@nguidotti
Copy link
Copy Markdown
Contributor Author

/ok to test 67f539c

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
@nguidotti
Copy link
Copy Markdown
Contributor Author

/ok to test ef450ba

f_t node_lower_bound = node ? node->lower_bound : std::numeric_limits<f_t>::infinity();
worker->lower_bound = std::min(worker->lower_bound.load(), node_lower_bound);

worker->node_queue.unlock();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can unlock this after poping the best first right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, no. The lower bound of a node is calculated based on the heap and the node being explored by a worker. If we take a node from the heap, then it lower bound are no longer considered when retrieving the lower bound from the heap. So, we need to store it in the worker lower bound, otherwise it would temporarily vanish from the solver and potentially cause the solver to terminate early due to gap convergence rule.

Copy link
Copy Markdown
Contributor Author

@nguidotti nguidotti May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This two operations (popping a node from the heap and updating the lower bound of the worker) needs to happen "simultaneously".

Comment thread cpp/src/branch_and_bound/branch_and_bound.cpp Outdated
Comment thread cpp/src/branch_and_bound/branch_and_bound.cpp Outdated
max_num_workers_per_type[strategy]);
if (!start_node || upper_bound_.load() < start_node->lower_bound ||
start_node->depth < settings_.diving_settings.min_node_depth) {
bfs_worker->node_queue.unlock();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we unlock the queue as soon as we pop from the queue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we must keep the mutex locked here until we create a detached copy of the node (i.e., a complete copy of the node without no connection to the main tree) and retrieve the bounds from the tree. This prevents another worker stealing that node and modify its content (or even fathoming it) while the first one is still copying the information, which causes a race condition.

run_scheduler();
const i_t num_workers = settings_.num_threads;
const i_t num_bfs_workers = std::max(settings_.num_threads / 2, 1);
const i_t num_diving_workers = num_workers - num_bfs_workers;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we leave some threads to other threads like CPUF and RINS ?

Copy link
Copy Markdown
Contributor Author

@nguidotti nguidotti May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be done automatically by the OpenMP. More specifically, the best-first workers are long-lived, blocking half of the CPU cores. While diving, reliability branching, RINS and CPUFJ tasks will share the remaining of the CPU cores. These tasks have a short (or medium) duration, and thus, allows the runtime to switch between them, keeping the cores always busy

Copy link
Copy Markdown
Contributor Author

@nguidotti nguidotti May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prioritize the RINS task based on our discussion in Slack

class branch_and_bound_worker_t {
public:
const i_t worker_id;
using float_type = f_t;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use float_type and int_type ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mostly used in the worker_pool_t so it can retrieve the i_t and f_t from the underlying worker.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due the generic template parameter WorkerType in the worker_pool_t

return steal;
}

while (num_nodes > 0) {
Copy link
Copy Markdown
Contributor

@akifcorduk akifcorduk May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the amount of lock and unlock calls and minimize contention. Can we do something like this?

std::vector<mip_node_t<i_t, f_t>*> stolen;
stolen.reserve(num_nodes);

other->node_queue.lock();

while (num_nodes > 0 &&
       other->node_queue.best_first_queue_size() > num_nodes) {
  stolen.push_back(other->node_queue.pop_best_first());
  --num_nodes;
}

other->node_queue.unlock();

if (!stolen.empty()) {
  this->node_queue.lock();

  for (auto* node : stolen) {
    this->node_queue.push(node);
  }

  this->node_queue.unlock();
  steal = true;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, no. In terms of contention, I think the original version is better since allows the two or more workers to pipeline the operations (the first one may be popping the node, while the other one will be pushing) and each operation will be quite short. We also need to be careful here how to handle the lower bound.

Comment thread cpp/src/branch_and_bound/worker.hpp Outdated
nguidotti added 6 commits May 18, 2026 10:49
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…plunge/dive.

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…e of diving

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
@nguidotti
Copy link
Copy Markdown
Contributor Author

/ok to test d1f4c12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality mip non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants