Conversation
| PoolT& pool, const std::vector<f_t>& incumbent_snapshot, double horizon_start, double horizon_end) | ||
| { | ||
| for (auto& worker : pool) { | ||
| worker.set_snapshots(upper_bound_.load(), |
There was a problem hiding this comment.
What does a snapshot consist of? The incumbent and the pseudocosts?
|
|
||
| std::sort(to_repair.begin(), | ||
| to_repair.end(), | ||
| [](const std::vector<f_t>& a, const std::vector<f_t>& b) { return a < b; }); |
There was a problem hiding this comment.
What is this sorting by? The address?
| if (success) { | ||
| // Queue repaired solution with work unit timestamp (...workstamp?) | ||
| mutex_heuristic_queue_.lock(); | ||
| heuristic_solution_queue_.push_back( |
There was a problem hiding this comment.
This tries to repair the solution and if it can be repaired puts it into a heuristic queue? Since all these are tagged with the same determinisim_current_horizon is there a reason why they can't be processed now? That is, why do they need to go into a queue?
| { | ||
| std::vector<queued_heuristic_solution_t> future_solutions; | ||
| for (auto& sol : heuristic_solution_queue_) { | ||
| if (sol.wut < determinism_current_horizon_) { |
There was a problem hiding this comment.
I keep forgetting what a wut is. It might be better to write this out as a work_unit_timestamp.
| [](const queued_heuristic_solution_t& a, const queued_heuristic_solution_t& b) { | ||
| if (a.wut != b.wut) { return a.wut < b.wut; } | ||
| if (a.objective != b.objective) { return a.objective < b.objective; } | ||
| return a.solution < b.solution; // edge-case - lexicographical comparison |
There was a problem hiding this comment.
What comparison is being used here? Address of the solution?
| case bb_event_type_t::NODE_BRANCHED: | ||
| case bb_event_type_t::NODE_FATHOMED: | ||
| case bb_event_type_t::NODE_INFEASIBLE: | ||
| case bb_event_type_t::NODE_NUMERICAL: break; |
There was a problem hiding this comment.
So if you hit any of these events you break out of the loop? I'm not sure I understand what happens then. The idea is that you can't process further solutions until you handle one of these events?
| } | ||
|
|
||
| // Merge integer solutions from BFS workers and update global incumbent | ||
| determinism_process_worker_solutions(*determinism_workers_, |
There was a problem hiding this comment.
I'm not sure I understand what's going here. What is lambda doing?
| } | ||
|
|
||
| template <typename i_t, typename f_t> | ||
| void branch_and_bound_t<i_t, f_t>::determinism_prune_worker_nodes_vs_incumbent() |
There was a problem hiding this comment.
If I understand correctly, this is fathoming nodes based on if there lower bound is greater than the upper bound?
Is it necessary to do this work? In the current branch and bound code, we don't go through and clean out the heap each time we find a better incumbent as that would be quite expensive. Instead we just throw away nodes when we remove them from the heap.
Is this because you are trying load balance the worker's queue? If so, is the size of each worker's queue significantly smaller than the global heap?
|
|
||
| // Redistribute round-robin | ||
| std::vector<size_t> worker_order; | ||
| for (size_t w = 0; w < num_workers; ++w) { |
There was a problem hiding this comment.
Is this the same as:
std::vector<size_t> worker_order(num_workers)
std::iota(worker_order.begin(), worker_order.end(), 0);
|
|
||
| // Distribute nodes | ||
| for (size_t i = 0; i < all_nodes.size(); ++i) { | ||
| size_t worker_idx = worker_order[i % num_workers]; |
There was a problem hiding this comment.
I'm not sure I understand what is happening here. I think worker_order[w] = w. So is this the same as
size_t worker_idx = i % num_workers;
| f_t branch_and_bound_t<i_t, f_t>::determinism_compute_lower_bound() | ||
| { | ||
| // Compute lower bound from BFS worker local structures only | ||
| const f_t inf = std::numeric_limits<f_t>::infinity(); |
There was a problem hiding this comment.
Nit: inf is already defined in dual_simplex/types.hpp. So you can use it without recreating from limits.
| // Compute lower bound from BFS worker local structures only | ||
| const f_t inf = std::numeric_limits<f_t>::infinity(); | ||
| f_t lower_bound = lower_bound_ceiling_.load(); | ||
| if (!std::isfinite(lower_bound)) lower_bound = inf; |
There was a problem hiding this comment.
I'm not sure I understand what this is doing. If the lower bound is not finite, shouldn't it already be inf?
Are we ever setting the lower_bound_ceiling to -inf or NaN?
| for (auto* node : worker.backlog.data()) { | ||
| if (node->lower_bound < upper_bound) { | ||
| f_t score = node->objective_estimate; | ||
| if (!std::isfinite(score)) { score = node->lower_bound; } |
There was a problem hiding this comment.
Nit: I'm struggling with the !isfinite logic. It's hard for me to read. Maybe should define an isinf function? Or just use score >= inf to check for infinity.
|
|
||
| // Skip workers that already have enough nodes | ||
| if ((int)worker.dive_queue_size() >= target_nodes_per_worker) { | ||
| worker_idx = (worker_idx + 1) % num_workers; |
There was a problem hiding this comment.
Nit: Maybe you could lift this out of the if statement and do it immediately after assigning worker. Since I think in all cases you increment the worker_idx?
| break; | ||
| } | ||
| } | ||
| if (all_full) break; |
There was a problem hiding this comment.
You might add some comments here
if (all_full) break; // break out of the assignment loop as all workers have more than their target nodes
continue; // move on to the next worker, this worker has more than its target node.
| } | ||
|
|
||
| template <typename i_t, typename f_t> | ||
| void branch_and_bound_t<i_t, f_t>::determinism_collect_diving_solutions() |
There was a problem hiding this comment.
This also updates the psuedocosts. Would it be better to call it: determistic_collect_diving_solutions_and_update_psuedocosts? I know that is not a short name. :)
| if (!feasible) { | ||
| worker.recompute_bounds_and_basis = true; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Do you need to add in the nnzs from bound strengthening to the work estimate?
| worker.basic_list, | ||
| worker.nonbasic_list, | ||
| leaf_vstatus, | ||
| leaf_edge_norms); |
There was a problem hiding this comment.
Same as above: work not counted?
|
|
||
| // Determinism heuristic solution queue - solutions received from GPU heuristics | ||
| // Stored with work unit timestamp for deterministic ordering | ||
| struct queued_heuristic_solution_t { |
There was a problem hiding this comment.
How is queued_heuristic_solution_t different than a queued_integer_solution_t?
| namespace cuopt::linear_programming::dual_simplex { | ||
|
|
||
| template <typename i_t, typename f_t> | ||
| struct backlog_node_compare_t { |
There was a problem hiding this comment.
Ah ok. Here are the definitions of custom comparators.
| double horizon_end{0.0}; | ||
| work_limit_context_t work_context; | ||
|
|
||
| // Local snapshots of global state |
There was a problem hiding this comment.
Maybe it would be helpful to define a snapshot_t type that contains the psuedocosts and the incumbent
| { | ||
| const i_t n = this->leaf_problem.num_cols; | ||
| local_upper_bound = global_upper_bound; | ||
| pc_sum_up_snapshot.assign(pc_sum_up, pc_sum_up + n); |
There was a problem hiding this comment.
I'm not sure I understand what is happening here. What does pc_sum_up and pc_num_up contain? I thought these would be the psuedocost sum and numbers for each variable. But you are assigning them based on a single value. So these must something different.
There was a problem hiding this comment.
Oh these are pointers to an array's you are passing in. Is there a reason to use raw pointers here?
| static_cast<uint64_t>(static_cast<uint32_t>(creation_seq)); | ||
| } | ||
|
|
||
| uint32_t compute_path_hash() const |
There was a problem hiding this comment.
where is the path hash used?
|
|
||
| if (work_unit_context) { | ||
| // TEMP; | ||
| work_unit_context->record_work((total_loads + total_stores) / 1e8); |
There was a problem hiding this comment.
Does record_work block and act as a sync point across the threads? If so, it's probably good to call that out in the name of the function. Maybe record_work_and_sync_threads?
| #include <vector> | ||
|
|
||
| namespace cuopt { | ||
| struct work_limit_context_t; |
There was a problem hiding this comment.
Why an empty definition?
| CONCURRENT_LIMIT = 6, | ||
| UNSET = 7 | ||
| UNSET = 7, | ||
| WORK_LIMIT = 8 |
There was a problem hiding this comment.
It would probably be better to put WORK_LIMIT next to the other limits, and make UNSET the last entry in the status_t
| template <typename i_t, typename f_t, typename VectorI> | ||
| i_t initialize_degree_data(const csc_matrix_t<i_t, f_t>& A, | ||
| const std::vector<i_t>& column_list, | ||
| const VectorI& column_list, |
There was a problem hiding this comment.
Here and below, why only change the type of column_list and not the rest of the variables?
| CONCURRENT_LIMIT = 7, | ||
| UNSET = 8 | ||
| UNSET = 8, | ||
| WORK_LIMIT = 9 |
There was a problem hiding this comment.
Nit: same here, better to place WORK_LIMIT near CONCURRENT_LIMT and leave UNSET for last.
|
There is quite a bit of vocabulary in this PR. I made a list of words/phrases as I went a long. It might be useful to have some documentation that defines these words, preferably in the repository itself:
|
|
I made a first pass through the PR. Here are my initial comments. I'm sure as I spend more time with the code these will change. First let me say, it is an impressive piece of work. It clearly took a lot of effort. And you had to rewrite large portions of the branch and bound and diving code. Figuring out how to synchronize branch and bound across multiple threads is no easy task. And thanks for incorporating the last minute changes from cuts and reliability branching. From my perspective, because I didn't look at the CPUFJ changes, there are two main parts to the PR.
For 1) I'm still not sure I understand the PR from a high-level. For example, it seems like each BFS worker must sync with the other workers before pruning a node. But BFS workers also maintain there own heaps. Is there a global heap as well? If a sync is necessary before pruning, why maintain a separate heap? I'm also not clear why we want to prune nodes in the workers heaps based on the incumbent. It seems like we could wait until the worker popped the node of the heap to prune it. But perhaps that isn't a concern because the worker heaps can't get large? With any PR this big, I'd like understand how it was verified and how the non-deterministic code changed, as well as how the deterministic path compares to the non-deterministic path. Do you have new results after merging cuts and reliablitly branching? In addition to num feasible and primal integral, MIP gap and num optimal would be good measures to understand the variations. Also, since it touched dual simplex, it would be good due some runs on LPs to see if there are any changes (e.g. does the number of iterations change on any of the NETLIB problems after this PR). For both 1 and 2, my biggest concern is that the increased complexity of the code will make it difficult to modify in the future. For 1) I anticipate that we will make many changes to the branch and bound code in the next few months such as: branching on sums of variables, orbital branching/symmetry, restarts, reliability branching using batch PDLP etc. I'm worried that these types of changes will be difficult to incorporate into the current PR. Some more concrete comments/suggestions on 1)
For 2) I think your approach of an instrumented vector made it possible for you to get work estimates from dual simplex and get this code up and running. That is great. But I'm concerned about my ability to maintain the dual simplex code in this state in the long run. Right now, there is a mix of instrumented vector and std::vector in the code. I think it will be hard for me to modify functions or fix bugs in this state. Also, I know the compiler is smart and may be able to lift increments outside of a loop so they become O(1) work instead of O(n). But I worry about the overhead associated with logging every array access. I'd prefer to go through the dual simplex code and manually track work estimates and provide these to you. That would introduce minimal overhead and allow me to write new functions using the base types. Some of the dual simplex changes involved making variables persist across loop iterations. I've created many bugs with persistent state across iterations. So it would be good to double check that these changes do not result in any iteration differences across the NETLIB LPs or the MIPLIB root relaxations. Anyway, impressive piece of work. We should discuss more in person on Monday. I will try to make another pass through the PR as well on Monday. But I wanted to give you my first pass thoughts. |
This PR introduces a deterministic execution mode for the parallel branch-and-bound MIP solver.
When
determinism_mode = CUOPT_MODE_DETERMINISTIC, the solver guarantees bitwise-identical results across runs regardless of thread scheduling variations, so long as the runs are on the same platform with the same environment (= same glibc, etc...)The approach that was chosen is referred to as Bulk Synchronous Parallel (BSP) in the literature. This is an execution model where computation proceeds in discrete horizons of virtual time (=work units) separated by synchronization barriers.
To cope with the inherent nondeterminism of wall-clock-time due to factors such as caching state, OS scheduling, CPU throttling...
instead of time, progress is measured in "work units" approximating the execution time of an algorithm using known and deterministic factors, such as memory operations, problem features and sparsity...
Workers explore the tree independently and locally within a horizon, and collect events (branching decisions, solutions, pseudo-cost updates) during the process without exchanging them yet.
At each horizon boundary, events are sorted by work-unit timestamp with tie-breaking, history is replayed in deterministic order to update global state, and if need be nodes are redistributed across workers to balance load (which replaces the ramp-up mechanism).
Workers operate on local snapshots of shared state (pseudo-costs, upper bound, LP iteration count) taken at horizon start to avoid read races. This trades some accuracy in decisions (pruning may rely on a higher lower bound than in nondeterminsitic mode), for determinism.
Support for start CPU heuristics is present, based on a "producer" model: CPUFJ starts immediately and begins producing solutions with work unit timestamps, that are produced by the B&B thread once the appropriate time is reached. A form of one-way synchronization is implemented to prevent the CPUFJ thread from "falling behind" and producing solutiosn in the past from the perspective of the B&B thread. CPUFJ is allowed to run ahead, and is biased to run faster than B&B to avoid unnecessary syncs.
B&B BFS workers, diving workers, and a start CPUFJ thread are supported in deterministic mode. GPU Heuristics are disabled in this mode and will be incorporated into the deterministic framework in the future.
Note: files in the
utilities/models/folders are auto-generated from the LightGBM models trained to predict work units from algorithm features.The implementation is described in more detail in this document: B&B Determinism
Benchmark results
B&B alone, 10min
Regression testing again main branch:
With changes:
Baseline:
Summary by CodeRabbit
New Features
Infrastructure
Tests