Add exception handling for pdlp in concurrent mode#966
Add exception handling for pdlp in concurrent mode#966Iroy30 wants to merge 24 commits intoNVIDIA:release/26.04from
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds exception-capture and propagation scaffolding to the concurrent PDLP runner in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cpp/src/pdlp/solve.cu (1)
1190-1192: TODO lacks issue tracking; commented-out code should be cleaned up.The TODO indicates a known intermittent issue but provides no way to track resolution. Additionally, commented-out code with
printfstatements should not remain in production code.Recommendations:
- Create a tracked issue and reference it in the comment (e.g.,
// TODO(#123): ...)- Either remove the commented-out rethrow entirely, or conditionally enable it behind a debug flag if needed for investigation
♻️ Suggested cleanup
- // TODO: Active Issue: PDLP throws an Exception interminttently. - // if (pdlp_exception) { printf("Rethrowing PDLP exception from concurrent mode\n"); - // std::rethrow_exception(pdlp_exception); } + // TODO(`#ISSUE_NUMBER`): PDLP throws exceptions intermittently in concurrent mode. + // Once root cause is fixed, rethrow: std::rethrow_exception(pdlp_exception);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/pdlp/solve.cu` around lines 1190 - 1192, Remove the dead commented-out rethrow and replace the vague TODO with a tracked issue reference and optional debug guard: create a repository issue for the intermittent PDLP exception and update the comment to include the issue ID (e.g., "// TODO(#<issue>): Intermittent PDLP exception observed when running concurrent mode"), and either delete the commented block that references pdlp_exception/printf or move that logic behind a compile-time/debug flag (e.g., PDLP_DEBUG_RETHROW) so the rethrow of pdlp_exception can be enabled only for debugging; locate the commented code around pdlp_exception in solve.cu and apply the change accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1175-1183: The catch block around run_pdlp currently stores the
exception in pdlp_exception but discards it, losing diagnostics; update the
catch in solve.cu to extract and surface the exception message (from
pdlp_exception or std::current_exception()) — e.g., convert to a string via
std::rethrow_exception / try-catch and log it with the existing logging
facilities and/or set a more informative termination status on sol_pdlp (instead
of only pdlp_termination_status_t::NumericalError) before setting
*settings_pdlp.concurrent_halt = 1; ensure run_pdlp, pdlp_exception, sol_pdlp,
and settings_pdlp.concurrent_halt are referenced so callers receive meaningful
error info.
---
Nitpick comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1190-1192: Remove the dead commented-out rethrow and replace the
vague TODO with a tracked issue reference and optional debug guard: create a
repository issue for the intermittent PDLP exception and update the comment to
include the issue ID (e.g., "// TODO(#<issue>): Intermittent PDLP exception
observed when running concurrent mode"), and either delete the commented block
that references pdlp_exception/printf or move that logic behind a
compile-time/debug flag (e.g., PDLP_DEBUG_RETHROW) so the rethrow of
pdlp_exception can be enabled only for debugging; locate the commented code
around pdlp_exception in solve.cu and apply the change accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 36088b3c-2460-43f0-af07-139729eea9b0
📒 Files selected for processing (1)
cpp/src/pdlp/solve.cu
|
/ok to test 439bf38 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1175-1190: The code captures exceptions into pdlp_exception in the
run_pdlp try/catch but never rethrows it after thread cleanup, causing swallowed
errors and returning sol_pdlp with pdlp_termination_status_t::NumericalError;
fix by rethrowing pdlp_exception after the concurrent threads are joined (i.e.,
after the thread-join/cleanup section that follows this block) so the caller
observes the original exception, while keeping the existing
settings_pdlp.concurrent_halt update and logging in the catch; alternatively, if
swallowing is intentional remove pdlp_exception and add a clarifying
comment—refer to pdlp_exception, run_pdlp, settings_pdlp.concurrent_halt, and
sol_pdlp when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7636b542-cc14-47d7-88fd-6753bae7ce52
📒 Files selected for processing (1)
cpp/src/pdlp/solve.cu
| std::exception_ptr pdlp_exception; | ||
| optimization_problem_solution_t<i_t, f_t> sol_pdlp{pdlp_termination_status_t::NumericalError, | ||
| problem.handle_ptr->get_stream()}; | ||
| try { | ||
| sol_pdlp = run_pdlp(problem, settings_pdlp, timer, is_batch_mode); | ||
| } catch (...) { | ||
| pdlp_exception = std::current_exception(); | ||
| *settings_pdlp.concurrent_halt = 1; | ||
| try { | ||
| std::rethrow_exception(pdlp_exception); | ||
| } catch (const std::exception& e) { | ||
| CUOPT_LOG_ERROR("PDLP exception in concurrent mode: %s", e.what()); | ||
| } catch (...) { | ||
| CUOPT_LOG_ERROR("PDLP unknown exception in concurrent mode"); | ||
| } | ||
| } |
There was a problem hiding this comment.
pdlp_exception is captured but never rethrown after thread cleanup.
The exception is captured in pdlp_exception and logged, but the variable is never used after line 1190. After threads join (lines 1193-1195), the function continues and returns sol_pdlp with NumericalError status without rethrowing.
If the intent (per PR description: "rethrowing exceptions") is to propagate the exception after ensuring threads are cleaned up, add a rethrow after the joins:
🛡️ Proposed fix to rethrow after thread cleanup
barrier_thread.join();
+
+ // Rethrow captured exception after threads are safely joined
+ if (pdlp_exception) {
+ std::rethrow_exception(pdlp_exception);
+ }
// copy the dual simplex solution to the deviceIf the current behavior (graceful degradation returning NumericalError status) is intentional, consider removing the unused pdlp_exception variable and directly logging within the catch block, or add a comment clarifying the exception is intentionally swallowed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/pdlp/solve.cu` around lines 1175 - 1190, The code captures exceptions
into pdlp_exception in the run_pdlp try/catch but never rethrows it after thread
cleanup, causing swallowed errors and returning sol_pdlp with
pdlp_termination_status_t::NumericalError; fix by rethrowing pdlp_exception
after the concurrent threads are joined (i.e., after the thread-join/cleanup
section that follows this block) so the caller observes the original exception,
while keeping the existing settings_pdlp.concurrent_halt update and logging in
the catch; alternatively, if swallowing is intentional remove pdlp_exception and
add a clarifying comment—refer to pdlp_exception, run_pdlp,
settings_pdlp.concurrent_halt, and sol_pdlp when making the change.
chris-maes
left a comment
There was a problem hiding this comment.
LGTM. Thanks @Iroy30 . Great job tracking this bug down and fixing it.
|
/ok to test 2903db8 |
Yes we are debugging some CI fails, this is not to be merged as is. Will be reverted. |
|
/ok to test d71bce2 |
|
This is a P0, right? @chris-maes for advise. |
…aCuopt into add_exception_handling_pdlp
|
/ok to test fd04d94 |
|
Let's rethrow the exception and remove swath1 from the incumbent test. |
|
@jameslamb can you approve now that CI changes are removed? |
|
@akifcorduk to try to reenable this test |
|
Sure, dismissed my blocking review and removed myself from the reviewers list. |
|
/ok to test b6ad396 |
|
/ok to test 0c32bb5 |
|
/ok to test e89578a |
Description
Fixes fails when pdlp throws exception in concurrent mode. Still needs evaluation on why pdlp fails.
Issue
Checklist