Skip to content

Main merge release/26.04 1#1005

Merged
rgsl888prabhu merged 4 commits intoNVIDIA:mainfrom
rgsl888prabhu:main-merge-release/26.04_1
Mar 31, 2026
Merged

Main merge release/26.04 1#1005
rgsl888prabhu merged 4 commits intoNVIDIA:mainfrom
rgsl888prabhu:main-merge-release/26.04_1

Conversation

@rgsl888prabhu
Copy link
Copy Markdown
Collaborator

For #1002

tmckayus and others added 3 commits March 31, 2026 01:23
This change replaces the solve_lp_remote and solve_mip_remote stubs with real routines that use an embedded grpc client to communicate with a remote cuopt server.

There are two documents included, GRPC_ARCHITECTURE.md and SERVER_ARCHITECTURE.md that act as developer guides.

The server is built by build.sh and is called cuopt_grpc_server.

Remote execution is still enabled via env vars CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT

To try this feature, checkout the branch/PR and do the following:

$ build.sh
$ cuopt_grpc_server // defaults are fine, run with -h for help
$ CUOPT_REMOTE_HOST=localhost CUOPT_REMOTE_PORT=8765 cuopt_cli myproblem.mps

All cuopt APIs will pick up remote execution if the env vars are set. So cuopt_cli, the C API, the Python API, all will solve problems on the server if the env vars are set and the server is running. Just use cuopt tools and APIs as you normally would for a local solve.

Authors:
  - Trevor McKay (https://github.com/tmckayus)

Approvers:
  - Rajesh Gandham (https://github.com/rg20)
  - Ishika Roy (https://github.com/Iroy30)
  - Bradley Dice (https://github.com/bdice)

URL: NVIDIA#939
@rgsl888prabhu rgsl888prabhu requested review from a team as code owners March 31, 2026 15:18
@rgsl888prabhu rgsl888prabhu self-assigned this Mar 31, 2026
@rgsl888prabhu rgsl888prabhu added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels Mar 31, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fee6f156-f6a6-4338-884f-44c1dacaafce

📥 Commits

Reviewing files that changed from the base of the PR and between 335e936 and 3b2156c.

📒 Files selected for processing (1)
  • ci/validate_wheel.sh
✅ Files skipped from review due to trivial changes (1)
  • ci/validate_wheel.sh

📝 Walkthrough

Walkthrough

Adds a full gRPC remote-execution system: protobuf service/messages, client library, multi-process gRPC server with shared-memory and pipe IPC, chunked array transfer protocol, TLS/mTLS options, build/test integration, and extensive mapping/serialization helpers for LP/MIP solves.

Changes

Cohort / File(s) Summary
Documentation
GRPC_INTERFACE.md, GRPC_QUICK_START.md, GRPC_SERVER_ARCHITECTURE.md
New docs describing gRPC service contract, quick-start/TLS setup, chunked transfer protocol, client config, and server architecture.
Protobuf / Service Contracts
cpp/src/grpc/cuopt_remote.proto, cpp/src/grpc/cuopt_remote_service.proto
New proto schemas and the CuOptRemoteService gRPC service defining job lifecycle RPCs, chunked upload/download messages, streaming logs/incumbents, and result types.
Client Library
cpp/src/grpc/client/grpc_client.hpp, cpp/src/grpc/client/grpc_client.cpp, cpp/src/grpc/client/solve_remote.cpp
New typed gRPC client (grpc_client_t) and remote solve entrypoints with TLS, connect/poll/wait/cancel/delete APIs, log/incumbent streaming, and unary↔chunked transfer selection.
Problem & Solution Mappers
cpp/src/grpc/grpc_problem_mapper.*, cpp/src/grpc/grpc_solution_mapper.*, cpp/src/grpc/grpc_settings_mapper.*, cpp/src/grpc/grpc_service_mapper.*
Bidirectional mapping between cpu problem/solution/settings and protobuf messages; size estimation and chunked-header/array builders; submit-request builders.
Server Implementation
cpp/src/grpc/server/grpc_server_main.cpp, cpp/src/grpc/server/grpc_service_impl.cpp, cpp/src/grpc/server/grpc_server_types.hpp, cpp/src/grpc/server/grpc_server_threads.cpp, cpp/src/grpc/server/grpc_job_management.cpp, cpp/src/grpc/server/grpc_worker*.cpp, cpp/src/grpc/server/grpc_worker_infra.cpp
New multi-process gRPC server entrypoint, service implementation, shared-memory job/result queues, worker processes, worker monitor/result/incumbent threads, job lifecycle and chunked session management.
Pipe I/O & Serialization
cpp/src/grpc/server/grpc_pipe_io.cpp, cpp/src/grpc/server/grpc_pipe_serialization.hpp, cpp/src/grpc/server/grpc_field_element_size.hpp, cpp/src/grpc/server/grpc_incumbent_proto.hpp
Length-prefixed protobuf pipe I/O, chunked-request/result serialization/deserialization, array element-size lookup, incumbent proto helpers, and safety/limit checks.
Server Logging
cpp/src/grpc/server/grpc_server_logger.{cpp,hpp}
Server-side logger singleton and macros for server logging configuration.
Build / Packaging
cpp/CMakeLists.txt, build.sh, ci/build_wheel_libcuopt.sh, ci/utils/install_protobuf_grpc.sh, python/libcuopt/CMakeLists.txt, conda/environments/*yaml, conda/recipes/libcuopt/recipe.yaml, dependencies.yaml
CMake target cuopt_grpc_server, proto generation integration, build-script target handling, CI install script for protobuf/grpc, conda dependency additions, and packaging adjustments.
Tests & Test Infra
cpp/tests/linear_programming/grpc/*, cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp, python/.../test_cpu_only_execution.py
New extensive gRPC unit/integration tests (mock stubs, pipe serialization, integration server tests), test helpers for injecting mock stubs and capturing logs, and Python/C API test fixtures to spawn/drive the server (TLS/mTLS coverage).
Solver Integration Changes
cpp/src/pdlp/*, cpp/src/mip_heuristics/*, cpp/include/cuopt/linear_programming/solve_remote.hpp, cpp/cuopt_cli.cpp
Removed old remote stubs, updated remote solve signatures and call sites to use new gRPC client, added targeted exception handling and small correctness/sync fixes.
Miscellaneous
cpp/tests/linear_programming/CMakeLists.txt, cpp/src/pdlp/CMakeLists.txt, ci/validate_wheel.sh
Test CMake wiring for grpc tests, small CMake source-list updates, and pydist size threshold tweak.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • Remote execution implementation #939: Implements the same gRPC remote-execution feature set — protos, client/server, worker orchestration, chunked transfers, TLS, and tests (strong code-level overlap).

Suggested reviewers

  • bdice
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/src/mip_heuristics/solve.cu (1)

442-453: ⚠️ Potential issue | 🟠 Major

Keep remote MIP routing backend-agnostic.

This turns remote execution into a CPU-only path. If a caller holds a GPU-backed optimization_problem_t and CUOPT_REMOTE_* is set, the solve now hard-fails here instead of being serialized and sent to the remote worker.

Based on learnings, in cpp/src/linear_programming/solve.cu, remote vs. local solve routing intentionally checks only is_remote_execution_enabled() without checking backend type so GPU-resident problems can still execute remotely.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/mip_heuristics/solve.cu` around lines 442 - 453, The remote-execution
branch currently forces a CPU-only cast (dynamic_cast to
cpu_optimization_problem_t) which makes remote solves fail for GPU-backed
problems; remove the cpu-specific cast in the is_remote_execution_enabled()
branch and call solve_mip_remote using the generic problem_interface (or an
overload that accepts the base optimization_problem_t<i_t,f_t>) so remote
routing stays backend-agnostic; keep the CPU-only cast only for the local path
where solve_mip(*cpu_prob, settings) is needed and ensure solve_mip_remote's
signature or overload supports serializing/handling GPU-resident problems from
the base type rather than requiring cpu_optimization_problem_t.
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (1)

417-421: ⚠️ Potential issue | 🟡 Minor

Fail this test when no warm-start payload is produced.

If get_pdlp_warm_start_data() returns None, the test exits successfully without exercising the round-trip at all. That hides regressions in remote warm-start serialization.

Suggested change
     sol1 = linear_programming.Solve(dm, settings)
     ws = sol1.get_pdlp_warm_start_data()
+    assert ws is not None, "expected PDLP warm-start data from the first solve"
 
-    if ws is not None:
-        settings.set_pdlp_warm_start_data(ws)
-        settings.set_parameter(CUOPT_ITERATION_LIMIT, 200)
-        sol2 = linear_programming.Solve(dm, settings)
-        assert sol2.get_primal_solution() is not None
+    settings.set_pdlp_warm_start_data(ws)
+    settings.set_parameter(CUOPT_ITERATION_LIMIT, 200)
+    sol2 = linear_programming.Solve(dm, settings)
+    assert sol2.get_primal_solution() is not None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 417 - 421, The test currently skips the warm-start round-trip when
ws is None; update the test to fail if get_pdlp_warm_start_data() returns None
so regressions are caught: after calling settings.get_pdlp_warm_start_data() (or
wherever ws is obtained), assert ws is not None (or raise an AssertionError)
with a clear message before calling settings.set_pdlp_warm_start_data(ws) and
Solve(dm, settings); keep the existing calls to set_pdlp_warm_start_data,
set_parameter(CUOPT_ITERATION_LIMIT, 200), and linear_programming.Solve to
exercise the full round-trip and then assert sol2.get_primal_solution() is not
None.
🟡 Minor comments (5)
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp-76-86 (1)

76-86: ⚠️ Potential issue | 🟡 Minor

Reset the server log marker when the file can't be opened.

clear() leaves server_log_start_pos_ untouched on open failure. If the log file was rotated/recreated between tests, later reads seek to a stale offset and miss current-test output.

Suggested fix
   void clear()
   {
     std::lock_guard<std::mutex> lock(mutex_);
     client_logs_.clear();
 
     if (!server_log_path_.empty()) {
       std::ifstream file(server_log_path_, std::ios::ate);
-      if (file.is_open()) { server_log_start_pos_ = file.tellg(); }
+      if (file.is_open()) {
+        server_log_start_pos_ = file.tellg();
+      } else {
+        server_log_start_pos_ = 0;
+      }
+    } else {
+      server_log_start_pos_ = 0;
     }
     test_start_marked_ = true;
   }

Based on learnings, **/*test*.{cpp,cu,py} should ensure test isolation so state from previous tests cannot leak into the current one.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 76
- 86, The clear() method leaves server_log_start_pos_ unchanged when opening
server_log_path_ fails, which can leave a stale offset; update clear() so that
if server_log_path_ is non-empty but the file fails to open you reset
server_log_start_pos_ to the beginning (e.g., streampos(0) or 0) to avoid
seeking to a stale offset on later reads; locate the logic in clear(), reference
server_log_path_, server_log_start_pos_, and test_start_marked_, and ensure
test_start_marked_ behavior remains the same after the reset.
GRPC_QUICK_START.md-3-8 (1)

3-8: ⚠️ Potential issue | 🟡 Minor

This guide promises the C API, but the example here is C++.

The sample includes C++ headers and calls solve_lp, so C users still do not get a documented path through cuopt_c.h. Either rename these sections to “C++ API” or add a real C example so the supported API surface stays clear. Based on learnings, the officially supported public APIs for cuOpt are C (via cuopt_c.h) and Python.

Also applies to: 151-154, 222-234

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_QUICK_START.md` around lines 3 - 8, The doc promises a C API but the
sample uses C++ headers and calls solve_lp; update the guide so the API surface
is accurate by either renaming the example sections to "C++ API" or replacing
the sample with a real C example that includes cuopt_c.h and demonstrates the
equivalent C entrypoint(s) (e.g., the cuopt C solve function(s) corresponding to
solve_lp), and ensure every other section that currently claims to show C but
uses C++ is corrected likewise.
GRPC_INTERFACE.md-189-193 (1)

189-193: ⚠️ Potential issue | 🟡 Minor

Use the correct gRPC cancellation API here.

IsCancelled() is a server-side API only; ClientContext does not have this method. The documented client-side detection is incorrect. For client-initiated cancellation, the client calls context->TryCancel(). For detecting cancellation on the client side, check the returned Status object for StatusCode::CANCELLED.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_INTERFACE.md` around lines 189 - 193, The docs incorrectly state
client-side detection uses context->IsCancelled(); update the Connection
Handling section to use the correct gRPC APIs: note that IsCancelled() is
server-side only (used on ServerContext), clients should call
ClientContext::TryCancel() to initiate cancellation, and clients detect
cancellation by checking the returned grpc::Status for status.code() ==
grpc::StatusCode::CANCELLED (or StatusCode::CANCELLED) rather than calling
IsCancelled() on ClientContext. Mention the correct symbols:
ClientContext::TryCancel(), ServerContext::IsCancelled(), and grpc::Status /
grpc::StatusCode::CANCELLED.
cpp/src/grpc/client/grpc_client.hpp-76-78 (1)

76-78: ⚠️ Potential issue | 🟡 Minor

Fix the clamping bounds documentation to match server constraints.

The comment claims the value is clamped to [4 MiB, 2 GiB - 1 MiB], but the actual server constraints defined in grpc_server_types.hpp are [4 KiB, 2 GiB - 1 MiB]. Update the comment on line 76 to reflect the correct minimum of 4 KiB, not 4 MiB.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.hpp` around lines 76 - 78, Update the comment
for the max_message_bytes field in grpc_client.hpp to correctly state the
clamping bounds used by the server: change the documented minimum from 4 MiB to
4 KiB so the comment reads that values are clamped to [4 KiB, 2 GiB - 1 MiB]
(consistent with grpc_server_types.hpp); keep the existing explanatory text
about protobuf limits and retain the default value and variable name
max_message_bytes.
cpp/src/grpc/server/grpc_server_types.hpp-257-259 (1)

257-259: ⚠️ Potential issue | 🟡 Minor

Hardcoded shared memory names may cause conflicts with multiple server instances.

The SHM region names are hardcoded constants. If multiple cuopt_grpc_server instances run on the same host (e.g., different ports for different GPUs), they would collide on these shared memory regions.

💡 Consider making SHM names configurable or unique per instance

One approach is to include the port or a unique identifier in the SHM names:

// In ServerConfig:
std::string shm_suffix;  // e.g., "_8765" for port 8765

// At startup, construct names like:
// /cuopt_job_queue_8765
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 257 - 259, SHM names
SHM_JOB_QUEUE, SHM_RESULT_QUEUE and SHM_CONTROL are hardcoded and will collide
across multiple server instances; make them instance-specific by replacing these
global const char* symbols with configurable strings generated at startup (e.g.,
in ServerConfig or GrpcServer) that append a unique suffix such as the server
port or UUID, update all code that references
SHM_JOB_QUEUE/SHM_RESULT_QUEUE/SHM_CONTROL to use the new per-instance string
members or accessors (e.g., serverConfig.getJobQueueName()), and ensure the
constructed names keep the leading '/' and are used consistently when
creating/opening shared memory regions.
🧹 Nitpick comments (6)
dependencies.yaml (1)

305-310: Consider adding version constraints for gRPC stack dependencies.

The new gRPC dependencies (openssl, c-ares, libgrpc, libprotobuf, libabseil, re2) are added without version constraints. For build reproducibility and to avoid potential ABI incompatibilities between these tightly coupled packages, consider pinning compatible versions (at minimum libgrpc and libprotobuf which have known version coupling requirements).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dependencies.yaml` around lines 305 - 310, The added gRPC-related
dependencies (openssl, c-ares, libgrpc, libprotobuf, libabseil, re2) lack
version constraints; update dependencies.yaml to pin compatible versions (at
minimum for libgrpc and libprotobuf) to ensure ABI compatibility and
reproducible builds, selecting matching libgrpc/libprotobuf pairs and consistent
versions for libabseil, c-ares, re2 and openssl, and express them using the
repository's existing versioning format so the package resolver installs the
intended releases.
ci/build_wheel_libcuopt.sh (1)

20-28: Consider adding fallback handling if neither package manager is available.

The libuuid installation logic silently does nothing if neither dnf nor apt-get is available. While the build will likely fail later with a clearer error from CMake, an explicit warning or error here would provide faster feedback. The same pattern is used for existing installations (e.g., update_rockylinux_repo.sh), so this may be intentional for consistency.

Optional: Add fallback warning
 # Install libuuid (needed by cuopt_grpc_server)
 if command -v dnf &> /dev/null; then
     dnf install -y libuuid-devel
 elif command -v apt-get &> /dev/null; then
     apt-get update && apt-get install -y uuid-dev
+else
+    echo "Warning: Neither dnf nor apt-get found, skipping libuuid installation"
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/build_wheel_libcuopt.sh` around lines 20 - 28, The package-manager check
in the if/elif block that uses "command -v dnf" and "command -v apt-get" can
silently skip installing libuuid if neither tool exists; add an else branch
after that if/elif to print a clear error or warning (e.g., echo "error: no
supported package manager found to install libuuid") and exit non-zero (or at
least echo a prominent warning) so the failure is explicit; update the block
that installs libuuid (the conditional using dnf/apt-get) to include this else
handler and keep the subsequent call to bash ci/utils/install_protobuf_grpc.sh
unchanged.
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (1)

470-474: Prefer per-test server fixtures, or add an explicit reset between cases.

These class-scoped fixtures reuse one remote server across multiple tests, so any leaked job, session, or worker-side state can bleed into later cases and make them order-dependent. If function scope is too expensive, add an explicit “server is empty” cleanup/assertion before each test instead.

As per coding guidelines, "Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment."

Also applies to: 516-520, 713-748, 771-814

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 470 - 474, The class-scoped fixture cpu_only_env_with_server (which
calls _start_grpc_server_fixture and _stop_grpc_server) can leak server-side
state between tests; either change its pytest.fixture scope from "class" to
"function" so each test starts/stops a fresh server, or keep class scope but add
an explicit reset/assertion before each test (e.g., call a helper that verifies
the server has no active jobs/sessions or restarts the server via
_stop_grpc_server and _start_grpc_server_fixture) to guarantee isolation for
cpu_only_env_with_server and the other similar fixtures referenced in the file.
cpp/src/grpc/server/grpc_server_types.hpp (1)

83-94: Multi-atomic JobQueueEntry relies on implicit sequencing.

The entry has multiple std::atomic fields that are accessed independently. The correctness relies on setting ready last when publishing and checking ready first when consuming. This works with default memory_order_seq_cst, but the invariants could be documented for maintainability.

Consider adding a brief comment documenting the publish/consume protocol:

// Protocol: writer sets all other fields, then sets ready=true (release).
// Reader checks ready==true (acquire), then reads other fields.
std::atomic<bool> ready;        // Job is ready to be processed
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 83 - 94, The
JobQueueEntry uses multiple std::atomic fields with an implicit publish/consume
ordering via ready; add a brief protocol comment above the ready declaration
(e.g., "writer sets other fields then sets ready=true (release); reader checks
ready==true (acquire) then reads other fields") and update the code paths that
publish/consume jobs to use explicit memory orders: use
std::memory_order_release when the producer stores ready=true, use
std::memory_order_acquire when the consumer loads ready, and keep other field
accesses as std::memory_order_relaxed where appropriate (referencing
JobQueueEntry and its members ready, claimed, worker_pid, cancelled,
worker_index, data_sent, is_chunked to locate sites to change).
cpp/src/grpc/grpc_solution_mapper.cpp (2)

333-340: Consider reserving capacity to avoid potential reallocation.

The doubles_to_bytes helper creates an intermediate std::vector<double> tmp by copying from input. For large solution vectors, this causes an extra allocation and copy. Since the output size is known upfront, this could be optimized.

♻️ Optional optimization to avoid intermediate copy when f_t is double
 template <typename f_t>
 std::vector<uint8_t> doubles_to_bytes(const std::vector<f_t>& vec)
 {
-  std::vector<double> tmp(vec.begin(), vec.end());
-  std::vector<uint8_t> bytes(tmp.size() * sizeof(double));
-  std::memcpy(bytes.data(), tmp.data(), bytes.size());
-  return bytes;
+  std::vector<uint8_t> bytes(vec.size() * sizeof(double));
+  if constexpr (std::is_same_v<f_t, double>) {
+    std::memcpy(bytes.data(), vec.data(), bytes.size());
+  } else {
+    std::vector<double> tmp(vec.begin(), vec.end());
+    std::memcpy(bytes.data(), tmp.data(), bytes.size());
+  }
+  return bytes;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cpp` around lines 333 - 340, The helper
doubles_to_bytes currently copies into a temporary std::vector<double> tmp which
forces an extra allocation; fix by preallocating the output buffer and avoiding
unnecessary reallocation and copy: when f_t != double, construct tmp with
vec.size() and use tmp.assign(vec.begin(), vec.end()) or reserve tmp.capacity()
before copying, then allocate bytes with bytes.resize(tmp.size()*sizeof(double))
and memcpy from tmp.data(); additionally provide an overload/specialization for
doubles_to_bytes<double> (or a branch when std::is_same<f_t,double>::value) that
bypasses tmp entirely by directly resizing bytes to vec.size()*sizeof(double)
and memcpy from vec.data().

518-527: Silent failure on size validation returns empty vector.

When raw.size() % sizeof(double) != 0, the function returns an empty vector with no error indication. This could mask data corruption issues during transfer. Consider whether logging or an exception would be more appropriate for debugging production issues.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cpp` around lines 518 - 527, The current
branch inside the templated reader (the code that looks up arrays by field_id
and handles the T==float case) silently returns an empty vector when raw.size()
% sizeof(double) != 0, which hides transfer/corruption problems; replace that
silent return with explicit error handling: either log a descriptive error
including field_id, raw.size() and expected element size using the project's
logger (or tracing facility) or throw a std::runtime_error (or a domain-specific
exception) with the same diagnostic text so callers can detect the failure;
update the block referencing arrays, field_id, raw, and the T==float branch to
perform this logging/throw before exiting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/utils/install_protobuf_grpc.sh`:
- Around line 129-141: The cleanup currently only removes a subset of possible
install locations; update the removal block that uses PREFIX to also delete
include/grpcpp, any libgrpc* and libprotobuf* shared objects and symlinks in
both "${PREFIX}/lib" and "${PREFIX}/lib64" (e.g. libgrpc*.so* , libprotobuf*.so*
, libprotoc*.so*), libabsl_* archives in "${PREFIX}/lib" as well as
"${PREFIX}/lib64", and any remaining grpc/protobuf-related pkgconfig or cmake
dirs; in short, broaden the rm globs to cover "${PREFIX}/include/grpcpp",
"${PREFIX}/lib/"lib*.so* and "${PREFIX}/lib64/"lib*.so*, add
"${PREFIX}/lib/"libabsl_*.a, and remove lingering pkgconfig/cmake directories so
no mixed-version headers or shared libs remain before rebuilding.

In `@cpp/src/grpc/client/grpc_client.cpp`:
- Around line 783-794: After submit_lp succeeds you must ensure the remote job
is cancelled/deleted on every error path: after
start_log_streaming/poll_for_completion/stop_log_streaming, if poll.completed is
false or poll timed out, call cancel_job(sub.job_id) (or delete_job if terminal)
before returning the error; similarly, if get_lp_result<i_t, f_t>(sub.job_id)
fails or result mapping fails, call delete_job(sub.job_id) before returning;
update the error-returning branches around submit_lp, poll_for_completion,
get_lp_result, and the symmetric block at the other location (the 810–820 block)
to always perform the appropriate cancel_job/delete_job cleanup using
sub.job_id.
- Around line 839-915: After StartChunkedUpload (and similarly
StartChunkedDownload) succeeds you must ensure the server session is always
released on any early return; add a scope guard (RAII or lambda-based) right
after StartChunkedUpload that captures upload_id and impl_, which on destruction
(unless dismissed) calls the cleanup RPC (FinishChunkedUpload or an
AbortChunkedUpload/cancel RPC if available) using impl_->stub and ignores/logs
errors; dismiss the guard only after the normal FinishChunkedUpload completes
and job_id_out is set. Apply the same pattern around
StartChunkedDownload/FinishChunkedDownload so any error paths from
SendArrayChunk, validation failures, or other early returns invoke the
best-effort finish/abort to avoid exhausting kMaxChunkedSessions.

In `@cpp/src/grpc/grpc_settings_mapper.cpp`:
- Around line 76-121: The PDLP warm-start blob isn't being transferred: update
map_pdlp_settings_to_proto to copy settings.warm_start_data into the protobuf
(use pb_settings->set_warm_start_data(settings.warm_start_data) or
pb_settings->mutable_warm_start_data()->CopyFrom(...) depending on the field
type) and make the symmetric change in the reverse mapper (the function that
converts cuopt::remote::PDLPSolverSettings back to pdlp_solver_settings_t, e.g.
map_proto_to_pdlp_settings or similar) to populate settings.warm_start_data from
the proto, ensuring warm-start state is preserved across remote calls.

In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp`:
- Around line 38-45: The parse_incumbent_proto function must validate the
incoming buffer before calling Incumbent::ParseFromArray: ensure that size does
not exceed std::numeric_limits<int>::max() (to avoid int overflow when casting)
and that if size > 0 then data is not nullptr; if either check fails return
false. After these guards, perform the safe static_cast<int>(size) and call
incumbent_msg.ParseFromArray(data, static_cast<int>(size)). Mirror the defensive
checks used in build_incumbent_proto and keep the function signature and return
behavior unchanged.

In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 13-27: The code in send_job_data_pipe captures the raw fd from
worker_pipes under worker_pipes_mutex but then releases the lock before writing,
allowing the FD to be closed/reused; to fix, under the same lock duplicate the
descriptor (use dup or dup3) and store the duplicated fd (e.g., dup_fd) then
release worker_pipes_mutex and perform write_to_pipe against dup_fd, closing
dup_fd afterwards; update references to worker_pipes, worker_pipes_mutex,
send_job_data_pipe, and write_to_pipe accordingly so the writes use the
duplicated FD and avoid races.
- Around line 219-329: In cancel_job, after scanning job_queue but before
overwriting it->second, re-read the tracked status under tracker_mutex (e.g.,
auto final_status = it->second.status) and if final_status is a terminal state
(COMPLETED, FAILED, or CANCELLED) return the appropriate code/message without
changing job_tracker or calling delete_log_file; only proceed to set status to
CANCELLED, set error_message, and notify waiters when final_status is
non-terminal. This change references cancel_job, job_tracker, job_queue, and
tracker_mutex to ensure the fallback path does not clobber a terminal state that
was published while the slot was recycled.

In `@cpp/src/grpc/server/grpc_pipe_io.cpp`:
- Around line 38-74: The read_from_pipe() logic uses a single initial poll()
with timeout_ms then switches to blocking ::read, which can hang if the sender
stalls mid-message; also it treats POLLHUP as an immediate error even when
POLLIN is set and data remains. Fix by making reads timeout-aware: either set fd
non-blocking and loop using poll() (or repeated poll() with remaining
timeout_ms) before each ::read, or after a partial ::read use
poll(&pfd,1,remaining_ms) to wait for more data; ensure you update and use a
remaining timeout counter; and change the revents check to allow POLLIN|POLLHUP
(i.e., only treat POLLHUP/POLLERR/POLLNVAL as fatal when POLLIN is not set) so
buffered data is read out before returning an error.

In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 290-301: shutdown_all currently only sets
shm_ctrl->shutdown_requested and waits, which leaves workers stuck inside
long-running routines like solve_lp() or solve_mip(); update shutdown_all to
perform a hard-stop for active workers by (1) setting keep_running = false and
shm_ctrl->shutdown_requested as now, (2) invoking the worker-level cancellation
path for any running workers (e.g., call the worker interrupt/terminate API or
set a per-worker cancel flag and signal the workers) so threads inside
solve_lp()/solve_mip() can break out immediately, and (3) after signalling
cancellations, proceed to join
result_thread/incumbent_thread/monitor_thread/reaper_thread, call
wait_for_workers(), and then cleanup_shared_memory(); refer to the shutdown_all
lambda, wait_for_workers(), and the long-running routines solve_lp()/solve_mip()
to add and use the cancellation mechanism.
- Around line 228-263: Validate that TLS is enabled before accepting any
TLS/mTLS-specific flags: add a guard that checks config.enable_tls and rejects
when false while any of config.tls_cert_path, config.tls_key_path,
config.tls_root_path are non-empty or config.require_client is true (use
cuopt_expects with error_type_t::ValidationError and a clear message). Place
this check before the existing TLS branch that constructs
grpc::SslServerCredentials (i.e., before the if (config.enable_tls) { ... }
block) so flags like --require-client-cert, --tls-cert, --tls-key, and
--tls-root cannot silently fall back to grpc::InsecureServerCredentials().

In `@cpp/src/grpc/server/grpc_server_threads.cpp`:
- Around line 401-409: The reaper currently uses the session's created timestamp
(it->second.created) causing active downloads to be expired mid-transfer; update
the session model to include a last_activity timestamp (e.g., last_activity) and
change the reaper to compare now - it->second.last_activity against
kSessionTimeoutSeconds, and in GetResultChunk (or any handler that serves/reads
a chunk from chunked_downloads) update that session's last_activity under
chunked_downloads_mutex each time the session is accessed so active clients
reset the timeout; keep using chunked_downloads and chunked_downloads_mutex and
remove reliance on created for expiration logic.

In `@cpp/src/grpc/server/grpc_service_impl.cpp`:
- Around line 795-819: The terminal path currently reads at most one more line
after check_job_status and then sends the job_complete sentinel, which can drop
any additional flushed lines and leave byte_offset behind; change the logic in
the completion branch (around check_job_status, the use of in, line,
current_offset, next_offset2, cuopt::remote::LogMessage and writer->Write) to
drain the entire remaining stream by repeatedly calling std::getline(in, line)
in a loop, for each iteration compute the correct byte offset from the stream
position (use tellg() before/after or after if available) and emit a LogMessage
with m.set_line(line), m.set_byte_offset(next_offset), m.set_job_complete(false)
via writer->Write(m), update current_offset to the last emitted byte offset, and
after the loop emit the final done message with m.set_job_complete(true) and
byte_offset set to that last current_offset.
- Around line 141-179: SendArrayChunk currently only updates meta.received_bytes
and pushes the raw chunk, letting FinishChunkedUpload queue jobs even if chunks
overlap or leave gaps; fix by enforcing exact coverage and no-overlap for each
field. In SendArrayChunk (use symbols state.field_meta, meta,
meta.received_bytes, meta.element_size, state.chunks, state.total_chunks, and
the incoming ac/raw/elem_offset) validate the incoming chunk against existing
ranges for that field (track per-field ranges in meta or compare against
already-stored chunks) and reject overlaps immediately; store the byte/element
range rather than only adding to received_bytes. In FinishChunkedUpload iterate
state.field_meta entries and ensure for every field
meta.total_elements*meta.element_size equals the sum of non-overlapping received
ranges (and that ranges fully cover [0,total_elements) with no gaps); if any
field is incomplete or has gaps/overlaps, return
INVALID_ARGUMENT/RESOURCE_EXHAUSTED and do not enqueue the job.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp`:
- Around line 103-138: spawn_worker currently forks a child to create
replacement workers which is unsafe in a multithreaded server; change the
replacement-worker path so you do not continue execution in a forked child of a
multithreaded process — instead spawn a fresh process via fork+exec (or
posix_spawn) and exec the same server binary with an argument to run a single
worker, preserving the existing pipe setup/FDs for the new process, or use
posix_spawn to avoid inheriting thread/mutex state; ensure spawn_single_worker
and the is_replacement=true branch in spawn_worker call this exec-based spawn
path, and maintain the same close_all_worker_pipes/close_worker_pipes_child_ends
semantics for FD hygiene.

In `@cpp/src/grpc/server/grpc_worker.cpp`:
- Around line 399-477: publish_result can silently drop a completed job when no
result_queue slot is free; change publish_result to return a bool (e.g.,
enqueued) and ensure it does not silently succeed when result_slot stays -1, and
update worker_process to check that return value before clearing the job slot:
if publish_result returns false, either retry/enqueue with a bounded backoff or
propagate an explicit RESULT_ERROR into the result handling path so the server
thread will see a terminal result; refer to publish_result, result_slot,
result_queue, and worker_process (the code that resets the job slot) when making
these changes.

In `@cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp`:
- Around line 428-463: Reserve a free loopback port before forking by creating a
temporary socket, binding it to 127.0.0.1:0, calling getsockname to set port_,
and ensuring the socket is not closed-on-exec so the child can inherit the
reservation; then fork() as before, pass the port_str to execl via the "--port"
arg, and in the parent keep the reservation socket open until
tcp_connect_check(port_, 15000) succeeds (close it afterward); update code
around port_, fork(), tcp_connect_check(), server_path_, execl and server_pid_
to implement this reservation/hand-off sequence.

In `@GRPC_SERVER_ARCHITECTURE.md`:
- Around line 52-57: The POSIX shared-memory names `/cuopt_job_queue`,
`/cuopt_result_queue`, and `/cuopt_control` are host-global and must be
namespaced per server instance; update all references (including the other
occurrence around lines mentioning 307-314) to generate instance-specific names
(for example by appending PID, port, or a UUID) instead of the fixed literals so
multiple `cuopt_grpc_server` instances can coexist—add a small helper like
generateInstanceShmName(instanceId, baseName) and replace uses of the three
fixed symbols with calls that pass a unique instance identifier derived from the
server startup parameters (PID/port/ENV/UUID).

---

Outside diff comments:
In `@cpp/src/mip_heuristics/solve.cu`:
- Around line 442-453: The remote-execution branch currently forces a CPU-only
cast (dynamic_cast to cpu_optimization_problem_t) which makes remote solves fail
for GPU-backed problems; remove the cpu-specific cast in the
is_remote_execution_enabled() branch and call solve_mip_remote using the generic
problem_interface (or an overload that accepts the base
optimization_problem_t<i_t,f_t>) so remote routing stays backend-agnostic; keep
the CPU-only cast only for the local path where solve_mip(*cpu_prob, settings)
is needed and ensure solve_mip_remote's signature or overload supports
serializing/handling GPU-resident problems from the base type rather than
requiring cpu_optimization_problem_t.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 417-421: The test currently skips the warm-start round-trip when
ws is None; update the test to fail if get_pdlp_warm_start_data() returns None
so regressions are caught: after calling settings.get_pdlp_warm_start_data() (or
wherever ws is obtained), assert ws is not None (or raise an AssertionError)
with a clear message before calling settings.set_pdlp_warm_start_data(ws) and
Solve(dm, settings); keep the existing calls to set_pdlp_warm_start_data,
set_parameter(CUOPT_ITERATION_LIMIT, 200), and linear_programming.Solve to
exercise the full round-trip and then assert sol2.get_primal_solution() is not
None.

---

Minor comments:
In `@cpp/src/grpc/client/grpc_client.hpp`:
- Around line 76-78: Update the comment for the max_message_bytes field in
grpc_client.hpp to correctly state the clamping bounds used by the server:
change the documented minimum from 4 MiB to 4 KiB so the comment reads that
values are clamped to [4 KiB, 2 GiB - 1 MiB] (consistent with
grpc_server_types.hpp); keep the existing explanatory text about protobuf limits
and retain the default value and variable name max_message_bytes.

In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 257-259: SHM names SHM_JOB_QUEUE, SHM_RESULT_QUEUE and SHM_CONTROL
are hardcoded and will collide across multiple server instances; make them
instance-specific by replacing these global const char* symbols with
configurable strings generated at startup (e.g., in ServerConfig or GrpcServer)
that append a unique suffix such as the server port or UUID, update all code
that references SHM_JOB_QUEUE/SHM_RESULT_QUEUE/SHM_CONTROL to use the new
per-instance string members or accessors (e.g., serverConfig.getJobQueueName()),
and ensure the constructed names keep the leading '/' and are used consistently
when creating/opening shared memory regions.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 76-86: The clear() method leaves server_log_start_pos_ unchanged
when opening server_log_path_ fails, which can leave a stale offset; update
clear() so that if server_log_path_ is non-empty but the file fails to open you
reset server_log_start_pos_ to the beginning (e.g., streampos(0) or 0) to avoid
seeking to a stale offset on later reads; locate the logic in clear(), reference
server_log_path_, server_log_start_pos_, and test_start_marked_, and ensure
test_start_marked_ behavior remains the same after the reset.

In `@GRPC_INTERFACE.md`:
- Around line 189-193: The docs incorrectly state client-side detection uses
context->IsCancelled(); update the Connection Handling section to use the
correct gRPC APIs: note that IsCancelled() is server-side only (used on
ServerContext), clients should call ClientContext::TryCancel() to initiate
cancellation, and clients detect cancellation by checking the returned
grpc::Status for status.code() == grpc::StatusCode::CANCELLED (or
StatusCode::CANCELLED) rather than calling IsCancelled() on ClientContext.
Mention the correct symbols: ClientContext::TryCancel(),
ServerContext::IsCancelled(), and grpc::Status / grpc::StatusCode::CANCELLED.

In `@GRPC_QUICK_START.md`:
- Around line 3-8: The doc promises a C API but the sample uses C++ headers and
calls solve_lp; update the guide so the API surface is accurate by either
renaming the example sections to "C++ API" or replacing the sample with a real C
example that includes cuopt_c.h and demonstrates the equivalent C entrypoint(s)
(e.g., the cuopt C solve function(s) corresponding to solve_lp), and ensure
every other section that currently claims to show C but uses C++ is corrected
likewise.

---

Nitpick comments:
In `@ci/build_wheel_libcuopt.sh`:
- Around line 20-28: The package-manager check in the if/elif block that uses
"command -v dnf" and "command -v apt-get" can silently skip installing libuuid
if neither tool exists; add an else branch after that if/elif to print a clear
error or warning (e.g., echo "error: no supported package manager found to
install libuuid") and exit non-zero (or at least echo a prominent warning) so
the failure is explicit; update the block that installs libuuid (the conditional
using dnf/apt-get) to include this else handler and keep the subsequent call to
bash ci/utils/install_protobuf_grpc.sh unchanged.

In `@cpp/src/grpc/grpc_solution_mapper.cpp`:
- Around line 333-340: The helper doubles_to_bytes currently copies into a
temporary std::vector<double> tmp which forces an extra allocation; fix by
preallocating the output buffer and avoiding unnecessary reallocation and copy:
when f_t != double, construct tmp with vec.size() and use
tmp.assign(vec.begin(), vec.end()) or reserve tmp.capacity() before copying,
then allocate bytes with bytes.resize(tmp.size()*sizeof(double)) and memcpy from
tmp.data(); additionally provide an overload/specialization for
doubles_to_bytes<double> (or a branch when std::is_same<f_t,double>::value) that
bypasses tmp entirely by directly resizing bytes to vec.size()*sizeof(double)
and memcpy from vec.data().
- Around line 518-527: The current branch inside the templated reader (the code
that looks up arrays by field_id and handles the T==float case) silently returns
an empty vector when raw.size() % sizeof(double) != 0, which hides
transfer/corruption problems; replace that silent return with explicit error
handling: either log a descriptive error including field_id, raw.size() and
expected element size using the project's logger (or tracing facility) or throw
a std::runtime_error (or a domain-specific exception) with the same diagnostic
text so callers can detect the failure; update the block referencing arrays,
field_id, raw, and the T==float branch to perform this logging/throw before
exiting.

In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 83-94: The JobQueueEntry uses multiple std::atomic fields with an
implicit publish/consume ordering via ready; add a brief protocol comment above
the ready declaration (e.g., "writer sets other fields then sets ready=true
(release); reader checks ready==true (acquire) then reads other fields") and
update the code paths that publish/consume jobs to use explicit memory orders:
use std::memory_order_release when the producer stores ready=true, use
std::memory_order_acquire when the consumer loads ready, and keep other field
accesses as std::memory_order_relaxed where appropriate (referencing
JobQueueEntry and its members ready, claimed, worker_pid, cancelled,
worker_index, data_sent, is_chunked to locate sites to change).

In `@dependencies.yaml`:
- Around line 305-310: The added gRPC-related dependencies (openssl, c-ares,
libgrpc, libprotobuf, libabseil, re2) lack version constraints; update
dependencies.yaml to pin compatible versions (at minimum for libgrpc and
libprotobuf) to ensure ABI compatibility and reproducible builds, selecting
matching libgrpc/libprotobuf pairs and consistent versions for libabseil,
c-ares, re2 and openssl, and express them using the repository's existing
versioning format so the package resolver installs the intended releases.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 470-474: The class-scoped fixture cpu_only_env_with_server (which
calls _start_grpc_server_fixture and _stop_grpc_server) can leak server-side
state between tests; either change its pytest.fixture scope from "class" to
"function" so each test starts/stops a fresh server, or keep class scope but add
an explicit reset/assertion before each test (e.g., call a helper that verifies
the server has no active jobs/sessions or restarts the server via
_stop_grpc_server and _start_grpc_server_fixture) to guarantee isolation for
cpu_only_env_with_server and the other similar fixtures referenced in the file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 741d5bb2-9920-447b-a562-335486d8ad08

📥 Commits

Reviewing files that changed from the base of the PR and between 2cf8b32 and 335e936.

📒 Files selected for processing (59)
  • GRPC_INTERFACE.md
  • GRPC_QUICK_START.md
  • GRPC_SERVER_ARCHITECTURE.md
  • build.sh
  • ci/build_wheel_libcuopt.sh
  • ci/utils/install_protobuf_grpc.sh
  • conda/environments/all_cuda-129_arch-aarch64.yaml
  • conda/environments/all_cuda-129_arch-x86_64.yaml
  • conda/environments/all_cuda-131_arch-aarch64.yaml
  • conda/environments/all_cuda-131_arch-x86_64.yaml
  • conda/recipes/libcuopt/recipe.yaml
  • cpp/CMakeLists.txt
  • cpp/cuopt_cli.cpp
  • cpp/include/cuopt/linear_programming/solve_remote.hpp
  • cpp/src/grpc/client/grpc_client.cpp
  • cpp/src/grpc/client/grpc_client.hpp
  • cpp/src/grpc/client/solve_remote.cpp
  • cpp/src/grpc/cuopt_remote.proto
  • cpp/src/grpc/cuopt_remote_service.proto
  • cpp/src/grpc/grpc_problem_mapper.cpp
  • cpp/src/grpc/grpc_problem_mapper.hpp
  • cpp/src/grpc/grpc_service_mapper.cpp
  • cpp/src/grpc/grpc_service_mapper.hpp
  • cpp/src/grpc/grpc_settings_mapper.cpp
  • cpp/src/grpc/grpc_settings_mapper.hpp
  • cpp/src/grpc/grpc_solution_mapper.cpp
  • cpp/src/grpc/grpc_solution_mapper.hpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_io.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_logger.cpp
  • cpp/src/grpc/server/grpc_server_logger.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/mip_heuristics/presolve/probing_cache.cu
  • cpp/src/mip_heuristics/presolve/third_party_presolve.cpp
  • cpp/src/mip_heuristics/solve.cu
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/cpu_optimization_problem.cpp
  • cpp/src/pdlp/optimization_problem.cu
  • cpp/src/pdlp/solve.cu
  • cpp/src/pdlp/solve_remote.cu
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/grpc_client_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • dependencies.yaml
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
  • python/libcuopt/CMakeLists.txt
💤 Files with no reviewable changes (4)
  • cpp/src/pdlp/cpu_optimization_problem.cpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/optimization_problem.cu
  • cpp/src/pdlp/solve_remote.cu

Comment on lines +129 to +141
# Clean up any previous installations to avoid ABI mismatches
# (notably Abseil LTS namespaces like absl::lts_20220623 vs absl::lts_20250512)
echo "Cleaning up previous installations..."
rm -rf \
"${PREFIX}/lib/cmake/grpc" "${PREFIX}/lib64/cmake/grpc" \
"${PREFIX}/lib/cmake/protobuf" "${PREFIX}/lib64/cmake/protobuf" \
"${PREFIX}/lib/cmake/absl" "${PREFIX}/lib64/cmake/absl" \
"${PREFIX}/include/absl" "${PREFIX}/include/google/protobuf" "${PREFIX}/include/grpc" \
"${PREFIX}/bin/grpc_cpp_plugin" "${PREFIX}/bin/protoc" "${PREFIX}/bin/protoc-"* || true
rm -f \
"${PREFIX}/lib/"libgrpc*.a "${PREFIX}/lib/"libgpr*.a "${PREFIX}/lib/"libaddress_sorting*.a "${PREFIX}/lib/"libre2*.a "${PREFIX}/lib/"libupb*.a \
"${PREFIX}/lib64/"libabsl_*.a "${PREFIX}/lib64/"libprotobuf*.so* "${PREFIX}/lib64/"libprotoc*.so* \
"${PREFIX}/lib/"libprotobuf*.a "${PREFIX}/lib/"libprotoc*.a || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The cleanup pass misses several common install layouts.

This block removes only part of the old tree. include/grpcpp, shared libgrpc*/libprotobuf* under lib, and libabsl_* under lib can all survive, which means CMake may still find a mixed-version installation and reproduce the ABI mismatch this cleanup is trying to prevent.

Expand cleanup across both header and library layouts
 rm -rf \
   "${PREFIX}/lib/cmake/grpc" "${PREFIX}/lib64/cmake/grpc" \
   "${PREFIX}/lib/cmake/protobuf" "${PREFIX}/lib64/cmake/protobuf" \
   "${PREFIX}/lib/cmake/absl" "${PREFIX}/lib64/cmake/absl" \
-  "${PREFIX}/include/absl" "${PREFIX}/include/google/protobuf" "${PREFIX}/include/grpc" \
+  "${PREFIX}/include/absl" "${PREFIX}/include/google/protobuf" \
+  "${PREFIX}/include/grpc" "${PREFIX}/include/grpcpp" \
   "${PREFIX}/bin/grpc_cpp_plugin" "${PREFIX}/bin/protoc" "${PREFIX}/bin/protoc-"* || true
-rm -f \
-  "${PREFIX}/lib/"libgrpc*.a "${PREFIX}/lib/"libgpr*.a "${PREFIX}/lib/"libaddress_sorting*.a "${PREFIX}/lib/"libre2*.a "${PREFIX}/lib/"libupb*.a \
-  "${PREFIX}/lib64/"libabsl_*.a "${PREFIX}/lib64/"libprotobuf*.so* "${PREFIX}/lib64/"libprotoc*.so* \
-  "${PREFIX}/lib/"libprotobuf*.a "${PREFIX}/lib/"libprotoc*.a || true
+for libdir in "${PREFIX}/lib" "${PREFIX}/lib64"; do
+  rm -f \
+    "${libdir}/"libgrpc*.a "${libdir}/"libgrpc*.so* \
+    "${libdir}/"libgpr*.a "${libdir}/"libgpr*.so* \
+    "${libdir}/"libaddress_sorting*.a "${libdir}/"libaddress_sorting*.so* \
+    "${libdir}/"libre2*.a "${libdir}/"libre2*.so* \
+    "${libdir}/"libupb*.a "${libdir}/"libupb*.so* \
+    "${libdir}/"libabsl_*.a "${libdir}/"libabsl_*.so* \
+    "${libdir}/"libprotobuf*.a "${libdir}/"libprotobuf*.so* \
+    "${libdir}/"libprotoc*.a "${libdir}/"libprotoc*.so*
+done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/utils/install_protobuf_grpc.sh` around lines 129 - 141, The cleanup
currently only removes a subset of possible install locations; update the
removal block that uses PREFIX to also delete include/grpcpp, any libgrpc* and
libprotobuf* shared objects and symlinks in both "${PREFIX}/lib" and
"${PREFIX}/lib64" (e.g. libgrpc*.so* , libprotobuf*.so* , libprotoc*.so*),
libabsl_* archives in "${PREFIX}/lib" as well as "${PREFIX}/lib64", and any
remaining grpc/protobuf-related pkgconfig or cmake dirs; in short, broaden the
rm globs to cover "${PREFIX}/include/grpcpp", "${PREFIX}/lib/"lib*.so* and
"${PREFIX}/lib64/"lib*.so*, add "${PREFIX}/lib/"libabsl_*.a, and remove
lingering pkgconfig/cmake directories so no mixed-version headers or shared libs
remain before rebuilding.

Comment on lines +783 to +794
auto sub = submit_lp(problem, settings);
if (!sub.success) { return {.error_message = sub.error_message}; }

start_log_streaming(sub.job_id);
auto poll = poll_for_completion(sub.job_id);
stop_log_streaming();

if (!poll.completed) { return {.error_message = poll.error_message}; }

auto result = get_lp_result<i_t, f_t>(sub.job_id);
if (result.success) { delete_job(sub.job_id); }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Cancel/delete the remote job on every non-success exit path.

After submission, these helpers only call delete_job() when the solve fully succeeds. If polling times out, result download fails, or result mapping fails, the remote job/result/log file is left behind on the server and can eventually exhaust server slots. Add cleanup logic that cancels outstanding jobs and deletes terminal ones before returning an error.

Also applies to: 810-820

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cpp` around lines 783 - 794, After submit_lp
succeeds you must ensure the remote job is cancelled/deleted on every error
path: after start_log_streaming/poll_for_completion/stop_log_streaming, if
poll.completed is false or poll timed out, call cancel_job(sub.job_id) (or
delete_job if terminal) before returning the error; similarly, if
get_lp_result<i_t, f_t>(sub.job_id) fails or result mapping fails, call
delete_job(sub.job_id) before returning; update the error-returning branches
around submit_lp, poll_for_completion, get_lp_result, and the symmetric block at
the other location (the 810–820 block) to always perform the appropriate
cancel_job/delete_job cleanup using sub.job_id.

Comment on lines +839 to +915
// --- 1. StartChunkedUpload ---
std::string upload_id;
{
grpc::ClientContext context;
set_rpc_deadline(context, kDefaultRpcTimeoutSeconds);
cuopt::remote::StartChunkedUploadRequest request;
*request.mutable_problem_header() = header;

cuopt::remote::StartChunkedUploadResponse response;
auto status = impl_->stub->StartChunkedUpload(&context, request, &response);

if (!status.ok()) {
last_error_ = "StartChunkedUpload failed: " + status.error_message();
return false;
}

upload_id = response.upload_id();
if (response.max_message_bytes() > 0) {
server_max_message_bytes_.store(response.max_message_bytes(), std::memory_order_relaxed);
}
}

GRPC_CLIENT_DEBUG_LOG(config_, "[grpc_client] ChunkedUpload started, upload_id=" << upload_id);

// --- 2. Build chunk requests directly from problem arrays ---
int64_t chunk_data_budget = config_.chunk_size_bytes;
if (chunk_data_budget <= 0) { chunk_data_budget = 1LL * 1024 * 1024; }
int64_t srv_max = server_max_message_bytes_.load(std::memory_order_relaxed);
if (srv_max > 0 && chunk_data_budget > srv_max * 9 / 10) { chunk_data_budget = srv_max * 9 / 10; }

auto chunk_requests = build_array_chunk_requests(problem, upload_id, chunk_data_budget);

// --- 3. Send each chunk request ---
int total_chunks = 0;
int64_t total_bytes_sent = 0;

for (auto& chunk_request : chunk_requests) {
grpc::ClientContext chunk_context;
set_rpc_deadline(chunk_context, kDefaultRpcTimeoutSeconds);
cuopt::remote::SendArrayChunkResponse chunk_response;
auto status = impl_->stub->SendArrayChunk(&chunk_context, chunk_request, &chunk_response);

if (!status.ok()) {
last_error_ = "SendArrayChunk failed: " + status.error_message();
return false;
}

total_bytes_sent += chunk_request.chunk().data().size();
++total_chunks;
}

GRPC_CLIENT_DEBUG_LOG(config_,
"[grpc_client] ChunkedUpload sent " << total_chunks << " chunk requests");

// --- 4. FinishChunkedUpload ---
{
grpc::ClientContext context;
set_rpc_deadline(context, kDefaultRpcTimeoutSeconds);
cuopt::remote::FinishChunkedUploadRequest request;
request.set_upload_id(upload_id);

cuopt::remote::SubmitJobResponse response;
auto status = impl_->stub->FinishChunkedUpload(&context, request, &response);

if (!status.ok()) {
last_error_ = "FinishChunkedUpload failed: " + status.error_message();
return false;
}

job_id_out = response.job_id();
}

GRPC_CLIENT_THROUGHPUT_LOG(config_, "upload_chunked", total_bytes_sent, upload_t0);
GRPC_CLIENT_DEBUG_LOG(
config_,
"[grpc_client] ChunkedUpload complete: " << total_chunks << " chunks, job_id=" << job_id_out);
return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Release chunked upload/download sessions when a transfer aborts.

Once StartChunkedUpload or StartChunkedDownload succeeds, any later RPC or validation failure returns immediately without telling the server to drop the session. A handful of interrupted transfers can fill kMaxChunkedSessions until the reaper runs and block unrelated clients. This needs an abort/cleanup path, or at least a best-effort finish/abort guard, around the whole transfer.

Also applies to: 1010-1133

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cpp` around lines 839 - 915, After
StartChunkedUpload (and similarly StartChunkedDownload) succeeds you must ensure
the server session is always released on any early return; add a scope guard
(RAII or lambda-based) right after StartChunkedUpload that captures upload_id
and impl_, which on destruction (unless dismissed) calls the cleanup RPC
(FinishChunkedUpload or an AbortChunkedUpload/cancel RPC if available) using
impl_->stub and ignores/logs errors; dismiss the guard only after the normal
FinishChunkedUpload completes and job_id_out is set. Apply the same pattern
around StartChunkedDownload/FinishChunkedDownload so any error paths from
SendArrayChunk, validation failures, or other early returns invoke the
best-effort finish/abort to avoid exhausting kMaxChunkedSessions.

Comment on lines +76 to +121
void map_pdlp_settings_to_proto(const pdlp_solver_settings_t<i_t, f_t>& settings,
cuopt::remote::PDLPSolverSettings* pb_settings)
{
// Termination tolerances (all names match cuOpt API)
pb_settings->set_absolute_gap_tolerance(settings.tolerances.absolute_gap_tolerance);
pb_settings->set_relative_gap_tolerance(settings.tolerances.relative_gap_tolerance);
pb_settings->set_primal_infeasible_tolerance(settings.tolerances.primal_infeasible_tolerance);
pb_settings->set_dual_infeasible_tolerance(settings.tolerances.dual_infeasible_tolerance);
pb_settings->set_absolute_dual_tolerance(settings.tolerances.absolute_dual_tolerance);
pb_settings->set_relative_dual_tolerance(settings.tolerances.relative_dual_tolerance);
pb_settings->set_absolute_primal_tolerance(settings.tolerances.absolute_primal_tolerance);
pb_settings->set_relative_primal_tolerance(settings.tolerances.relative_primal_tolerance);

// Limits
pb_settings->set_time_limit(settings.time_limit);
// Avoid emitting a huge number when the iteration limit is the library default.
// Use -1 sentinel for "unset/use server defaults".
if (settings.iteration_limit == std::numeric_limits<i_t>::max()) {
pb_settings->set_iteration_limit(-1);
} else {
pb_settings->set_iteration_limit(static_cast<int64_t>(settings.iteration_limit));
}

// Solver configuration
pb_settings->set_log_to_console(settings.log_to_console);
pb_settings->set_detect_infeasibility(settings.detect_infeasibility);
pb_settings->set_strict_infeasibility(settings.strict_infeasibility);
pb_settings->set_pdlp_solver_mode(to_proto_pdlp_mode(settings.pdlp_solver_mode));
pb_settings->set_method(to_proto_method(settings.method));
pb_settings->set_presolver(static_cast<int32_t>(settings.presolver));
pb_settings->set_dual_postsolve(settings.dual_postsolve);
pb_settings->set_crossover(settings.crossover);
pb_settings->set_num_gpus(settings.num_gpus);

pb_settings->set_per_constraint_residual(settings.per_constraint_residual);
pb_settings->set_cudss_deterministic(settings.cudss_deterministic);
pb_settings->set_folding(settings.folding);
pb_settings->set_augmented(settings.augmented);
pb_settings->set_dualize(settings.dualize);
pb_settings->set_ordering(settings.ordering);
pb_settings->set_barrier_dual_initial_point(settings.barrier_dual_initial_point);
pb_settings->set_eliminate_dense_columns(settings.eliminate_dense_columns);
pb_settings->set_pdlp_precision(static_cast<int32_t>(settings.pdlp_precision));
pb_settings->set_save_best_primal_so_far(settings.save_best_primal_so_far);
pb_settings->set_first_primal_feasible(settings.first_primal_feasible);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't drop warm_start_data in the PDLP settings mapper.

PDLPSolverSettings exposes warm_start_data, but neither conversion copies it. Any remote solve that goes through these helpers silently loses PDLP warm-start state, so iterative workflows restart cold even when prior state is provided.

Also applies to: 123-181

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_settings_mapper.cpp` around lines 76 - 121, The PDLP
warm-start blob isn't being transferred: update map_pdlp_settings_to_proto to
copy settings.warm_start_data into the protobuf (use
pb_settings->set_warm_start_data(settings.warm_start_data) or
pb_settings->mutable_warm_start_data()->CopyFrom(...) depending on the field
type) and make the symmetric change in the reverse mapper (the function that
converts cuopt::remote::PDLPSolverSettings back to pdlp_solver_settings_t, e.g.
map_proto_to_pdlp_settings or similar) to populate settings.warm_start_data from
the proto, ensuring warm-start state is preserved across remote calls.

Comment on lines +38 to +45
inline bool parse_incumbent_proto(const uint8_t* data,
size_t size,
std::string& job_id,
double& objective,
std::vector<double>& assignment)
{
cuopt::remote::Incumbent incumbent_msg;
if (!incumbent_msg.ParseFromArray(data, static_cast<int>(size))) { return false; }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if file exists and get actual contents
if [ -f "cpp/src/grpc/server/grpc_incumbent_proto.hpp" ]; then
  cat -n "cpp/src/grpc/server/grpc_incumbent_proto.hpp"
else
  echo "File not found. Searching for similar files..."
  fd "grpc_incumbent_proto" --type f
fi

Repository: NVIDIA/cuopt

Length of output: 2393


🏁 Script executed:

# Search for other instances of ParseFromArray to understand the pattern
rg "ParseFromArray" --type cpp --type hpp -B 2 -A 2

Repository: NVIDIA/cuopt

Length of output: 83


🏁 Script executed:

# Check what numeric_limits header provides and standard int max
rg "#include.*limits" --type cpp --type hpp | head -20

Repository: NVIDIA/cuopt

Length of output: 83


🏁 Script executed:

# Look for other protobuf parsing patterns in the codebase
rg "ParseFromArray|protobuf" --type cpp --type hpp | head -30

Repository: NVIDIA/cuopt

Length of output: 83


Guard the protobuf size conversion and pointer validation before parsing.

ParseFromArray takes an int, but size is blindly cast from size_t without bounds checking. Additionally, there is no validation that data is non-null when size is nonzero. A malformed or oversized payload can overflow the cast and cause undefined behavior during protobuf deserialization. This is inconsistent with the sender side (build_incumbent_proto line 32), which correctly validates the size before serializing.

Suggested fix
 inline bool parse_incumbent_proto(const uint8_t* data,
                                   size_t size,
                                   std::string& job_id,
                                   double& objective,
                                   std::vector<double>& assignment)
 {
+  if ((data == nullptr && size != 0) ||
+      size > static_cast<size_t>(std::numeric_limits<int>::max())) {
+    return false;
+  }
   cuopt::remote::Incumbent incumbent_msg;
   if (!incumbent_msg.ParseFromArray(data, static_cast<int>(size))) { return false; }

Apply the same defensive validation pattern used in build_incumbent_proto (line 32) to prevent integer overflow and null pointer dereference in untrusted IPC paths.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 38 - 45, The
parse_incumbent_proto function must validate the incoming buffer before calling
Incumbent::ParseFromArray: ensure that size does not exceed
std::numeric_limits<int>::max() (to avoid int overflow when casting) and that if
size > 0 then data is not nullptr; if either check fails return false. After
these guards, perform the safe static_cast<int>(size) and call
incumbent_msg.ParseFromArray(data, static_cast<int>(size)). Mirror the defensive
checks used in build_incumbent_proto and keep the function signature and return
behavior unchanged.

Comment on lines +795 to +819
// Check whether the job has finished. If so, drain any final
// bytes the solver may have flushed after our last read, then
// send the job_complete sentinel and close the stream.
std::string msg;
JobStatus s = check_job_status(job_id, msg);
if (s == JobStatus::COMPLETED || s == JobStatus::FAILED || s == JobStatus::CANCELLED) {
std::streampos before2 = in.tellg();
if (before2 >= 0) { current_offset = static_cast<int64_t>(before2); }
if (std::getline(in, line)) {
std::streampos after2 = in.tellg();
int64_t next_offset2 = current_offset + static_cast<int64_t>(line.size());
if (after2 >= 0) { next_offset2 = static_cast<int64_t>(after2); }
cuopt::remote::LogMessage m;
m.set_line(line);
m.set_byte_offset(next_offset2);
m.set_job_complete(false);
writer->Write(m);
}

cuopt::remote::LogMessage done;
done.set_line("");
done.set_byte_offset(current_offset);
done.set_job_complete(true);
writer->Write(done);
return Status::OK;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Drain the full log tail before sending job_complete.

The terminal path reads at most one more line and then emits the sentinel. If the worker flushes multiple lines at exit, every line after the first is lost, and the final byte_offset can also lag behind the last emitted line. Keep reading until EOF and advance the offset from the last line actually written.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 795 - 819, The
terminal path currently reads at most one more line after check_job_status and
then sends the job_complete sentinel, which can drop any additional flushed
lines and leave byte_offset behind; change the logic in the completion branch
(around check_job_status, the use of in, line, current_offset, next_offset2,
cuopt::remote::LogMessage and writer->Write) to drain the entire remaining
stream by repeatedly calling std::getline(in, line) in a loop, for each
iteration compute the correct byte offset from the stream position (use tellg()
before/after or after if available) and emit a LogMessage with m.set_line(line),
m.set_byte_offset(next_offset), m.set_job_complete(false) via writer->Write(m),
update current_offset to the last emitted byte offset, and after the loop emit
the final done message with m.set_job_complete(true) and byte_offset set to that
last current_offset.

Comment on lines +103 to +138
pid_t spawn_worker(int worker_id, bool is_replacement)
{
std::lock_guard<std::mutex> lock(worker_pipes_mutex);

if (is_replacement) { close_worker_pipes_server(worker_id); }

if (!create_worker_pipes(worker_id)) {
SERVER_LOG_ERROR("[Server] Failed to create pipes for %s%d",
is_replacement ? "replacement worker " : "worker ",
worker_id);
return -1;
}

pid_t pid = fork();
if (pid < 0) {
SERVER_LOG_ERROR("[Server] Failed to fork %s%d",
is_replacement ? "replacement worker " : "worker ",
worker_id);
close_all_worker_pipes(worker_pipes[worker_id]);
return -1;
} else if (pid == 0) {
// Child: close all fds belonging to other workers.
for (int j = 0; j < static_cast<int>(worker_pipes.size()); ++j) {
if (j != worker_id) { close_all_worker_pipes(worker_pipes[j]); }
}
// Close the server-side ends of this worker's pipes (child uses the other ends).
close_and_reset(worker_pipes[worker_id].to_worker_fd);
close_and_reset(worker_pipes[worker_id].from_worker_fd);
close_and_reset(worker_pipes[worker_id].incumbent_from_worker_fd);
worker_process(worker_id);
_exit(0);
}

close_worker_pipes_child_ends(worker_id);
return pid;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Do not respawn workers with fork() after the server is multithreaded.

The initial startup fork is fine, but spawn_single_worker() routes the worker-monitor thread back through this same path after gRPC and the background threads are already live. Forking a multithreaded process and continuing in the child without exec() is unsafe; inherited library and mutex state can deadlock or corrupt the replacement worker immediately.

Based on learnings, "Ensure race conditions are absent in multi-threaded server implementations; verify proper synchronization of shared state."

Also applies to: 159-161

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 103 - 138,
spawn_worker currently forks a child to create replacement workers which is
unsafe in a multithreaded server; change the replacement-worker path so you do
not continue execution in a forked child of a multithreaded process — instead
spawn a fresh process via fork+exec (or posix_spawn) and exec the same server
binary with an argument to run a single worker, preserving the existing pipe
setup/FDs for the new process, or use posix_spawn to avoid inheriting
thread/mutex state; ensure spawn_single_worker and the is_replacement=true
branch in spawn_worker call this exec-based spawn path, and maintain the same
close_all_worker_pipes/close_worker_pipes_child_ends semantics for FD hygiene.

Comment on lines +399 to +477
static void publish_result(const SolveResult& sr, const std::string& job_id, int worker_id)
{
int64_t result_total_bytes = 0;
if (sr.success) {
for (const auto& [fid, data] : sr.arrays) {
result_total_bytes += data.size();
}
}

// Same CAS protocol as store_simple_result (see comment there).
int result_slot = -1;
for (size_t i = 0; i < MAX_RESULTS; ++i) {
if (result_queue[i].ready.load(std::memory_order_acquire)) continue;
bool expected = false;
if (!result_queue[i].claimed.compare_exchange_strong(
expected, true, std::memory_order_acq_rel)) {
continue;
}
if (result_queue[i].ready.load(std::memory_order_acquire)) {
result_queue[i].claimed.store(false, std::memory_order_release);
continue;
}
result_slot = static_cast<int>(i);
ResultQueueEntry& result = result_queue[i];
copy_cstr(result.job_id, job_id);
result.status = sr.success ? RESULT_SUCCESS : RESULT_ERROR;
result.data_size = sr.success ? std::max<uint64_t>(result_total_bytes, 1) : 0;
result.worker_index.store(worker_id, std::memory_order_relaxed);
if (!sr.success) { copy_cstr(result.error_message, sr.error_message); }
result.retrieved.store(false, std::memory_order_relaxed);
result.ready.store(true, std::memory_order_release);
result.claimed.store(false, std::memory_order_release);
if (config.verbose) {
SERVER_LOG_DEBUG(
"[Worker %d] Enqueued result metadata for job %s in result_slot=%d status=%d data_size=%lu",
worker_id,
job_id.c_str(),
result_slot,
static_cast<int>(result.status),
result.data_size);
}
break;
}

// Stream the full solution payload through the worker's result pipe.
// The server thread reads the other end when the client calls
// GetResult / DownloadChunk.
if (sr.success && result_slot >= 0) {
int write_fd = worker_pipes[worker_id].worker_write_fd;
if (config.verbose) {
SERVER_LOG_DEBUG("[Worker %d] Streaming result (%zu arrays, %ld bytes) to pipe for job %s",
worker_id,
sr.arrays.size(),
result_total_bytes,
job_id.c_str());
}
auto pipe_result_t0 = std::chrono::steady_clock::now();
bool write_success = write_result_to_pipe(write_fd, sr.header, sr.arrays);
if (write_success && config.verbose) {
log_pipe_throughput("pipe_result_send", result_total_bytes, pipe_result_t0);
}
if (!write_success) {
SERVER_LOG_ERROR("[Worker %d] Failed to write result to pipe", worker_id);
result_queue[result_slot].status = RESULT_ERROR;
copy_cstr(result_queue[result_slot].error_message, "Failed to write result to pipe");
} else if (config.verbose) {
SERVER_LOG_DEBUG(
"[Worker %d] Finished writing result payload for job %s", worker_id, job_id.c_str());
}
} else if (config.verbose) {
SERVER_LOG_DEBUG(
"[Worker %d] No result payload write needed for job %s (success=%d, result_slot=%d, "
"payload_bytes=%ld)",
worker_id,
job_id.c_str(),
static_cast<int>(sr.success),
result_slot,
result_total_bytes);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t silently lose completed jobs when the result queue is full.

If the CAS loop never finds a free result_queue slot, result_slot stays -1 and this function just falls through without publishing anything. worker_process() then immediately resets the job slot on Lines 534-535, so the solve can disappear without any terminal result ever reaching the server thread.

Also applies to: 534-535

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 399 - 477, publish_result
can silently drop a completed job when no result_queue slot is free; change
publish_result to return a bool (e.g., enqueued) and ensure it does not silently
succeed when result_slot stays -1, and update worker_process to check that
return value before clearing the job slot: if publish_result returns false,
either retry/enqueue with a bounded backoff or propagate an explicit
RESULT_ERROR into the result handling path so the server thread will see a
terminal result; refer to publish_result, result_slot, result_queue, and
worker_process (the code that resets the job slot) when making these changes.

Comment on lines +428 to +463
port_ = 18500;
const char* env_base = std::getenv("CUOPT_TEST_PORT_BASE");
if (env_base) { port_ = std::atoi(env_base) + 500; }

server_pid_ = fork();
if (server_pid_ < 0) {
skip_reason_ = "fork() failed";
return;
}

if (server_pid_ == 0) {
std::string port_str = std::to_string(port_);
std::string log_file = "/tmp/cuopt_c_api_test_server_" + port_str + ".log";
int fd = open(log_file.c_str(), O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd >= 0) {
dup2(fd, STDOUT_FILENO);
dup2(fd, STDERR_FILENO);
close(fd);
}
execl(server_path_.c_str(),
server_path_.c_str(),
"--port",
port_str.c_str(),
"--workers",
"1",
nullptr);
_exit(127);
}

if (!tcp_connect_check(port_, 15000)) {
skip_reason_ = "cuopt_grpc_server failed to start within 15 seconds";
kill(server_pid_, SIGKILL);
waitpid(server_pid_, nullptr, 0);
server_pid_ = -1;
return;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Pick a free port instead of probing a fixed one.

This fixture treats “something is listening on 127.0.0.1:port_” as proof that the child server started. If 18500 (or CUOPT_TEST_PORT_BASE + 500) is already occupied, tcp_connect_check() succeeds immediately, the child can still die on bind, and the tests run against the wrong service. Reserve the port before fork() or have the child report its bound port back to the parent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp` around lines 428 -
463, Reserve a free loopback port before forking by creating a temporary socket,
binding it to 127.0.0.1:0, calling getsockname to set port_, and ensuring the
socket is not closed-on-exec so the child can inherit the reservation; then
fork() as before, pass the port_str to execl via the "--port" arg, and in the
parent keep the reservation socket open until tcp_connect_check(port_, 15000)
succeeds (close it afterward); update code around port_, fork(),
tcp_connect_check(), server_path_, execl and server_pid_ to implement this
reservation/hand-off sequence.

Comment on lines +52 to +57
| Segment | Purpose |
|---------|---------|
| `/cuopt_job_queue` | Job metadata (ID, type, size, status) |
| `/cuopt_result_queue` | Result metadata (ID, status, size, error) |
| `/cuopt_control` | Server control flags (shutdown, worker count) |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Namespace the shared-memory segments per server instance.

/cuopt_job_queue, /cuopt_result_queue, and /cuopt_control are host-global POSIX shm names. A second cuopt_grpc_server on the same machine will either fail to start or attach to the first server's control blocks, so these need a per-instance suffix (port/PID/user) or an explicit single-instance-per-host restriction.

Also applies to: 307-314

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_SERVER_ARCHITECTURE.md` around lines 52 - 57, The POSIX shared-memory
names `/cuopt_job_queue`, `/cuopt_result_queue`, and `/cuopt_control` are
host-global and must be namespaced per server instance; update all references
(including the other occurrence around lines mentioning 307-314) to generate
instance-specific names (for example by appending PID, port, or a UUID) instead
of the fixed literals so multiple `cuopt_grpc_server` instances can coexist—add
a small helper like generateInstanceShmName(instanceId, baseName) and replace
uses of the three fixed symbols with calls that pass a unique instance
identifier derived from the server startup parameters (PID/port/ENV/UUID).

@rgsl888prabhu rgsl888prabhu merged commit f1225ef into NVIDIA:main Mar 31, 2026
103 of 106 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants