Main merge release/26.04 1#1005
Conversation
This change replaces the solve_lp_remote and solve_mip_remote stubs with real routines that use an embedded grpc client to communicate with a remote cuopt server. There are two documents included, GRPC_ARCHITECTURE.md and SERVER_ARCHITECTURE.md that act as developer guides. The server is built by build.sh and is called cuopt_grpc_server. Remote execution is still enabled via env vars CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT To try this feature, checkout the branch/PR and do the following: $ build.sh $ cuopt_grpc_server // defaults are fine, run with -h for help $ CUOPT_REMOTE_HOST=localhost CUOPT_REMOTE_PORT=8765 cuopt_cli myproblem.mps All cuopt APIs will pick up remote execution if the env vars are set. So cuopt_cli, the C API, the Python API, all will solve problems on the server if the env vars are set and the server is running. Just use cuopt tools and APIs as you normally would for a local solve. Authors: - Trevor McKay (https://github.com/tmckayus) Approvers: - Rajesh Gandham (https://github.com/rg20) - Ishika Roy (https://github.com/Iroy30) - Bradley Dice (https://github.com/bdice) URL: NVIDIA#939
…VIDIA#992) ## Issue Authors: - Alice Boucher (https://github.com/aliceb-nv) Approvers: - Akif ÇÖRDÜK (https://github.com/akifcorduk) URL: NVIDIA#992
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds a full gRPC remote-execution system: protobuf service/messages, client library, multi-process gRPC server with shared-memory and pipe IPC, chunked array transfer protocol, TLS/mTLS options, build/test integration, and extensive mapping/serialization helpers for LP/MIP solves. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 17
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/src/mip_heuristics/solve.cu (1)
442-453:⚠️ Potential issue | 🟠 MajorKeep remote MIP routing backend-agnostic.
This turns remote execution into a CPU-only path. If a caller holds a GPU-backed
optimization_problem_tandCUOPT_REMOTE_*is set, the solve now hard-fails here instead of being serialized and sent to the remote worker.Based on learnings, in
cpp/src/linear_programming/solve.cu, remote vs. local solve routing intentionally checks onlyis_remote_execution_enabled()without checking backend type so GPU-resident problems can still execute remotely.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/mip_heuristics/solve.cu` around lines 442 - 453, The remote-execution branch currently forces a CPU-only cast (dynamic_cast to cpu_optimization_problem_t) which makes remote solves fail for GPU-backed problems; remove the cpu-specific cast in the is_remote_execution_enabled() branch and call solve_mip_remote using the generic problem_interface (or an overload that accepts the base optimization_problem_t<i_t,f_t>) so remote routing stays backend-agnostic; keep the CPU-only cast only for the local path where solve_mip(*cpu_prob, settings) is needed and ensure solve_mip_remote's signature or overload supports serializing/handling GPU-resident problems from the base type rather than requiring cpu_optimization_problem_t.python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (1)
417-421:⚠️ Potential issue | 🟡 MinorFail this test when no warm-start payload is produced.
If
get_pdlp_warm_start_data()returnsNone, the test exits successfully without exercising the round-trip at all. That hides regressions in remote warm-start serialization.Suggested change
sol1 = linear_programming.Solve(dm, settings) ws = sol1.get_pdlp_warm_start_data() + assert ws is not None, "expected PDLP warm-start data from the first solve" - if ws is not None: - settings.set_pdlp_warm_start_data(ws) - settings.set_parameter(CUOPT_ITERATION_LIMIT, 200) - sol2 = linear_programming.Solve(dm, settings) - assert sol2.get_primal_solution() is not None + settings.set_pdlp_warm_start_data(ws) + settings.set_parameter(CUOPT_ITERATION_LIMIT, 200) + sol2 = linear_programming.Solve(dm, settings) + assert sol2.get_primal_solution() is not None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 417 - 421, The test currently skips the warm-start round-trip when ws is None; update the test to fail if get_pdlp_warm_start_data() returns None so regressions are caught: after calling settings.get_pdlp_warm_start_data() (or wherever ws is obtained), assert ws is not None (or raise an AssertionError) with a clear message before calling settings.set_pdlp_warm_start_data(ws) and Solve(dm, settings); keep the existing calls to set_pdlp_warm_start_data, set_parameter(CUOPT_ITERATION_LIMIT, 200), and linear_programming.Solve to exercise the full round-trip and then assert sol2.get_primal_solution() is not None.
🟡 Minor comments (5)
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp-76-86 (1)
76-86:⚠️ Potential issue | 🟡 MinorReset the server log marker when the file can't be opened.
clear()leavesserver_log_start_pos_untouched on open failure. If the log file was rotated/recreated between tests, later reads seek to a stale offset and miss current-test output.Suggested fix
void clear() { std::lock_guard<std::mutex> lock(mutex_); client_logs_.clear(); if (!server_log_path_.empty()) { std::ifstream file(server_log_path_, std::ios::ate); - if (file.is_open()) { server_log_start_pos_ = file.tellg(); } + if (file.is_open()) { + server_log_start_pos_ = file.tellg(); + } else { + server_log_start_pos_ = 0; + } + } else { + server_log_start_pos_ = 0; } test_start_marked_ = true; }Based on learnings,
**/*test*.{cpp,cu,py}should ensure test isolation so state from previous tests cannot leak into the current one.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 76 - 86, The clear() method leaves server_log_start_pos_ unchanged when opening server_log_path_ fails, which can leave a stale offset; update clear() so that if server_log_path_ is non-empty but the file fails to open you reset server_log_start_pos_ to the beginning (e.g., streampos(0) or 0) to avoid seeking to a stale offset on later reads; locate the logic in clear(), reference server_log_path_, server_log_start_pos_, and test_start_marked_, and ensure test_start_marked_ behavior remains the same after the reset.GRPC_QUICK_START.md-3-8 (1)
3-8:⚠️ Potential issue | 🟡 MinorThis guide promises the C API, but the example here is C++.
The sample includes C++ headers and calls
solve_lp, so C users still do not get a documented path throughcuopt_c.h. Either rename these sections to “C++ API” or add a real C example so the supported API surface stays clear. Based on learnings, the officially supported public APIs for cuOpt are C (viacuopt_c.h) and Python.Also applies to: 151-154, 222-234
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@GRPC_QUICK_START.md` around lines 3 - 8, The doc promises a C API but the sample uses C++ headers and calls solve_lp; update the guide so the API surface is accurate by either renaming the example sections to "C++ API" or replacing the sample with a real C example that includes cuopt_c.h and demonstrates the equivalent C entrypoint(s) (e.g., the cuopt C solve function(s) corresponding to solve_lp), and ensure every other section that currently claims to show C but uses C++ is corrected likewise.GRPC_INTERFACE.md-189-193 (1)
189-193:⚠️ Potential issue | 🟡 MinorUse the correct gRPC cancellation API here.
IsCancelled()is a server-side API only;ClientContextdoes not have this method. The documented client-side detection is incorrect. For client-initiated cancellation, the client callscontext->TryCancel(). For detecting cancellation on the client side, check the returnedStatusobject forStatusCode::CANCELLED.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@GRPC_INTERFACE.md` around lines 189 - 193, The docs incorrectly state client-side detection uses context->IsCancelled(); update the Connection Handling section to use the correct gRPC APIs: note that IsCancelled() is server-side only (used on ServerContext), clients should call ClientContext::TryCancel() to initiate cancellation, and clients detect cancellation by checking the returned grpc::Status for status.code() == grpc::StatusCode::CANCELLED (or StatusCode::CANCELLED) rather than calling IsCancelled() on ClientContext. Mention the correct symbols: ClientContext::TryCancel(), ServerContext::IsCancelled(), and grpc::Status / grpc::StatusCode::CANCELLED.cpp/src/grpc/client/grpc_client.hpp-76-78 (1)
76-78:⚠️ Potential issue | 🟡 MinorFix the clamping bounds documentation to match server constraints.
The comment claims the value is clamped to
[4 MiB, 2 GiB - 1 MiB], but the actual server constraints defined ingrpc_server_types.hppare[4 KiB, 2 GiB - 1 MiB]. Update the comment on line 76 to reflect the correct minimum of 4 KiB, not 4 MiB.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.hpp` around lines 76 - 78, Update the comment for the max_message_bytes field in grpc_client.hpp to correctly state the clamping bounds used by the server: change the documented minimum from 4 MiB to 4 KiB so the comment reads that values are clamped to [4 KiB, 2 GiB - 1 MiB] (consistent with grpc_server_types.hpp); keep the existing explanatory text about protobuf limits and retain the default value and variable name max_message_bytes.cpp/src/grpc/server/grpc_server_types.hpp-257-259 (1)
257-259:⚠️ Potential issue | 🟡 MinorHardcoded shared memory names may cause conflicts with multiple server instances.
The SHM region names are hardcoded constants. If multiple cuopt_grpc_server instances run on the same host (e.g., different ports for different GPUs), they would collide on these shared memory regions.
💡 Consider making SHM names configurable or unique per instance
One approach is to include the port or a unique identifier in the SHM names:
// In ServerConfig: std::string shm_suffix; // e.g., "_8765" for port 8765 // At startup, construct names like: // /cuopt_job_queue_8765🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 257 - 259, SHM names SHM_JOB_QUEUE, SHM_RESULT_QUEUE and SHM_CONTROL are hardcoded and will collide across multiple server instances; make them instance-specific by replacing these global const char* symbols with configurable strings generated at startup (e.g., in ServerConfig or GrpcServer) that append a unique suffix such as the server port or UUID, update all code that references SHM_JOB_QUEUE/SHM_RESULT_QUEUE/SHM_CONTROL to use the new per-instance string members or accessors (e.g., serverConfig.getJobQueueName()), and ensure the constructed names keep the leading '/' and are used consistently when creating/opening shared memory regions.
🧹 Nitpick comments (6)
dependencies.yaml (1)
305-310: Consider adding version constraints for gRPC stack dependencies.The new gRPC dependencies (
openssl,c-ares,libgrpc,libprotobuf,libabseil,re2) are added without version constraints. For build reproducibility and to avoid potential ABI incompatibilities between these tightly coupled packages, consider pinning compatible versions (at minimumlibgrpcandlibprotobufwhich have known version coupling requirements).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dependencies.yaml` around lines 305 - 310, The added gRPC-related dependencies (openssl, c-ares, libgrpc, libprotobuf, libabseil, re2) lack version constraints; update dependencies.yaml to pin compatible versions (at minimum for libgrpc and libprotobuf) to ensure ABI compatibility and reproducible builds, selecting matching libgrpc/libprotobuf pairs and consistent versions for libabseil, c-ares, re2 and openssl, and express them using the repository's existing versioning format so the package resolver installs the intended releases.ci/build_wheel_libcuopt.sh (1)
20-28: Consider adding fallback handling if neither package manager is available.The
libuuidinstallation logic silently does nothing if neitherdnfnorapt-getis available. While the build will likely fail later with a clearer error from CMake, an explicit warning or error here would provide faster feedback. The same pattern is used for existing installations (e.g.,update_rockylinux_repo.sh), so this may be intentional for consistency.Optional: Add fallback warning
# Install libuuid (needed by cuopt_grpc_server) if command -v dnf &> /dev/null; then dnf install -y libuuid-devel elif command -v apt-get &> /dev/null; then apt-get update && apt-get install -y uuid-dev +else + echo "Warning: Neither dnf nor apt-get found, skipping libuuid installation" fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/build_wheel_libcuopt.sh` around lines 20 - 28, The package-manager check in the if/elif block that uses "command -v dnf" and "command -v apt-get" can silently skip installing libuuid if neither tool exists; add an else branch after that if/elif to print a clear error or warning (e.g., echo "error: no supported package manager found to install libuuid") and exit non-zero (or at least echo a prominent warning) so the failure is explicit; update the block that installs libuuid (the conditional using dnf/apt-get) to include this else handler and keep the subsequent call to bash ci/utils/install_protobuf_grpc.sh unchanged.python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (1)
470-474: Prefer per-test server fixtures, or add an explicit reset between cases.These class-scoped fixtures reuse one remote server across multiple tests, so any leaked job, session, or worker-side state can bleed into later cases and make them order-dependent. If function scope is too expensive, add an explicit “server is empty” cleanup/assertion before each test instead.
As per coding guidelines, "Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment."
Also applies to: 516-520, 713-748, 771-814
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 470 - 474, The class-scoped fixture cpu_only_env_with_server (which calls _start_grpc_server_fixture and _stop_grpc_server) can leak server-side state between tests; either change its pytest.fixture scope from "class" to "function" so each test starts/stops a fresh server, or keep class scope but add an explicit reset/assertion before each test (e.g., call a helper that verifies the server has no active jobs/sessions or restarts the server via _stop_grpc_server and _start_grpc_server_fixture) to guarantee isolation for cpu_only_env_with_server and the other similar fixtures referenced in the file.cpp/src/grpc/server/grpc_server_types.hpp (1)
83-94: Multi-atomic JobQueueEntry relies on implicit sequencing.The entry has multiple
std::atomicfields that are accessed independently. The correctness relies on settingreadylast when publishing and checkingreadyfirst when consuming. This works with defaultmemory_order_seq_cst, but the invariants could be documented for maintainability.Consider adding a brief comment documenting the publish/consume protocol:
// Protocol: writer sets all other fields, then sets ready=true (release). // Reader checks ready==true (acquire), then reads other fields. std::atomic<bool> ready; // Job is ready to be processed🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 83 - 94, The JobQueueEntry uses multiple std::atomic fields with an implicit publish/consume ordering via ready; add a brief protocol comment above the ready declaration (e.g., "writer sets other fields then sets ready=true (release); reader checks ready==true (acquire) then reads other fields") and update the code paths that publish/consume jobs to use explicit memory orders: use std::memory_order_release when the producer stores ready=true, use std::memory_order_acquire when the consumer loads ready, and keep other field accesses as std::memory_order_relaxed where appropriate (referencing JobQueueEntry and its members ready, claimed, worker_pid, cancelled, worker_index, data_sent, is_chunked to locate sites to change).cpp/src/grpc/grpc_solution_mapper.cpp (2)
333-340: Consider reserving capacity to avoid potential reallocation.The
doubles_to_byteshelper creates an intermediatestd::vector<double> tmpby copying from input. For large solution vectors, this causes an extra allocation and copy. Since the output size is known upfront, this could be optimized.♻️ Optional optimization to avoid intermediate copy when f_t is double
template <typename f_t> std::vector<uint8_t> doubles_to_bytes(const std::vector<f_t>& vec) { - std::vector<double> tmp(vec.begin(), vec.end()); - std::vector<uint8_t> bytes(tmp.size() * sizeof(double)); - std::memcpy(bytes.data(), tmp.data(), bytes.size()); - return bytes; + std::vector<uint8_t> bytes(vec.size() * sizeof(double)); + if constexpr (std::is_same_v<f_t, double>) { + std::memcpy(bytes.data(), vec.data(), bytes.size()); + } else { + std::vector<double> tmp(vec.begin(), vec.end()); + std::memcpy(bytes.data(), tmp.data(), bytes.size()); + } + return bytes; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_solution_mapper.cpp` around lines 333 - 340, The helper doubles_to_bytes currently copies into a temporary std::vector<double> tmp which forces an extra allocation; fix by preallocating the output buffer and avoiding unnecessary reallocation and copy: when f_t != double, construct tmp with vec.size() and use tmp.assign(vec.begin(), vec.end()) or reserve tmp.capacity() before copying, then allocate bytes with bytes.resize(tmp.size()*sizeof(double)) and memcpy from tmp.data(); additionally provide an overload/specialization for doubles_to_bytes<double> (or a branch when std::is_same<f_t,double>::value) that bypasses tmp entirely by directly resizing bytes to vec.size()*sizeof(double) and memcpy from vec.data().
518-527: Silent failure on size validation returns empty vector.When
raw.size() % sizeof(double) != 0, the function returns an empty vector with no error indication. This could mask data corruption issues during transfer. Consider whether logging or an exception would be more appropriate for debugging production issues.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/grpc_solution_mapper.cpp` around lines 518 - 527, The current branch inside the templated reader (the code that looks up arrays by field_id and handles the T==float case) silently returns an empty vector when raw.size() % sizeof(double) != 0, which hides transfer/corruption problems; replace that silent return with explicit error handling: either log a descriptive error including field_id, raw.size() and expected element size using the project's logger (or tracing facility) or throw a std::runtime_error (or a domain-specific exception) with the same diagnostic text so callers can detect the failure; update the block referencing arrays, field_id, raw, and the T==float branch to perform this logging/throw before exiting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/utils/install_protobuf_grpc.sh`:
- Around line 129-141: The cleanup currently only removes a subset of possible
install locations; update the removal block that uses PREFIX to also delete
include/grpcpp, any libgrpc* and libprotobuf* shared objects and symlinks in
both "${PREFIX}/lib" and "${PREFIX}/lib64" (e.g. libgrpc*.so* , libprotobuf*.so*
, libprotoc*.so*), libabsl_* archives in "${PREFIX}/lib" as well as
"${PREFIX}/lib64", and any remaining grpc/protobuf-related pkgconfig or cmake
dirs; in short, broaden the rm globs to cover "${PREFIX}/include/grpcpp",
"${PREFIX}/lib/"lib*.so* and "${PREFIX}/lib64/"lib*.so*, add
"${PREFIX}/lib/"libabsl_*.a, and remove lingering pkgconfig/cmake directories so
no mixed-version headers or shared libs remain before rebuilding.
In `@cpp/src/grpc/client/grpc_client.cpp`:
- Around line 783-794: After submit_lp succeeds you must ensure the remote job
is cancelled/deleted on every error path: after
start_log_streaming/poll_for_completion/stop_log_streaming, if poll.completed is
false or poll timed out, call cancel_job(sub.job_id) (or delete_job if terminal)
before returning the error; similarly, if get_lp_result<i_t, f_t>(sub.job_id)
fails or result mapping fails, call delete_job(sub.job_id) before returning;
update the error-returning branches around submit_lp, poll_for_completion,
get_lp_result, and the symmetric block at the other location (the 810–820 block)
to always perform the appropriate cancel_job/delete_job cleanup using
sub.job_id.
- Around line 839-915: After StartChunkedUpload (and similarly
StartChunkedDownload) succeeds you must ensure the server session is always
released on any early return; add a scope guard (RAII or lambda-based) right
after StartChunkedUpload that captures upload_id and impl_, which on destruction
(unless dismissed) calls the cleanup RPC (FinishChunkedUpload or an
AbortChunkedUpload/cancel RPC if available) using impl_->stub and ignores/logs
errors; dismiss the guard only after the normal FinishChunkedUpload completes
and job_id_out is set. Apply the same pattern around
StartChunkedDownload/FinishChunkedDownload so any error paths from
SendArrayChunk, validation failures, or other early returns invoke the
best-effort finish/abort to avoid exhausting kMaxChunkedSessions.
In `@cpp/src/grpc/grpc_settings_mapper.cpp`:
- Around line 76-121: The PDLP warm-start blob isn't being transferred: update
map_pdlp_settings_to_proto to copy settings.warm_start_data into the protobuf
(use pb_settings->set_warm_start_data(settings.warm_start_data) or
pb_settings->mutable_warm_start_data()->CopyFrom(...) depending on the field
type) and make the symmetric change in the reverse mapper (the function that
converts cuopt::remote::PDLPSolverSettings back to pdlp_solver_settings_t, e.g.
map_proto_to_pdlp_settings or similar) to populate settings.warm_start_data from
the proto, ensuring warm-start state is preserved across remote calls.
In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp`:
- Around line 38-45: The parse_incumbent_proto function must validate the
incoming buffer before calling Incumbent::ParseFromArray: ensure that size does
not exceed std::numeric_limits<int>::max() (to avoid int overflow when casting)
and that if size > 0 then data is not nullptr; if either check fails return
false. After these guards, perform the safe static_cast<int>(size) and call
incumbent_msg.ParseFromArray(data, static_cast<int>(size)). Mirror the defensive
checks used in build_incumbent_proto and keep the function signature and return
behavior unchanged.
In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 13-27: The code in send_job_data_pipe captures the raw fd from
worker_pipes under worker_pipes_mutex but then releases the lock before writing,
allowing the FD to be closed/reused; to fix, under the same lock duplicate the
descriptor (use dup or dup3) and store the duplicated fd (e.g., dup_fd) then
release worker_pipes_mutex and perform write_to_pipe against dup_fd, closing
dup_fd afterwards; update references to worker_pipes, worker_pipes_mutex,
send_job_data_pipe, and write_to_pipe accordingly so the writes use the
duplicated FD and avoid races.
- Around line 219-329: In cancel_job, after scanning job_queue but before
overwriting it->second, re-read the tracked status under tracker_mutex (e.g.,
auto final_status = it->second.status) and if final_status is a terminal state
(COMPLETED, FAILED, or CANCELLED) return the appropriate code/message without
changing job_tracker or calling delete_log_file; only proceed to set status to
CANCELLED, set error_message, and notify waiters when final_status is
non-terminal. This change references cancel_job, job_tracker, job_queue, and
tracker_mutex to ensure the fallback path does not clobber a terminal state that
was published while the slot was recycled.
In `@cpp/src/grpc/server/grpc_pipe_io.cpp`:
- Around line 38-74: The read_from_pipe() logic uses a single initial poll()
with timeout_ms then switches to blocking ::read, which can hang if the sender
stalls mid-message; also it treats POLLHUP as an immediate error even when
POLLIN is set and data remains. Fix by making reads timeout-aware: either set fd
non-blocking and loop using poll() (or repeated poll() with remaining
timeout_ms) before each ::read, or after a partial ::read use
poll(&pfd,1,remaining_ms) to wait for more data; ensure you update and use a
remaining timeout counter; and change the revents check to allow POLLIN|POLLHUP
(i.e., only treat POLLHUP/POLLERR/POLLNVAL as fatal when POLLIN is not set) so
buffered data is read out before returning an error.
In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 290-301: shutdown_all currently only sets
shm_ctrl->shutdown_requested and waits, which leaves workers stuck inside
long-running routines like solve_lp() or solve_mip(); update shutdown_all to
perform a hard-stop for active workers by (1) setting keep_running = false and
shm_ctrl->shutdown_requested as now, (2) invoking the worker-level cancellation
path for any running workers (e.g., call the worker interrupt/terminate API or
set a per-worker cancel flag and signal the workers) so threads inside
solve_lp()/solve_mip() can break out immediately, and (3) after signalling
cancellations, proceed to join
result_thread/incumbent_thread/monitor_thread/reaper_thread, call
wait_for_workers(), and then cleanup_shared_memory(); refer to the shutdown_all
lambda, wait_for_workers(), and the long-running routines solve_lp()/solve_mip()
to add and use the cancellation mechanism.
- Around line 228-263: Validate that TLS is enabled before accepting any
TLS/mTLS-specific flags: add a guard that checks config.enable_tls and rejects
when false while any of config.tls_cert_path, config.tls_key_path,
config.tls_root_path are non-empty or config.require_client is true (use
cuopt_expects with error_type_t::ValidationError and a clear message). Place
this check before the existing TLS branch that constructs
grpc::SslServerCredentials (i.e., before the if (config.enable_tls) { ... }
block) so flags like --require-client-cert, --tls-cert, --tls-key, and
--tls-root cannot silently fall back to grpc::InsecureServerCredentials().
In `@cpp/src/grpc/server/grpc_server_threads.cpp`:
- Around line 401-409: The reaper currently uses the session's created timestamp
(it->second.created) causing active downloads to be expired mid-transfer; update
the session model to include a last_activity timestamp (e.g., last_activity) and
change the reaper to compare now - it->second.last_activity against
kSessionTimeoutSeconds, and in GetResultChunk (or any handler that serves/reads
a chunk from chunked_downloads) update that session's last_activity under
chunked_downloads_mutex each time the session is accessed so active clients
reset the timeout; keep using chunked_downloads and chunked_downloads_mutex and
remove reliance on created for expiration logic.
In `@cpp/src/grpc/server/grpc_service_impl.cpp`:
- Around line 795-819: The terminal path currently reads at most one more line
after check_job_status and then sends the job_complete sentinel, which can drop
any additional flushed lines and leave byte_offset behind; change the logic in
the completion branch (around check_job_status, the use of in, line,
current_offset, next_offset2, cuopt::remote::LogMessage and writer->Write) to
drain the entire remaining stream by repeatedly calling std::getline(in, line)
in a loop, for each iteration compute the correct byte offset from the stream
position (use tellg() before/after or after if available) and emit a LogMessage
with m.set_line(line), m.set_byte_offset(next_offset), m.set_job_complete(false)
via writer->Write(m), update current_offset to the last emitted byte offset, and
after the loop emit the final done message with m.set_job_complete(true) and
byte_offset set to that last current_offset.
- Around line 141-179: SendArrayChunk currently only updates meta.received_bytes
and pushes the raw chunk, letting FinishChunkedUpload queue jobs even if chunks
overlap or leave gaps; fix by enforcing exact coverage and no-overlap for each
field. In SendArrayChunk (use symbols state.field_meta, meta,
meta.received_bytes, meta.element_size, state.chunks, state.total_chunks, and
the incoming ac/raw/elem_offset) validate the incoming chunk against existing
ranges for that field (track per-field ranges in meta or compare against
already-stored chunks) and reject overlaps immediately; store the byte/element
range rather than only adding to received_bytes. In FinishChunkedUpload iterate
state.field_meta entries and ensure for every field
meta.total_elements*meta.element_size equals the sum of non-overlapping received
ranges (and that ranges fully cover [0,total_elements) with no gaps); if any
field is incomplete or has gaps/overlaps, return
INVALID_ARGUMENT/RESOURCE_EXHAUSTED and do not enqueue the job.
In `@cpp/src/grpc/server/grpc_worker_infra.cpp`:
- Around line 103-138: spawn_worker currently forks a child to create
replacement workers which is unsafe in a multithreaded server; change the
replacement-worker path so you do not continue execution in a forked child of a
multithreaded process — instead spawn a fresh process via fork+exec (or
posix_spawn) and exec the same server binary with an argument to run a single
worker, preserving the existing pipe setup/FDs for the new process, or use
posix_spawn to avoid inheriting thread/mutex state; ensure spawn_single_worker
and the is_replacement=true branch in spawn_worker call this exec-based spawn
path, and maintain the same close_all_worker_pipes/close_worker_pipes_child_ends
semantics for FD hygiene.
In `@cpp/src/grpc/server/grpc_worker.cpp`:
- Around line 399-477: publish_result can silently drop a completed job when no
result_queue slot is free; change publish_result to return a bool (e.g.,
enqueued) and ensure it does not silently succeed when result_slot stays -1, and
update worker_process to check that return value before clearing the job slot:
if publish_result returns false, either retry/enqueue with a bounded backoff or
propagate an explicit RESULT_ERROR into the result handling path so the server
thread will see a terminal result; refer to publish_result, result_slot,
result_queue, and worker_process (the code that resets the job slot) when making
these changes.
In `@cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp`:
- Around line 428-463: Reserve a free loopback port before forking by creating a
temporary socket, binding it to 127.0.0.1:0, calling getsockname to set port_,
and ensuring the socket is not closed-on-exec so the child can inherit the
reservation; then fork() as before, pass the port_str to execl via the "--port"
arg, and in the parent keep the reservation socket open until
tcp_connect_check(port_, 15000) succeeds (close it afterward); update code
around port_, fork(), tcp_connect_check(), server_path_, execl and server_pid_
to implement this reservation/hand-off sequence.
In `@GRPC_SERVER_ARCHITECTURE.md`:
- Around line 52-57: The POSIX shared-memory names `/cuopt_job_queue`,
`/cuopt_result_queue`, and `/cuopt_control` are host-global and must be
namespaced per server instance; update all references (including the other
occurrence around lines mentioning 307-314) to generate instance-specific names
(for example by appending PID, port, or a UUID) instead of the fixed literals so
multiple `cuopt_grpc_server` instances can coexist—add a small helper like
generateInstanceShmName(instanceId, baseName) and replace uses of the three
fixed symbols with calls that pass a unique instance identifier derived from the
server startup parameters (PID/port/ENV/UUID).
---
Outside diff comments:
In `@cpp/src/mip_heuristics/solve.cu`:
- Around line 442-453: The remote-execution branch currently forces a CPU-only
cast (dynamic_cast to cpu_optimization_problem_t) which makes remote solves fail
for GPU-backed problems; remove the cpu-specific cast in the
is_remote_execution_enabled() branch and call solve_mip_remote using the generic
problem_interface (or an overload that accepts the base
optimization_problem_t<i_t,f_t>) so remote routing stays backend-agnostic; keep
the CPU-only cast only for the local path where solve_mip(*cpu_prob, settings)
is needed and ensure solve_mip_remote's signature or overload supports
serializing/handling GPU-resident problems from the base type rather than
requiring cpu_optimization_problem_t.
In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 417-421: The test currently skips the warm-start round-trip when
ws is None; update the test to fail if get_pdlp_warm_start_data() returns None
so regressions are caught: after calling settings.get_pdlp_warm_start_data() (or
wherever ws is obtained), assert ws is not None (or raise an AssertionError)
with a clear message before calling settings.set_pdlp_warm_start_data(ws) and
Solve(dm, settings); keep the existing calls to set_pdlp_warm_start_data,
set_parameter(CUOPT_ITERATION_LIMIT, 200), and linear_programming.Solve to
exercise the full round-trip and then assert sol2.get_primal_solution() is not
None.
---
Minor comments:
In `@cpp/src/grpc/client/grpc_client.hpp`:
- Around line 76-78: Update the comment for the max_message_bytes field in
grpc_client.hpp to correctly state the clamping bounds used by the server:
change the documented minimum from 4 MiB to 4 KiB so the comment reads that
values are clamped to [4 KiB, 2 GiB - 1 MiB] (consistent with
grpc_server_types.hpp); keep the existing explanatory text about protobuf limits
and retain the default value and variable name max_message_bytes.
In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 257-259: SHM names SHM_JOB_QUEUE, SHM_RESULT_QUEUE and SHM_CONTROL
are hardcoded and will collide across multiple server instances; make them
instance-specific by replacing these global const char* symbols with
configurable strings generated at startup (e.g., in ServerConfig or GrpcServer)
that append a unique suffix such as the server port or UUID, update all code
that references SHM_JOB_QUEUE/SHM_RESULT_QUEUE/SHM_CONTROL to use the new
per-instance string members or accessors (e.g., serverConfig.getJobQueueName()),
and ensure the constructed names keep the leading '/' and are used consistently
when creating/opening shared memory regions.
In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 76-86: The clear() method leaves server_log_start_pos_ unchanged
when opening server_log_path_ fails, which can leave a stale offset; update
clear() so that if server_log_path_ is non-empty but the file fails to open you
reset server_log_start_pos_ to the beginning (e.g., streampos(0) or 0) to avoid
seeking to a stale offset on later reads; locate the logic in clear(), reference
server_log_path_, server_log_start_pos_, and test_start_marked_, and ensure
test_start_marked_ behavior remains the same after the reset.
In `@GRPC_INTERFACE.md`:
- Around line 189-193: The docs incorrectly state client-side detection uses
context->IsCancelled(); update the Connection Handling section to use the
correct gRPC APIs: note that IsCancelled() is server-side only (used on
ServerContext), clients should call ClientContext::TryCancel() to initiate
cancellation, and clients detect cancellation by checking the returned
grpc::Status for status.code() == grpc::StatusCode::CANCELLED (or
StatusCode::CANCELLED) rather than calling IsCancelled() on ClientContext.
Mention the correct symbols: ClientContext::TryCancel(),
ServerContext::IsCancelled(), and grpc::Status / grpc::StatusCode::CANCELLED.
In `@GRPC_QUICK_START.md`:
- Around line 3-8: The doc promises a C API but the sample uses C++ headers and
calls solve_lp; update the guide so the API surface is accurate by either
renaming the example sections to "C++ API" or replacing the sample with a real C
example that includes cuopt_c.h and demonstrates the equivalent C entrypoint(s)
(e.g., the cuopt C solve function(s) corresponding to solve_lp), and ensure
every other section that currently claims to show C but uses C++ is corrected
likewise.
---
Nitpick comments:
In `@ci/build_wheel_libcuopt.sh`:
- Around line 20-28: The package-manager check in the if/elif block that uses
"command -v dnf" and "command -v apt-get" can silently skip installing libuuid
if neither tool exists; add an else branch after that if/elif to print a clear
error or warning (e.g., echo "error: no supported package manager found to
install libuuid") and exit non-zero (or at least echo a prominent warning) so
the failure is explicit; update the block that installs libuuid (the conditional
using dnf/apt-get) to include this else handler and keep the subsequent call to
bash ci/utils/install_protobuf_grpc.sh unchanged.
In `@cpp/src/grpc/grpc_solution_mapper.cpp`:
- Around line 333-340: The helper doubles_to_bytes currently copies into a
temporary std::vector<double> tmp which forces an extra allocation; fix by
preallocating the output buffer and avoiding unnecessary reallocation and copy:
when f_t != double, construct tmp with vec.size() and use
tmp.assign(vec.begin(), vec.end()) or reserve tmp.capacity() before copying,
then allocate bytes with bytes.resize(tmp.size()*sizeof(double)) and memcpy from
tmp.data(); additionally provide an overload/specialization for
doubles_to_bytes<double> (or a branch when std::is_same<f_t,double>::value) that
bypasses tmp entirely by directly resizing bytes to vec.size()*sizeof(double)
and memcpy from vec.data().
- Around line 518-527: The current branch inside the templated reader (the code
that looks up arrays by field_id and handles the T==float case) silently returns
an empty vector when raw.size() % sizeof(double) != 0, which hides
transfer/corruption problems; replace that silent return with explicit error
handling: either log a descriptive error including field_id, raw.size() and
expected element size using the project's logger (or tracing facility) or throw
a std::runtime_error (or a domain-specific exception) with the same diagnostic
text so callers can detect the failure; update the block referencing arrays,
field_id, raw, and the T==float branch to perform this logging/throw before
exiting.
In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 83-94: The JobQueueEntry uses multiple std::atomic fields with an
implicit publish/consume ordering via ready; add a brief protocol comment above
the ready declaration (e.g., "writer sets other fields then sets ready=true
(release); reader checks ready==true (acquire) then reads other fields") and
update the code paths that publish/consume jobs to use explicit memory orders:
use std::memory_order_release when the producer stores ready=true, use
std::memory_order_acquire when the consumer loads ready, and keep other field
accesses as std::memory_order_relaxed where appropriate (referencing
JobQueueEntry and its members ready, claimed, worker_pid, cancelled,
worker_index, data_sent, is_chunked to locate sites to change).
In `@dependencies.yaml`:
- Around line 305-310: The added gRPC-related dependencies (openssl, c-ares,
libgrpc, libprotobuf, libabseil, re2) lack version constraints; update
dependencies.yaml to pin compatible versions (at minimum for libgrpc and
libprotobuf) to ensure ABI compatibility and reproducible builds, selecting
matching libgrpc/libprotobuf pairs and consistent versions for libabseil,
c-ares, re2 and openssl, and express them using the repository's existing
versioning format so the package resolver installs the intended releases.
In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`:
- Around line 470-474: The class-scoped fixture cpu_only_env_with_server (which
calls _start_grpc_server_fixture and _stop_grpc_server) can leak server-side
state between tests; either change its pytest.fixture scope from "class" to
"function" so each test starts/stops a fresh server, or keep class scope but add
an explicit reset/assertion before each test (e.g., call a helper that verifies
the server has no active jobs/sessions or restarts the server via
_stop_grpc_server and _start_grpc_server_fixture) to guarantee isolation for
cpu_only_env_with_server and the other similar fixtures referenced in the file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 741d5bb2-9920-447b-a562-335486d8ad08
📒 Files selected for processing (59)
GRPC_INTERFACE.mdGRPC_QUICK_START.mdGRPC_SERVER_ARCHITECTURE.mdbuild.shci/build_wheel_libcuopt.shci/utils/install_protobuf_grpc.shconda/environments/all_cuda-129_arch-aarch64.yamlconda/environments/all_cuda-129_arch-x86_64.yamlconda/environments/all_cuda-131_arch-aarch64.yamlconda/environments/all_cuda-131_arch-x86_64.yamlconda/recipes/libcuopt/recipe.yamlcpp/CMakeLists.txtcpp/cuopt_cli.cppcpp/include/cuopt/linear_programming/solve_remote.hppcpp/src/grpc/client/grpc_client.cppcpp/src/grpc/client/grpc_client.hppcpp/src/grpc/client/solve_remote.cppcpp/src/grpc/cuopt_remote.protocpp/src/grpc/cuopt_remote_service.protocpp/src/grpc/grpc_problem_mapper.cppcpp/src/grpc/grpc_problem_mapper.hppcpp/src/grpc/grpc_service_mapper.cppcpp/src/grpc/grpc_service_mapper.hppcpp/src/grpc/grpc_settings_mapper.cppcpp/src/grpc/grpc_settings_mapper.hppcpp/src/grpc/grpc_solution_mapper.cppcpp/src/grpc/grpc_solution_mapper.hppcpp/src/grpc/server/grpc_field_element_size.hppcpp/src/grpc/server/grpc_incumbent_proto.hppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_io.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_logger.cppcpp/src/grpc/server/grpc_server_logger.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_threads.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_service_impl.cppcpp/src/grpc/server/grpc_worker.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/mip_heuristics/presolve/probing_cache.cucpp/src/mip_heuristics/presolve/third_party_presolve.cppcpp/src/mip_heuristics/solve.cucpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/cpu_optimization_problem.cppcpp/src/pdlp/optimization_problem.cucpp/src/pdlp/solve.cucpp/src/pdlp/solve_remote.cucpp/tests/linear_programming/CMakeLists.txtcpp/tests/linear_programming/c_api_tests/c_api_tests.cppcpp/tests/linear_programming/grpc/CMakeLists.txtcpp/tests/linear_programming/grpc/grpc_client_test.cppcpp/tests/linear_programming/grpc/grpc_client_test_helper.hppcpp/tests/linear_programming/grpc/grpc_integration_test.cppcpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hppdependencies.yamlpython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.pypython/libcuopt/CMakeLists.txt
💤 Files with no reviewable changes (4)
- cpp/src/pdlp/cpu_optimization_problem.cpp
- cpp/src/pdlp/CMakeLists.txt
- cpp/src/pdlp/optimization_problem.cu
- cpp/src/pdlp/solve_remote.cu
| # Clean up any previous installations to avoid ABI mismatches | ||
| # (notably Abseil LTS namespaces like absl::lts_20220623 vs absl::lts_20250512) | ||
| echo "Cleaning up previous installations..." | ||
| rm -rf \ | ||
| "${PREFIX}/lib/cmake/grpc" "${PREFIX}/lib64/cmake/grpc" \ | ||
| "${PREFIX}/lib/cmake/protobuf" "${PREFIX}/lib64/cmake/protobuf" \ | ||
| "${PREFIX}/lib/cmake/absl" "${PREFIX}/lib64/cmake/absl" \ | ||
| "${PREFIX}/include/absl" "${PREFIX}/include/google/protobuf" "${PREFIX}/include/grpc" \ | ||
| "${PREFIX}/bin/grpc_cpp_plugin" "${PREFIX}/bin/protoc" "${PREFIX}/bin/protoc-"* || true | ||
| rm -f \ | ||
| "${PREFIX}/lib/"libgrpc*.a "${PREFIX}/lib/"libgpr*.a "${PREFIX}/lib/"libaddress_sorting*.a "${PREFIX}/lib/"libre2*.a "${PREFIX}/lib/"libupb*.a \ | ||
| "${PREFIX}/lib64/"libabsl_*.a "${PREFIX}/lib64/"libprotobuf*.so* "${PREFIX}/lib64/"libprotoc*.so* \ | ||
| "${PREFIX}/lib/"libprotobuf*.a "${PREFIX}/lib/"libprotoc*.a || true |
There was a problem hiding this comment.
The cleanup pass misses several common install layouts.
This block removes only part of the old tree. include/grpcpp, shared libgrpc*/libprotobuf* under lib, and libabsl_* under lib can all survive, which means CMake may still find a mixed-version installation and reproduce the ABI mismatch this cleanup is trying to prevent.
Expand cleanup across both header and library layouts
rm -rf \
"${PREFIX}/lib/cmake/grpc" "${PREFIX}/lib64/cmake/grpc" \
"${PREFIX}/lib/cmake/protobuf" "${PREFIX}/lib64/cmake/protobuf" \
"${PREFIX}/lib/cmake/absl" "${PREFIX}/lib64/cmake/absl" \
- "${PREFIX}/include/absl" "${PREFIX}/include/google/protobuf" "${PREFIX}/include/grpc" \
+ "${PREFIX}/include/absl" "${PREFIX}/include/google/protobuf" \
+ "${PREFIX}/include/grpc" "${PREFIX}/include/grpcpp" \
"${PREFIX}/bin/grpc_cpp_plugin" "${PREFIX}/bin/protoc" "${PREFIX}/bin/protoc-"* || true
-rm -f \
- "${PREFIX}/lib/"libgrpc*.a "${PREFIX}/lib/"libgpr*.a "${PREFIX}/lib/"libaddress_sorting*.a "${PREFIX}/lib/"libre2*.a "${PREFIX}/lib/"libupb*.a \
- "${PREFIX}/lib64/"libabsl_*.a "${PREFIX}/lib64/"libprotobuf*.so* "${PREFIX}/lib64/"libprotoc*.so* \
- "${PREFIX}/lib/"libprotobuf*.a "${PREFIX}/lib/"libprotoc*.a || true
+for libdir in "${PREFIX}/lib" "${PREFIX}/lib64"; do
+ rm -f \
+ "${libdir}/"libgrpc*.a "${libdir}/"libgrpc*.so* \
+ "${libdir}/"libgpr*.a "${libdir}/"libgpr*.so* \
+ "${libdir}/"libaddress_sorting*.a "${libdir}/"libaddress_sorting*.so* \
+ "${libdir}/"libre2*.a "${libdir}/"libre2*.so* \
+ "${libdir}/"libupb*.a "${libdir}/"libupb*.so* \
+ "${libdir}/"libabsl_*.a "${libdir}/"libabsl_*.so* \
+ "${libdir}/"libprotobuf*.a "${libdir}/"libprotobuf*.so* \
+ "${libdir}/"libprotoc*.a "${libdir}/"libprotoc*.so*
+done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ci/utils/install_protobuf_grpc.sh` around lines 129 - 141, The cleanup
currently only removes a subset of possible install locations; update the
removal block that uses PREFIX to also delete include/grpcpp, any libgrpc* and
libprotobuf* shared objects and symlinks in both "${PREFIX}/lib" and
"${PREFIX}/lib64" (e.g. libgrpc*.so* , libprotobuf*.so* , libprotoc*.so*),
libabsl_* archives in "${PREFIX}/lib" as well as "${PREFIX}/lib64", and any
remaining grpc/protobuf-related pkgconfig or cmake dirs; in short, broaden the
rm globs to cover "${PREFIX}/include/grpcpp", "${PREFIX}/lib/"lib*.so* and
"${PREFIX}/lib64/"lib*.so*, add "${PREFIX}/lib/"libabsl_*.a, and remove
lingering pkgconfig/cmake directories so no mixed-version headers or shared libs
remain before rebuilding.
| auto sub = submit_lp(problem, settings); | ||
| if (!sub.success) { return {.error_message = sub.error_message}; } | ||
|
|
||
| start_log_streaming(sub.job_id); | ||
| auto poll = poll_for_completion(sub.job_id); | ||
| stop_log_streaming(); | ||
|
|
||
| if (!poll.completed) { return {.error_message = poll.error_message}; } | ||
|
|
||
| auto result = get_lp_result<i_t, f_t>(sub.job_id); | ||
| if (result.success) { delete_job(sub.job_id); } | ||
|
|
There was a problem hiding this comment.
Cancel/delete the remote job on every non-success exit path.
After submission, these helpers only call delete_job() when the solve fully succeeds. If polling times out, result download fails, or result mapping fails, the remote job/result/log file is left behind on the server and can eventually exhaust server slots. Add cleanup logic that cancels outstanding jobs and deletes terminal ones before returning an error.
Also applies to: 810-820
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/client/grpc_client.cpp` around lines 783 - 794, After submit_lp
succeeds you must ensure the remote job is cancelled/deleted on every error
path: after start_log_streaming/poll_for_completion/stop_log_streaming, if
poll.completed is false or poll timed out, call cancel_job(sub.job_id) (or
delete_job if terminal) before returning the error; similarly, if
get_lp_result<i_t, f_t>(sub.job_id) fails or result mapping fails, call
delete_job(sub.job_id) before returning; update the error-returning branches
around submit_lp, poll_for_completion, get_lp_result, and the symmetric block at
the other location (the 810–820 block) to always perform the appropriate
cancel_job/delete_job cleanup using sub.job_id.
| // --- 1. StartChunkedUpload --- | ||
| std::string upload_id; | ||
| { | ||
| grpc::ClientContext context; | ||
| set_rpc_deadline(context, kDefaultRpcTimeoutSeconds); | ||
| cuopt::remote::StartChunkedUploadRequest request; | ||
| *request.mutable_problem_header() = header; | ||
|
|
||
| cuopt::remote::StartChunkedUploadResponse response; | ||
| auto status = impl_->stub->StartChunkedUpload(&context, request, &response); | ||
|
|
||
| if (!status.ok()) { | ||
| last_error_ = "StartChunkedUpload failed: " + status.error_message(); | ||
| return false; | ||
| } | ||
|
|
||
| upload_id = response.upload_id(); | ||
| if (response.max_message_bytes() > 0) { | ||
| server_max_message_bytes_.store(response.max_message_bytes(), std::memory_order_relaxed); | ||
| } | ||
| } | ||
|
|
||
| GRPC_CLIENT_DEBUG_LOG(config_, "[grpc_client] ChunkedUpload started, upload_id=" << upload_id); | ||
|
|
||
| // --- 2. Build chunk requests directly from problem arrays --- | ||
| int64_t chunk_data_budget = config_.chunk_size_bytes; | ||
| if (chunk_data_budget <= 0) { chunk_data_budget = 1LL * 1024 * 1024; } | ||
| int64_t srv_max = server_max_message_bytes_.load(std::memory_order_relaxed); | ||
| if (srv_max > 0 && chunk_data_budget > srv_max * 9 / 10) { chunk_data_budget = srv_max * 9 / 10; } | ||
|
|
||
| auto chunk_requests = build_array_chunk_requests(problem, upload_id, chunk_data_budget); | ||
|
|
||
| // --- 3. Send each chunk request --- | ||
| int total_chunks = 0; | ||
| int64_t total_bytes_sent = 0; | ||
|
|
||
| for (auto& chunk_request : chunk_requests) { | ||
| grpc::ClientContext chunk_context; | ||
| set_rpc_deadline(chunk_context, kDefaultRpcTimeoutSeconds); | ||
| cuopt::remote::SendArrayChunkResponse chunk_response; | ||
| auto status = impl_->stub->SendArrayChunk(&chunk_context, chunk_request, &chunk_response); | ||
|
|
||
| if (!status.ok()) { | ||
| last_error_ = "SendArrayChunk failed: " + status.error_message(); | ||
| return false; | ||
| } | ||
|
|
||
| total_bytes_sent += chunk_request.chunk().data().size(); | ||
| ++total_chunks; | ||
| } | ||
|
|
||
| GRPC_CLIENT_DEBUG_LOG(config_, | ||
| "[grpc_client] ChunkedUpload sent " << total_chunks << " chunk requests"); | ||
|
|
||
| // --- 4. FinishChunkedUpload --- | ||
| { | ||
| grpc::ClientContext context; | ||
| set_rpc_deadline(context, kDefaultRpcTimeoutSeconds); | ||
| cuopt::remote::FinishChunkedUploadRequest request; | ||
| request.set_upload_id(upload_id); | ||
|
|
||
| cuopt::remote::SubmitJobResponse response; | ||
| auto status = impl_->stub->FinishChunkedUpload(&context, request, &response); | ||
|
|
||
| if (!status.ok()) { | ||
| last_error_ = "FinishChunkedUpload failed: " + status.error_message(); | ||
| return false; | ||
| } | ||
|
|
||
| job_id_out = response.job_id(); | ||
| } | ||
|
|
||
| GRPC_CLIENT_THROUGHPUT_LOG(config_, "upload_chunked", total_bytes_sent, upload_t0); | ||
| GRPC_CLIENT_DEBUG_LOG( | ||
| config_, | ||
| "[grpc_client] ChunkedUpload complete: " << total_chunks << " chunks, job_id=" << job_id_out); | ||
| return true; |
There was a problem hiding this comment.
Release chunked upload/download sessions when a transfer aborts.
Once StartChunkedUpload or StartChunkedDownload succeeds, any later RPC or validation failure returns immediately without telling the server to drop the session. A handful of interrupted transfers can fill kMaxChunkedSessions until the reaper runs and block unrelated clients. This needs an abort/cleanup path, or at least a best-effort finish/abort guard, around the whole transfer.
Also applies to: 1010-1133
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/client/grpc_client.cpp` around lines 839 - 915, After
StartChunkedUpload (and similarly StartChunkedDownload) succeeds you must ensure
the server session is always released on any early return; add a scope guard
(RAII or lambda-based) right after StartChunkedUpload that captures upload_id
and impl_, which on destruction (unless dismissed) calls the cleanup RPC
(FinishChunkedUpload or an AbortChunkedUpload/cancel RPC if available) using
impl_->stub and ignores/logs errors; dismiss the guard only after the normal
FinishChunkedUpload completes and job_id_out is set. Apply the same pattern
around StartChunkedDownload/FinishChunkedDownload so any error paths from
SendArrayChunk, validation failures, or other early returns invoke the
best-effort finish/abort to avoid exhausting kMaxChunkedSessions.
| void map_pdlp_settings_to_proto(const pdlp_solver_settings_t<i_t, f_t>& settings, | ||
| cuopt::remote::PDLPSolverSettings* pb_settings) | ||
| { | ||
| // Termination tolerances (all names match cuOpt API) | ||
| pb_settings->set_absolute_gap_tolerance(settings.tolerances.absolute_gap_tolerance); | ||
| pb_settings->set_relative_gap_tolerance(settings.tolerances.relative_gap_tolerance); | ||
| pb_settings->set_primal_infeasible_tolerance(settings.tolerances.primal_infeasible_tolerance); | ||
| pb_settings->set_dual_infeasible_tolerance(settings.tolerances.dual_infeasible_tolerance); | ||
| pb_settings->set_absolute_dual_tolerance(settings.tolerances.absolute_dual_tolerance); | ||
| pb_settings->set_relative_dual_tolerance(settings.tolerances.relative_dual_tolerance); | ||
| pb_settings->set_absolute_primal_tolerance(settings.tolerances.absolute_primal_tolerance); | ||
| pb_settings->set_relative_primal_tolerance(settings.tolerances.relative_primal_tolerance); | ||
|
|
||
| // Limits | ||
| pb_settings->set_time_limit(settings.time_limit); | ||
| // Avoid emitting a huge number when the iteration limit is the library default. | ||
| // Use -1 sentinel for "unset/use server defaults". | ||
| if (settings.iteration_limit == std::numeric_limits<i_t>::max()) { | ||
| pb_settings->set_iteration_limit(-1); | ||
| } else { | ||
| pb_settings->set_iteration_limit(static_cast<int64_t>(settings.iteration_limit)); | ||
| } | ||
|
|
||
| // Solver configuration | ||
| pb_settings->set_log_to_console(settings.log_to_console); | ||
| pb_settings->set_detect_infeasibility(settings.detect_infeasibility); | ||
| pb_settings->set_strict_infeasibility(settings.strict_infeasibility); | ||
| pb_settings->set_pdlp_solver_mode(to_proto_pdlp_mode(settings.pdlp_solver_mode)); | ||
| pb_settings->set_method(to_proto_method(settings.method)); | ||
| pb_settings->set_presolver(static_cast<int32_t>(settings.presolver)); | ||
| pb_settings->set_dual_postsolve(settings.dual_postsolve); | ||
| pb_settings->set_crossover(settings.crossover); | ||
| pb_settings->set_num_gpus(settings.num_gpus); | ||
|
|
||
| pb_settings->set_per_constraint_residual(settings.per_constraint_residual); | ||
| pb_settings->set_cudss_deterministic(settings.cudss_deterministic); | ||
| pb_settings->set_folding(settings.folding); | ||
| pb_settings->set_augmented(settings.augmented); | ||
| pb_settings->set_dualize(settings.dualize); | ||
| pb_settings->set_ordering(settings.ordering); | ||
| pb_settings->set_barrier_dual_initial_point(settings.barrier_dual_initial_point); | ||
| pb_settings->set_eliminate_dense_columns(settings.eliminate_dense_columns); | ||
| pb_settings->set_pdlp_precision(static_cast<int32_t>(settings.pdlp_precision)); | ||
| pb_settings->set_save_best_primal_so_far(settings.save_best_primal_so_far); | ||
| pb_settings->set_first_primal_feasible(settings.first_primal_feasible); | ||
| } |
There was a problem hiding this comment.
Don't drop warm_start_data in the PDLP settings mapper.
PDLPSolverSettings exposes warm_start_data, but neither conversion copies it. Any remote solve that goes through these helpers silently loses PDLP warm-start state, so iterative workflows restart cold even when prior state is provided.
Also applies to: 123-181
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/grpc_settings_mapper.cpp` around lines 76 - 121, The PDLP
warm-start blob isn't being transferred: update map_pdlp_settings_to_proto to
copy settings.warm_start_data into the protobuf (use
pb_settings->set_warm_start_data(settings.warm_start_data) or
pb_settings->mutable_warm_start_data()->CopyFrom(...) depending on the field
type) and make the symmetric change in the reverse mapper (the function that
converts cuopt::remote::PDLPSolverSettings back to pdlp_solver_settings_t, e.g.
map_proto_to_pdlp_settings or similar) to populate settings.warm_start_data from
the proto, ensuring warm-start state is preserved across remote calls.
| inline bool parse_incumbent_proto(const uint8_t* data, | ||
| size_t size, | ||
| std::string& job_id, | ||
| double& objective, | ||
| std::vector<double>& assignment) | ||
| { | ||
| cuopt::remote::Incumbent incumbent_msg; | ||
| if (!incumbent_msg.ParseFromArray(data, static_cast<int>(size))) { return false; } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if file exists and get actual contents
if [ -f "cpp/src/grpc/server/grpc_incumbent_proto.hpp" ]; then
cat -n "cpp/src/grpc/server/grpc_incumbent_proto.hpp"
else
echo "File not found. Searching for similar files..."
fd "grpc_incumbent_proto" --type f
fiRepository: NVIDIA/cuopt
Length of output: 2393
🏁 Script executed:
# Search for other instances of ParseFromArray to understand the pattern
rg "ParseFromArray" --type cpp --type hpp -B 2 -A 2Repository: NVIDIA/cuopt
Length of output: 83
🏁 Script executed:
# Check what numeric_limits header provides and standard int max
rg "#include.*limits" --type cpp --type hpp | head -20Repository: NVIDIA/cuopt
Length of output: 83
🏁 Script executed:
# Look for other protobuf parsing patterns in the codebase
rg "ParseFromArray|protobuf" --type cpp --type hpp | head -30Repository: NVIDIA/cuopt
Length of output: 83
Guard the protobuf size conversion and pointer validation before parsing.
ParseFromArray takes an int, but size is blindly cast from size_t without bounds checking. Additionally, there is no validation that data is non-null when size is nonzero. A malformed or oversized payload can overflow the cast and cause undefined behavior during protobuf deserialization. This is inconsistent with the sender side (build_incumbent_proto line 32), which correctly validates the size before serializing.
Suggested fix
inline bool parse_incumbent_proto(const uint8_t* data,
size_t size,
std::string& job_id,
double& objective,
std::vector<double>& assignment)
{
+ if ((data == nullptr && size != 0) ||
+ size > static_cast<size_t>(std::numeric_limits<int>::max())) {
+ return false;
+ }
cuopt::remote::Incumbent incumbent_msg;
if (!incumbent_msg.ParseFromArray(data, static_cast<int>(size))) { return false; }Apply the same defensive validation pattern used in build_incumbent_proto (line 32) to prevent integer overflow and null pointer dereference in untrusted IPC paths.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 38 - 45, The
parse_incumbent_proto function must validate the incoming buffer before calling
Incumbent::ParseFromArray: ensure that size does not exceed
std::numeric_limits<int>::max() (to avoid int overflow when casting) and that if
size > 0 then data is not nullptr; if either check fails return false. After
these guards, perform the safe static_cast<int>(size) and call
incumbent_msg.ParseFromArray(data, static_cast<int>(size)). Mirror the defensive
checks used in build_incumbent_proto and keep the function signature and return
behavior unchanged.
| // Check whether the job has finished. If so, drain any final | ||
| // bytes the solver may have flushed after our last read, then | ||
| // send the job_complete sentinel and close the stream. | ||
| std::string msg; | ||
| JobStatus s = check_job_status(job_id, msg); | ||
| if (s == JobStatus::COMPLETED || s == JobStatus::FAILED || s == JobStatus::CANCELLED) { | ||
| std::streampos before2 = in.tellg(); | ||
| if (before2 >= 0) { current_offset = static_cast<int64_t>(before2); } | ||
| if (std::getline(in, line)) { | ||
| std::streampos after2 = in.tellg(); | ||
| int64_t next_offset2 = current_offset + static_cast<int64_t>(line.size()); | ||
| if (after2 >= 0) { next_offset2 = static_cast<int64_t>(after2); } | ||
| cuopt::remote::LogMessage m; | ||
| m.set_line(line); | ||
| m.set_byte_offset(next_offset2); | ||
| m.set_job_complete(false); | ||
| writer->Write(m); | ||
| } | ||
|
|
||
| cuopt::remote::LogMessage done; | ||
| done.set_line(""); | ||
| done.set_byte_offset(current_offset); | ||
| done.set_job_complete(true); | ||
| writer->Write(done); | ||
| return Status::OK; |
There was a problem hiding this comment.
Drain the full log tail before sending job_complete.
The terminal path reads at most one more line and then emits the sentinel. If the worker flushes multiple lines at exit, every line after the first is lost, and the final byte_offset can also lag behind the last emitted line. Keep reading until EOF and advance the offset from the last line actually written.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 795 - 819, The
terminal path currently reads at most one more line after check_job_status and
then sends the job_complete sentinel, which can drop any additional flushed
lines and leave byte_offset behind; change the logic in the completion branch
(around check_job_status, the use of in, line, current_offset, next_offset2,
cuopt::remote::LogMessage and writer->Write) to drain the entire remaining
stream by repeatedly calling std::getline(in, line) in a loop, for each
iteration compute the correct byte offset from the stream position (use tellg()
before/after or after if available) and emit a LogMessage with m.set_line(line),
m.set_byte_offset(next_offset), m.set_job_complete(false) via writer->Write(m),
update current_offset to the last emitted byte offset, and after the loop emit
the final done message with m.set_job_complete(true) and byte_offset set to that
last current_offset.
| pid_t spawn_worker(int worker_id, bool is_replacement) | ||
| { | ||
| std::lock_guard<std::mutex> lock(worker_pipes_mutex); | ||
|
|
||
| if (is_replacement) { close_worker_pipes_server(worker_id); } | ||
|
|
||
| if (!create_worker_pipes(worker_id)) { | ||
| SERVER_LOG_ERROR("[Server] Failed to create pipes for %s%d", | ||
| is_replacement ? "replacement worker " : "worker ", | ||
| worker_id); | ||
| return -1; | ||
| } | ||
|
|
||
| pid_t pid = fork(); | ||
| if (pid < 0) { | ||
| SERVER_LOG_ERROR("[Server] Failed to fork %s%d", | ||
| is_replacement ? "replacement worker " : "worker ", | ||
| worker_id); | ||
| close_all_worker_pipes(worker_pipes[worker_id]); | ||
| return -1; | ||
| } else if (pid == 0) { | ||
| // Child: close all fds belonging to other workers. | ||
| for (int j = 0; j < static_cast<int>(worker_pipes.size()); ++j) { | ||
| if (j != worker_id) { close_all_worker_pipes(worker_pipes[j]); } | ||
| } | ||
| // Close the server-side ends of this worker's pipes (child uses the other ends). | ||
| close_and_reset(worker_pipes[worker_id].to_worker_fd); | ||
| close_and_reset(worker_pipes[worker_id].from_worker_fd); | ||
| close_and_reset(worker_pipes[worker_id].incumbent_from_worker_fd); | ||
| worker_process(worker_id); | ||
| _exit(0); | ||
| } | ||
|
|
||
| close_worker_pipes_child_ends(worker_id); | ||
| return pid; | ||
| } |
There was a problem hiding this comment.
Do not respawn workers with fork() after the server is multithreaded.
The initial startup fork is fine, but spawn_single_worker() routes the worker-monitor thread back through this same path after gRPC and the background threads are already live. Forking a multithreaded process and continuing in the child without exec() is unsafe; inherited library and mutex state can deadlock or corrupt the replacement worker immediately.
Based on learnings, "Ensure race conditions are absent in multi-threaded server implementations; verify proper synchronization of shared state."
Also applies to: 159-161
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 103 - 138,
spawn_worker currently forks a child to create replacement workers which is
unsafe in a multithreaded server; change the replacement-worker path so you do
not continue execution in a forked child of a multithreaded process — instead
spawn a fresh process via fork+exec (or posix_spawn) and exec the same server
binary with an argument to run a single worker, preserving the existing pipe
setup/FDs for the new process, or use posix_spawn to avoid inheriting
thread/mutex state; ensure spawn_single_worker and the is_replacement=true
branch in spawn_worker call this exec-based spawn path, and maintain the same
close_all_worker_pipes/close_worker_pipes_child_ends semantics for FD hygiene.
| static void publish_result(const SolveResult& sr, const std::string& job_id, int worker_id) | ||
| { | ||
| int64_t result_total_bytes = 0; | ||
| if (sr.success) { | ||
| for (const auto& [fid, data] : sr.arrays) { | ||
| result_total_bytes += data.size(); | ||
| } | ||
| } | ||
|
|
||
| // Same CAS protocol as store_simple_result (see comment there). | ||
| int result_slot = -1; | ||
| for (size_t i = 0; i < MAX_RESULTS; ++i) { | ||
| if (result_queue[i].ready.load(std::memory_order_acquire)) continue; | ||
| bool expected = false; | ||
| if (!result_queue[i].claimed.compare_exchange_strong( | ||
| expected, true, std::memory_order_acq_rel)) { | ||
| continue; | ||
| } | ||
| if (result_queue[i].ready.load(std::memory_order_acquire)) { | ||
| result_queue[i].claimed.store(false, std::memory_order_release); | ||
| continue; | ||
| } | ||
| result_slot = static_cast<int>(i); | ||
| ResultQueueEntry& result = result_queue[i]; | ||
| copy_cstr(result.job_id, job_id); | ||
| result.status = sr.success ? RESULT_SUCCESS : RESULT_ERROR; | ||
| result.data_size = sr.success ? std::max<uint64_t>(result_total_bytes, 1) : 0; | ||
| result.worker_index.store(worker_id, std::memory_order_relaxed); | ||
| if (!sr.success) { copy_cstr(result.error_message, sr.error_message); } | ||
| result.retrieved.store(false, std::memory_order_relaxed); | ||
| result.ready.store(true, std::memory_order_release); | ||
| result.claimed.store(false, std::memory_order_release); | ||
| if (config.verbose) { | ||
| SERVER_LOG_DEBUG( | ||
| "[Worker %d] Enqueued result metadata for job %s in result_slot=%d status=%d data_size=%lu", | ||
| worker_id, | ||
| job_id.c_str(), | ||
| result_slot, | ||
| static_cast<int>(result.status), | ||
| result.data_size); | ||
| } | ||
| break; | ||
| } | ||
|
|
||
| // Stream the full solution payload through the worker's result pipe. | ||
| // The server thread reads the other end when the client calls | ||
| // GetResult / DownloadChunk. | ||
| if (sr.success && result_slot >= 0) { | ||
| int write_fd = worker_pipes[worker_id].worker_write_fd; | ||
| if (config.verbose) { | ||
| SERVER_LOG_DEBUG("[Worker %d] Streaming result (%zu arrays, %ld bytes) to pipe for job %s", | ||
| worker_id, | ||
| sr.arrays.size(), | ||
| result_total_bytes, | ||
| job_id.c_str()); | ||
| } | ||
| auto pipe_result_t0 = std::chrono::steady_clock::now(); | ||
| bool write_success = write_result_to_pipe(write_fd, sr.header, sr.arrays); | ||
| if (write_success && config.verbose) { | ||
| log_pipe_throughput("pipe_result_send", result_total_bytes, pipe_result_t0); | ||
| } | ||
| if (!write_success) { | ||
| SERVER_LOG_ERROR("[Worker %d] Failed to write result to pipe", worker_id); | ||
| result_queue[result_slot].status = RESULT_ERROR; | ||
| copy_cstr(result_queue[result_slot].error_message, "Failed to write result to pipe"); | ||
| } else if (config.verbose) { | ||
| SERVER_LOG_DEBUG( | ||
| "[Worker %d] Finished writing result payload for job %s", worker_id, job_id.c_str()); | ||
| } | ||
| } else if (config.verbose) { | ||
| SERVER_LOG_DEBUG( | ||
| "[Worker %d] No result payload write needed for job %s (success=%d, result_slot=%d, " | ||
| "payload_bytes=%ld)", | ||
| worker_id, | ||
| job_id.c_str(), | ||
| static_cast<int>(sr.success), | ||
| result_slot, | ||
| result_total_bytes); | ||
| } |
There was a problem hiding this comment.
Don’t silently lose completed jobs when the result queue is full.
If the CAS loop never finds a free result_queue slot, result_slot stays -1 and this function just falls through without publishing anything. worker_process() then immediately resets the job slot on Lines 534-535, so the solve can disappear without any terminal result ever reaching the server thread.
Also applies to: 534-535
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 399 - 477, publish_result
can silently drop a completed job when no result_queue slot is free; change
publish_result to return a bool (e.g., enqueued) and ensure it does not silently
succeed when result_slot stays -1, and update worker_process to check that
return value before clearing the job slot: if publish_result returns false,
either retry/enqueue with a bounded backoff or propagate an explicit
RESULT_ERROR into the result handling path so the server thread will see a
terminal result; refer to publish_result, result_slot, result_queue, and
worker_process (the code that resets the job slot) when making these changes.
| port_ = 18500; | ||
| const char* env_base = std::getenv("CUOPT_TEST_PORT_BASE"); | ||
| if (env_base) { port_ = std::atoi(env_base) + 500; } | ||
|
|
||
| server_pid_ = fork(); | ||
| if (server_pid_ < 0) { | ||
| skip_reason_ = "fork() failed"; | ||
| return; | ||
| } | ||
|
|
||
| if (server_pid_ == 0) { | ||
| std::string port_str = std::to_string(port_); | ||
| std::string log_file = "/tmp/cuopt_c_api_test_server_" + port_str + ".log"; | ||
| int fd = open(log_file.c_str(), O_WRONLY | O_CREAT | O_TRUNC, 0644); | ||
| if (fd >= 0) { | ||
| dup2(fd, STDOUT_FILENO); | ||
| dup2(fd, STDERR_FILENO); | ||
| close(fd); | ||
| } | ||
| execl(server_path_.c_str(), | ||
| server_path_.c_str(), | ||
| "--port", | ||
| port_str.c_str(), | ||
| "--workers", | ||
| "1", | ||
| nullptr); | ||
| _exit(127); | ||
| } | ||
|
|
||
| if (!tcp_connect_check(port_, 15000)) { | ||
| skip_reason_ = "cuopt_grpc_server failed to start within 15 seconds"; | ||
| kill(server_pid_, SIGKILL); | ||
| waitpid(server_pid_, nullptr, 0); | ||
| server_pid_ = -1; | ||
| return; | ||
| } |
There was a problem hiding this comment.
Pick a free port instead of probing a fixed one.
This fixture treats “something is listening on 127.0.0.1:port_” as proof that the child server started. If 18500 (or CUOPT_TEST_PORT_BASE + 500) is already occupied, tcp_connect_check() succeeds immediately, the child can still die on bind, and the tests run against the wrong service. Reserve the port before fork() or have the child report its bound port back to the parent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp` around lines 428 -
463, Reserve a free loopback port before forking by creating a temporary socket,
binding it to 127.0.0.1:0, calling getsockname to set port_, and ensuring the
socket is not closed-on-exec so the child can inherit the reservation; then
fork() as before, pass the port_str to execl via the "--port" arg, and in the
parent keep the reservation socket open until tcp_connect_check(port_, 15000)
succeeds (close it afterward); update code around port_, fork(),
tcp_connect_check(), server_path_, execl and server_pid_ to implement this
reservation/hand-off sequence.
| | Segment | Purpose | | ||
| |---------|---------| | ||
| | `/cuopt_job_queue` | Job metadata (ID, type, size, status) | | ||
| | `/cuopt_result_queue` | Result metadata (ID, status, size, error) | | ||
| | `/cuopt_control` | Server control flags (shutdown, worker count) | | ||
|
|
There was a problem hiding this comment.
Namespace the shared-memory segments per server instance.
/cuopt_job_queue, /cuopt_result_queue, and /cuopt_control are host-global POSIX shm names. A second cuopt_grpc_server on the same machine will either fail to start or attach to the first server's control blocks, so these need a per-instance suffix (port/PID/user) or an explicit single-instance-per-host restriction.
Also applies to: 307-314
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@GRPC_SERVER_ARCHITECTURE.md` around lines 52 - 57, The POSIX shared-memory
names `/cuopt_job_queue`, `/cuopt_result_queue`, and `/cuopt_control` are
host-global and must be namespaced per server instance; update all references
(including the other occurrence around lines mentioning 307-314) to generate
instance-specific names (for example by appending PID, port, or a UUID) instead
of the fixed literals so multiple `cuopt_grpc_server` instances can coexist—add
a small helper like generateInstanceShmName(instanceId, baseName) and replace
uses of the three fixed symbols with calls that pass a unique instance
identifier derived from the server startup parameters (PID/port/ENV/UUID).
For #1002