Indirect agent connection improvements#13345
Conversation
…ements. - Enhances the Host connecting logic to avoid connecting storm (where Agent opens multiple sockets against Management Server). - Implements HostConnectProcess task where Host upon connection checks whether lock is available, traces Host connecting progress, status and timeout. - Introduces AgentConnectStatusCommand, where Host checks whether lock for the Host is available (i.e. "previous" connect process is finished). - Implementes logic to check whether Management Server has lock against Host (exposed MySQL DB lock presence via API) - Removes synchronization on Host disconnect process, double-disconnect logic in clustered Management Server environment, added early removal from ping map (in case of combination ping timeout delay + synchronized disconnect process the Agent Manager submits more disconnect requests) - Introduces parameterized connection and status check timeouts - Implements backoff algorithm abstraction - can be used either constant backoff timeout or exponential with jitter to wait between connection Host attempts to Management Server - Implements ServerAttache to be used on the Agent side of communication (similar to Attache on Management Server side) - Enhances/Adds logs significantly to Host Agent and Agent Manager logic to trace Host connecting and disconnecting process, including ids, names, context UUIDs and timings (how much time took overall initialization/deinitialization) - Adds logs to communication between Management Servers (PDU requests) - Adds DB indexes to improve search performance, uses IDEMPOTENT_ADD_INDEX for safer DB schema updates
- Bug 1 fix (AgentManagerImpl.java) — GlobalLock.isLockAvailable() now only runs when the host status is not alive. This eliminates one IS_FREE_LOCK DB query per ping per healthy host, which is the direct cause of listHosts/listNetworks degradation. - Bug 2 fix (HostConnectProcess.java) — shutdown() → shutdownNow(). Old thread pools from prior connect cycles are now interrupted immediately instead of draining their queued tasks, preventing thread accumulation during reconnect storms. - Bug 3 fix (AgentManagerImpl.java) — Lock timeout in handleDisconnectWithoutInvestigation now logs a warn instead of silently discarding the disconnect event. - Bug 4 fix (ServerAttache.java) — Alarm ScheduledFuture handles are now tracked in _alarmFutures and cancelled when the corresponding listener is unregistered or all commands are cancelled on disconnect. - Bug 5 fix (AgentAttache.java) - Fix Alarm ScheduledFuture handles in AgentAttache as well
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #13345 +/- ##
============================================
+ Coverage 18.10% 18.13% +0.02%
- Complexity 16750 16800 +50
============================================
Files 6037 6047 +10
Lines 542798 544321 +1523
Branches 66457 66606 +149
============================================
+ Hits 98298 98695 +397
- Misses 433453 434530 +1077
- Partials 11047 11096 +49
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR is a broad refactor/enhancement of indirect agent connection handling across both Management Server and Agent sides. It introduces a host-side connect/status-check process, adds configurable backoff and host-status heuristics, and expands logging/lock-check mechanisms to reduce reconnect storms and improve observability.
Changes:
- Adds agent-side connection orchestration (status polling + startup submission) and a server-side
AgentConnectStatus*command/answer pair to coordinate connect progress. - Introduces configurable backoff (factory + exponential-with-jitter implementation) and propagates configuration from Management Server to Agent during startup.
- Refactors NIO connection lifecycle and adds supporting utilities/logging, plus DB lock availability checks and DB indexes for performance.
Reviewed changes
Copilot reviewed 47 out of 47 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| utils/src/test/java/com/cloud/utils/testcase/NioTest.java | Test logging cleanup (exception logging). |
| utils/src/test/java/com/cloud/utils/backoff/impl/ConstantTimeBackoffTest.java | Updates config key name for constant backoff. |
| utils/src/main/java/com/cloud/utils/nio/NioServer.java | NIO server init tweaks and docstring. |
| utils/src/main/java/com/cloud/utils/nio/NioConnection.java | Connection lifecycle refactor, selector loop/logging, reject logic. |
| utils/src/main/java/com/cloud/utils/nio/NioClient.java | Client connect/handshake logging and expanded cleanup. |
| utils/src/main/java/com/cloud/utils/nio/Link.java | Adds local-port tracking, richer toString, termination helpers/logging changes. |
| utils/src/main/java/com/cloud/utils/nio/HandlerFactory.java | Changes new-connection registration API to InetSocketAddress. |
| utils/src/main/java/com/cloud/utils/net/NetUtils.java | Null guard in hostname-to-IP resolution helper. |
| utils/src/main/java/com/cloud/utils/LogUtils.java | Adds host logging helper with optional reverse lookup. |
| utils/src/main/java/com/cloud/utils/exception/CSExceptionErrorCode.java | Adds error-code mapping for ConnectionException. |
| utils/src/main/java/com/cloud/utils/DateUtil.java | Adds formatMillis duration formatter. |
| utils/src/main/java/com/cloud/utils/backoff/impl/ExponentialWithJitterBackoffMBean.java | New MBean interface for exponential-jitter backoff. |
| utils/src/main/java/com/cloud/utils/backoff/impl/ExponentialWithJitterBackoff.java | New exponential-with-jitter backoff implementation. |
| utils/src/main/java/com/cloud/utils/backoff/impl/ConstantTimeBackoff.java | Migrates to namespaced config keys + exposes configuration. |
| utils/src/main/java/com/cloud/utils/backoff/BackoffFactory.java | New backoff factory for selecting/configuring algorithms. |
| utils/src/main/java/com/cloud/utils/backoff/BackoffAlgorithm.java | Adds getConfiguration() API for propagating settings. |
| server/src/main/java/org/apache/cloudstack/agent/lb/IndirectAgentLBServiceImpl.java | Adds notes/timeout tweaks around agent migration command dispatch. |
| framework/db/src/main/java/com/cloud/utils/db/GlobalLock.java | Refactors lock bookkeeping/logging and adds lock-availability query. |
| framework/db/src/main/java/com/cloud/utils/db/DbUtil.java | Adds IS_FREE_LOCK helper and improves logging around lock DB connections. |
| framework/config/src/main/java/org/apache/cloudstack/framework/config/ConfigKeyUtil.java | New utility to parse key=value;... configuration strings. |
| framework/cluster/src/main/java/com/cloud/cluster/dao/ManagementServerHostDaoImpl.java | Adds DAO method to list all MS hosts including removed. |
| framework/cluster/src/main/java/com/cloud/cluster/dao/ManagementServerHostDao.java | Adds DAO API for “including removed” listing. |
| framework/cluster/src/main/java/com/cloud/cluster/ClusterServiceServletImpl.java | Adds request logging and more detailed RemoteException messages. |
| framework/cluster/src/main/java/com/cloud/cluster/ClusterServiceServletHttpHandler.java | Adds debug logging of inbound request line/body. |
| engine/schema/src/main/resources/META-INF/db/schema-42210to42300.sql | Adds indexes via idempotent add-index procedure calls. |
| engine/schema/src/main/resources/META-INF/db/procedures/cloud.idempotent_add_index.sql | Adds/defines IDEMPOTENT_ADD_INDEX procedure. |
| engine/orchestration/src/test/java/com/cloud/agent/manager/ClusteredAgentManagerImplTest.java | Expands tests for disconnect broadcast behavior and lock usage. |
| engine/orchestration/src/test/java/com/cloud/agent/manager/AgentManagerImplTest.java | Adds tests for config keys and deregister/disconnect behaviors. |
| engine/orchestration/src/main/java/com/cloud/agent/manager/ClusteredAgentManagerImpl.java | Refactors attache creation/removal and event handling behavior. |
| engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java | Major connect/disconnect refactor, new config keys, status checks, backoff propagation. |
| engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java | Cancels listener alarm futures on unregister/cancel paths. |
| core/src/main/java/org/apache/cloudstack/threadcontext/ThreadContextUtil.java | New helper for propagating Log4j ThreadContext across threads. |
| core/src/main/java/com/cloud/resource/ServerResource.java | Changes default isExitOnFailures() behavior. |
| core/src/main/java/com/cloud/agent/transport/Request.java | Improves JSON deserialization error logging. |
| core/src/main/java/com/cloud/agent/api/StartupAnswer.java | Adds params + agent-side status-check delay transport fields. |
| core/src/main/java/com/cloud/agent/api/AgentConnectStatusCommand.java | New command for host connect-status checks. |
| core/src/main/java/com/cloud/agent/api/AgentConnectStatusAnswer.java | New answer carrying lock availability and host status. |
| agent/src/test/java/com/cloud/agent/HostConnectProcessTest.java | New test around scheduling the agent connect process. |
| agent/src/test/java/com/cloud/agent/AgentTest.java | Updates tests for new link/logging and reconnect helpers. |
| agent/src/main/java/com/cloud/agent/SynchronousListener.java | New blocking listener for synchronous waits on answers. |
| agent/src/main/java/com/cloud/agent/ServerListener.java | New listener interface for agent-side ServerAttache callbacks. |
| agent/src/main/java/com/cloud/agent/ServerAttache.java | New agent-side counterpart to MS Attache for command/answer flow. |
| agent/src/main/java/com/cloud/agent/properties/AgentProperties.java | Adds new agent properties for async timeouts and status-check delay. |
| agent/src/main/java/com/cloud/agent/IAgentShell.java | Adds setter for dynamically updated backoff algorithm. |
| agent/src/main/java/com/cloud/agent/HostConnectProcess.java | New connect/status-check orchestration loop on the agent side. |
| agent/src/main/java/com/cloud/agent/AgentShell.java | Uses backoff factory + persists configuration; relaxes version handling. |
| agent/src/main/java/com/cloud/agent/Agent.java | Major reconnect/startup refactor; integrates HostConnectProcess and ServerAttache. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18146 |
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18149 |
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18154 |
Description
Continued from #13028
This PR improves the Indirect agent connection handling, has the following improvements.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Checked the indirect agents connections (and re-connections) with KVM hosts, SSVM & CPVM.
How did you try to break this feature and the system with this change?