HDDS-15382. Close idle connection for Datanode GRPC server#10371
HDDS-15382. Close idle connection for Datanode GRPC server#10371ChenSammi wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens the Datanode gRPC server against connection exhaustion and lingering idle clients by introducing server-side idle/keepalive handling and by lowering the default socket backlog to align with IPC defaults.
Changes:
- Configure the Datanode gRPC server to enforce a max idle connection time and to use HTTP/2 keepalive pings/timeouts.
- Reduce the default
hdds.datanode.grpc.so.backlogfrom 4096 to 256.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/XceiverServerGrpc.java |
Adds gRPC Netty server connection idle and keepalive settings. |
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java |
Lowers the default gRPC server socket backlog and updates the config default. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .executor(readExecutors) | ||
| // If a client does not send an actual functional business RPC for 15 minutes, | ||
| // the server kicks them off with a GOAWAY frame. | ||
| .maxConnectionIdle(15, TimeUnit.MINUTES) |
There was a problem hiding this comment.
should we also set MaxConnectionAge() to 1H?
There was a problem hiding this comment.
For a long running application, a ozone client can exist for hours, even days.
| // If the server fires a ping and the client fails to respond with a | ||
| // PING ACK within 30 seconds, the server assumes the socket is a dead | ||
| // "zombie connection" and immediately destroys the TCP socket. | ||
| .keepAliveTimeout(30, TimeUnit.SECONDS) |
There was a problem hiding this comment.
15s is a little aggressive.
| // if the network wire or client machine is still alive. | ||
| .keepAliveTime(5, TimeUnit.MINUTES) |
There was a problem hiding this comment.
1min is a little aggressive.
What changes were proposed in this pull request?
HDDS-15149 limits the pending connections allowed for datanode grpc server.
This JIRA is another defensive protection, close the connection if it's idle for 15 minutes.
Also reduce the so.backlog from default 4096 to 256, as 4096 seems too much, as IPC server so.backlog is also 256.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15382
How was this patch tested?
existing UT