RATIS-2385. Don't keep failed requests in the RetryCache.#1339
RATIS-2385. Don't keep failed requests in the RetryCache.#1339ss77892 wants to merge 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
@ss77892 , thanks a lot for working on this!
You are right that we should clean up the cache entries for NotLeaderException.
- Indeed, we should also clean up all the cache entries for non-StateMachineException.
- BTW, the current code does not check the return value from CompletableFuture.complete(..)/completeExceptionally(..).
Could you also fix the problems above?
See also https://issues.apache.org/jira/secure/attachment/13080683/1339_review.patch
|
@ss77892 , on second thought, we should not invalidate any cache entries. Ratis retry cache is designed to have unique cache entries.
Such design makes thing simple since all the replies are always fixed. In contrast, if we invalidate a NotLeaderException entry. It could cause problem like below:
|
|
@szetszwo We have another leadership check at the beginning, so we don't store the entry in the cache if we are not the leader at the moment. So, it would behave exactly in the same way even without the patch. The patch removes an inconsistency in behavior that might occur when leadership changes during the event processing (the node was the leader when it received the event, but it's not a leader when it wants to add the event to the log). |
Could you point out the code? |
What changes were proposed in this pull request?
If the leader has changed after we placed the request in the RetryCache but before we wrote it to the log file, we should remove it from the cache.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/RATIS-2385
How was this patch tested?
Manual tests with Ozone. Under load with a continuously changing leader, NPEs occur during retry operations. No exception occurs; the patch and retry operations complete successfully.