RATIS-2385. Don't keep failed requests in the RetryCache. by ss77892 · Pull Request #1339 · apache/ratis

ss77892 · 2026-02-03T18:58:21Z

What changes were proposed in this pull request?

If the leader has changed after we placed the request in the RetryCache but before we wrote it to the log file, we should remove it from the cache.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-2385

How was this patch tested?

Manual tests with Ozone. Under load with a continuously changing leader, NPEs occur during retry operations. No exception occurs; the patch and retry operations complete successfully.

szetszwo

@ss77892 , thanks a lot for working on this!

You are right that we should clean up the cache entries for NotLeaderException.

Indeed, we should also clean up all the cache entries for non-StateMachineException.
BTW, the current code does not check the return value from CompletableFuture.complete(..)/completeExceptionally(..).

Could you also fix the problems above?

See also https://issues.apache.org/jira/secure/attachment/13080683/1339_review.patch

szetszwo · 2026-02-07T16:39:26Z

@ss77892 , on second thought, we should not invalidate any cache entries. Ratis retry cache is designed to have unique cache entries.

If a call has replied normally or exceptionally, the calls with the same ClientInvocationId should always receive the same result.
For retires, the client always uses a different ClientInvocationId.

Such design makes thing simple since all the replies are always fixed.

In contrast, if we invalidate a NotLeaderException entry. It could cause problem like below:

Client sends a call.
Server replies NotLeaderException and invalidates the entry.
Client has not received the NotLeaderException reply (due to slow network). It retries with the same ClientInvocationId.
Server replies replies again with a different result.
Client receives the second result.
Client receives the NotLeaderException

ss77892 · 2026-02-07T17:31:04Z

@szetszwo We have another leadership check at the beginning, so we don't store the entry in the cache if we are not the leader at the moment. So, it would behave exactly in the same way even without the patch. The patch removes an inconsistency in behavior that might occur when leadership changes during the event processing (the node was the leader when it received the event, but it's not a leader when it wants to add the event to the log).

szetszwo · 2026-02-07T23:04:27Z

... We have another leadership check at the beginning, ...

Could you point out the code?

RATIS-2385. Don't keep failed requests in the RetryCache.

4c8bb96

ss77892 requested a review from szetszwo February 4, 2026 17:42

szetszwo reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RATIS-2385. Don't keep failed requests in the RetryCache.#1339

RATIS-2385. Don't keep failed requests in the RetryCache.#1339
ss77892 wants to merge 1 commit intoapache:masterfrom
ss77892:RATIS-2385

ss77892 commented Feb 3, 2026

Uh oh!

szetszwo left a comment •

edited

Loading

Uh oh!

szetszwo commented Feb 7, 2026

Uh oh!

ss77892 commented Feb 7, 2026

Uh oh!

szetszwo commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ss77892 commented Feb 3, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

szetszwo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szetszwo commented Feb 7, 2026

Uh oh!

ss77892 commented Feb 7, 2026

Uh oh!

szetszwo commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

szetszwo left a comment •

edited

Loading