DOC-6252 sections about failover behaviour when all endpoints are unhealthy#2768
DOC-6252 sections about failover behaviour when all endpoints are unhealthy#2768andy-stark-redis merged 3 commits intomainfrom
Conversation
|
Thanks @dwdougherty ! |
ggivo
left a comment
There was a problem hiding this comment.
LGTM from Jedis perspective
| in the [Retry configuration]({{< relref "#retry-configuration" >}}) section). However, if the client exhausts | ||
| all the available failover attempts before any endpoint becomes healthy again, commands will throw a `JedisPermanentlyNotAvailableException`. The client won't recover automatically from this situation, so you | ||
| should handle it by reconnecting with the `MultiDBClient` builder after a suitable delay (see | ||
| [Failover configuration](#failover-configuration) for a connection example). | ||
|
|
There was a problem hiding this comment.
On a second look, I don’t think this is technically correct.
Even after a JedisPermanentlyNotAvailableException, if an endpoint becomes healthy again, the client can recover.
JedisPermanentlyNotAvailableException just means that there were no healthy connections for a configured amount of time, so we treat it as a permanent error at that moment. It doesn’t necessarily mean the client is incapable of recovering later.
It also looks like we’re missing an integration test for this scenario — e.g. recovery after a JedisPermanentlyNotAvailableException has already been thrown.
@atakavci — any concerns we clarify this behavior in the docs around JedisPermanentlyNotAvailableException, as it can recover?
There was a problem hiding this comment.
@ggivo , agreed.
JedisPermanentlyNotAvailableException is the way Jedis signaling to the application that "all unhealthy" state has been stable for some period of time, and configured number of attempts(in regard to configured delay) is already exhausted. So that upon receiving this type of exception, the application would decide how to react to a consistent/stable availability issue.
There was a problem hiding this comment.
@atakavci @ggivo OK, so after the app gets a JedisPermanentlyNotAvailableException does Jedis still keep trying to find a healthy endpoint automatically in the background (so if you try a command again a bit later then it might succeed)? Or do you have to add some code to handle this explicitly from the app (eg, use isHealthy to check all the current endpoints and then use setActiveDatabase to start using a healthy endpoint if you can find one)?
There was a problem hiding this comment.
@andy-stark-redis
thank you for raising the question, it looks it was a kind of gray area.
When a client instance hits the all-unhealthy case:
- If failback is enabled, it will automatically recover and switch to a healthy database on the first run of the periodic failback execution, without user intervention.
- If failback is disabled, the user will need to verify a healthy endpoint and explicitly call setActiveDatabase to switch to it.
you can check the test i am introducing with this PR.
@ggivo please let me know what you think of it.
Beyond the question, it also made me think it could be a good improvement to trigger a check if the all-unhealthy situation is resolved, either on a health state change or on any incoming command request, or both may be. I ll take a closer look when i find the time.
There was a problem hiding this comment.
Fixed. Thanks for the clarification.
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
Added info about this based on customer feedback. The corresponding section for the Lettuce geo failover page will be added in a separate PR.
Note
Low Risk
Low risk documentation-only changes; the main risk is incorrect exception/option naming that could mislead users configuring failover.
Overview
Expands the client-side geographic failover docs to better describe health check strategies (ping, lag-aware via REST API, and custom) in the main overview.
Adds new guidance for Jedis and redis-py on what happens when all endpoints are unhealthy, including the exceptions thrown, how long the client keeps probing based on failover attempt/delay settings, and suggested retry/reconnect handling. Also clarifies redis-py troubleshooting advice around timeouts and
LagAwareHealthCheckconfiguration.Written by Cursor Bugbot for commit 817863f. This will update automatically on new commits. Configure here.