Skip to content

DOC-6252 sections about failover behaviour when all endpoints are unhealthy#2768

Merged
andy-stark-redis merged 3 commits intomainfrom
DOC-6252-failover-no-dbs
Feb 27, 2026
Merged

DOC-6252 sections about failover behaviour when all endpoints are unhealthy#2768
andy-stark-redis merged 3 commits intomainfrom
DOC-6252-failover-no-dbs

Conversation

@andy-stark-redis
Copy link
Contributor

@andy-stark-redis andy-stark-redis commented Feb 10, 2026

Added info about this based on customer feedback. The corresponding section for the Lettuce geo failover page will be added in a separate PR.


Note

Low Risk
Low risk documentation-only changes; the main risk is incorrect exception/option naming that could mislead users configuring failover.

Overview
Expands the client-side geographic failover docs to better describe health check strategies (ping, lag-aware via REST API, and custom) in the main overview.

Adds new guidance for Jedis and redis-py on what happens when all endpoints are unhealthy, including the exceptions thrown, how long the client keeps probing based on failover attempt/delay settings, and suggested retry/reconnect handling. Also clarifies redis-py troubleshooting advice around timeouts and LagAwareHealthCheck configuration.

Written by Cursor Bugbot for commit 817863f. This will update automatically on new commits. Configure here.

@andy-stark-redis andy-stark-redis requested a review from a team February 10, 2026 16:21
@andy-stark-redis andy-stark-redis self-assigned this Feb 10, 2026
@andy-stark-redis andy-stark-redis added the clients Client library docs label Feb 10, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 10, 2026

DOC-6252

Copy link
Collaborator

@dwdougherty dwdougherty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@andy-stark-redis
Copy link
Contributor Author

Thanks @dwdougherty !

Copy link
Contributor

@ggivo ggivo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from Jedis perspective

Comment on lines +422 to +426
in the [Retry configuration]({{< relref "#retry-configuration" >}}) section). However, if the client exhausts
all the available failover attempts before any endpoint becomes healthy again, commands will throw a `JedisPermanentlyNotAvailableException`. The client won't recover automatically from this situation, so you
should handle it by reconnecting with the `MultiDBClient` builder after a suitable delay (see
[Failover configuration](#failover-configuration) for a connection example).

Copy link
Contributor

@ggivo ggivo Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second look, I don’t think this is technically correct.

Even after a JedisPermanentlyNotAvailableException, if an endpoint becomes healthy again, the client can recover.

JedisPermanentlyNotAvailableException just means that there were no healthy connections for a configured amount of time, so we treat it as a permanent error at that moment. It doesn’t necessarily mean the client is incapable of recovering later.

It also looks like we’re missing an integration test for this scenario — e.g. recovery after a JedisPermanentlyNotAvailableException has already been thrown.

@atakavci — any concerns we clarify this behavior in the docs around JedisPermanentlyNotAvailableException, as it can recover?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggivo , agreed.
JedisPermanentlyNotAvailableException is the way Jedis signaling to the application that "all unhealthy" state has been stable for some period of time, and configured number of attempts(in regard to configured delay) is already exhausted. So that upon receiving this type of exception, the application would decide how to react to a consistent/stable availability issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atakavci @ggivo OK, so after the app gets a JedisPermanentlyNotAvailableException does Jedis still keep trying to find a healthy endpoint automatically in the background (so if you try a command again a bit later then it might succeed)? Or do you have to add some code to handle this explicitly from the app (eg, use isHealthy to check all the current endpoints and then use setActiveDatabase to start using a healthy endpoint if you can find one)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andy-stark-redis
thank you for raising the question, it looks it was a kind of gray area.

When a client instance hits the all-unhealthy case:

  • If failback is enabled, it will automatically recover and switch to a healthy database on the first run of the periodic failback execution, without user intervention.
  • If failback is disabled, the user will need to verify a healthy endpoint and explicitly call setActiveDatabase to switch to it.

you can check the test i am introducing with this PR.
@ggivo please let me know what you think of it.

Beyond the question, it also made me think it could be a good improvement to trigger a check if the all-unhealthy situation is resolved, either on a health state change or on any incoming command request, or both may be. I ll take a closer look when i find the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks for the clarification.

@jit-ci
Copy link

jit-ci bot commented Feb 27, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

@andy-stark-redis andy-stark-redis merged commit eb8720c into main Feb 27, 2026
5 checks passed
@andy-stark-redis andy-stark-redis deleted the DOC-6252-failover-no-dbs branch February 27, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clients Client library docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants