Skip to content

SE.Redis Connection Storms & High CPU #3036

@Pavanlalit

Description

@Pavanlalit

Description

When StackExchange.Redis connects to a Redis instance through a proxy layer (e.g., Envoy, HAProxy, or cloud-managed Redis proxies like Azure Managed Redis), a backend node failure triggers an infinite, high-frequency connection storm that bypasses the library's default exponential backoff algorithm.

This results in the client forcefully DDoSing the proxy with hundreds of connection attempts per second per instance, leading to SNAT port exhaustion, ThreadPool starvation, and proxy-layer CPU spikes.

To Reproduce

The issue stems from how cloud proxies handle load-shedding when their backend shards are dead or unresponsive (e.g., hitting 100% CPU).

  1. Client (SE.Redis) sits behind a TCP/TLS Proxy. The Proxy sits in front of the Redis server.
  2. The backend Redis server becomes completely unresponsive.
  3. The Proxy process is still healthy, so it accepts the client's incoming TCP handshake (SYN -> SYN-ACK).
  4. SE.Redis treats this as a successful connection establishment and resets its internal reconnect backoff counters back to zero.
  5. The Proxy attempts to route the connection/initial commands to the backend Redis node, realizes the node is dead, and violently sheds the load by dropping the socket (SO_LINGER = 0).
  6. The client OS receives a TCP RST (surfaced as SocketException 10054: An existing connection was forcibly closed by the remote host).
  7. Because the failure counter was just reset to zero (due to the successful TCP proxy handshake), SE.Redis immediately attempts to reconnect without any delay.
  8. The loop repeats infinitely.

Expected behavior

The ConnectionMultiplexer should recognize that the connection was dropped during the initial routing/handshake phase (before actual Redis commands or heartbeats successfully flow) and should correctly apply the default Reconnect Retry Policy (exponential backoff).

Since we did not explicitly configure a custom ReconnectRetryPolicy in our ConfigurationOptions, we expected the default backoff to prevent a storm. Is it possible for the library to not treat a mere TCP SYN-ACK from a proxy as a fully healthy connection that warrants resetting the consecutive failure penalty?

Actual behavior

The library enters a tight, infinite loop. Our proxy-level telemetry showed operations/sec spiking from a baseline of ~50k to over 100,000+ operations/sec consisting entirely of failed connection handshakes, while actual GET/SET commands were flat at zero.

Environment details

  • StackExchange.Redis Version: v2.11.8
  • OS: Windows / Linux
  • .NET Version: .NET 8.0
  • Architecture: Distributed clients connecting via an intermediate proxy to a clustered Redis backend.

Additional context

We verified this behavior locally using a simple Python TCP relay to mimic the proxy. If the relay accepts the connection but forcefully closes it (l_linger = 0) before routing data, StackExchange.Redis falls into the exact same connection storm loop.

We also confirmed that if we explicitly inject a custom IReconnectRetryPolicy into the configuration options that forces a delay or rejects retries, the library does obey it and mitigates the storm. This isolates the issue to how the default state machine handles the reset counter upon receiving a proxy's SYN-ACK followed by an immediate connection drop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions