Skip to content

fix: Properly shutdown quickwit-serve when subcomponents panic or otherwise error.#6196

Open
philip-wernersbach wants to merge 1 commit intoquickwit-oss:mainfrom
philip-wernersbach:patch-1
Open

fix: Properly shutdown quickwit-serve when subcomponents panic or otherwise error.#6196
philip-wernersbach wants to merge 1 commit intoquickwit-oss:mainfrom
philip-wernersbach:patch-1

Conversation

@philip-wernersbach
Copy link

@philip-wernersbach philip-wernersbach commented Mar 9, 2026

Description

Before this change, the if let Err block silently swallows the error and logs it. The code continues on to the shutdown_handle.await call. In the case where the tokio::try_join! returns an error (such as when any of the three components for the three JoinHandle arguments panic), the shutdown_handle is not guaranteed to have completed, so the program sits there waiting for a SIGTERM, even though some components aren’t running.

Context

We are seeing the following chitchat panic in prod, in our metastore pods. After the panic message is printed, the ERROR quickwit_serve: server failed: Chitchat server panicked message is printed, and the program waits to for a SIGTERM. No further log messages are printed until the SIGTERM occurs. Meanwhile, our quickwit-indexer pods do not work, because the chitchat with metastore is broken.

thread 'main_runtime_thread' (17) panicked at /usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/chitchat-0.10.0/src/state.rs:605:17:
assertion failed: monotonic_property_after >= monotonic_property_before
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2026-03-06T13:59:04.262Z ERROR quickwit_serve: server failed: Chitchat server panicked

How was this PR tested?

Built a custom Docker image with this fix, tested in prod:

  1. Happy path: SIGTERM still shuts down pods:
2026-03-10T14:45:27.543Z  INFO quickwit_cli::service: SIGTERM received
2026-03-10T14:45:27.543Z  INFO quickwit_serve::rest: REST server shutdown signal received
2026-03-10T14:45:27.543Z  INFO quickwit_serve::rest: gracefully shutdown
2026-03-10T14:45:27.545Z  INFO quickwit_serve: waiting for services to shutdown
2026-03-10T14:45:28.546Z  INFO quickwit_cli::service: quickwit successfully terminated
  1. Error path: chitchat panic causes pod to shut down:
thread 'main_runtime_thread' (8) panicked at /usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/chitchat-0.10.0/src/state.rs:605:17:
assertion failed: monotonic_property_after >= monotonic_property_before
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2026-03-10T02:54:34.891Z ERROR quickwit_serve: server failed: Chitchat server panicked
2026-03-10T02:54:34.891Z  INFO quickwit_serve: waiting for services to shutdown
2026-03-10T02:54:34.891Z  INFO quickwit_serve::rest: REST server shutdown signal received
2026-03-10T02:54:34.891Z  INFO quickwit_serve::rest: gracefully shutdown
2026-03-10T02:54:35.892Z  INFO quickwit_cli::service: #Quickwit successfully terminated

@philip-wernersbach philip-wernersbach force-pushed the patch-1 branch 2 times, most recently from 904ac62 to c46f2a5 Compare March 9, 2026 21:24
…erwise error.

Before this change, the `if let Err` block silently swallows the error and logs it. The code continues on to the `shutdown_handle.await` call. In the case where the `tokio::try_join!` returns an error (such as when any of the three components for the three `JoinHandle` arguments panic), the `shutdown_handle` is not guaranteed to have completed, so the program sits there waiting for a SIGTERM, even though some components aren’t running.
@philip-wernersbach
Copy link
Author

Added results from a test in prod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant