Fix 408 timeout caused by ignoring DATA frames after GOAWAY#12936
Fix 408 timeout caused by ignoring DATA frames after GOAWAY#12936yknoya wants to merge 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes HTTP/2 408/504 timeout errors that occur when ATS sends a GOAWAY frame due to excessive stream error rates. Previously, after sending a GOAWAY with ENHANCE_YOUR_CALM, ATS would stop processing all incoming frames, causing in-flight requests with pending DATA frames to time out. The fix continues processing frames per RFC 9113 §6.8, which requires minimally processing HEADERS/CONTINUATION (for HPACK table consistency) and counting DATA frames toward the connection flow-control window, even for streams beyond the last-stream-ID.
Changes:
- Modified the frame-reading loop in
Http2CommonSession.ccto only stop on fatal errors (notENHANCE_YOUR_CALM), and moved GOAWAY sending to happen before frame reading so that subsequent frames can be processed with GOAWAY-aware filtering. - Added GOAWAY-aware frame filtering in
Http2ConnectionState::rcv_frameand individual frame handlers (rcv_data_frame,rcv_headers_frame,rcv_continuation_frame) to discard frames for refused streams while maintaining HPACK table consistency and flow-control accounting. - Added
goaway_sentandlast_stream_id_txstate tracking toHttp2ConnectionStateandreset_header_after_decoding/free_stream_after_decodingflags toHttp2Streamfor managing temporary streams created solely for HPACK decoding.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/proxy/http2/Http2ConnectionState.cc |
Added GOAWAY-aware filtering in rcv_frame, rcv_data_frame, rcv_headers_frame, and rcv_continuation_frame; sets goaway_sent/last_stream_id_tx in send_goaway_frame |
src/proxy/http2/Http2CommonSession.cc |
Changed error check to allow continued frame processing after ENHANCE_YOUR_CALM GOAWAY; moved GOAWAY sending before frame reading; changed break to return 0 for fatal errors |
include/proxy/http2/Http2ConnectionState.h |
Added goaway_sent, last_stream_id_tx members and get_goaway_sent() accessor |
include/proxy/http2/Http2Stream.h |
Added reset_header_after_decoding and free_stream_after_decoding public flags |
You can also share your feedback on Copilot code review. Take the survey.
| if (stream->free_stream_after_decoding) { | ||
| stream->initiating_close(); | ||
| } |
There was a problem hiding this comment.
Same issue as in rcv_headers_frame: initiating_close() may send an RST_STREAM(NO_ERROR) if the stream is in a writeable state, and then the HTTP2_ERROR_CLASS_STREAM return at line 1144 causes rcv_frame() to send a second RST_STREAM(REFUSED_STREAM). This results in duplicate RST_STREAM frames for the same stream.
| if (this->connection_state.get_stream_error_rate() > std::min(1.0, Http2::stream_error_rate_threshold * 2.0) && | ||
| !this->connection_state.get_goaway_sent()) { | ||
| ip_port_text_buffer ipb; | ||
| const char *peer_ip = ats_ip_ntop(this->get_proxy_session()->get_remote_addr(), ipb, sizeof(ipb)); | ||
| SiteThrottledWarning("HTTP/2 session error peer_ip=%s session_id=%" PRId64 | ||
| " closing a connection, because its stream error rate (%f) exceeded the threshold (%f)", | ||
| peer_ip, this->get_connection_id(), this->connection_state.get_stream_error_rate(), | ||
| Http2::stream_error_rate_threshold); | ||
| err = Http2ErrorCode::HTTP2_ERROR_ENHANCE_YOUR_CALM; | ||
| this->connection_state.send_goaway_frame(this->connection_state.get_latest_stream_id_in(), | ||
| Http2ErrorCode::HTTP2_ERROR_ENHANCE_YOUR_CALM); | ||
| this->set_half_close_local_flag(true); |
There was a problem hiding this comment.
This PR fixes a significant behavioral issue (408 timeouts caused by ignoring DATA frames after GOAWAY), but doesn't include any test coverage for the new behavior. Given the existing test infrastructure in tests/gold_tests/h2/ and the complexity of this fix (continuing to process frames after GOAWAY, minimal HPACK processing for HEADERS/CONTINUATION on refused streams, flow-control counting for DATA), it would be valuable to add an integration test that validates the core scenario: sending multiple POST requests over a single HTTP/2 connection where the stream error rate threshold is exceeded, and verifying that in-flight requests with pending DATA frames complete successfully rather than timing out.
|
|
||
| // If this was an outbound connection and the state was already closed, just clear the | ||
| // headers after processing. We just processed the heaer blocks to keep the dynamic table in | ||
| // We just processed the heaer blocks to keep the dynamic table in |
There was a problem hiding this comment.
Typo: "heaer" should be "header" in the comment.
| Http2ErrorCode result = stream->decode_header_blocks(*this->local_hpack_handle, | ||
| this->acknowledged_local_settings.get(HTTP2_SETTINGS_HEADER_TABLE_SIZE)); | ||
|
|
||
| // We just processed the heaer blocks to keep the dynamic table in |
There was a problem hiding this comment.
Typo: "heaer" should be "header" in the comment.
| if (free_stream_after_decoding) { | ||
| THREAD_FREE(stream, http2StreamAllocator, this_ethread()); | ||
| stream->initiating_close(); | ||
| } |
There was a problem hiding this comment.
When initiating_close() is called on the temporary stream here, the stream's state has already transitioned from IDLE to OPEN (via change_state at line 481), making is_state_writeable() return true. This causes initiating_close() to send an RST_STREAM with HTTP2_ERROR_NO_ERROR. Then, when this function returns the HTTP2_ERROR_CLASS_STREAM error at line 512, rcv_frame() sends a second RST_STREAM with HTTP2_ERROR_REFUSED_STREAM (line 1556 of rcv_frame). This results in duplicate RST_STREAM frames for the same stream. Consider either not calling initiating_close() and instead using THREAD_FREE directly (as the old code did for the outbound case), or setting the stream state to CLOSED before calling initiating_close() to prevent the extra RST_STREAM.
Issue
When multiple POST requests are sent over a single HTTP/2 connection, 408/504 (timeout) responses may occur if the HTTP/2 stream error rate exceeds the configured threshold.
When the stream error rate exceeds the threshold, ATS sends a GOAWAY frame in the following code path. However, a timeout occurs between sending the GOAWAY frame and closing the connection with the client.
trafficserver/src/proxy/http2/Http2CommonSession.cc
Lines 371 to 393 in 00604e1
Root Cause
The issue is caused by ATS canceling HTTP/2 frame reception processing after sending the GOAWAY frame.
trafficserver/src/proxy/http2/Http2CommonSession.cc
Lines 363 to 369 in 00604e1
Specifically, while handling multiple POST requests, the timeout occurs in the following sequence:
I reviewed RFC 9113 regarding the handling of frames received after sending a GOAWAY frame. The following description indicates that canceling the reception of all frames violates the RFC:
https://datatracker.ietf.org/doc/html/rfc9113#name-goaway
Fix
ATS has been modified to continue receiving frames even after the HTTP/2 stream error rate exceeds the threshold and a GOAWAY frame is sent.
However, if the stream ID of a received frame is greater than the Last-Stream-ID specified at the time the GOAWAY frame was sent, frame processing is limited to HEADERS, PUSH_PROMISE, CONTINUATION, and DATA frames.
In such cases, ATS performs only the minimal processing required by RFC 9113 and sends an RST_STREAM frame.