Skip to content

Don't pool origin connections with unconsumed request body#12926

Open
JakeChampion wants to merge 4 commits intoapache:masterfrom
JakeChampion:jake/100
Open

Don't pool origin connections with unconsumed request body#12926
JakeChampion wants to merge 4 commits intoapache:masterfrom
JakeChampion:jake/100

Conversation

@JakeChampion
Copy link
Contributor

@JakeChampion JakeChampion commented Feb 27, 2026

When an origin responds before consuming a request body sent with Expect: 100-continue, ATS was returning the origin connection to the pool with leftover body data in the TCP stream
The next request reusing that connection would see the leftover bytes as a corrupted response

Fix by recording server_request_body_bytes when aborting the tunnel early, then checking it in both connection pooling paths to prevent reuse when a request body was in flight

When an origin responds before consuming a request body sent with `Expect: 100-continue`, ATS was returning the origin connection to the pool with leftover body data in the TCP stream
The next request reusing that connection would see the leftover bytes as a corrupted response

Fix by recording `server_request_body_bytes` when aborting the tunnel early, then checking it in both connection pooling paths to prevent reuse when a request body was in flight
@JakeChampion JakeChampion changed the title Add tests for 100-continue connection pool corruption scenario Don't pool origin connections with unconsumed request body Feb 27, 2026
@JakeChampion JakeChampion marked this pull request as ready for review February 27, 2026 16:45
@brbzull0 brbzull0 self-requested a review February 27, 2026 18:44
@brbzull0
Copy link
Contributor

I'll have a look at this. @JakeChampion are you on slack?

@JakeChampion
Copy link
Contributor Author

I'll have a look at this. @JakeChampion are you on slack?

I am not, I requested an invite some time last year but I don't think it's been accepted

@brbzull0
Copy link
Contributor

I'll have a look at this. @JakeChampion are you on slack?

I am not, I requested an invite some time last year but I don't think it's been accepted

Ohh, sorry about that. I've just had a look and couldn't find the email. Please send another email to the users@trafficserver.apache.org list and I'll help you with that.

Thanks.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a connection pooling corruption issue in ATS where an origin connection could be returned to the pool with unread request-body bytes still in the TCP stream (notably with Expect: 100-continue and an early origin response), causing subsequent reused-connection requests to see corrupted data.

Changes:

  • Track incomplete origin request-body forwarding via a new HttpSM flag and use it to block origin connection pooling.
  • Record server_request_body_bytes when aborting a live request-body tunnel early.
  • Add a new gold test that reproduces the corruption scenario using a custom client/origin pair.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/proxy/http/HttpSM.cc Records request-body bytes on early tunnel abort; blocks pooling when request-body is incomplete; adds an additional Expect: 100-continue pooling guard.
include/proxy/http/HttpSM.h Adds server_request_body_incomplete flag to HttpSM.
tests/gold_tests/post/expect-100-continue-corruption.test.py New gold test harness for the corruption regression.
tests/gold_tests/post/corruption_origin.py Origin simulator that replies early and detects reuse/corruption on the same TCP connection.
tests/gold_tests/post/corruption_client.py Client that sends POST (with Expect: 100-continue) followed by a second request to detect corruption.

Comment on lines +62 to +66
if b'100' in resp1_data.split(b'\r\n')[0]:
after_100 = resp1_data.split(b'\r\n\r\n', 1)[1] if b'\r\n\r\n' in resp1_data else b''
if b'\r\n\r\n' not in after_100:
after_100 += wait_for_headers_complete(sock)
resp1_data = after_100
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 100-Continue detection is very loose (b'100' in <status-line>), which could mis-detect in edge cases (e.g., unusual reason phrases or malformed input). It would be more robust to check for an exact status code prefix such as HTTP/1.1 100 (or parse the status code as an integer) before stripping the 100 response.

Copilot uses AI. Check for mistakes.
Comment on lines +2088 to +2097
} else if (!server_request_body_incomplete && server_request_body_bytes > 0 &&
t_state.hdr_info.client_request.m_100_continue_sent) {
// When ATS proactively sent 100 Continue to the client
// (send_100_continue_response), the body tunnel was set up before the
// origin confirmed it would accept the body. The tunnel may have
// completed before the origin responded, but the origin might not have
// consumed the body data. Prevent connection pooling to avoid the next
// request on this connection seeing leftover body bytes as corruption.
server_request_body_incomplete = true;
t_state.current.server->keep_alive = HTTPKeepAlive::NO_KEEPALIVE;
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else if branch marks server_request_body_incomplete true for any transaction where ATS sent a proactive 100-Continue and server_request_body_bytes > 0. Since server_request_body_bytes is normally set to the bytes successfully written to the origin (e.g., in tunnel_handler_post_server), this condition will typically be true for all such POST/PUT requests, effectively disabling origin connection pooling in the common case (not just the early-response case). Consider tightening the condition so it only triggers when an early origin response is actually detected (e.g., response arrives while the request-body tunnel is still active / before request-body forwarding completes), or gate it on response statuses where skipping the body is expected.

Copilot uses AI. Check for mistakes.
Comment on lines 3203 to 3207
if (t_state.current.server->keep_alive == HTTPKeepAlive::KEEPALIVE && server_entry->eos == false &&
plugin_tunnel_type == HttpPluginTunnel_t::NONE && t_state.txn_conf->keep_alive_enabled_out == 1) {
plugin_tunnel_type == HttpPluginTunnel_t::NONE && t_state.txn_conf->keep_alive_enabled_out == 1 &&
!server_request_body_incomplete) {
close_connection = false;
} else {
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new server_request_body_incomplete condition, connections closed for this reason will fall into the existing metric path that increments origin_shutdown_tunnel_server_plugin_tunnel (the else branch’s final case), even when plugin_tunnel_type == NONE. This will misattribute the shutdown reason in stats. Consider adding a separate metric/counter for this new shutdown condition or adjusting the metric selection logic to account for server_request_body_incomplete explicitly.

Copilot uses AI. Check for mistakes.
Comment on lines 513 to +515
int64_t client_request_body_bytes = 0;
int64_t server_request_body_bytes = 0;
bool server_request_body_incomplete = false;
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server_request_body_incomplete is an HttpSM member and is only ever set to true in this change; it’s not cleared when the state machine issues a new origin request on a different server session (e.g., redirects, retries, or other multi-origin flows handled within the same HttpSM). That can cause subsequent, unrelated server sessions to be forcibly closed / not pooled. Consider resetting this flag at the start of each new server request / server session (alongside other per-origin-request state), rather than keeping it sticky for the entire HttpSM lifetime.

Copilot uses AI. Check for mistakes.
Comment on lines +123 to +143
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(('', args.port))
sock.listen(5)
sock.settimeout(args.timeout + 5)

result = {'corrupted': False, 'new_connection': False}
threads = []
connections_handled = 0

try:
while connections_handled < 10:
try:
conn, _ = sock.accept()
t = threading.Thread(target=handle_connection, args=(conn, args, result))
t.daemon = True
t.start()
threads.append(t)
connections_handled += 1
except socket.timeout:
break
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The origin process can add ~10s of idle time to the test run: after handling the expected connections it continues blocking in accept() until sock.settimeout(args.timeout + 5) expires. This can make the gold test noticeably slower and more timing-sensitive. Consider exiting once the expected interactions have occurred (e.g., after the POST connection closes and a second connection is observed), or reduce the accept timeout / use an event to stop the accept loop promptly.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants