Skip to content

fix: wait for hijacked connections to close during drain#16625

Open
immanuwell wants to merge 2 commits into
knative:mainfrom
immanuwell:fix/hijack-tracker-drain
Open

fix: wait for hijacked connections to close during drain#16625
immanuwell wants to merge 2 commits into
knative:mainfrom
immanuwell:fix/hijack-tracker-drain

Conversation

@immanuwell

Copy link
Copy Markdown
Contributor

Fixes #

Proposed Changes

  • keep hijacked conns counted til the wrapped net.Conn closes
  • preserve Flush() and Unwrap() on the tracker wrapper
  • add regression tests for a real Hijack() path, not just a blocked handler

Repro:

  1. Open a websocket to a Knative Service.
  2. Send TERM to the queue-proxy while that socket is still open.
  3. Before this patch, the handler can return right after Hijack(), so HijackedDrainer.Drain() may see zero inflight too early and exit. The websocket gets cut off early, which is kinda rough.
  4. With this patch, drain waits for the hijacked conn to close, or hits the existing 60s cap.

Related: follow-up to #16362.

Release Note

queue-proxy now waits for hijacked connections to close before finishing drain

@knative-prow knative-prow Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 29, 2026
@knative-prow

knative-prow Bot commented May 29, 2026

Copy link
Copy Markdown

Hi @immanuwell. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@knative-prow

knative-prow Bot commented May 29, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: immanuwell
Once this PR has been reviewed and has the lgtm label, please assign skonto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow Bot requested review from dprotaso and skonto May 29, 2026 14:50
@codecov

codecov Bot commented May 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.34%. Comparing base (516bc43) to head (751dd09).

Files with missing lines Patch % Lines
pkg/http/handler/hijack.go 84.00% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16625      +/-   ##
==========================================
- Coverage   80.36%   80.34%   -0.03%     
==========================================
  Files         217      217              
  Lines       13568    13592      +24     
==========================================
+ Hits        10904    10920      +16     
- Misses       2301     2307       +6     
- Partials      363      365       +2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/http/handler/hijack.go Outdated
@immanuwell immanuwell force-pushed the fix/hijack-tracker-drain branch from a95113d to 751dd09 Compare June 7, 2026 17:49
@immanuwell immanuwell requested a review from dprotaso June 7, 2026 17:49
@dprotaso

dprotaso commented Jun 7, 2026

Copy link
Copy Markdown
Member

Before this patch, the handler can return right after Hijack(), so HijackedDrainer.Drain() may see zero inflight too early and exit.

Are any of our handlers in serving returning after hijack? I would be surprised if they were - that's why this change seems unnecessary to me

@kunalworldwide kunalworldwide left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach is solid — wrapping the ResponseWriter to intercept Hijack() and tracking the returned net.Conn lifetime via trackedConn.Close() is the right pattern. Without this, the drainer's inflight count drops to zero as soon as the handler returns after hijacking, even though the websocket connection is still active.

A few observations:

  1. Double-close protection: closed.CompareAndSwap(false, true) correctly prevents double-decrementing the inflight counter if Close() is called multiple times. Good.

  2. Flush assertion: w.ResponseWriter.(http.Flusher).Flush() will panic if the wrapped writer doesn't implement Flusher. This is fine for the queue-proxy's use case (the underlying writer is always a *http.response which implements Flusher), but a nil-check would make the wrapper safer for reuse.

  3. Inflight double-counting: After Hijack(), the handler's defer s.inflight.Add(-1) fires AND the hijacked conn holds its own +1. So during the window between handler return and conn close, inflight is still >= 1 from the conn tracker, which is the desired behavior. The handler's original +1/-1 pair cancels out cleanly. Logic checks out.

  4. Test coverage: The test creates a real hijackable writer and verifies the drain blocks until the conn is closed. Good regression test for the original bug.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants