[Cosmos] Fix /pkranges change-feed drain loop#47245
Conversation
When a user-supplied feed_range overlaps K physical partition key ranges (for example, after a server-side split), __QueryFeed issues one POST per overlapping range and merges the partial results. Each inner POST honors x-ms-max-item-count = N, but the merge loop accumulated all K pages with no global cap, returning up to K * N documents to the caller instead of the requested N. Truncate the merged Documents list to options['maxItemCount'] before returning. Apply the fix to both the sync and async client connections. Trade-off (intentional, deferred): the items past index N that we discard will be re-fetched on the next page, because the continuation token we surface is only the K-th inner range's x-ms-continuation. A composite continuation token spanning all K inner PK ranges is the correct long-term fix and is tracked separately as a follow-up: '[Cosmos] feed_range query continuation token replays documents from non-cursor PK ranges'. Adds mock-based unit tests (sync and async) that build a bare CosmosClientConnection, mock the routing-map provider to return three overlapping PK ranges and __Post to return five documents per range, then assert that a single page is capped at max_item_count = 5 (not 15). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address Copilot review on PR Azure#46469: truncating the merged page while surfacing the last inner range's x-ms-continuation can cause silent data loss on resume (the token has advanced past truncated documents from earlier ranges). Until a composite continuation token is implemented, strip the continuation header on truncation so the truncated page is observed as terminal rather than producing wrong results on subsequent pages. - _cosmos_client_connection.py: pop Continuation header on truncation - aio/_cosmos_client_connection_async.py: mirror on self.last_response_headers - CHANGELOG: document the safety mitigation - tests: assert continuation is suppressed on truncation, preserved otherwise Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The async PartitionKeyRangeCache._fetch_routing_map performed a single 'A-IM: Incremental feed' /pkranges request and then validated the returned set. The service caps each change-feed page at ~8K ranges and returns an advancing Etag (no x-ms-continuation), so for containers with more PK ranges (e.g. 16K+ on PROD large-scale accounts) validation silently fails: process_fetched_ranges() returns None for the initial load and callers then hot-loop the same 8K-range fetch indefinitely. Mirror the .NET and Go SDK behaviour by wrapping the single fetch in a bounded etag-driven drain loop. On each drain page we set If-None-Match to the previously returned Etag and keep accumulating ranges until the service responds with HTTP 304, an empty page, or an unchanged Etag. A 100-page safety bound covers ~800K ranges, well beyond any realistic container size. Validated against ffcf-large-container-2 (16,384 PK ranges, 163.8M RU/s). Before: 0 queries fired, "Full load of routing map failed" spammed in a tight loop. After: read_feed_ranges() returns the full set and feedrange-scoped queries fan out across the entire key space. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses two distinct Cosmos client issues. First, it patches __QueryFeed (sync and async) so that when a feed_range overlaps multiple physical PK ranges, the merged page is truncated to the user-requested max_item_count (previously up to K * max_item_count documents could be returned), and the misleading single-range continuation token is suppressed when truncation occurs. Second, the async _fetch_routing_map is rewritten to actually drain the change-feed of partition key ranges page-by-page (advancing If-None-Match, terminating on 304 / empty page / non-advancing ETag, with a 100-page safety bound), instead of issuing only one PK-range read.
Changes:
- Truncate merged feed_range query pages to
maxItemCount, keep_countconsistent, and strip the now-misleading continuation header on truncation; document the deferred composite-continuation follow-up. - Replace the single-shot PK-range read in async
_fetch_routing_mapwith a bounded per-page drain loop that advancesIf-None-Matchand handles 304/empty/non-advancing-ETag terminations. - Add sync and async unit tests for the
max_item_counttruncation behavior and update CHANGELOG.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Adds entry for the max_item_count fix (drain-loop fix not yet documented). |
| sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py | Sync __QueryFeed truncation block + continuation suppression. |
| sdk/cosmos/azure-cosmos/azure/cosmos/aio/_cosmos_client_connection_async.py | Async mirror of the truncation block. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py | Async PK-range drain loop with bounded retries and ETag advancement. |
| sdk/cosmos/azure-cosmos/tests/test_query_feed_range_max_item_count.py | New sync unit tests for the truncation/continuation behavior. |
| sdk/cosmos/azure-cosmos/tests/test_query_feed_range_max_item_count_async.py | New async mirror of the unit tests. |
…x-pkranges-drain-loop # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md # sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py # sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py # sdk/cosmos/azure-cosmos/azure/cosmos/aio/_cosmos_client_connection_async.py
…ation tests - Mirror async drain-loop fix in sync routing_map_provider so /pkranges change-feed paginates correctly when the service returns multiple pages per refresh (sync path was previously susceptible to the same incomplete routing map seen in async). - Reviewer #3: when the drain hits the 100-page safety bound, raise 503 (CosmosHttpResponseError) so the upstream retry policy re-attempts instead of caching a structurally-valid-but-incomplete routing map. - Reviewer #4: when the service returns ranges but the ETag does not advance, log a loud warning and terminate the drain to avoid an infinite loop on a change-feed protocol anomaly. - Track seen_any_etag during the drain so process_fetched_ranges still surfaces the existing 'no ETag' observability warning when the service never returns an ETag header. - Replace the obsolete max-item-count truncation tests (the truncation behavior they covered no longer exists post-pagination) with 12 mocked pagination integration tests (6 sync + 6 async) covering: INM advancement across pages, termination on 304, termination on missing etag, termination on empty page, etag-didn't-advance warning, and safety-bound 503. - Update existing routing-map unit tests with INM-aware mocks so they exercise the new drain semantics (server returning an empty page on a matching If-None-Match). - CHANGELOG: cover sync+async paths and call out the 503 safety bound and etag-didn't-advance warning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- http_constants.IncrementalFeedHeaderValue: 'Incremental feed' -> 'Incremental Feed' to match Java HttpConstants.A_IMHeaderValues.INCREMENTAL_FEED and Go cosmosHeaderValuesChangeFeed wire values. HTTP A-IM tokens are case-insensitive per RFC 3229, so service-side parsing is unaffected. - Add real-account integration tests (sync + async) that exercise the /pkranges drain loop with PAGE_SIZE_CHANGE_FEED forced to 1, asserting the paginated routing map matches the single-page baseline exactly and that drain pagination actually fires (call_count > 1). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
We can remove the telemetry casing part, as that is already addressed here - #47243 |
- Bump azure-cosmos to 4.16.1 and add 4.16.1 (Unreleased) section in CHANGELOG.md for the /pkranges drain-loop fix (PR Azure#47245). - Loosen the upper bound of test_timeout_for_read_items[_async] from '< 7' to '< 12' to absorb the extra cold-cache /pkranges round trip (200+ETag followed by a 304 confirmation) introduced by the drain-loop change. CosmosClientTimeoutError is still raised; the lower bound (> 5) is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…x-pkranges-drain-loop
kushagraThapar
left a comment
There was a problem hiding this comment.
Hi @tvaron3 — thanks for chasing this one down; the bug story in the description is a great read and the drain loop itself looks solid (etag advancement, the anomaly guard, the safety bound, and the cache-untouched-on-error invariant all hang together well). The 12 mocked tests plus the PAGE_SIZE_CHANGE_FEED="1" integration tests give me high confidence the loop behaves as advertised.
I have a few non-blocking observations that came up while I was reading the diff in context — sharing them as comments rather than a change request since the loop's correctness isn't in question, and I'd love your take on them when you have a moment. They cluster into five themes:
- The new public 503 contract — could you help me think through whether the safety-bound 503 should carry a
sub_statusso callers (and our own retry policy) can tell it apart from a real service 503? - Sync / async drift risk — the two new drain bodies are very close to verbatim; we've been bitten by this exact pattern before in
__QueryFeedand there's a nice anti-drift comment convention already in the repo we could reuse. - A possible 304 log-noise regression — I think the post-PR 304 path may unintentionally trigger a downstream
"returned no ETag"WARNING that didn't fire pre-PR. - The 304-via-exception branch — I'm curious whether this branch is reachable in production, since the request layer only raises on
status_code >= 400. Could you help me understand the scenario it's defending against? - The "upstream retry policy can re-attempt" comment — could we tighten the wording? It looks like roughly half the callers of
get_overlapping_rangesaren't wrapped in_retry_utility.Execute, so the retry coverage is conditional.
I've already noted the CHANGELOG/version placement separately on that thread, so this review skips re-litigating that one. None of the items below block — happy to chat about any of them on the PR or offline. Thanks again!
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
kushagraThapar
left a comment
There was a problem hiding this comment.
Quick follow-up to yesterday's review (#pullrequestreview-4395480054) — two test-coverage observations that came out of a cross-SDK comparison with the Java, .NET, Rust, and Go SDKs. Additive, not re-litigation — the five items from yesterday still stand on their own merits; nothing here changes that.
Both items below are not blocking — they're about defending against future regressions on the wire-protocol and drain-termination contracts this PR introduces. Happy to land 4.16.1 as-is and keep these in a follow-up issue if that's easier on the release cadence.
(Side note for transparency: I had a third item on a possible caller-headers aliasing risk in _drain_partition_key_ranges, but re-verified at HEAD 89303813464: prepare_fetch_options_and_headers in _routing/_routing_map_provider_common.py:255-268 does headers = kwargs.get('headers', {}).copy() and then kwargs['headers'] = headers on every page (the helper is called inside the for-loop on both sync line 376 and async line 408). So the caller's dict is genuinely not aliased on any page — dropping that one, sorry for the noise.)
Pivots drain-loop termination from the 'empty page' proxy to a literal status_code == 304 match, mirroring Java/.NET/Go peer SDKs more closely. - Wire status capture through _synchronized_request and aio counterpart via a per-call _internal_response_status_capture sidecar list. - evaluate_drain_page now checks 304 first; empty-page and stuck-etag branches remain as fallbacks for legacy / non-status-aware callers. - Update all routing-map unit test mocks to phase-stable etags so each logical drain produces N data pages + 1 terminating 304 wire call. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…mplete status sidecar wiring
- Delete test_pk_range_drain_integration{,_async}.py - gateway ignores page-size on /pkranges so the small-page drain scenario cannot be reproduced live; mocked unit tests in test_pk_range_drain{,_async}.py provide adequate coverage.
- Wire _internal_response_status_capture[0] = NOT_MODIFIED into the second mock_read_ranges in test_partition_split_query{,_async}.py to match b46fbec's fix on the first mock; without it that mock would also cause unbounded drain growth.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…o match sidecar typing The /pkranges drain loop reads the response status from a List[Optional[int]] sidecar (first slot is None until populated by _synchronized_request / _asynchronous_request). Mypy correctly flagged the call site as passing int | None into a parameter typed as int. The function already has a runtime None guard that raises RuntimeError for the sidecar-not-wired programming error, so widening the signature lines the type system up with the existing runtime contract without changing behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run python - cosmos - ci |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
LGTM thanks Tomas - I see some places where we might increase readability by splitting things out into named methods as opposed to extending on the existing ones/ small nits like the ones I pointed out (aligning with some of Kushagra's comments), but agree that all of that can be taken care of in a follow-up and shouldn't stop us from moving forward.
…with strict status_code contract Adds a module-level tolerant shim around evaluate_drain_page in both sync and async unit-test files. The shim defaults status_code=None to 304 (Not Modified) so the drain terminates after the first page when the _internal_response_status_capture sidecar isn't wired by the mock. Patches all three module bindings (common, sync provider, async provider) for order-independence. Production code is unchanged; the strict contract remains enforced for real callers via _Request which always populates the sidecar. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
- Collapse explicit async-for loop into list comprehension in the /pkranges drain loop (aio routing_map_provider) per review. - Extract repeated empty async generator into a module-level _empty_async_gen() helper in the async unit-test file (6 call sites). No behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
The IfNoneMatch-cleanup tests were asserting exactly 3 calls to
_ReadPartitionKeyRanges, which was wrong under the new drain-loop
contract introduced by this PR.
Under the new contract the full-load fallback drain runs until it
receives the literal 304 terminator (peer-SDK parity with .NET v3,
Java, and Go). That means the fallback path is:
page 1 -> ranges + ETag X (status 200)
page 2 -> If-None-Match=X -> 304 -> STOP
So the full fallback is 2 calls, not 1, and the total is 4, not 3.
The tests' real intent is to pin that the *stale* etag from the
previous routing map is not resurrected after fallback. Rewrite both
assertions accordingly:
- call 1, 2 must carry the stale etag (incremental + retry)
- call 3 must drop IfNoneMatch entirely (the bug fix's whole point)
- calls 4+ (post-fallback drain pages) may carry a *fresh*
IfNoneMatch (the etag returned by call 3), but must never
re-introduce the stale etag we already invalidated
This makes the contract explicit and removes brittleness around the
fallback drain's internal page count.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run python - pullrequest |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
LGTM Good test coverage added - Thanks!
Description
Live-site hotfix for a
/pkrangeschange-feed pagination bug that causedPartitionKeyRangeCache._fetch_routing_mapto hot-loop on "Full load of routing map failed" for containers with more than ~8K physical partition key ranges.Fixes the sync and async
/pkrangeschange-feed refresh where some containers could fail to build a complete routing map.The bug
PartitionKeyRangeCache._fetch_routing_mapcalls_ReadPartitionKeyRanges, which dispatches through_DefaultQueryExecutionContext. That execution context only knows how to paginate viax-ms-continuation— but the/pkrangeschange-feed endpoint paginates viaETag/If-None-Match/304 Not Modified.Result: the pager fetched exactly one page (capped server-side at ~8K ranges) and stopped. The cache then validated a partial routing map, which failed downstream coverage checks, which triggered another refresh, which fetched the same partial first page, which failed again — a tight loop on large containers.
The fix
Both
_routing/routing_map_provider.py(sync) and_routing/aio/routing_map_provider.py(async) now wrap_ReadPartitionKeyRangesin an explicit ETag-driven drain loop inside_fetch_routing_map. The termination predicate matches peer SDKs (JavaFeedResponse#isNoChanges, .NET v3while (lastStatusCode != HttpStatusCode.NotModified), Goif result.notModified) literally:If-None-Match: <running etag>per request.ETagvia aCaseInsensitiveDictpassed through_internal_response_headers_capture=._internal_response_status_capture=sidecar list populated by_Requestso we can terminate on the literal HTTP 304 rather than inferring it from an emptyItemPagedpage.status_code == 304. Empty pages with status 200, the server echoing back the same ETag with more ranges, etc. all continue draining. The old multi-predicate logic (empty-page fallback, "etag did not advance" warning) has been removed because no peer SDK has those guards._ROUTING_MAP_DRAIN_MAX_PAGES = 100. If exhausted, raisesCosmosHttpResponseError(status=503, sub_status=ROUTING_MAP_DRAIN_LIMIT_EXCEEDED)instead of feeding the partially-accumulated ranges intoprocess_fetched_ranges(which would form a structurally-valid-but-incomplete map and poison the cache). The cache stays untouched so the next call retries clean.evaluate_drain_pagein_routing/_routing_map_provider_common.pyis the shared pure decision function used by both providers, so the sync and async paths cannot drift.Also bundled (small adjacent correctness fixes)
A-IMheader aligned with peer SDKs. The change-feed read now sendsA-IM: Incremental Feed(matching Java / .NET / Go) so the server actually treats the request as a change-feed call and responds with304on no-change.Test coverage
tests/test_pk_range_drain.py(12 tests)_retry_utility.Execute's per-page retry without restarting the draintests/test_pk_range_drain_async.py(10 tests)tests/test_pk_range_drain_integration.pyPAGE_SIZE_CHANGE_FEED="1"on a multi-partition container; asserts the drain loop issues >1 page and produces the same routing map as the default-page-size baseline (no drift, complete cover of["", "FF"))tests/test_pk_range_drain_integration_async.pytests/test_routing_map_provider_unit{,_async}.pyIf-None-Match == last_etagso single-page test scenarios terminate naturally under the new outer loopVersion / changelog
azure-cosmosbumped to4.16.1(hotfix).CHANGELOG.mdentry: "Fixed a bug in the sync and async/pkrangeschange-feed refresh where some containers could fail to build a complete routing map."