feat: Early query cancellation based on per-segment sampling#19235
feat: Early query cancellation based on per-segment sampling#19235mshahid6 wants to merge 1 commit intoapache:masterfrom
Conversation
| log.noStackTrace().warn(e, "Query interrupted, cancelling pending results for query [%s]", query.getId()); | ||
| GuavaUtils.cancelAll(true, future, futures); | ||
| if (extrapolationCancelled != null && extrapolationCancelled.get()) { | ||
| int completed = completedSegments.get(); |
Check warning
Code scanning / CodeQL
Dereferenced variable may be null Warning
| } | ||
| } | ||
| }, | ||
| Execs.directExecutor() |
There was a problem hiding this comment.
I have some concerns over noisy-neighbor/variability that might cause thing to fire more often than not. Another concern is since this is operating on the servicing Jetty thread, this might cause interrupts to occur on the thread blocking on .get():
I think we should be careful about scheduling async tasks on the main executor aside from the primary timeout. The main thread's job is to wait for completion of all futures, but if there are other competing tasks it needs to service, I want to make sure that:
a) Under contention we cannot possibly get InterruptedException (to go service a callback) and bail the processing on a valid query.
b) We don't delay processing of the future group because we are busy servicing a callback for one processing future.
Do we have a way of validating this won't happen?
| int completed = completedSegments.get(); | ||
| if (completed >= samplingWindow) { | ||
| long elapsedMs = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - queryStartNanos); | ||
| long extrapolatedMs = elapsedMs * totalSegments / completed; |
There was a problem hiding this comment.
I'm also worried about this calculation is not accounting for parallelism correctly and being "overkill." It's approximating of the time taken to complete a query as effectively avg segment time * total segments, which is not usually the case(it is assuming a purely serial execution). We have 100+ threads processing segments, and queries can sometimes query 100-1000s of segments per historical.
There was a problem hiding this comment.
For example, you could have a query that queries 300 segments each taking 1100ms with a total query timeout of 300s, we would preemptively kill this query even though assuming moderate contention it would complete in ~84s.
There was a problem hiding this comment.
We should instead incorporate the parallelism of the threadpool into the calculation to avoid killing ok queries. For example, something like (sum of extrapolated segment times) / (thread pool parallelism).
Description
Added early query cancellation on historicals by sampling segment completion times and extrapolating whether the query will exceed its timeout. After a configurable number of segments complete, elapsed time is measured and used to project total query time. If the projection exceeds the remaining timeout, all pending segment futures are cancelled immediately.
New query context parameter is
perSegmentSamplingWindow(number of segments to complete before extrapolating) with a default value of 0 i.e. disabled.Can be set per-query via query context, or as a system default via druid.query.default.context.perSegmentSamplingWindow.
Example
After 5 segments complete, if wall-clock extrapolation suggests the query will exceed 30 seconds, it fails fast instead of waiting for the full timeout.
Key changed/added classes in this PR
ChainedExecutionQueryRunnerQueryContexts/QueryContextChainedExecutionQueryRunnerExtrapolationTestLimitations
This PR has: