fix: prevent unparser stack overflow on deeply nested expressions#23058
Open
adriangb wants to merge 2 commits into
Open
fix: prevent unparser stack overflow on deeply nested expressions#23058adriangb wants to merge 2 commits into
adriangb wants to merge 2 commits into
Conversation
Adds `test_deeply_nested_expr_does_not_overflow_stack`, gated on the `recursive_protection` feature, which unparses a deeply nested expression through both a dialect scalar-function override (`array_has` on `PostgreSqlDialect`) and a plain binary-operator chain. Even with `recursive_protection` enabled this aborts the process with a stack overflow today (issue apache#23056): the per-level stack cost of the unparser exceeds the `recursive` crate's default red zone, so the stack-growing trampoline never engages in time. The fix follows in the next commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QyHc4z88pC8nQLpFx7z2vM
…d core Makes the regression test from the previous commit pass. Although `expr_to_sql_inner` is annotated with `#[recursive]`, two things defeated the stack-growth protection: 1. The unparser installed no `StackGuard`, so the `recursive` crate used its default 128 KiB red zone. In debug builds a single `expr_to_sql_inner` frame is ~130 KiB, i.e. larger than the red zone, so `stacker::maybe_grow` could pass its check and then overflow before reaching the next checkpoint. The planner already installs `StackGuard::new(256 * 1024)` in `query.rs` for the same reason. 2. Several internal recursion sites (scalar-function arguments, arrays, maps, and the `array_has` / `date_part` dialect overrides) recursed through the public, un-annotated `expr_to_sql`, and `remove_unnecessary_nesting` (used in pretty mode) was not annotated. The fix keeps the recursive design but: - installs a `StackGuard` with a 256 KiB red zone once at the top-level `expr_to_sql`, matching the planner; - adds an annotated `pub(crate) expr_to_sql_with_nesting` recursion core that all internal recursion sites call (preserving pretty-mode behavior), so every nesting level is a stack-growth checkpoint; - annotates `remove_unnecessary_nesting`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QyHc4z88pC8nQLpFx7z2vM
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
Unparser::expr_to_sqloverflows the OS stack on deeply nested expressions, and — surprisingly — it does so even with therecursive_protectionfeature enabled (the default for thedatafusioncrate).While investigating I found the root cause is subtler than "a function is missing
#[recursive]":expr_to_sql_inneris already annotated and is re-entered on the function-argument / dialect-override paths, so the trampoline does fire. The real problem is that therecursivecrate's default red zone is 128 KiB, but a singleexpr_to_sql_innerstack frame in debug builds is ~130 KiB — larger than the red zone.stacker::maybe_growonly grows whenremaining < red_zone, so a frame can pass the check and then overflow before the next checkpoint. The unparser installed noStackGuard, unlike the planner, which already does exactly this inquery.rs:with a comment noting the same debug-frame-size issue (PR #13310).
Secondary leaks: several internal recursion sites recursed through the public, un-annotated
expr_to_sql(function arguments,make_array,array_element,named_struct,map, and thearray_has/date_partdialect overrides), andremove_unnecessary_nesting(pretty mode) was not annotated at all — it would be the next overflow site once the first is fixed.What changes are included in this PR?
The recursive design is kept (an iterative rewrite of the tree-building unparser would be a large, high-risk change for no benefit over the trampoline). The fix mirrors the planner:
StackGuard::new(256 * 1024)once at the top-levelexpr_to_sql.pub(crate) expr_to_sql_with_nestingrecursion core that all internal recursion sites call. This keeps pretty-mode behavior identical (each level still runsremove_unnecessary_nesting) while making every nesting level a stack-growth checkpoint.remove_unnecessary_nestingwith#[recursive].The PR is split into two commits for review: the failing regression test first, then the fix.
Are these changes tested?
Yes.
test_deeply_nested_expr_does_not_overflow_stack(gated onrecursive_protection) unparses a depth-2000array_haschain onPostgreSqlDialectand a depth-2000 binary chain on a 2 MiB thread. It aborts the process onmainand passes with the fix. I verified empirically that with the default 128 KiB red zone the test still aborts, and that 256 KiB carries both paths past depth 5000 on a realistic thread.Note: the guarantee holds on realistically-sized thread stacks (≥1–2 MiB). The 512 KiB-stack figure in the issue's repro is too small to bootstrap the heavier
array_hasoverride path even after the fix, but that is the same memory floor the planner side has and is not a configuration used by DataFusion in practice.Are there any user-facing changes?
No public API changes. Deeply nested expressions that previously aborted the process now unparse successfully when
recursive_protectionis enabled.🤖 Generated with Claude Code
https://claude.ai/code/session_01QyHc4z88pC8nQLpFx7z2vM