expr, storage: Fix reduce error-propagation, LIKE, and CSV decoder bugs#36987
Open
def- wants to merge 3 commits into
Open
expr, storage: Fix reduce error-propagation, LIKE, and CSV decoder bugs#36987def- wants to merge 3 commits into
def- wants to merge 3 commits into
Conversation
… reduce The generic variadic fold in `reduce` replaced a call with any operand's literal error unconditionally. That is wrong for a non-strict function: AND/OR have a dominating operand (`false`/`true`) that absorbs an erroring operand at runtime, so `false AND <error>` evaluates to `false`, not an error. Folding `col AND <error>` to the error made `reduce` introduce an error into a query that evaluates successfully — a correctness bug. Guard error-propagation with `func.propagates_nulls()`, mirroring the adjacent null-propagation guard (the other non-strict variadic, Coalesce, already bails out earlier). Found by the mir_scalar_reduce fuzz target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The string matcher (`is_match_subpatterns`, used for patterns with up to MAX_SUBPATTERNS subpatterns) resolves each `%` by searching for the following suffix and back-tracking over every candidate position. With two or more `%` the back-tracking nests, so matching an adversarial pattern like `%a%a%a` against a long run of `a`s costs time proportional to len(text)^(number of %) — a fuzzer (like_pattern_escape) hit a single input that ran for 1667s. The regex engine matches these patterns in linear time with no back-tracking, so route any pattern with more than one `many` (`%`) subpattern to it, alongside the existing case-insensitive and too-many-subpatterns cases. The string matcher now only handles patterns with at most one `%`, where it is near-linear. The regex and string matchers produce identical results, so this changes only performance. Found by fuzzing. Adds a regression test covering the pathological shape (instant after the fix) and confirming the routing still matches correctly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fuzzing the CSV decoder turned up two ways untrusted source bytes could
panic the decode operator (crashing source ingestion), and one silent
data-corruption path:
1. The error paths (wrong column count, invalid UTF-8) returned without
resetting output_cursor/ends_cursor, unlike the success path. csv_core
resets its own per-record output position, so after a malformed record
the next record read back stale, non-monotonic field offsets — slicing
output[ends[i]..ends[i+1]] then panicked (start > end), or silently
produced rows stitched from leftover buffer bytes. One bad row corrupted
the rest of the source. Fixed by clearing the buffers after every
completed record, success or error.
2. The whole record's concatenated field bytes were validated as UTF-8 in
one shot, then sliced as a &str at each field boundary. A delimiter can
sit between two bytes that, with the delimiter removed, form one
multi-byte character: the concatenation validates, but a field boundary
lands mid-character, panicking the &str slice (and even when it didn't
panic, bytes were silently merged across fields). Fixed by validating
each field's bytes as UTF-8 independently; a field that isn't valid
UTF-8 on its own is now a decode error.
To make the decoder fuzzable without mz-storage's heavy dependency tree
(rocksdb/rdkafka/aws), the pure CsvDecoderState moves to mz-storage-types
alongside the CsvEncoding it consumes; mz-storage re-imports it. Adds
regression tests for both panics.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
antiguru
reviewed
Jun 11, 2026
antiguru
left a comment
Member
There was a problem hiding this comment.
I only now realized this is still draft, so take my reviews as hints. I couldn't review the CSV decoder change as I can't see the actual diff.
| // for the existing case-insensitive / too-many-subpatterns reasons). | ||
| let many_subpatterns = subpatterns.iter().filter(|s| s.many).count(); | ||
| let use_regex = case_insensitive || subpatterns.len() > MAX_SUBPATTERNS || many_subpatterns > 1; | ||
| let matcher_impl = match use_regex { |
Member
There was a problem hiding this comment.
What's even the point of having our own string matcher? I feel we could switch to the regex-rewriting for most (all?) cases.
Contributor
Author
There was a problem hiding this comment.
It'd be up to 700x slower in extreme cases based on trying it out.
| // — e.g. `false AND <error>` evaluates to `false` — so folding the whole | ||
| // call to the error here would introduce an error the evaluated expression | ||
| // never raises. (Coalesce, also non-strict, bailed out above.) | ||
| if func.propagates_nulls() { |
Contributor
Author
There was a problem hiding this comment.
@ggevay Are you familiar with this code?
antiguru
reviewed
Jun 12, 2026
antiguru
left a comment
Member
There was a problem hiding this comment.
The regex change LGTM. I find it difficult to review the CSV decoder change. You could split this into three PRs?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three correctness/robustness fixes found by fuzzing:
reduce(false AND <error>evaluates tofalse, so folding it to the error introduced a spurious error into a query that succeeds).%patterns.Each fix has a regression test. Found by #36982
🤖 Generated with Claude Code