expr, storage: Fix reduce error-propagation, LIKE, and CSV decoder bugs by def- · Pull Request #36987 · MaterializeInc/materialize

def- · 2026-06-11T11:39:29Z

Three correctness/robustness fixes found by fuzzing:

expr: don't propagate an operand's error through non-strict AND/OR in reduce (false AND <error> evaluates to false, so folding it to the error introduced a spurious error into a query that succeeds).
expr: avoid super-linear LIKE matching on multi-% patterns.
storage: fix CSV source-decoder panics on malformed untrusted input (cursor reset + per-field UTF-8 validation).

Each fix has a regression test. Found by #36982

🤖 Generated with Claude Code

… reduce The generic variadic fold in `reduce` replaced a call with any operand's literal error unconditionally. That is wrong for a non-strict function: AND/OR have a dominating operand (`false`/`true`) that absorbs an erroring operand at runtime, so `false AND <error>` evaluates to `false`, not an error. Folding `col AND <error>` to the error made `reduce` introduce an error into a query that evaluates successfully — a correctness bug. Guard error-propagation with `func.propagates_nulls()`, mirroring the adjacent null-propagation guard (the other non-strict variadic, Coalesce, already bails out earlier). Found by the mir_scalar_reduce fuzz target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The string matcher (`is_match_subpatterns`, used for patterns with up to MAX_SUBPATTERNS subpatterns) resolves each `%` by searching for the following suffix and back-tracking over every candidate position. With two or more `%` the back-tracking nests, so matching an adversarial pattern like `%a%a%a` against a long run of `a`s costs time proportional to len(text)^(number of %) — a fuzzer (like_pattern_escape) hit a single input that ran for 1667s. The regex engine matches these patterns in linear time with no back-tracking, so route any pattern with more than one `many` (`%`) subpattern to it, alongside the existing case-insensitive and too-many-subpatterns cases. The string matcher now only handles patterns with at most one `%`, where it is near-linear. The regex and string matchers produce identical results, so this changes only performance. Found by fuzzing. Adds a regression test covering the pathological shape (instant after the fix) and confirming the routing still matches correctly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fuzzing the CSV decoder turned up two ways untrusted source bytes could panic the decode operator (crashing source ingestion), and one silent data-corruption path: 1. The error paths (wrong column count, invalid UTF-8) returned without resetting output_cursor/ends_cursor, unlike the success path. csv_core resets its own per-record output position, so after a malformed record the next record read back stale, non-monotonic field offsets — slicing output[ends[i]..ends[i+1]] then panicked (start > end), or silently produced rows stitched from leftover buffer bytes. One bad row corrupted the rest of the source. Fixed by clearing the buffers after every completed record, success or error. 2. The whole record's concatenated field bytes were validated as UTF-8 in one shot, then sliced as a &str at each field boundary. A delimiter can sit between two bytes that, with the delimiter removed, form one multi-byte character: the concatenation validates, but a field boundary lands mid-character, panicking the &str slice (and even when it didn't panic, bytes were silently merged across fields). Fixed by validating each field's bytes as UTF-8 independently; a field that isn't valid UTF-8 on its own is now a decode error. To make the decoder fuzzable without mz-storage's heavy dependency tree (rocksdb/rdkafka/aws), the pure CsvDecoderState moves to mz-storage-types alongside the CsvEncoding it consumes; mz-storage re-imports it. Adds regression tests for both panics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

antiguru

I only now realized this is still draft, so take my reviews as hints. I couldn't review the CSV decoder change as I can't see the actual diff.

antiguru · 2026-06-11T12:10:54Z

+    // for the existing case-insensitive / too-many-subpatterns reasons).
+    let many_subpatterns = subpatterns.iter().filter(|s| s.many).count();
+    let use_regex = case_insensitive || subpatterns.len() > MAX_SUBPATTERNS || many_subpatterns > 1;
+    let matcher_impl = match use_regex {


What's even the point of having our own string matcher? I feel we could switch to the regex-rewriting for most (all?) cases.

It'd be up to 700x slower in extreme cases based on trying it out.

antiguru · 2026-06-11T12:13:21Z

+    // — e.g. `false AND <error>` evaluates to `false` — so folding the whole
+    // call to the error here would introduce an error the evaluated expression
+    // never raises. (Coalesce, also non-strict, bailed out above.)
+    if func.propagates_nulls() {


Needs optimizer review.

@ggevay Are you familiar with this code?

antiguru

The regex change LGTM. I find it difficult to review the CSV decoder change. You could split this into three PRs?

def- and others added 3 commits June 11, 2026 10:57

def- mentioned this pull request Jun 11, 2026

tests: Add the cargo-fuzz fuzzing suite #36982

Draft

antiguru reviewed Jun 11, 2026

View reviewed changes

def- marked this pull request as ready for review June 12, 2026 13:29

def- requested review from a team as code owners June 12, 2026 13:29

antiguru reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expr, storage: Fix reduce error-propagation, LIKE, and CSV decoder bugs#36987

expr, storage: Fix reduce error-propagation, LIKE, and CSV decoder bugs#36987
def- wants to merge 3 commits into
MaterializeInc:mainfrom
def-:fuzz-expr-storage

def- commented Jun 11, 2026 •

edited

Loading

Uh oh!

antiguru left a comment

Uh oh!

antiguru Jun 11, 2026

Uh oh!

def- Jun 12, 2026

Uh oh!

antiguru Jun 11, 2026

Uh oh!

def- Jun 12, 2026

Uh oh!

antiguru left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

def- commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

antiguru Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

def- Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

antiguru Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

def- Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

def- commented Jun 11, 2026 •

edited

Loading