Skip to content

expr, storage: Fix reduce error-propagation, LIKE, and CSV decoder bugs#36987

Open
def- wants to merge 3 commits into
MaterializeInc:mainfrom
def-:fuzz-expr-storage
Open

expr, storage: Fix reduce error-propagation, LIKE, and CSV decoder bugs#36987
def- wants to merge 3 commits into
MaterializeInc:mainfrom
def-:fuzz-expr-storage

Conversation

@def-

@def- def- commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Three correctness/robustness fixes found by fuzzing:

  • expr: don't propagate an operand's error through non-strict AND/OR in reduce (false AND <error> evaluates to false, so folding it to the error introduced a spurious error into a query that succeeds).
  • expr: avoid super-linear LIKE matching on multi-% patterns.
  • storage: fix CSV source-decoder panics on malformed untrusted input (cursor reset + per-field UTF-8 validation).

Each fix has a regression test. Found by #36982

🤖 Generated with Claude Code

def- and others added 3 commits June 11, 2026 10:57
… reduce

The generic variadic fold in `reduce` replaced a call with any operand's literal
error unconditionally. That is wrong for a non-strict function: AND/OR have a
dominating operand (`false`/`true`) that absorbs an erroring operand at runtime,
so `false AND <error>` evaluates to `false`, not an error. Folding
`col AND <error>` to the error made `reduce` introduce an error into a query
that evaluates successfully — a correctness bug. Guard error-propagation with
`func.propagates_nulls()`, mirroring the adjacent null-propagation guard (the
other non-strict variadic, Coalesce, already bails out earlier). Found by the
mir_scalar_reduce fuzz target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The string matcher (`is_match_subpatterns`, used for patterns with up to
MAX_SUBPATTERNS subpatterns) resolves each `%` by searching for the
following suffix and back-tracking over every candidate position. With
two or more `%` the back-tracking nests, so matching an adversarial
pattern like `%a%a%a` against a long run of `a`s costs time proportional
to len(text)^(number of %) — a fuzzer (like_pattern_escape) hit a single
input that ran for 1667s.

The regex engine matches these patterns in linear time with no
back-tracking, so route any pattern with more than one `many` (`%`)
subpattern to it, alongside the existing case-insensitive and
too-many-subpatterns cases. The string matcher now only handles patterns
with at most one `%`, where it is near-linear. The regex and string
matchers produce identical results, so this changes only performance.

Found by fuzzing. Adds a regression test covering the pathological shape
(instant after the fix) and confirming the routing still matches
correctly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fuzzing the CSV decoder turned up two ways untrusted source bytes could
panic the decode operator (crashing source ingestion), and one silent
data-corruption path:

  1. The error paths (wrong column count, invalid UTF-8) returned without
     resetting output_cursor/ends_cursor, unlike the success path. csv_core
     resets its own per-record output position, so after a malformed record
     the next record read back stale, non-monotonic field offsets — slicing
     output[ends[i]..ends[i+1]] then panicked (start > end), or silently
     produced rows stitched from leftover buffer bytes. One bad row corrupted
     the rest of the source. Fixed by clearing the buffers after every
     completed record, success or error.

  2. The whole record's concatenated field bytes were validated as UTF-8 in
     one shot, then sliced as a &str at each field boundary. A delimiter can
     sit between two bytes that, with the delimiter removed, form one
     multi-byte character: the concatenation validates, but a field boundary
     lands mid-character, panicking the &str slice (and even when it didn't
     panic, bytes were silently merged across fields). Fixed by validating
     each field's bytes as UTF-8 independently; a field that isn't valid
     UTF-8 on its own is now a decode error.

To make the decoder fuzzable without mz-storage's heavy dependency tree
(rocksdb/rdkafka/aws), the pure CsvDecoderState moves to mz-storage-types
alongside the CsvEncoding it consumes; mz-storage re-imports it. Adds
regression tests for both panics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@antiguru antiguru left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only now realized this is still draft, so take my reviews as hints. I couldn't review the CSV decoder change as I can't see the actual diff.

// for the existing case-insensitive / too-many-subpatterns reasons).
let many_subpatterns = subpatterns.iter().filter(|s| s.many).count();
let use_regex = case_insensitive || subpatterns.len() > MAX_SUBPATTERNS || many_subpatterns > 1;
let matcher_impl = match use_regex {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's even the point of having our own string matcher? I feel we could switch to the regex-rewriting for most (all?) cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be up to 700x slower in extreme cases based on trying it out.

// — e.g. `false AND <error>` evaluates to `false` — so folding the whole
// call to the error here would introduce an error the evaluated expression
// never raises. (Coalesce, also non-strict, bailed out above.)
if func.propagates_nulls() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs optimizer review.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggevay Are you familiar with this code?

@def- def- marked this pull request as ready for review June 12, 2026 13:29
@def- def- requested review from a team as code owners June 12, 2026 13:29

@antiguru antiguru left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex change LGTM. I find it difficult to review the CSV decoder change. You could split this into three PRs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants