You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add cargo-fuzz / OSS-Fuzz coverage for DataFusion's parser and analyzer
Background
DataFusion already has excellent property-based fuzzers for runtime correctness — sort_fuzz, aggregation_fuzzer, join_fuzz, and friends in datafusion/core/tests/fuzz_cases/. They generate random queries against valid data and cross-check results, which is exactly the right tool for catching aggregation/sort/join semantic bugs.
What we don't yet have is byte-level (libFuzzer / cargo-fuzz) coverage. That's a different attack surface: feeding arbitrary bytes into entry points like DFParser::parse_sql exercises the SQL tokenizer/parser, the dialect-extension code, and any error-recovery paths the parser walks while building the AST. These are the kinds of bugs that show up as panics or pathological allocations rather than wrong results.
This is the same delta arrow-rs#5332 is closing on the format-parser side; DataFusion benefits from byte-level fuzzing for its own reasons.
Concrete proposal
Add per-crate fuzz/ directories with cargo-fuzz harnesses, starting with the smallest blast radius and growing outward as we land. Two are already prototyped on my fork at fuzz/initial-harnesses:
datafusion/sql/fuzz/parse_sql — DFParser::parse_sql over UTF-8-validated &[u8]. 22 s sanity run reaches 3,520 edges / 12,788 features / 2,342 corpus entries from an empty start, RSS peak 117 MiB, no crashes.
datafusion/sql/fuzz/parse_expr — DFParser::parse_sql_into_expr (different production set). 22 s sanity reaches 3,980 edges / 14,500 features / 2,492 corpus entries at 9,090 runs/s, no crashes.
Then in subsequent PRs, depending on what shakes out: a LogicalPlanBuilder harness in datafusion/expr, a SessionContext::sql end-to-end harness, and possibly per-format datasource readers (csv/json/parquet/arrow/avro), though those overlap with the Fuzz tests for Arrow/Parquet arrow-rs#5332 work and we'd want to dedupe.
Wire ClusterFuzzLite into GitHub Actions so PR-time fuzz + nightly batch + corpus pruning run automatically. Catches regressions before they merge instead of relying on me to remember to run cargo-fuzz locally.
Open the OSS-Fuzz integration PR (projects/datafusion/) once the upstream fuzz/ lands and the harnesses are stable. Mirrors the layout I'm proposing for arrow-rs in google/oss-fuzz. Coverage target ≥80% on harness-touched modules.
Why this is worth doing
DataFusion is widely used as a SQL engine inside other projects (InfluxDB, Ballista, Coralogix, DataFusion Comet, ...). A panic or pathological allocation in DFParser::parse_sql can DoS any service that exposes a SQL endpoint to untrusted input.
The SQL parser walks an attacker-controlled token stream through DataFusion's extensions on top of sqlparser-rs. The sqlparser crate itself does not appear to be fuzzed in OSS-Fuzz, so DataFusion sits behind two unfuzzed parser layers right now.
Coverage cost is modest: cargo-fuzz harnesses are typically tens of lines each, and the OSS-Fuzz/ClusterFuzzLite plumbing is a one-time setup.
Status
I'm prototyping this in my fork at masumi-ryugo/datafusion@fuzz/initial-harnesses and have a working parse_sql harness building under nightly cargo-fuzz. Happy to send the upstream PR once the layout question is settled.
Layout preference: per-crate fuzz/ dirs (gitoxide-style — independent buildable units, easy to add per-crate seed corpora) vs. a single top-level fuzz/ workspace? I'd lean per-crate but happy to do whatever you prefer.
primary_contact / auto_ccs for OSS-Fuzz project.yaml: OSS-Fuzz wants Google-account emails on file for crash notifications. Who from the PMC would want to be on that list? I can default to a single contact if a PMC alias isn't preferred.
xref apache/arrow-rs#5332 (sibling work for arrow-rs side, same author, same OSS-Fuzz integration pattern).
Add cargo-fuzz / OSS-Fuzz coverage for DataFusion's parser and analyzer
Background
DataFusion already has excellent property-based fuzzers for runtime correctness —
sort_fuzz,aggregation_fuzzer,join_fuzz, and friends indatafusion/core/tests/fuzz_cases/. They generate random queries against valid data and cross-check results, which is exactly the right tool for catching aggregation/sort/join semantic bugs.What we don't yet have is byte-level (libFuzzer / cargo-fuzz) coverage. That's a different attack surface: feeding arbitrary bytes into entry points like
DFParser::parse_sqlexercises the SQL tokenizer/parser, the dialect-extension code, and any error-recovery paths the parser walks while building the AST. These are the kinds of bugs that show up as panics or pathological allocations rather than wrong results.This is the same delta arrow-rs#5332 is closing on the format-parser side; DataFusion benefits from byte-level fuzzing for its own reasons.
Concrete proposal
Add per-crate
fuzz/directories with cargo-fuzz harnesses, starting with the smallest blast radius and growing outward as we land. Two are already prototyped on my fork atfuzz/initial-harnesses:datafusion/sql/fuzz/parse_sql—DFParser::parse_sqlover UTF-8-validated&[u8]. 22 s sanity run reaches 3,520 edges / 12,788 features / 2,342 corpus entries from an empty start, RSS peak 117 MiB, no crashes.datafusion/sql/fuzz/parse_expr—DFParser::parse_sql_into_expr(different production set). 22 s sanity reaches 3,980 edges / 14,500 features / 2,492 corpus entries at 9,090 runs/s, no crashes.Then in subsequent PRs, depending on what shakes out: a
LogicalPlanBuilderharness indatafusion/expr, aSessionContext::sqlend-to-end harness, and possibly per-format datasource readers (csv/json/parquet/arrow/avro), though those overlap with the Fuzz tests for Arrow/Parquet arrow-rs#5332 work and we'd want to dedupe.Wire ClusterFuzzLite into GitHub Actions so PR-time fuzz + nightly batch + corpus pruning run automatically. Catches regressions before they merge instead of relying on me to remember to run cargo-fuzz locally.
Open the OSS-Fuzz integration PR (
projects/datafusion/) once the upstreamfuzz/lands and the harnesses are stable. Mirrors the layout I'm proposing for arrow-rs in google/oss-fuzz. Coverage target ≥80% on harness-touched modules.Why this is worth doing
DFParser::parse_sqlcan DoS any service that exposes a SQL endpoint to untrusted input.sqlparser-rs. Thesqlparsercrate itself does not appear to be fuzzed in OSS-Fuzz, so DataFusion sits behind two unfuzzed parser layers right now.Status
I'm prototyping this in my fork at masumi-ryugo/datafusion@fuzz/initial-harnesses and have a working
parse_sqlharness building under nightly cargo-fuzz. Happy to send the upstream PR once the layout question is settled.Two questions for the maintainers (cc @alamb @andygrove @comphead @ozankabak):
Layout preference: per-crate
fuzz/dirs (gitoxide-style — independent buildable units, easy to add per-crate seed corpora) vs. a single top-levelfuzz/workspace? I'd lean per-crate but happy to do whatever you prefer.primary_contact/auto_ccsfor OSS-Fuzzproject.yaml: OSS-Fuzz wants Google-account emails on file for crash notifications. Who from the PMC would want to be on that list? I can default to a single contact if a PMC alias isn't preferred.xref apache/arrow-rs#5332 (sibling work for arrow-rs side, same author, same OSS-Fuzz integration pattern).