feat: JVM UDF fallback for date_format by andygrove · Pull Request #4373 · apache/datafusion-comet

andygrove · 2026-05-20T17:37:50Z

Summary

Adds a JVM UDF fallback so that date_format always runs inside Comet:

Native (DataFusion to_char) when the format is a literal in the strftime-mappable whitelist and the session timezone is UTC, or when spark.comet.expression.DateFormatClass.allowIncompatible=true.
JVM UDF (org.apache.comet.udf.DateFormatUDF) in every other case: non-UTC timezone, non-literal format, or a format outside the whitelist. The UDF wraps Spark's own DateFormatClass and is invoked through the existing JvmScalarUdf framework, so results match Spark by construction.

Unlike the regexp engine config, no new user-visible knob is introduced — the JVM UDF is a transparent correctness fallback rather than an opt-in engine.

Behavior change

Cases that previously fell back to Spark (and broke up the Comet pipeline with row-group transitions) now stay inside the Comet operator. The native path is unchanged for the UTC + whitelisted-format case.

Implementation notes

DateFormatUDF caches one DateFormatClass instance per (format, timezone). Constructing with a literal format makes Spark's formatterOption lazy-val resolve to a reusable formatter, so the per-row work is just eval(InternalRow(micros)).
For scalar (literal-folded) formats — the common case — the cache lookup is hoisted out of the per-row loop to eliminate Tuple2 + HashMap.get allocations on the hot path.
The serde now always returns Compatible from getSupportLevel; the native-vs-UDF decision is made inside convert.

Test plan

CometTemporalExpressionSuite (Spark 3.5): all 27 tests pass
Three existing fallback-reason tests rewritten to assert the JVM UDF path now runs inside Comet (checkSparkAnswerAndOperator)
date_format - timestamp_ntz input now runs checkSparkAnswerAndOperator for all timezones (previously only UTC)
allowIncompatible test fixed: corrected config key (expression, not expr) and switched from answer-comparison to operator-only assertion since the native path may legitimately diverge from Spark for non-UTC timezones
Spark 3.4 / 4.0 profile sanity
Benchmark UTC native path vs JVM UDF path on a representative workload

Notes for reviewers

This is a draft pending the items above and any feedback on the routing strategy in CometDateFormat.convert.

CometDateFormat now picks between native to_char and a new DateFormatUDF that wraps Spark's DateFormatClass. Native is used when the format is a literal in the strftime-mappable whitelist and the timezone is UTC, or when spark.comet.expression.DateFormatClass.allowIncompatible is set. All other cases (non-UTC timezone, non-literal format, format outside the whitelist) now run inside Comet via the JVM UDF instead of falling back to Spark. Unlike the regexp engine config, there's no new user-facing knob: the JVM UDF is a transparent correctness fallback that delegates to Spark's own implementation, so behavior matches Spark by construction.

The JVM UDF fallback makes date_format always run inside Comet, so the expect_fallback(Non-UTC timezone) assertions no longer hold. Switch the four queries to plain coverage-and-answer checks.

andygrove added 2 commits May 20, 2026 11:37

test: update date_format.sql to assert in-Comet execution

aca31e7

The JVM UDF fallback makes date_format always run inside Comet, so the expect_fallback(Non-UTC timezone) assertions no longer hold. Switch the four queries to plain coverage-and-answer checks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: JVM UDF fallback for date_format#4373

feat: JVM UDF fallback for date_format#4373
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:feat/date-format-jvm-udf

andygrove commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 20, 2026

Summary

Behavior change

Implementation notes

Test plan

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant