Skip to content

feat: JVM UDF fallback for date_format#4373

Draft
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:feat/date-format-jvm-udf
Draft

feat: JVM UDF fallback for date_format#4373
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:feat/date-format-jvm-udf

Conversation

@andygrove
Copy link
Copy Markdown
Member

Summary

Adds a JVM UDF fallback so that date_format always runs inside Comet:

  • Native (DataFusion to_char) when the format is a literal in the strftime-mappable whitelist and the session timezone is UTC, or when spark.comet.expression.DateFormatClass.allowIncompatible=true.
  • JVM UDF (org.apache.comet.udf.DateFormatUDF) in every other case: non-UTC timezone, non-literal format, or a format outside the whitelist. The UDF wraps Spark's own DateFormatClass and is invoked through the existing JvmScalarUdf framework, so results match Spark by construction.

Unlike the regexp engine config, no new user-visible knob is introduced — the JVM UDF is a transparent correctness fallback rather than an opt-in engine.

Behavior change

Cases that previously fell back to Spark (and broke up the Comet pipeline with row-group transitions) now stay inside the Comet operator. The native path is unchanged for the UTC + whitelisted-format case.

Implementation notes

  • DateFormatUDF caches one DateFormatClass instance per (format, timezone). Constructing with a literal format makes Spark's formatterOption lazy-val resolve to a reusable formatter, so the per-row work is just eval(InternalRow(micros)).
  • For scalar (literal-folded) formats — the common case — the cache lookup is hoisted out of the per-row loop to eliminate Tuple2 + HashMap.get allocations on the hot path.
  • The serde now always returns Compatible from getSupportLevel; the native-vs-UDF decision is made inside convert.

Test plan

  • CometTemporalExpressionSuite (Spark 3.5): all 27 tests pass
  • Three existing fallback-reason tests rewritten to assert the JVM UDF path now runs inside Comet (checkSparkAnswerAndOperator)
  • date_format - timestamp_ntz input now runs checkSparkAnswerAndOperator for all timezones (previously only UTC)
  • allowIncompatible test fixed: corrected config key (expression, not expr) and switched from answer-comparison to operator-only assertion since the native path may legitimately diverge from Spark for non-UTC timezones
  • Spark 3.4 / 4.0 profile sanity
  • Benchmark UTC native path vs JVM UDF path on a representative workload

Notes for reviewers

This is a draft pending the items above and any feedback on the routing strategy in CometDateFormat.convert.

andygrove added 2 commits May 20, 2026 11:37
CometDateFormat now picks between native to_char and a new DateFormatUDF
that wraps Spark's DateFormatClass. Native is used when the format is a
literal in the strftime-mappable whitelist and the timezone is UTC, or
when spark.comet.expression.DateFormatClass.allowIncompatible is set.
All other cases (non-UTC timezone, non-literal format, format outside
the whitelist) now run inside Comet via the JVM UDF instead of falling
back to Spark.

Unlike the regexp engine config, there's no new user-facing knob: the
JVM UDF is a transparent correctness fallback that delegates to Spark's
own implementation, so behavior matches Spark by construction.
The JVM UDF fallback makes date_format always run inside Comet, so the
expect_fallback(Non-UTC timezone) assertions no longer hold. Switch the
four queries to plain coverage-and-answer checks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant