feat: JVM UDF fallback for date_format#4373
Draft
andygrove wants to merge 2 commits into
Draft
Conversation
CometDateFormat now picks between native to_char and a new DateFormatUDF that wraps Spark's DateFormatClass. Native is used when the format is a literal in the strftime-mappable whitelist and the timezone is UTC, or when spark.comet.expression.DateFormatClass.allowIncompatible is set. All other cases (non-UTC timezone, non-literal format, format outside the whitelist) now run inside Comet via the JVM UDF instead of falling back to Spark. Unlike the regexp engine config, there's no new user-facing knob: the JVM UDF is a transparent correctness fallback that delegates to Spark's own implementation, so behavior matches Spark by construction.
The JVM UDF fallback makes date_format always run inside Comet, so the expect_fallback(Non-UTC timezone) assertions no longer hold. Switch the four queries to plain coverage-and-answer checks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a JVM UDF fallback so that
date_formatalways runs inside Comet:to_char) when the format is a literal in the strftime-mappable whitelist and the session timezone is UTC, or whenspark.comet.expression.DateFormatClass.allowIncompatible=true.org.apache.comet.udf.DateFormatUDF) in every other case: non-UTC timezone, non-literal format, or a format outside the whitelist. The UDF wraps Spark's ownDateFormatClassand is invoked through the existingJvmScalarUdfframework, so results match Spark by construction.Unlike the regexp engine config, no new user-visible knob is introduced — the JVM UDF is a transparent correctness fallback rather than an opt-in engine.
Behavior change
Cases that previously fell back to Spark (and broke up the Comet pipeline with row-group transitions) now stay inside the Comet operator. The native path is unchanged for the UTC + whitelisted-format case.
Implementation notes
DateFormatUDFcaches oneDateFormatClassinstance per(format, timezone). Constructing with a literal format makes Spark'sformatterOptionlazy-val resolve to a reusable formatter, so the per-row work is justeval(InternalRow(micros)).CompatiblefromgetSupportLevel; the native-vs-UDF decision is made insideconvert.Test plan
CometTemporalExpressionSuite(Spark 3.5): all 27 tests passcheckSparkAnswerAndOperator)date_format - timestamp_ntz inputnow runscheckSparkAnswerAndOperatorfor all timezones (previously only UTC)allowIncompatibletest fixed: corrected config key (expression, notexpr) and switched from answer-comparison to operator-only assertion since the native path may legitimately diverge from Spark for non-UTC timezonesNotes for reviewers
This is a draft pending the items above and any feedback on the routing strategy in
CometDateFormat.convert.