Skip to content

feat: add user-facing CometUDF registration [experimental]#4387

Closed
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:register-comet-udf
Closed

feat: add user-facing CometUDF registration [experimental]#4387
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:register-comet-udf

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 21, 2026

Which issue does this PR close?

Part of #4193

Supersedes #4233 (closed).

Rationale for this change

apache/main already ships the underlying JVM UDF framework: the CometUDF trait, the JvmScalarUdf proto, native dispatch via CometUdfBridge, and the Janino codegen dispatcher (#4267) for automatic ScalaUDF handling. What's missing is a way for end users to plug their own vectorized CometUDF implementation in directly, so they can hand-tune a columnar kernel for a specific function instead of going through codegen.

What changes are included in this PR?

  • CometUDFRegistry (new): a thread-safe registry mapping a Spark UDF name to a user-supplied CometUDF implementation class.
  • CometScalaUDF.convert checks the registry first; if a registered name matches, it emits a JvmScalarUdf proto targeting the user class directly with the children expressions as args. Unregistered UDFs continue through the codegen dispatcher (when enabled).
  • @org.apache.spark.annotation.Unstable on CometUDF and CometUDFRegistry to signal that the user-facing surface may evolve.
  • New user-guide page custom_comet_udfs.md documenting the contract, registration, routing precedence, and cluster deployment.

User-facing API:

spark.udf.register(\"plus_one\", (x: Int) => x + 1)
CometUDFRegistry.register(\"plus_one\", classOf[com.example.PlusOneUdf])

How are these changes tested?

CometRegisteredUdfSuite:

  • Registered CometUDF runs on the native path end-to-end (`checkSparkAnswerAndOperator`).
  • An unregistered ScalaUDF falls back to Spark when codegen is disabled.
  • register / isRegistered / unregister round-trip.

[skip ci] on this commit while iterating; will drop the tag once the design is settled.

[skip ci]

Adds `CometUDFRegistry` so users can register a vectorized `CometUDF`
implementation against a Spark UDF name. The `CometScalaUDF` serde now
checks the registry first and emits a `JvmScalarUdf` targeting the
registered class directly, bypassing the Janino codegen dispatcher.
Unregistered UDFs continue through codegen when enabled, otherwise fall
back to Spark.

The `CometUDF` trait stays as defined on apache/main (the JVM UDF
framework already handles per-task instantiation and caching via
`CometUdfBridge`). `CometUDF` and `CometUDFRegistry` are marked
`@org.apache.spark.annotation.Unstable`.
@andygrove andygrove marked this pull request as draft May 21, 2026 16:27
@andygrove andygrove changed the title feat: add user-facing CometUDF registration feat: add user-facing CometUDF registration [experimental] May 21, 2026
@andygrove
Copy link
Copy Markdown
Member Author

@mbutrovich Do you think this approach is still needed, or is the existing Spark UDF support just as good now?

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich Do you think this approach is still needed, or is the existing Spark UDF support just as good now?

I'm not sure it's worth maintaining as an access path. In theory, if someone could adapt their UDF logic to be applied directly to Arrow types, then a user could elide a type change if they reimplemented as a CometUDF instead of programming against Spark UDF signature types. However, #4267 already codegens in optimizations like unsafe Arrow access (which users are unlikely to do), etc. So I think we're already in good shape from a performance standpoint. I also don't imagine users would want to maintain/debug both a CometUDF and the general ScalaUDF implementations when the former possibly provides minimal performance benefit.

@andygrove
Copy link
Copy Markdown
Member Author

@mbutrovich Do you think this approach is still needed, or is the existing Spark UDF support just as good now?

I'm not sure it's worth maintaining as an access path. In theory, if someone could adapt their UDF logic to be applied directly to Arrow types, then a user could elide a type change if they reimplemented as a CometUDF instead of programming against Spark UDF signature types. However, #4267 already codegens in optimizations like unsafe Arrow access (which users are unlikely to do), etc. So I think we're already in good shape from a performance standpoint. I also don't imagine users would want to maintain/debug both a CometUDF and the general ScalaUDF implementations when the former possibly provides minimal performance benefit.

Makes sense. Thanks, I'll go ahead and close this.

@andygrove andygrove closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants