Skip to content

Support Levenshtein func with threshold parameter for Spark 4+ #2279

@lyne7-sc

Description

@lyne7-sc

Describe the bug

The native Levenshtein function currently delegates to DataFusion's built-in levenshtein, which only supports 2 arguments (left, right). Starting from Spark 4, the Levenshtein expression supports an optional third threshold parameter, when the edit distance exceeds the threshold, it returns -1 instead of the actual distance.

The native implementation falls back to DataFusion's levenshtein which ignores the threshold parameter, causing result mismatches with Spark on 3-argument calls.

To Reproduce

Because the native implementation lacked threshold support, the "string Levenshtein distance" test was excluded in Spark 4.0/4.1 test settings.

Expected behavior

  • levenshtein('kitten', 'sitting')3
  • levenshtein('kitten', 'sitting', 2)-1 (distance 3 > threshold 2)
  • levenshtein('kitten', 'sitting', 5)3 (distance 3 ≤ threshold 5)

Screenshots

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions