Describe the bug
The native Levenshtein function currently delegates to DataFusion's built-in levenshtein, which only supports 2 arguments (left, right). Starting from Spark 4, the Levenshtein expression supports an optional third threshold parameter, when the edit distance exceeds the threshold, it returns -1 instead of the actual distance.
The native implementation falls back to DataFusion's levenshtein which ignores the threshold parameter, causing result mismatches with Spark on 3-argument calls.
To Reproduce
Because the native implementation lacked threshold support, the "string Levenshtein distance" test was excluded in Spark 4.0/4.1 test settings.
Expected behavior
levenshtein('kitten', 'sitting') → 3
levenshtein('kitten', 'sitting', 2) → -1 (distance 3 > threshold 2)
levenshtein('kitten', 'sitting', 5) → 3 (distance 3 ≤ threshold 5)
Screenshots
Additional context
Describe the bug
The native
Levenshteinfunction currently delegates to DataFusion's built-inlevenshtein, which only supports 2 arguments (left, right). Starting from Spark 4, theLevenshteinexpression supports an optional thirdthresholdparameter, when the edit distance exceeds the threshold, it returns-1instead of the actual distance.The native implementation falls back to DataFusion's
levenshteinwhich ignores the threshold parameter, causing result mismatches with Spark on 3-argument calls.To Reproduce
Because the native implementation lacked threshold support, the
"string Levenshtein distance"test was excluded in Spark 4.0/4.1 test settings.Expected behavior
levenshtein('kitten', 'sitting')→3levenshtein('kitten', 'sitting', 2)→-1(distance 3 > threshold 2)levenshtein('kitten', 'sitting', 5)→3(distance 3 ≤ threshold 5)Screenshots
Additional context