Spark soundex function implementation#20725
Open
kazantsev-maksim wants to merge 23 commits intoapache:mainfrom
Open
Spark soundex function implementation#20725kazantsev-maksim wants to merge 23 commits intoapache:mainfrom
kazantsev-maksim wants to merge 23 commits intoapache:mainfrom
Conversation
Contributor
davidlghellin
left a comment
There was a problem hiding this comment.
I'd be happy to add the SLT tests for these edge cases if you'd like — I already have them validated against Spark JVM. Just let me know!
| query T | ||
| SELECT soundex('Datafusion'); | ||
| ---- | ||
| D312 |
Contributor
There was a problem hiding this comment.
Hey! I had actually started working on a Spark soundex implementation too and didn't realize there was already a PR for it. Happy to see this moving forward!
I had put together a battery of edge-case tests validated against Spark JVM that might be useful. The current SLT coverage is a bit thin — there are some tricky Soundex behaviors that are easy to get wrong:
tests = [
# H/W transparency (must NOT separate same codes)
("H/W transparency", "SELECT soundex('Ashcroft') AS result"),
# Separators (digit, space, vowel MUST separate same codes)
("Digit separates same-code", "SELECT soundex('B1B') AS result"),
("Space separates same-code", "SELECT soundex('B B') AS result"),
("Vowel separates same-code", "SELECT soundex('BAB') AS result"),
# Non-alpha first character (returns input unchanged)
("Non-alpha first char", "SELECT soundex('#hello') AS result"),
("Space first char", "SELECT soundex(' hello') AS result"),
("Only spaces", "SELECT soundex(' ') AS result"),
("Tab prefix", "SELECT soundex('\thello') AS result"),
("Emoji prefix", "SELECT soundex('😀hello') AS result"),
("Only digits", "SELECT soundex('123') AS result"),
("Starts with digit", "SELECT soundex('1abc') AS result"),
# Basic behavior
("Single character", "SELECT soundex('A') AS result"),
("All same-code letters", "SELECT soundex('BFPV') AS result"),
("Similar names Robert", "SELECT soundex('Robert') AS result"),
("Similar names Rupert", "SELECT soundex('Rupert') AS result"),
("NULL", "SELECT soundex(NULL) AS result"),
("Empty string", "SELECT soundex('') AS result"),
# Case insensitivity
("Lowercase", "SELECT soundex('robert') AS result"),
("Mixed case same", "SELECT soundex('rObErT') AS result"),
# Unicode
("Unicode umlaut", "SELECT soundex('Müller') AS result"),
# Truncation (only first 3 codes after initial)
("Long string", "SELECT soundex('Abcdefghijklmnop') AS result"),
# Extra edge cases
("Adjacent same codes collapse", "SELECT soundex('Lloyd') AS result"),
("W between same codes", "SELECT soundex('BWB') AS result"),
("H between same codes", "SELECT soundex('BHB') AS result"),
("Double letters", "SELECT soundex('Tymczak') AS result"),
("All vowels after first", "SELECT soundex('Aeiou') AS result"),
("First char digit rest alpha", "SELECT soundex('1Robert') AS result"),
("Hyphen in name", "SELECT soundex('Smith-Jones') AS result"),
("Single non-alpha", "SELECT soundex('#') AS result"),
("Newline prefix", "SELECT soundex('\nhello') AS result"),
]
for label, sql in tests:
r = spark.sql(sql).collect()
print(f"{label}: {repr(r[0].result)}")
# Multi-row column test
print("\nColumn test:")
spark.sql("""
SELECT soundex(name) AS result
FROM VALUES ('Robert'), ('Rupert'), (NULL), (''), ('123') AS t(name)
""").show()Spark-3.5
Contributor
Author
There was a problem hiding this comment.
Big thanks to @davidlghellin for the test cases.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
N/A
Rationale for this change
Add new spark function: https://spark.apache.org/docs/latest/api/sql/index.html#soundex
What changes are included in this PR?
Are these changes tested?
Yes, tests added as part of this PR.
Are there any user-facing changes?
No, these are new function.