fix: preserve spaces as token boundaries in English TN tagger#365
Merged
Conversation
The English TN normalizer previously deleted all input spaces before
tagging (delete(" ").star + tagger.star), which destroyed token
boundaries and caused "4x faster" to be misparsed as a single measure
token producing "four times degrees Fahrenheit aster" (wenet-e2e/wetext#15).
Refactor the tagger composition to preserve spaces as token boundaries
(NeMo-style), using closure(punct) + classify + closure(punct) as
token units with delete(SPACE) | punct as inter-token separators.
Also remove single-letter f from unit_alternatives.tsv since 4°F
is the correct Fahrenheit input format, not 4f. Lower range tagger
weight to 1.0 so "4x" is matched as range ("four times") rather than
serial ("four x"), and fraction to 0.99 to preserve "3/4" as "three
quarters".
8be2f91 to
6192b47
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ffromunit_alternatives.tsv(Fahrenheit should use°F, not baref)Background
Issue wenet-e2e/wetext#15 reported
"4x faster" → "four times degrees Fahrenheit aster"— a major normalization failure. The root cause was the tagger deleting all input spaces (delete(" ").star + tagger.star) before matching, which destroyed token boundaries. Without spaces,"4xfaster"was matched as a single measure token where4x→ cardinal+range ("four times") andf→ unit "degree Fahrenheit".The fix follows NeMo's approach: preserve spaces as boundaries between classified tokens, so each rule can only consume up to a space boundary and cannot cross into adjacent words.
Test plan
python3 -m tn --language en --text "4x faster"→ "four times faster" (not "four times degrees Fahrenheit aster")python3 -m tn --language en --text "4x"→ "four times"python3 -m tn --language en --text "3/4"→ "three quarters"python3 -m tn --language en --text "hello, world"→ "hello, world"python3 -m tn --language en --text "4°F"→ "four degrees Fahrenheit"pytest tn/english/test/— 114 passedpytest itn/english/test/— 487 passed (no regression)