OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4)#1103
OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4)#1103krickert wants to merge 11 commits into
Conversation
|
OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to
Supersedes #1101. |
|
@rzo1 split it up like we discussed. OPENNLP-1850 is now four stacked PRs instead of the one big #1101 (closed):
All your licensing points are addressed, riding with whichever stack ships the data file: attribution in Each stack builds and tests green on its own. Thanks again for steering this, the split is much cleaner to review. |
There was a problem hiding this comment.
Pull request overview
This PR introduces the “foundation” layer for OPENNLP-1850’s Unicode normalization stack: new value types in opennlp-api (normalized text + offset mapping) and a runtime normalization engine built around CodePointSet/CharClass, Unicode whitespace/dash reference tables, and a TextNormalizer builder, along with legal/NOTICE updates for bundling Unicode UTS #39 confusables.txt.
Changes:
- Add core normalization primitives (
CodePointSet,CharClass) and multipleCharSequenceNormalizerimplementations (whitespace, dashes, quotes, digits, ellipsis, bullets, invisibles, case/accent folds, confusable skeleton). - Add API value types for normalization output and offset mapping (
NormalizedText,OffsetMap). - Add extensive JUnit tests for the new normalization components and update legal/NOTICE/RAT configuration for the bundled Unicode data file.
Reviewed changes
Copilot reviewed 38 out of 39 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/license/NOTICE.template | Adds NOTICE template entry for bundled Unicode UTS #39 data file. |
| rat-excludes | Adds RAT exclusion entry for confusables.txt. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeWhitespaceTest.java | Tests Unicode whitespace reference data and membership behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeDashTest.java | Tests Unicode dash reference data and normalization defaults. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeCharSequenceNormalizerTest.java | Tests composition/behavior of multiple normalizers in pipelines. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/TextNormalizerTest.java | Tests TextNormalizer builder and default search pipeline. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/SetBasedNormalizerTest.java | Tests quote/digit/invisible/ellipsis/bullet normalizers. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/GermanUmlautCharSequenceNormalizerTest.java | Tests German umlaut/eszett transliteration normalizer. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/DimensionTest.java | Tests Dimension default normalizer wiring. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CodePointSetTest.java | Tests CodePointSet creation, parsing, and set operations. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CharClassTest.java | Tests CharClass operations including offset-mapped outputs. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CaseFoldCharSequenceNormalizerTest.java | Tests locale-safe and locale-specific case folding behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/AccentFoldCharSequenceNormalizerTest.java | Tests script-gated accent folding and stroke/ligature mapping. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/WhitespaceCharSequenceNormalizer.java | Adds Unicode whitespace collapsing+trimming normalizer. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/UnicodeWhitespace.java | Adds Unicode whitespace reference table + O(1) membership. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/UnicodeDash.java | Adds Unicode dash reference table + O(1) membership and defaults. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/TextNormalizer.java | Adds fluent builder for composing normalizer pipelines. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/QuoteCharSequenceNormalizer.java | Adds typographic quote folding to ASCII. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NfkcCharSequenceNormalizer.java | Adds NFKC normalizer wrapper. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NfcCharSequenceNormalizer.java | Adds NFC normalizer wrapper. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/InvisibleCharSequenceNormalizer.java | Adds stripping of invisible/bidi control characters. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/GermanUmlautCharSequenceNormalizer.java | Adds DIN-style German umlaut/eszett expansion normalizer. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/EllipsisCharSequenceNormalizer.java | Adds ellipsis/leader expansion to ASCII dots. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Dimension.java | Adds normalization “dimension” ladder with lazy defaults. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/DigitCharSequenceNormalizer.java | Adds Unicode decimal digit folding to ASCII digits. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/DashCharSequenceNormalizer.java | Adds Unicode dash folding to ASCII hyphen-minus. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/ConfusableSkeletonCharSequenceNormalizer.java | Adds UTS #39 confusable skeleton normalizer wrapper. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Confusables.java | Adds confusables resource loader and skeleton algorithm implementation. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CodePointSet.java | Adds immutable O(1) membership code point set + parser/fromFile. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CharClass.java | Adds cursor-based normalization/splitting/collapsing + offset mapping. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CaseFoldCharSequenceNormalizer.java | Adds locale-aware lowercasing normalizer with ROOT singleton. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/BulletCharSequenceNormalizer.java | Adds bullet replacement normalizer (bullets -> space). |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/AccentFoldCharSequenceNormalizer.java | Adds script-gated diacritic fold with optional stroke/ligature mapping. |
| opennlp-api/src/main/java/opennlp/tools/util/normalizer/OffsetMap.java | Adds normalized↔original offset mapping structure and builder. |
| opennlp-api/src/main/java/opennlp/tools/util/normalizer/NormalizedText.java | Adds NormalizedText record bundling original/normalized + OffsetMap. |
| opennlp-api/pom.xml | Adds a test-scoped JUnit dependency. |
| NOTICE | Adds NOTICE entry for bundled Unicode UTS #39 data file. |
| LICENSE | Adds Unicode License V3 text for bundled confusables data file. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thx for the PR. Here are some suggestions:
|
|
You pointed out a gap in the API that I'm going to fix right now re: NormalizedText/OffsetMap.. Reworking it now - and I think the new way would be a much better shape that matches how it works in other packages. Addressing all your points now - thanks for the feedback - keep it coming. |
|
Dimension javadoc forward-references Term/TermAnalyzer. Offset mapping isn't reachable through the builder. OffsetMap buffer growth overflows past ~2^30. Confusables.load() has no per-line guard. Nit: serialVersionUID = 1L vs random longs; builder() returns its own mutable builder. |
|
Status since the last review. Offset-model items addressed; additive commits, so inline threads stay anchored.
|
…s, Dimension) Adds the dependency-free normalization layer that the tokenizer, Term model, and DL integration build on in later stacks. No segmentation, no behavioral changes. - opennlp-api: the offset-preserving value types NormalizedText and OffsetMap alongside the existing CharSequenceNormalizer contract. - opennlp-runtime util/normalizer: the CharClass / CodePointSet engine and the Unicode White_Space / Dash sets; the normalizer rungs (NFC, NFKC, whitespace, dash, case fold, accent fold, invisible, quote, digit, ellipsis, bullet, German umlaut, confusable skeleton); the Dimension enum that single-sources the char-level steps; and the TextNormalizer builder over them. - Bundles the Unicode confusables.txt (UTS #39) data file with its Unicode License V3 attribution across NOTICE, NOTICE.template, LICENSE, and rat-excludes.
- OffsetMap.toNormalizedOffset: use floor semantics so an original offset that falls inside a collapsed run maps to that run's single normalized character instead of the following one; correct the javadoc to match. - CharClass.collapse: document that each run collapses to one replacement with no trimming (an all-whitespace string becomes one space; compose with trim for emptiness). - Confusables: parse the prototype hex tokens with a cursor scan instead of a regex split, honoring the stated no-regex contract and avoiding Pattern compilation for every line during static init. - CaseFoldCharSequenceNormalizer: fail loud with a clear message on a null locale in both the factory and the constructor instead of defaulting; the locale-independent default is the no-arg getInstance(). - LICENSE: align the embedded Unicode License V3 copyright year with the bundled data file header and NOTICE (1991-2025). - Tests: add OffsetMapTest (the api module's first test) and ConfusableSkeletonTest, extend CharClassTest with offset cases for collapsed runs of mixed Unicode whitespace, tabs, newlines, single-char/all-whitespace/empty inputs, and surrogate pairs, and cover null-locale rejection.
…ennlp-api Replaces the never-released OffsetMap/NormalizedText (introduced earlier in this same unmerged work, so there is nothing to keep compatible) with Alignment, a bidirectional edit-sequence model (equal/replace runs) that maps span to span (toOriginalSpan/ toNormalizedSpan), composes with andThen, and represents deletions and length-growing folds correctly. AlignedText carries it. The CharClass engine (CharClass, CodePointSet, UnicodeWhitespace, UnicodeDash) moves from opennlp-runtime to opennlp-api so opennlp-dl can reach it without depending on the whole tools runtime. CharClass exposes normalize/collapse/collapsePreserving/trim/removeAll Aligned variants; the per-character offset machinery and appendMapped helper are gone.
Confusables.load parsed each line's hex inside the static initializer with no per-line guard, so a malformed bundled line surfaced as an opaque ExceptionInInitializerError. Wrap the per-line parse and rethrow with the line number, mirroring CodePointSet.fromFile.
Restore the coverage the removed *Mapped tests had, now against the Aligned variants, and extend it: mixed Unicode-whitespace / tab / newline collapse runs, empty/single/all-whitespace inputs, surrogate-pair offsets, normalize identity, supplementary pass-through and expansion to a supplementary replacement, leading/trailing deletions in removeAll, all-whitespace trim, and a collapsePreserving run without a kept line break. Alignment gains builder-growth, three-stage andThen composition, and reverse-span-across-expansion tests.
… TextNormalizer.Builder Give the new CharSequenceNormalizer rungs generated-style serialVersionUID values to match the existing classes instead of 1L, and split TextNormalizer's mutable builder state into a nested TextNormalizer.Builder so builder() returns a dedicated Builder rather than a TextNormalizer that is its own builder. Fluent usage (builder().nfc()...build()) is unchanged.
Add OffsetAwareNormalizer, a capability interface for rungs that can report an Alignment from their normalized output back to the input, and TextNormalizer.Builder.buildAligned() to compose a chain of them into one AlignedText that maps a span found in the fully normalized text back to original character offsets. The cursor-based rungs implement it: whitespace (collapse then trim), dashes, and invisible-control stripping, plus a new line-break-preserving whitespace rung that collapses horizontal runs to a space while keeping line and paragraph breaks as a single newline. Rungs that delegate to java.text.Normalizer (NFC/NFKC) or a fold table cannot report their edits and are rejected loudly by buildAligned(). This gives the previously engine-only Alignment.andThen and the CharClass *Aligned operations (collapseAligned, trimAligned, removeAllAligned, collapsePreservingAligned) real production consumers.
Quote, Digit, Ellipsis, Bullet, and GermanUmlaut now implement OffsetAwareNormalizer, so an offset-aware TextNormalizer.buildAligned() pipeline can include them. Each is a plain per-code-point substitution (the expanding ones such as the ellipsis to three dots, the eszett to ss, and the contracting supplementary digit to one ASCII digit record a replace edit), so the alignment maps a match back to the exact source code point. Only the folds that route through java.text.Normalizer (NFC, NFKC, accent fold, confusable skeleton) or JDK case mapping (case fold) still cannot report their edits and remain rejected by buildAligned().
Add the capital eszett (U+1E9E) to the German umlaut transliteration so it expands to SS, matching the lowercase eszett to ss. Generalize the buildAligned() rejection message and javadoc so they describe the rule (per-code-point folds are offset-aware; folds that route through java.text.Normalizer or JDK case mapping are not) instead of an enumerated list that fell out of date as more folds became offset-aware. Scope the Confusables javadoc to the skeleton transform and skeleton-equality test it implements, noting that the restriction-level, mixed-script, and bidirectional mechanisms of UTS #39 are out of scope. Use a single String for both sides of the identity AlignedText produced by an empty aligned pipeline so the alignment lengths cannot diverge from the stored original for a CharSequence whose length() differs from its toString(). Tests: capital-eszett fold and its 1->2 offset mapping; a leading-insertion andThen composition proving the zero-width-at-origin branch maps correctly; a systematic plain-equals-aligned parity battery across every CharClass operation; and buildAligned() rejection when the non-alignable rung is not first and for each kind of non-alignable fold.
Soften the Confusables class-javadoc opening to say it follows the skeleton algorithm of UTS #39, consistent with the paragraph scoping the rest of the report out. Normalize the input of an aligned pipeline to a String once, so the stored original and the per-stage alignment lengths agree even for a CharSequence whose length() differs from its toString(); the empty pipeline already did this. Add a forward-mapping test that a toNormalizedSpan over a span ending inside a deleted run stops at the last kept character rather than over-covering the next.
The fold matches the precomposed umlaut and eszett code points; a base letter followed by a combining diaeresis (U+0308) is not a member and passes through unchanged. Note in the javadoc that the input should be NFC-composed first if it may contain decomposed forms, and add a test that pins both behaviors.
090593f to
9f2622e
Compare
Part 1/4 of OPENNLP-1850, splitting #1101 into reviewable stacked PRs.
Dependency-free normalization layer: opennlp-api value types (NormalizedText, OffsetMap) plus the opennlp-runtime util/normalizer engine — CharClass/CodePointSet, the Unicode White_Space/Dash sets, the normalizer rungs, the Dimension step ladder, the TextNormalizer builder, and the bundled Unicode confusables.txt (UTS #39) with its License V3 attribution. No tokenizer, Term, or DL changes.
Stack: foundation (this) <- tokenizer <- DL <- docs.