OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) by krickert · Pull Request #1103 · apache/opennlp

krickert · 2026-06-20T12:36:34Z

Part 1/4 of OPENNLP-1850, splitting #1101 into reviewable stacked PRs.

Dependency-free normalization layer: opennlp-api value types (NormalizedText, OffsetMap) plus the opennlp-runtime util/normalizer engine — CharClass/CodePointSet, the Unicode White_Space/Dash sets, the normalizer rungs, the Dimension step ladder, the TextNormalizer builder, and the bundled Unicode confusables.txt (UTS #39) with its License V3 attribution. No tokenizer, Term, or DL changes.

Stack: foundation (this) <- tokenizer <- DL <- docs.

krickert · 2026-06-20T12:37:29Z

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

krickert · 2026-06-20T12:42:48Z

@rzo1 split it up like we discussed. OPENNLP-1850 is now four stacked PRs instead of the one big #1101 (closed):

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103, normalization foundation (base main): the dependency-free layer, meaning the CharClass engine, the normalizer rungs, Dimension, TextNormalizer, and confusables. Lands first, lowest risk.
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104, UAX OPENNLP-910: Add checkstyle #29 tokenizer + Term model (on OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103).
OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105, DL input normalization (on OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104): the behavioral change, isolated for the focused review you wanted. It only depends on the foundation, so once OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 lands it can re-target straight to main, no need to wait on the tokenizer.
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106, docs (on OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105).

All your licensing points are addressed, riding with whichever stack ships the data file: attribution in NOTICE.template so it survives regen, full Unicode License V3 text in LICENSE, the four paths in rat-excludes, and the ExtendedPictographic.txt wording fixed to "filtered subset of emoji-data.txt," not "unmodified."

Each stack builds and tests green on its own. Thanks again for steering this, the split is much cleaner to review.

Copilot

Pull request overview

This PR introduces the “foundation” layer for OPENNLP-1850’s Unicode normalization stack: new value types in opennlp-api (normalized text + offset mapping) and a runtime normalization engine built around CodePointSet/CharClass, Unicode whitespace/dash reference tables, and a TextNormalizer builder, along with legal/NOTICE updates for bundling Unicode UTS #39 confusables.txt.

Changes:

Add core normalization primitives (CodePointSet, CharClass) and multiple CharSequenceNormalizer implementations (whitespace, dashes, quotes, digits, ellipsis, bullets, invisibles, case/accent folds, confusable skeleton).
Add API value types for normalization output and offset mapping (NormalizedText, OffsetMap).
Add extensive JUnit tests for the new normalization components and update legal/NOTICE/RAT configuration for the bundled Unicode data file.

Reviewed changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/license/NOTICE.template	Adds NOTICE template entry for bundled Unicode UTS #39 data file.
rat-excludes	Adds RAT exclusion entry for `confusables.txt`.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeWhitespaceTest.java	Tests Unicode whitespace reference data and membership behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeDashTest.java	Tests Unicode dash reference data and normalization defaults.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeCharSequenceNormalizerTest.java	Tests composition/behavior of multiple normalizers in pipelines.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/TextNormalizerTest.java	Tests `TextNormalizer` builder and default search pipeline.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/SetBasedNormalizerTest.java	Tests quote/digit/invisible/ellipsis/bullet normalizers.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/GermanUmlautCharSequenceNormalizerTest.java	Tests German umlaut/eszett transliteration normalizer.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/DimensionTest.java	Tests `Dimension` default normalizer wiring.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CodePointSetTest.java	Tests `CodePointSet` creation, parsing, and set operations.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CharClassTest.java	Tests `CharClass` operations including offset-mapped outputs.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CaseFoldCharSequenceNormalizerTest.java	Tests locale-safe and locale-specific case folding behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/AccentFoldCharSequenceNormalizerTest.java	Tests script-gated accent folding and stroke/ligature mapping.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/WhitespaceCharSequenceNormalizer.java	Adds Unicode whitespace collapsing+trimming normalizer.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/UnicodeWhitespace.java	Adds Unicode whitespace reference table + O(1) membership.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/UnicodeDash.java	Adds Unicode dash reference table + O(1) membership and defaults.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/TextNormalizer.java	Adds fluent builder for composing normalizer pipelines.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/QuoteCharSequenceNormalizer.java	Adds typographic quote folding to ASCII.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NfkcCharSequenceNormalizer.java	Adds NFKC normalizer wrapper.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NfcCharSequenceNormalizer.java	Adds NFC normalizer wrapper.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/InvisibleCharSequenceNormalizer.java	Adds stripping of invisible/bidi control characters.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/GermanUmlautCharSequenceNormalizer.java	Adds DIN-style German umlaut/eszett expansion normalizer.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/EllipsisCharSequenceNormalizer.java	Adds ellipsis/leader expansion to ASCII dots.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Dimension.java	Adds normalization “dimension” ladder with lazy defaults.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/DigitCharSequenceNormalizer.java	Adds Unicode decimal digit folding to ASCII digits.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/DashCharSequenceNormalizer.java	Adds Unicode dash folding to ASCII hyphen-minus.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/ConfusableSkeletonCharSequenceNormalizer.java	Adds UTS #39 confusable skeleton normalizer wrapper.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Confusables.java	Adds confusables resource loader and skeleton algorithm implementation.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CodePointSet.java	Adds immutable O(1) membership code point set + parser/fromFile.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CharClass.java	Adds cursor-based normalization/splitting/collapsing + offset mapping.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CaseFoldCharSequenceNormalizer.java	Adds locale-aware lowercasing normalizer with ROOT singleton.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/BulletCharSequenceNormalizer.java	Adds bullet replacement normalizer (bullets -> space).
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/AccentFoldCharSequenceNormalizer.java	Adds script-gated diacritic fold with optional stroke/ligature mapping.
opennlp-api/src/main/java/opennlp/tools/util/normalizer/OffsetMap.java	Adds normalized↔original offset mapping structure and builder.
opennlp-api/src/main/java/opennlp/tools/util/normalizer/NormalizedText.java	Adds `NormalizedText` record bundling original/normalized + `OffsetMap`.
opennlp-api/pom.xml	Adds a test-scoped JUnit dependency.
NOTICE	Adds NOTICE entry for bundled Unicode UTS #39 data file.
LICENSE	Adds Unicode License V3 text for bundled confusables data file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rzo1 · 2026-06-21T17:02:52Z

Thx for the PR. Here are some suggestions:

Dimension javadoc forward-references Term/TermAnalyzer, which only arrive in OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104. Standalone javadoc on this branch emits unresolved-reference warnings. Either downgrade those {@link} to {@code}, or confirm the stack always merges as a unit.
Offset mapping isn't reachable through the builder. normalizeMapped/OffsetMap exist only directly on CharClass, while TextNormalizer.build() composes plain CharSequenceNormalizers and discards offsets, so the NormalizedText/OffsetMap capability can't be reached via TextNormalizer. Is that intentional for now?
OffsetMap buffer growth uses Arrays.copyOf(buffer, buffer.length * 2), which overflows to NegativeArraySizeException past ~2^30 chars instead of a clean OOM. Overflow-aware growth would be tidier. Edge case, per-document use.
Confusables.load() has no per-line guard, so a malformed bundled line surfaces as ExceptionInInitializerError, whereas the user-facing CodePointSet.fromFile reports the offending line number. Minor inconsistency. A checksum or version assertion would also back the "unmodified upstream 17.0.0" NOTICE claim.
Nit: serialVersionUID = 1L on the new rungs vs random longs on the legacy ones, and TextNormalizer.builder() returns a TextNormalizer that is itself the mutable builder (a nested Builder would read cleaner).

krickert · 2026-06-21T18:01:30Z

You pointed out a gap in the API that I'm going to fix right now re: NormalizedText/OffsetMap.. Reworking it now - and I think the new way would be a much better shape that matches how it works in other packages.

Addressing all your points now - thanks for the feedback - keep it coming.

krickert · 2026-06-21T19:18:13Z

Dimension javadoc forward-references Term/TermAnalyzer.
Dimension references Term/TermAnalyzer with {@code}, not {@link}, so standalone javadoc on this branch produces no unresolved-reference warnings.

Offset mapping isn't reachable through the builder.
You found an offset in my impl (pun intended), and the root cause was the missing composition primitive: there was no way to combine the per-stage offset maps. I got rid of the OffsetMapping and added Alignment.andThen so an offset-carrying pipeline is now possible. Wiring it through TextNormalizer.build() for arbitrary
CharSequenceNormalizers is a follow-up (only the CharClass-family transforms can emit an alignment cheaply; java.text.Normalizer-based stages would need ICU-style edit callbacks), but the primitive it depends on is in place.

OffsetMap buffer growth overflows past ~2^30.
OffsetMap is removed. Its replacement, Alignment.Builder, grows overflow-aware
(length + (length >> 1), clamped to Integer.MAX_VALUE - 8), so it degrades to a clean OutOfMemoryError instead of NegativeArraySizeException. WordSegmenter.IntList got the same treatment (see #1104).

Confusables.load() has no per-line guard.
Fixed. The per-line parse is wrapped and rethrows an IllegalStateException naming the offending line number, mirroring CodePointSet.fromFile, instead of surfacing a raw ExceptionInInitializerError. (A bundled-file checksum/version assertion is a reasonable follow-up but is left out here.)

Nit: serialVersionUID = 1L vs random longs; builder() returns its own mutable builder.
Although I'm camp "1L" for various reasons, I don't mind either way. Changing that now.

krickert · 2026-06-22T02:54:57Z

Status since the last review. Offset-model items addressed; additive commits, so inline threads stay anchored.

buildAligned() + OffsetAwareNormalizer give the *Aligned API a real consumer: every per-code-point fold (whitespace, line-break-preserving whitespace, dashes, invisible-strip, quotes, digits, ellipsis, bullets, umlaut) composes into one AlignedText. Folds that route through java.text.Normalizer or JDK case mapping are rejected loudly, naming the rung.
Capital eszett U+1E9E folds to SS. buildAligned() reject text states the rule instead of a stale list. Confusables javadoc scoped to the skeleton plus equality test (restriction-level, mixed-script, bidi out of scope). Empty aligned pipeline normalizes input to one String.
Alignment.andThen leading-insertion is not a bug: Math.max(start, end) already yields the zero-width span. Added a test that proves it.
New tests: CharClass plain-vs-aligned parity battery, leading-insertion compose, capital-eszett offsets, buildAligned() rejection at every index and fold type, toNormalizedSpan no over-cover across deletions.

Copilot

Pull request overview

Copilot reviewed 44 out of 45 changed files in this pull request and generated no new comments.

…s, Dimension) Adds the dependency-free normalization layer that the tokenizer, Term model, and DL integration build on in later stacks. No segmentation, no behavioral changes. - opennlp-api: the offset-preserving value types NormalizedText and OffsetMap alongside the existing CharSequenceNormalizer contract. - opennlp-runtime util/normalizer: the CharClass / CodePointSet engine and the Unicode White_Space / Dash sets; the normalizer rungs (NFC, NFKC, whitespace, dash, case fold, accent fold, invisible, quote, digit, ellipsis, bullet, German umlaut, confusable skeleton); the Dimension enum that single-sources the char-level steps; and the TextNormalizer builder over them. - Bundles the Unicode confusables.txt (UTS #39) data file with its Unicode License V3 attribution across NOTICE, NOTICE.template, LICENSE, and rat-excludes.

- OffsetMap.toNormalizedOffset: use floor semantics so an original offset that falls inside a collapsed run maps to that run's single normalized character instead of the following one; correct the javadoc to match. - CharClass.collapse: document that each run collapses to one replacement with no trimming (an all-whitespace string becomes one space; compose with trim for emptiness). - Confusables: parse the prototype hex tokens with a cursor scan instead of a regex split, honoring the stated no-regex contract and avoiding Pattern compilation for every line during static init. - CaseFoldCharSequenceNormalizer: fail loud with a clear message on a null locale in both the factory and the constructor instead of defaulting; the locale-independent default is the no-arg getInstance(). - LICENSE: align the embedded Unicode License V3 copyright year with the bundled data file header and NOTICE (1991-2025). - Tests: add OffsetMapTest (the api module's first test) and ConfusableSkeletonTest, extend CharClassTest with offset cases for collapsed runs of mixed Unicode whitespace, tabs, newlines, single-char/all-whitespace/empty inputs, and surrogate pairs, and cover null-locale rejection.

…ennlp-api Replaces the never-released OffsetMap/NormalizedText (introduced earlier in this same unmerged work, so there is nothing to keep compatible) with Alignment, a bidirectional edit-sequence model (equal/replace runs) that maps span to span (toOriginalSpan/ toNormalizedSpan), composes with andThen, and represents deletions and length-growing folds correctly. AlignedText carries it. The CharClass engine (CharClass, CodePointSet, UnicodeWhitespace, UnicodeDash) moves from opennlp-runtime to opennlp-api so opennlp-dl can reach it without depending on the whole tools runtime. CharClass exposes normalize/collapse/collapsePreserving/trim/removeAll Aligned variants; the per-character offset machinery and appendMapped helper are gone.

Confusables.load parsed each line's hex inside the static initializer with no per-line guard, so a malformed bundled line surfaced as an opaque ExceptionInInitializerError. Wrap the per-line parse and rethrow with the line number, mirroring CodePointSet.fromFile.

Restore the coverage the removed *Mapped tests had, now against the Aligned variants, and extend it: mixed Unicode-whitespace / tab / newline collapse runs, empty/single/all-whitespace inputs, surrogate-pair offsets, normalize identity, supplementary pass-through and expansion to a supplementary replacement, leading/trailing deletions in removeAll, all-whitespace trim, and a collapsePreserving run without a kept line break. Alignment gains builder-growth, three-stage andThen composition, and reverse-span-across-expansion tests.

… TextNormalizer.Builder Give the new CharSequenceNormalizer rungs generated-style serialVersionUID values to match the existing classes instead of 1L, and split TextNormalizer's mutable builder state into a nested TextNormalizer.Builder so builder() returns a dedicated Builder rather than a TextNormalizer that is its own builder. Fluent usage (builder().nfc()...build()) is unchanged.

Add OffsetAwareNormalizer, a capability interface for rungs that can report an Alignment from their normalized output back to the input, and TextNormalizer.Builder.buildAligned() to compose a chain of them into one AlignedText that maps a span found in the fully normalized text back to original character offsets. The cursor-based rungs implement it: whitespace (collapse then trim), dashes, and invisible-control stripping, plus a new line-break-preserving whitespace rung that collapses horizontal runs to a space while keeping line and paragraph breaks as a single newline. Rungs that delegate to java.text.Normalizer (NFC/NFKC) or a fold table cannot report their edits and are rejected loudly by buildAligned(). This gives the previously engine-only Alignment.andThen and the CharClass *Aligned operations (collapseAligned, trimAligned, removeAllAligned, collapsePreservingAligned) real production consumers.

Quote, Digit, Ellipsis, Bullet, and GermanUmlaut now implement OffsetAwareNormalizer, so an offset-aware TextNormalizer.buildAligned() pipeline can include them. Each is a plain per-code-point substitution (the expanding ones such as the ellipsis to three dots, the eszett to ss, and the contracting supplementary digit to one ASCII digit record a replace edit), so the alignment maps a match back to the exact source code point. Only the folds that route through java.text.Normalizer (NFC, NFKC, accent fold, confusable skeleton) or JDK case mapping (case fold) still cannot report their edits and remain rejected by buildAligned().

Add the capital eszett (U+1E9E) to the German umlaut transliteration so it expands to SS, matching the lowercase eszett to ss. Generalize the buildAligned() rejection message and javadoc so they describe the rule (per-code-point folds are offset-aware; folds that route through java.text.Normalizer or JDK case mapping are not) instead of an enumerated list that fell out of date as more folds became offset-aware. Scope the Confusables javadoc to the skeleton transform and skeleton-equality test it implements, noting that the restriction-level, mixed-script, and bidirectional mechanisms of UTS #39 are out of scope. Use a single String for both sides of the identity AlignedText produced by an empty aligned pipeline so the alignment lengths cannot diverge from the stored original for a CharSequence whose length() differs from its toString(). Tests: capital-eszett fold and its 1->2 offset mapping; a leading-insertion andThen composition proving the zero-width-at-origin branch maps correctly; a systematic plain-equals-aligned parity battery across every CharClass operation; and buildAligned() rejection when the non-alignable rung is not first and for each kind of non-alignable fold.

Soften the Confusables class-javadoc opening to say it follows the skeleton algorithm of UTS #39, consistent with the paragraph scoping the rest of the report out. Normalize the input of an aligned pipeline to a String once, so the stored original and the per-stage alignment lengths agree even for a CharSequence whose length() differs from its toString(); the empty pipeline already did this. Add a forward-mapping test that a toNormalizedSpan over a span ending inside a deleted run stops at the last kept character rather than over-covering the next.

The fold matches the precomposed umlaut and eszett code points; a base letter followed by a combining diaeresis (U+0308) is not a member and passes through unchanged. Note in the javadoc that the input should be NFC-composed first if it may contain decomposed forms, and add a test that pins both behaviors.

krickert mentioned this pull request Jun 20, 2026

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105

Draft

This was referenced Jun 20, 2026

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104

Draft

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106

Draft

OPENNLP-1850: Improve Whitespace UTF normalization #1101

Closed

krickert marked this pull request as draft June 20, 2026 14:42

krickert requested a review from Copilot June 20, 2026 14:56

Copilot started reviewing on behalf of krickert June 20, 2026 14:57 View session

krickert requested review from atarora, jzonthemtn, mawiesne and rzo1 June 20, 2026 14:57

Copilot AI reviewed Jun 20, 2026

View reviewed changes

krickert requested a review from Copilot June 22, 2026 03:00

Copilot started reviewing on behalf of krickert June 22, 2026 03:00 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

krickert added 9 commits June 21, 2026 23:48

krickert added 2 commits June 21, 2026 23:48

krickert force-pushed the OPENNLP-1850-1-foundation branch from 090593f to 9f2622e Compare June 22, 2026 03:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4)#1103

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4)#1103
krickert wants to merge 11 commits into
mainfrom
OPENNLP-1850-1-foundation

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rzo1 commented Jun 21, 2026

Uh oh!

krickert commented Jun 21, 2026

Uh oh!

krickert commented Jun 21, 2026

Uh oh!

krickert commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rzo1 commented Jun 21, 2026

Uh oh!

krickert commented Jun 21, 2026

Uh oh!

krickert commented Jun 21, 2026

Uh oh!

krickert commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants