Skip to content

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4)#1103

Draft
krickert wants to merge 11 commits into
mainfrom
OPENNLP-1850-1-foundation
Draft

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4)#1103
krickert wants to merge 11 commits into
mainfrom
OPENNLP-1850-1-foundation

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 1/4 of OPENNLP-1850, splitting #1101 into reviewable stacked PRs.

Dependency-free normalization layer: opennlp-api value types (NormalizedText, OffsetMap) plus the opennlp-runtime util/normalizer engine — CharClass/CodePointSet, the Unicode White_Space/Dash sets, the normalizer rungs, the Dimension step ladder, the TextNormalizer builder, and the bundled Unicode confusables.txt (UTS #39) with its License V3 attribution. No tokenizer, Term, or DL changes.

Stack: foundation (this) <- tokenizer <- DL <- docs.

@krickert

Copy link
Copy Markdown
Contributor Author

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

  1. OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
  2. OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
  3. OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
  4. OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

@krickert

Copy link
Copy Markdown
Contributor Author

@rzo1 split it up like we discussed. OPENNLP-1850 is now four stacked PRs instead of the one big #1101 (closed):

  1. OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103, normalization foundation (base main): the dependency-free layer, meaning the CharClass engine, the normalizer rungs, Dimension, TextNormalizer, and confusables. Lands first, lowest risk.
  2. OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104, UAX OPENNLP-910: Add checkstyle #29 tokenizer + Term model (on OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103).
  3. OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105, DL input normalization (on OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104): the behavioral change, isolated for the focused review you wanted. It only depends on the foundation, so once OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 lands it can re-target straight to main, no need to wait on the tokenizer.
  4. OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106, docs (on OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105).

All your licensing points are addressed, riding with whichever stack ships the data file: attribution in NOTICE.template so it survives regen, full Unicode License V3 text in LICENSE, the four paths in rat-excludes, and the ExtendedPictographic.txt wording fixed to "filtered subset of emoji-data.txt," not "unmodified."

Each stack builds and tests green on its own. Thanks again for steering this, the split is much cleaner to review.

@krickert krickert marked this pull request as draft June 20, 2026 14:42
@krickert krickert requested a review from Copilot June 20, 2026 14:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the “foundation” layer for OPENNLP-1850’s Unicode normalization stack: new value types in opennlp-api (normalized text + offset mapping) and a runtime normalization engine built around CodePointSet/CharClass, Unicode whitespace/dash reference tables, and a TextNormalizer builder, along with legal/NOTICE updates for bundling Unicode UTS #39 confusables.txt.

Changes:

  • Add core normalization primitives (CodePointSet, CharClass) and multiple CharSequenceNormalizer implementations (whitespace, dashes, quotes, digits, ellipsis, bullets, invisibles, case/accent folds, confusable skeleton).
  • Add API value types for normalization output and offset mapping (NormalizedText, OffsetMap).
  • Add extensive JUnit tests for the new normalization components and update legal/NOTICE/RAT configuration for the bundled Unicode data file.

Reviewed changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/license/NOTICE.template Adds NOTICE template entry for bundled Unicode UTS #39 data file.
rat-excludes Adds RAT exclusion entry for confusables.txt.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeWhitespaceTest.java Tests Unicode whitespace reference data and membership behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeDashTest.java Tests Unicode dash reference data and normalization defaults.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeCharSequenceNormalizerTest.java Tests composition/behavior of multiple normalizers in pipelines.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/TextNormalizerTest.java Tests TextNormalizer builder and default search pipeline.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/SetBasedNormalizerTest.java Tests quote/digit/invisible/ellipsis/bullet normalizers.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/GermanUmlautCharSequenceNormalizerTest.java Tests German umlaut/eszett transliteration normalizer.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/DimensionTest.java Tests Dimension default normalizer wiring.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CodePointSetTest.java Tests CodePointSet creation, parsing, and set operations.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CharClassTest.java Tests CharClass operations including offset-mapped outputs.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CaseFoldCharSequenceNormalizerTest.java Tests locale-safe and locale-specific case folding behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/AccentFoldCharSequenceNormalizerTest.java Tests script-gated accent folding and stroke/ligature mapping.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/WhitespaceCharSequenceNormalizer.java Adds Unicode whitespace collapsing+trimming normalizer.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/UnicodeWhitespace.java Adds Unicode whitespace reference table + O(1) membership.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/UnicodeDash.java Adds Unicode dash reference table + O(1) membership and defaults.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/TextNormalizer.java Adds fluent builder for composing normalizer pipelines.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/QuoteCharSequenceNormalizer.java Adds typographic quote folding to ASCII.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NfkcCharSequenceNormalizer.java Adds NFKC normalizer wrapper.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NfcCharSequenceNormalizer.java Adds NFC normalizer wrapper.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/InvisibleCharSequenceNormalizer.java Adds stripping of invisible/bidi control characters.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/GermanUmlautCharSequenceNormalizer.java Adds DIN-style German umlaut/eszett expansion normalizer.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/EllipsisCharSequenceNormalizer.java Adds ellipsis/leader expansion to ASCII dots.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Dimension.java Adds normalization “dimension” ladder with lazy defaults.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/DigitCharSequenceNormalizer.java Adds Unicode decimal digit folding to ASCII digits.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/DashCharSequenceNormalizer.java Adds Unicode dash folding to ASCII hyphen-minus.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/ConfusableSkeletonCharSequenceNormalizer.java Adds UTS #39 confusable skeleton normalizer wrapper.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Confusables.java Adds confusables resource loader and skeleton algorithm implementation.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CodePointSet.java Adds immutable O(1) membership code point set + parser/fromFile.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CharClass.java Adds cursor-based normalization/splitting/collapsing + offset mapping.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CaseFoldCharSequenceNormalizer.java Adds locale-aware lowercasing normalizer with ROOT singleton.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/BulletCharSequenceNormalizer.java Adds bullet replacement normalizer (bullets -> space).
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/AccentFoldCharSequenceNormalizer.java Adds script-gated diacritic fold with optional stroke/ligature mapping.
opennlp-api/src/main/java/opennlp/tools/util/normalizer/OffsetMap.java Adds normalized↔original offset mapping structure and builder.
opennlp-api/src/main/java/opennlp/tools/util/normalizer/NormalizedText.java Adds NormalizedText record bundling original/normalized + OffsetMap.
opennlp-api/pom.xml Adds a test-scoped JUnit dependency.
NOTICE Adds NOTICE entry for bundled Unicode UTS #39 data file.
LICENSE Adds Unicode License V3 text for bundled confusables data file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread opennlp-api/src/main/java/opennlp/tools/util/normalizer/OffsetMap.java Outdated
Comment thread opennlp-api/src/main/java/opennlp/tools/util/normalizer/OffsetMap.java Outdated
Comment thread rat-excludes
Comment thread LICENSE Outdated
@rzo1

rzo1 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Thx for the PR. Here are some suggestions:

  • Dimension javadoc forward-references Term/TermAnalyzer, which only arrive in OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104. Standalone javadoc on this branch emits unresolved-reference warnings. Either downgrade those {@link} to {@code}, or confirm the stack always merges as a unit.
  • Offset mapping isn't reachable through the builder. normalizeMapped/OffsetMap exist only directly on CharClass, while TextNormalizer.build() composes plain CharSequenceNormalizers and discards offsets, so the NormalizedText/OffsetMap capability can't be reached via TextNormalizer. Is that intentional for now?
  • OffsetMap buffer growth uses Arrays.copyOf(buffer, buffer.length * 2), which overflows to NegativeArraySizeException past ~2^30 chars instead of a clean OOM. Overflow-aware growth would be tidier. Edge case, per-document use.
  • Confusables.load() has no per-line guard, so a malformed bundled line surfaces as ExceptionInInitializerError, whereas the user-facing CodePointSet.fromFile reports the offending line number. Minor inconsistency. A checksum or version assertion would also back the "unmodified upstream 17.0.0" NOTICE claim.
  • Nit: serialVersionUID = 1L on the new rungs vs random longs on the legacy ones, and TextNormalizer.builder() returns a TextNormalizer that is itself the mutable builder (a nested Builder would read cleaner).

@krickert

Copy link
Copy Markdown
Contributor Author

You pointed out a gap in the API that I'm going to fix right now re: NormalizedText/OffsetMap.. Reworking it now - and I think the new way would be a much better shape that matches how it works in other packages.

Addressing all your points now - thanks for the feedback - keep it coming.

@krickert

Copy link
Copy Markdown
Contributor Author

Dimension javadoc forward-references Term/TermAnalyzer.
Dimension references Term/TermAnalyzer with {@code}, not {@link}, so standalone javadoc on this branch produces no unresolved-reference warnings.

Offset mapping isn't reachable through the builder.
You found an offset in my impl (pun intended), and the root cause was the missing composition primitive: there was no way to combine the per-stage offset maps. I got rid of the OffsetMapping and added Alignment.andThen so an offset-carrying pipeline is now possible. Wiring it through TextNormalizer.build() for arbitrary
CharSequenceNormalizers is a follow-up (only the CharClass-family transforms can emit an alignment cheaply; java.text.Normalizer-based stages would need ICU-style edit callbacks), but the primitive it depends on is in place.

OffsetMap buffer growth overflows past ~2^30.
OffsetMap is removed. Its replacement, Alignment.Builder, grows overflow-aware
(length + (length >> 1), clamped to Integer.MAX_VALUE - 8), so it degrades to a clean OutOfMemoryError instead of NegativeArraySizeException. WordSegmenter.IntList got the same treatment (see #1104).

Confusables.load() has no per-line guard.
Fixed. The per-line parse is wrapped and rethrows an IllegalStateException naming the offending line number, mirroring CodePointSet.fromFile, instead of surfacing a raw ExceptionInInitializerError. (A bundled-file checksum/version assertion is a reasonable follow-up but is left out here.)

Nit: serialVersionUID = 1L vs random longs; builder() returns its own mutable builder.
Although I'm camp "1L" for various reasons, I don't mind either way. Changing that now.

@krickert

Copy link
Copy Markdown
Contributor Author

Status since the last review. Offset-model items addressed; additive commits, so inline threads stay anchored.

  • buildAligned() + OffsetAwareNormalizer give the *Aligned API a real consumer: every per-code-point fold (whitespace, line-break-preserving whitespace, dashes, invisible-strip, quotes, digits, ellipsis, bullets, umlaut) composes into one AlignedText. Folds that route through java.text.Normalizer or JDK case mapping are rejected loudly, naming the rung.
  • Capital eszett U+1E9E folds to SS. buildAligned() reject text states the rule instead of a stale list. Confusables javadoc scoped to the skeleton plus equality test (restriction-level, mixed-script, bidi out of scope). Empty aligned pipeline normalizes input to one String.
  • Alignment.andThen leading-insertion is not a bug: Math.max(start, end) already yields the zero-width span. Added a test that proves it.
  • New tests: CharClass plain-vs-aligned parity battery, leading-insertion compose, capital-eszett offsets, buildAligned() rejection at every index and fold type, toNormalizedSpan no over-cover across deletions.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 45 changed files in this pull request and generated no new comments.

krickert added 9 commits June 21, 2026 23:48
…s, Dimension)

Adds the dependency-free normalization layer that the tokenizer, Term model, and
DL integration build on in later stacks. No segmentation, no behavioral changes.

- opennlp-api: the offset-preserving value types NormalizedText and OffsetMap
  alongside the existing CharSequenceNormalizer contract.
- opennlp-runtime util/normalizer: the CharClass / CodePointSet engine and the
  Unicode White_Space / Dash sets; the normalizer rungs (NFC, NFKC, whitespace,
  dash, case fold, accent fold, invisible, quote, digit, ellipsis, bullet,
  German umlaut, confusable skeleton); the Dimension enum that single-sources the
  char-level steps; and the TextNormalizer builder over them.
- Bundles the Unicode confusables.txt (UTS #39) data file with its Unicode
  License V3 attribution across NOTICE, NOTICE.template, LICENSE, and rat-excludes.
- OffsetMap.toNormalizedOffset: use floor semantics so an original offset that
  falls inside a collapsed run maps to that run's single normalized character
  instead of the following one; correct the javadoc to match.
- CharClass.collapse: document that each run collapses to one replacement with no
  trimming (an all-whitespace string becomes one space; compose with trim for
  emptiness).
- Confusables: parse the prototype hex tokens with a cursor scan instead of a
  regex split, honoring the stated no-regex contract and avoiding Pattern
  compilation for every line during static init.
- CaseFoldCharSequenceNormalizer: fail loud with a clear message on a null locale
  in both the factory and the constructor instead of defaulting; the
  locale-independent default is the no-arg getInstance().
- LICENSE: align the embedded Unicode License V3 copyright year with the bundled
  data file header and NOTICE (1991-2025).
- Tests: add OffsetMapTest (the api module's first test) and ConfusableSkeletonTest,
  extend CharClassTest with offset cases for collapsed runs of mixed Unicode
  whitespace, tabs, newlines, single-char/all-whitespace/empty inputs, and
  surrogate pairs, and cover null-locale rejection.
…ennlp-api

Replaces the never-released OffsetMap/NormalizedText (introduced earlier in this same
unmerged work, so there is nothing to keep compatible) with Alignment, a bidirectional
edit-sequence model (equal/replace runs) that maps span to span (toOriginalSpan/
toNormalizedSpan), composes with andThen, and represents deletions and length-growing
folds correctly. AlignedText carries it.

The CharClass engine (CharClass, CodePointSet, UnicodeWhitespace, UnicodeDash) moves from
opennlp-runtime to opennlp-api so opennlp-dl can reach it without depending on the whole
tools runtime. CharClass exposes normalize/collapse/collapsePreserving/trim/removeAll
Aligned variants; the per-character offset machinery and appendMapped helper are gone.
Confusables.load parsed each line's hex inside the static initializer with no per-line
guard, so a malformed bundled line surfaced as an opaque ExceptionInInitializerError.
Wrap the per-line parse and rethrow with the line number, mirroring CodePointSet.fromFile.
Restore the coverage the removed *Mapped tests had, now against the Aligned variants, and
extend it: mixed Unicode-whitespace / tab / newline collapse runs, empty/single/all-whitespace
inputs, surrogate-pair offsets, normalize identity, supplementary pass-through and expansion to a
supplementary replacement, leading/trailing deletions in removeAll, all-whitespace trim, and a
collapsePreserving run without a kept line break. Alignment gains builder-growth, three-stage
andThen composition, and reverse-span-across-expansion tests.
… TextNormalizer.Builder

Give the new CharSequenceNormalizer rungs generated-style serialVersionUID values to match the
existing classes instead of 1L, and split TextNormalizer's mutable builder state into a nested
TextNormalizer.Builder so builder() returns a dedicated Builder rather than a TextNormalizer that
is its own builder. Fluent usage (builder().nfc()...build()) is unchanged.
Add OffsetAwareNormalizer, a capability interface for rungs that can report an
Alignment from their normalized output back to the input, and
TextNormalizer.Builder.buildAligned() to compose a chain of them into one
AlignedText that maps a span found in the fully normalized text back to original
character offsets.

The cursor-based rungs implement it: whitespace (collapse then trim), dashes, and
invisible-control stripping, plus a new line-break-preserving whitespace rung that
collapses horizontal runs to a space while keeping line and paragraph breaks as a
single newline. Rungs that delegate to java.text.Normalizer (NFC/NFKC) or a fold
table cannot report their edits and are rejected loudly by buildAligned().

This gives the previously engine-only Alignment.andThen and the CharClass *Aligned
operations (collapseAligned, trimAligned, removeAllAligned, collapsePreservingAligned)
real production consumers.
Quote, Digit, Ellipsis, Bullet, and GermanUmlaut now implement OffsetAwareNormalizer,
so an offset-aware TextNormalizer.buildAligned() pipeline can include them. Each is a
plain per-code-point substitution (the expanding ones such as the ellipsis to three dots,
the eszett to ss, and the contracting supplementary digit to one ASCII digit record a
replace edit), so the alignment maps a match back to the exact source code point.

Only the folds that route through java.text.Normalizer (NFC, NFKC, accent fold,
confusable skeleton) or JDK case mapping (case fold) still cannot report their edits and
remain rejected by buildAligned().
Add the capital eszett (U+1E9E) to the German umlaut transliteration so it
expands to SS, matching the lowercase eszett to ss.

Generalize the buildAligned() rejection message and javadoc so they describe
the rule (per-code-point folds are offset-aware; folds that route through
java.text.Normalizer or JDK case mapping are not) instead of an enumerated
list that fell out of date as more folds became offset-aware.

Scope the Confusables javadoc to the skeleton transform and skeleton-equality
test it implements, noting that the restriction-level, mixed-script, and
bidirectional mechanisms of UTS #39 are out of scope.

Use a single String for both sides of the identity AlignedText produced by an
empty aligned pipeline so the alignment lengths cannot diverge from the stored
original for a CharSequence whose length() differs from its toString().

Tests: capital-eszett fold and its 1->2 offset mapping; a leading-insertion
andThen composition proving the zero-width-at-origin branch maps correctly;
a systematic plain-equals-aligned parity battery across every CharClass
operation; and buildAligned() rejection when the non-alignable rung is not
first and for each kind of non-alignable fold.
krickert added 2 commits June 21, 2026 23:48
Soften the Confusables class-javadoc opening to say it follows the skeleton
algorithm of UTS #39, consistent with the paragraph scoping the rest of the
report out.

Normalize the input of an aligned pipeline to a String once, so the stored
original and the per-stage alignment lengths agree even for a CharSequence whose
length() differs from its toString(); the empty pipeline already did this.

Add a forward-mapping test that a toNormalizedSpan over a span ending inside a
deleted run stops at the last kept character rather than over-covering the next.
The fold matches the precomposed umlaut and eszett code points; a base letter
followed by a combining diaeresis (U+0308) is not a member and passes through
unchanged. Note in the javadoc that the input should be NFC-composed first if it
may contain decomposed forms, and add a test that pins both behaviors.
@krickert krickert force-pushed the OPENNLP-1850-1-foundation branch from 090593f to 9f2622e Compare June 22, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants