OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) by krickert · Pull Request #1106 · apache/opennlp

krickert · 2026-06-20T12:36:52Z

Part 4/4 of OPENNLP-1850. New normalizer manual chapter plus tokenizer/doccat/namefinder/introduction updates and the master opennlp.xml.

krickert · 2026-06-20T12:37:31Z

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

Copilot

Pull request overview

Adds/updates Developer Manual documentation for Unicode-aware normalization and UAX #29 tokenization, and connects those docs to the DL (ONNX) components’ Unicode chunking/normalization behavior.

Changes:

Adds a new “Text Normalization” manual chapter and includes it in the master DocBook.
Extends the tokenizer chapter with guidance about Unicode preprocessing and a new UAX #29 tokenizer/segmenter section.
Updates NameFinderDL/DocumentCategorizerDL and introduction docs to reference Unicode-aware DL chunking and normalization options.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
opennlp-docs/src/docbkx/tokenizer.xml	Adds Unicode preprocessing guidance and documents the UAX #29 tokenizer/segmenter APIs.
opennlp-docs/src/docbkx/opennlp.xml	Includes the new normalizer chapter in the book build.
opennlp-docs/src/docbkx/normalizer.xml	New “Text Normalization” chapter covering normalizers, pipelines, term model, and reference data.
opennlp-docs/src/docbkx/namefinder.xml	Updates NameFinderDL constructor usage and documents Unicode-aware DL chunking and normalization options.
opennlp-docs/src/docbkx/introduction.xml	Links DL inference Unicode handling to the normalizer documentation.
opennlp-docs/src/docbkx/doccat.xml	Documents Unicode-aware DL chunking/normalization and adds ONNX usage snippet updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rzo1 · 2026-06-21T17:04:48Z

Thx for the PR. Here are some suggestions:

Declare xmlns:xlink explicitly on the chapter roots of normalizer.xml and tokenizer.xml. Both use xlink:href (normalizer.xml:457, tokenizer.xml:461) but rely on the DocBook 5.0 DTD's #FIXED default to bind the prefix. The Maven build resolves the DTD so it works, but it breaks under any non-validating namespace-aware tool (IDE linters, xmllint --nonet), and every other chapter declares it explicitly:
normalizer.xml: <chapter xml:id="tools.normalizer" xmlns:xlink="http://www.w3.org/1999/xlink">
tokenizer.xml: <chapter xml:id="tools.tokenizer" xmlns:xlink="http://www.w3.org/1999/xlink">
Note that tokenizer.xml newly introduces xlink usage, so this is the first chapter to add it there.

Otherwise the content is accurate.

krickert · 2026-06-22T02:58:20Z

Status since the last review. Normalizer chapter gains an "Offset-aware pipelines" section for buildAligned() and the capability interface with a worked dash-fold span example, the line-break-preserving rung in the fold table, and the supplementary-dash offset note in the DL fold options. Name-finder chapter names OffsetMappingNameFinder behind findInOriginal. docbkx HTML builds clean. Rebased onto the updated stack.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

+					<emphasis role="bold">Standards-sourced.</emphasis> Membership sets come from the
+					Unicode Character Database (for example the <code>White_Space</code> and
+					<code>Dash</code> properties), not from the JVM's locale-dependent or quirky
+					character predicates. The library never relies on
+					<code>Character.isWhitespace</code>, which disagrees with the Unicode standard.


+			Each normalizer implements the existing
+			<code>opennlp.tools.util.normalizer.CharSequenceNormalizer</code> interface
+			(<code>CharSequence normalize(CharSequence)</code>) and is a shared, stateless singleton
+			obtained through <code>getInstance()</code>. They can therefore be combined with the
+			existing <code>AggregateCharSequenceNormalizer</code>, or with the
+			<code>TextNormalizer</code> builder described below.


…nd DL handling Add the Text Normalization manual chapter (CharClass engine, normalizer pipeline, the Term model, and the Aligned offset variants that return an AlignedText carrying an Alignment), extend the tokenizer chapter with the UAX #29 segmenter, and document the DL components' Unicode-aware chunking and opt-in whitespace/dash folding with offset-safe findInOriginal. All embedded ONNX snippets are self-contained and compile.

…ligned) Add an "Offset-aware pipelines" section to the normalizer chapter covering TextNormalizer.Builder.buildAligned(), the OffsetAwareNormalizer capability interface, mapping a match back to the source with AlignedText/Alignment, and the fail-loud rejection of rungs that cannot report edits (NFC/NFKC). List the new line-break-preserving whitespace rung in the normalizer family table.

… the manual Document that NameFinderDL.findInOriginal comes from the OffsetMappingNameFinder capability interface, detectable with a plain instanceof check, so the name-finder chapter matches how the normalizer chapter presents OffsetAwareNormalizer.

…gits, ellipsis, bullets, umlaut)

…old options Note in the normalizer manual that, with dash folding enabled, a dash in the supplementary planes shrinks from two UTF-16 units to one and shifts later offsets, so find reports offsets into the normalized text in that case while findInOriginal maps them back to the original input. The one-for-one whitespace fold versus the run-collapsing whitespace rung is already covered in the same section.

Scope the "never relies on Character.isWhitespace" statement to the normalization engine rather than the whole library. Note that getInstance() gives the default shared instance and that case and accent folding also offer configured forms. Refer to the conformance file by its full name WordBreakTest.txt.

…enizer manual The Word Tokenizer section said it drops punctuation and keeps emoji without noting that emoji means any Extended_Pictographic code point, so symbol-like characters such as the copyright, trademark, and double-exclamation signs are kept. Match the WordTokenizer class javadoc.

krickert mentioned this pull request Jun 20, 2026

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103

Draft

This was referenced Jun 20, 2026

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104

Draft

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105

Draft

OPENNLP-1850: Improve Whitespace UTF normalization #1101

Closed

krickert marked this pull request as draft June 20, 2026 14:43

krickert requested a review from Copilot June 20, 2026 14:56

Copilot started reviewing on behalf of krickert June 20, 2026 14:57 View session

krickert requested review from mawiesne and rzo1 June 20, 2026 14:58

Copilot AI reviewed Jun 20, 2026

View reviewed changes

Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated

Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated

Comment thread opennlp-docs/src/docbkx/doccat.xml Outdated

Comment thread opennlp-docs/src/docbkx/doccat.xml

krickert force-pushed the OPENNLP-1850-3-dl branch from 1c17110 to 8534bb3 Compare June 20, 2026 20:16

krickert force-pushed the OPENNLP-1850-4-docs branch from 3037db7 to 9a71f28 Compare June 20, 2026 20:16

krickert force-pushed the OPENNLP-1850-3-dl branch from 8534bb3 to 5154da4 Compare June 21, 2026 19:00

krickert force-pushed the OPENNLP-1850-4-docs branch 2 times, most recently from d7d316f to 0ff5d07 Compare June 21, 2026 19:21

krickert force-pushed the OPENNLP-1850-3-dl branch from 5154da4 to c51f37d Compare June 21, 2026 19:21

krickert force-pushed the OPENNLP-1850-4-docs branch 2 times, most recently from 667e850 to d71e472 Compare June 21, 2026 22:59

krickert force-pushed the OPENNLP-1850-3-dl branch from 40698dc to 001ac01 Compare June 21, 2026 22:59

krickert force-pushed the OPENNLP-1850-4-docs branch from b65c0de to 0022bc1 Compare June 22, 2026 00:19

krickert force-pushed the OPENNLP-1850-3-dl branch 2 times, most recently from 038e23d to bc401d3 Compare June 22, 2026 01:52

krickert force-pushed the OPENNLP-1850-4-docs branch from 0022bc1 to 2fd9543 Compare June 22, 2026 01:52

krickert force-pushed the OPENNLP-1850-3-dl branch from bc401d3 to 4c12897 Compare June 22, 2026 02:10

krickert force-pushed the OPENNLP-1850-4-docs branch from 2fd9543 to 743a955 Compare June 22, 2026 02:10

krickert requested a review from Copilot June 22, 2026 02:59

Copilot started reviewing on behalf of krickert June 22, 2026 02:59 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

krickert force-pushed the OPENNLP-1850-3-dl branch from 4c12897 to 2006d1d Compare June 22, 2026 03:31

krickert force-pushed the OPENNLP-1850-4-docs branch from 743a955 to 213ab50 Compare June 22, 2026 03:31

krickert force-pushed the OPENNLP-1850-3-dl branch from 2006d1d to fdd329f Compare June 22, 2026 03:51

krickert force-pushed the OPENNLP-1850-4-docs branch from 213ab50 to 8475b41 Compare June 22, 2026 03:51

krickert force-pushed the OPENNLP-1850-3-dl branch from fdd329f to 47a39bf Compare June 22, 2026 03:59

krickert force-pushed the OPENNLP-1850-4-docs branch 2 times, most recently from 76ec3b7 to def4b08 Compare June 22, 2026 05:34

krickert added 7 commits June 22, 2026 01:48

OPENNLP-1850 Document the offset-aware substitution folds (quotes, di…

dc44533

…gits, ellipsis, bullets, umlaut)

krickert force-pushed the OPENNLP-1850-4-docs branch from def4b08 to d7838a5 Compare June 22, 2026 05:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106
krickert wants to merge 7 commits into
OPENNLP-1850-3-dlfrom
OPENNLP-1850-4-docs

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rzo1 commented Jun 21, 2026 •

edited

Loading

Uh oh!

krickert commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rzo1 commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krickert commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rzo1 commented Jun 21, 2026 •

edited

Loading