Skip to content

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106

Draft
krickert wants to merge 7 commits into
OPENNLP-1850-3-dlfrom
OPENNLP-1850-4-docs
Draft

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106
krickert wants to merge 7 commits into
OPENNLP-1850-3-dlfrom
OPENNLP-1850-4-docs

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 4/4 of OPENNLP-1850. New normalizer manual chapter plus tokenizer/doccat/namefinder/introduction updates and the master opennlp.xml.

@krickert

Copy link
Copy Markdown
Contributor Author

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

  1. OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
  2. OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
  3. OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
  4. OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds/updates Developer Manual documentation for Unicode-aware normalization and UAX #29 tokenization, and connects those docs to the DL (ONNX) components’ Unicode chunking/normalization behavior.

Changes:

  • Adds a new “Text Normalization” manual chapter and includes it in the master DocBook.
  • Extends the tokenizer chapter with guidance about Unicode preprocessing and a new UAX #29 tokenizer/segmenter section.
  • Updates NameFinderDL/DocumentCategorizerDL and introduction docs to reference Unicode-aware DL chunking and normalization options.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
opennlp-docs/src/docbkx/tokenizer.xml Adds Unicode preprocessing guidance and documents the UAX #29 tokenizer/segmenter APIs.
opennlp-docs/src/docbkx/opennlp.xml Includes the new normalizer chapter in the book build.
opennlp-docs/src/docbkx/normalizer.xml New “Text Normalization” chapter covering normalizers, pipelines, term model, and reference data.
opennlp-docs/src/docbkx/namefinder.xml Updates NameFinderDL constructor usage and documents Unicode-aware DL chunking and normalization options.
opennlp-docs/src/docbkx/introduction.xml Links DL inference Unicode handling to the normalizer documentation.
opennlp-docs/src/docbkx/doccat.xml Documents Unicode-aware DL chunking/normalization and adds ONNX usage snippet updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated
Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated
Comment thread opennlp-docs/src/docbkx/doccat.xml Outdated
Comment thread opennlp-docs/src/docbkx/doccat.xml
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 1c17110 to 8534bb3 Compare June 20, 2026 20:16
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from 3037db7 to 9a71f28 Compare June 20, 2026 20:16
@rzo1

rzo1 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Thx for the PR. Here are some suggestions:

  • Declare xmlns:xlink explicitly on the chapter roots of normalizer.xml and tokenizer.xml. Both use xlink:href (normalizer.xml:457, tokenizer.xml:461) but rely on the DocBook 5.0 DTD's #FIXED default to bind the prefix. The Maven build resolves the DTD so it works, but it breaks under any non-validating namespace-aware tool (IDE linters, xmllint --nonet), and every other chapter declares it explicitly:
    normalizer.xml: <chapter xml:id="tools.normalizer" xmlns:xlink="http://www.w3.org/1999/xlink">
    tokenizer.xml: <chapter xml:id="tools.tokenizer" xmlns:xlink="http://www.w3.org/1999/xlink">
    Note that tokenizer.xml newly introduces xlink usage, so this is the first chapter to add it there.

Otherwise the content is accurate.

@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 8534bb3 to 5154da4 Compare June 21, 2026 19:00
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch 2 times, most recently from d7d316f to 0ff5d07 Compare June 21, 2026 19:21
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 5154da4 to c51f37d Compare June 21, 2026 19:21
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch 2 times, most recently from 667e850 to d71e472 Compare June 21, 2026 22:59
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 40698dc to 001ac01 Compare June 21, 2026 22:59
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from b65c0de to 0022bc1 Compare June 22, 2026 00:19
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch 2 times, most recently from 038e23d to bc401d3 Compare June 22, 2026 01:52
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from 0022bc1 to 2fd9543 Compare June 22, 2026 01:52
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from bc401d3 to 4c12897 Compare June 22, 2026 02:10
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from 2fd9543 to 743a955 Compare June 22, 2026 02:10
@krickert

Copy link
Copy Markdown
Contributor Author

Status since the last review. Normalizer chapter gains an "Offset-aware pipelines" section for buildAligned() and the capability interface with a worked dash-fold span example, the line-break-preserving rung in the fold table, and the supplementary-dash offset note in the DL fold options. Name-finder chapter names OffsetMappingNameFinder behind findInOriginal. docbkx HTML builds clean. Rebased onto the updated stack.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comment on lines +36 to +40
<emphasis role="bold">Standards-sourced.</emphasis> Membership sets come from the
Unicode Character Database (for example the <code>White_Space</code> and
<code>Dash</code> properties), not from the JVM's locale-dependent or quirky
character predicates. The library never relies on
<code>Character.isWhitespace</code>, which disagrees with the Unicode standard.
Comment on lines +78 to +83
Each normalizer implements the existing
<code>opennlp.tools.util.normalizer.CharSequenceNormalizer</code> interface
(<code>CharSequence normalize(CharSequence)</code>) and is a shared, stateless singleton
obtained through <code>getInstance()</code>. They can therefore be combined with the
existing <code>AggregateCharSequenceNormalizer</code>, or with the
<code>TextNormalizer</code> builder described below.
Comment thread opennlp-docs/src/docbkx/tokenizer.xml Outdated
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 4c12897 to 2006d1d Compare June 22, 2026 03:31
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from 743a955 to 213ab50 Compare June 22, 2026 03:31
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 2006d1d to fdd329f Compare June 22, 2026 03:51
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from 213ab50 to 8475b41 Compare June 22, 2026 03:51
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from fdd329f to 47a39bf Compare June 22, 2026 03:59
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch 2 times, most recently from 76ec3b7 to def4b08 Compare June 22, 2026 05:34
krickert added 7 commits June 22, 2026 01:48
…nd DL handling

Add the Text Normalization manual chapter (CharClass engine, normalizer pipeline, the Term
model, and the Aligned offset variants that return an AlignedText carrying an Alignment),
extend the tokenizer chapter with the UAX #29 segmenter, and document the DL components'
Unicode-aware chunking and opt-in whitespace/dash folding with offset-safe findInOriginal.
All embedded ONNX snippets are self-contained and compile.
…ligned)

Add an "Offset-aware pipelines" section to the normalizer chapter covering
TextNormalizer.Builder.buildAligned(), the OffsetAwareNormalizer capability
interface, mapping a match back to the source with AlignedText/Alignment, and the
fail-loud rejection of rungs that cannot report edits (NFC/NFKC). List the new
line-break-preserving whitespace rung in the normalizer family table.
… the manual

Document that NameFinderDL.findInOriginal comes from the OffsetMappingNameFinder
capability interface, detectable with a plain instanceof check, so the name-finder
chapter matches how the normalizer chapter presents OffsetAwareNormalizer.
…old options

Note in the normalizer manual that, with dash folding enabled, a dash in the
supplementary planes shrinks from two UTF-16 units to one and shifts later
offsets, so find reports offsets into the normalized text in that case while
findInOriginal maps them back to the original input. The one-for-one whitespace
fold versus the run-collapsing whitespace rung is already covered in the same
section.
Scope the "never relies on Character.isWhitespace" statement to the normalization
engine rather than the whole library. Note that getInstance() gives the default
shared instance and that case and accent folding also offer configured forms.
Refer to the conformance file by its full name WordBreakTest.txt.
…enizer manual

The Word Tokenizer section said it drops punctuation and keeps emoji without
noting that emoji means any Extended_Pictographic code point, so symbol-like
characters such as the copyright, trademark, and double-exclamation signs are
kept. Match the WordTokenizer class javadoc.
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from def4b08 to d7838a5 Compare June 22, 2026 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants