OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106krickert wants to merge 7 commits into
Conversation
|
OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to
Supersedes #1101. |
There was a problem hiding this comment.
Pull request overview
Adds/updates Developer Manual documentation for Unicode-aware normalization and UAX #29 tokenization, and connects those docs to the DL (ONNX) components’ Unicode chunking/normalization behavior.
Changes:
- Adds a new “Text Normalization” manual chapter and includes it in the master DocBook.
- Extends the tokenizer chapter with guidance about Unicode preprocessing and a new UAX #29 tokenizer/segmenter section.
- Updates NameFinderDL/DocumentCategorizerDL and introduction docs to reference Unicode-aware DL chunking and normalization options.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| opennlp-docs/src/docbkx/tokenizer.xml | Adds Unicode preprocessing guidance and documents the UAX #29 tokenizer/segmenter APIs. |
| opennlp-docs/src/docbkx/opennlp.xml | Includes the new normalizer chapter in the book build. |
| opennlp-docs/src/docbkx/normalizer.xml | New “Text Normalization” chapter covering normalizers, pipelines, term model, and reference data. |
| opennlp-docs/src/docbkx/namefinder.xml | Updates NameFinderDL constructor usage and documents Unicode-aware DL chunking and normalization options. |
| opennlp-docs/src/docbkx/introduction.xml | Links DL inference Unicode handling to the normalizer documentation. |
| opennlp-docs/src/docbkx/doccat.xml | Documents Unicode-aware DL chunking/normalization and adds ONNX usage snippet updates. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1c17110 to
8534bb3
Compare
3037db7 to
9a71f28
Compare
|
Thx for the PR. Here are some suggestions:
Otherwise the content is accurate. |
8534bb3 to
5154da4
Compare
d7d316f to
0ff5d07
Compare
5154da4 to
c51f37d
Compare
667e850 to
d71e472
Compare
40698dc to
001ac01
Compare
b65c0de to
0022bc1
Compare
038e23d to
bc401d3
Compare
0022bc1 to
2fd9543
Compare
bc401d3 to
4c12897
Compare
2fd9543 to
743a955
Compare
|
Status since the last review. Normalizer chapter gains an "Offset-aware pipelines" section for |
| <emphasis role="bold">Standards-sourced.</emphasis> Membership sets come from the | ||
| Unicode Character Database (for example the <code>White_Space</code> and | ||
| <code>Dash</code> properties), not from the JVM's locale-dependent or quirky | ||
| character predicates. The library never relies on | ||
| <code>Character.isWhitespace</code>, which disagrees with the Unicode standard. |
| Each normalizer implements the existing | ||
| <code>opennlp.tools.util.normalizer.CharSequenceNormalizer</code> interface | ||
| (<code>CharSequence normalize(CharSequence)</code>) and is a shared, stateless singleton | ||
| obtained through <code>getInstance()</code>. They can therefore be combined with the | ||
| existing <code>AggregateCharSequenceNormalizer</code>, or with the | ||
| <code>TextNormalizer</code> builder described below. |
4c12897 to
2006d1d
Compare
743a955 to
213ab50
Compare
2006d1d to
fdd329f
Compare
213ab50 to
8475b41
Compare
fdd329f to
47a39bf
Compare
76ec3b7 to
def4b08
Compare
…nd DL handling Add the Text Normalization manual chapter (CharClass engine, normalizer pipeline, the Term model, and the Aligned offset variants that return an AlignedText carrying an Alignment), extend the tokenizer chapter with the UAX #29 segmenter, and document the DL components' Unicode-aware chunking and opt-in whitespace/dash folding with offset-safe findInOriginal. All embedded ONNX snippets are self-contained and compile.
…ligned) Add an "Offset-aware pipelines" section to the normalizer chapter covering TextNormalizer.Builder.buildAligned(), the OffsetAwareNormalizer capability interface, mapping a match back to the source with AlignedText/Alignment, and the fail-loud rejection of rungs that cannot report edits (NFC/NFKC). List the new line-break-preserving whitespace rung in the normalizer family table.
… the manual Document that NameFinderDL.findInOriginal comes from the OffsetMappingNameFinder capability interface, detectable with a plain instanceof check, so the name-finder chapter matches how the normalizer chapter presents OffsetAwareNormalizer.
…gits, ellipsis, bullets, umlaut)
…old options Note in the normalizer manual that, with dash folding enabled, a dash in the supplementary planes shrinks from two UTF-16 units to one and shifts later offsets, so find reports offsets into the normalized text in that case while findInOriginal maps them back to the original input. The one-for-one whitespace fold versus the run-collapsing whitespace rung is already covered in the same section.
Scope the "never relies on Character.isWhitespace" statement to the normalization engine rather than the whole library. Note that getInstance() gives the default shared instance and that case and accent folding also offer configured forms. Refer to the conformance file by its full name WordBreakTest.txt.
…enizer manual The Word Tokenizer section said it drops punctuation and keeps emoji without noting that emoji means any Extended_Pictographic code point, so symbol-like characters such as the copyright, trademark, and double-exclamation signs are kept. Match the WordTokenizer class javadoc.
def4b08 to
d7838a5
Compare
Part 4/4 of OPENNLP-1850. New normalizer manual chapter plus tokenizer/doccat/namefinder/introduction updates and the master opennlp.xml.