Skip to content

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105

Draft
krickert wants to merge 7 commits into
OPENNLP-1850-2-tokenizerfrom
OPENNLP-1850-3-dl
Draft

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105
krickert wants to merge 7 commits into
OPENNLP-1850-2-tokenizerfrom
OPENNLP-1850-3-dl

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 3/4 of OPENNLP-1850. Behavioral DL integration, isolated for focused review.

opennlp-dl compile-depends on opennlp-runtime for CharClass (input chunking on Unicode whitespace/dash); InferenceOptions opt-ins; AbstractDL applies them offset-safely so NameFinderDL/DocumentCategorizerDL decode spans back to the original text.

Note: this only depends on the foundation (#1103), not the tokenizer — once #1103 merges, this PR can be re-targeted to main and merged independently of #2.

@krickert

Copy link
Copy Markdown
Contributor Author

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

  1. OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
  2. OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
  3. OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
  4. OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (OPENNLP-1850, part 3/4) updates the DL components to handle Unicode whitespace/dashes more robustly and to locate decoded entity spans in the original source text without relying on regex-based matching, while adding opt-in, offset-aware input normalization controls via InferenceOptions.

Changes:

  • Add Unicode-aware whitespace chunking in AbstractDL and use it in NameFinderDL / DocumentCategorizerDL instead of text.split("\\s+").
  • Replace regex-based span localization in NameFinderDL with a cursor-based matcher that treats span spaces as flexible Unicode whitespace and matches other code points case-insensitively.
  • Introduce opt-in InferenceOptions toggles for whitespace/dash folding and document the behavior; add targeted regression tests.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java Adds shared Unicode whitespace/dash classes, optional input folding, and whitespace chunking helper used by DL components.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/InferenceOptions.java Adds opt-in flags to normalize whitespace and dashes before inference.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java Applies optional normalization, switches chunking to Unicode whitespace, and replaces regex span matching with a cursor matcher.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java Applies optional normalization and switches chunking to Unicode whitespace.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/AbstractDLChunkingTest.java New model-free tests covering Unicode whitespace chunking and opt-in normalization behavior.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/namefinder/NameFinderDLTest.java Adds regression tests for decoding spans across NBSP/ideographic spaces and updates comments for the new matcher.
opennlp-core/opennlp-ml/opennlp-dl/README.md Documents Unicode whitespace chunking, cursor-based span localization, and new normalization options.
opennlp-core/opennlp-ml/opennlp-dl/pom.xml Makes opennlp-runtime a compile dependency to use CharClass at runtime.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from dab5605 to 67c922a Compare June 20, 2026 20:16
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 1c17110 to 8534bb3 Compare June 20, 2026 20:16
@rzo1

rzo1 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Thx for the PR. Here are some suggestions:

  • opennlp-dl now compile-depends on all of opennlp-runtime (the pom promotes it from test to compile) solely to reach CharClass. Before this, the module compiled against opennlp-api plus onnxruntime only; it now transitively bundles the whole tools runtime.
  • @deprecated on NameFinderDL.find(String[]) is too broad. It's the canonical TokenNameFinder entry point used by every existing caller, but it only returns wrong offsets when normalizeDashes=true and the input contains a supplementary-plane dash (whitespace folding is length-preserving, so find() stays correct there). As written, every caller gets a deprecation warning for an edge case that can't affect them. Suggest dropping @deprecated, documenting the caveat in javadoc, and delegating the mapping (or logging once) only when a length-changing fold is active.
  • findInOriginal is only on the concrete type, not on TokenNameFinder, so interface-typed callers (UIMA, eval harness) can't reach the offset-correct path while the interface path is deprecated. Consider exposing an offset-mapped variant on the interface, or document the asymmetry.
  • Missing end-to-end test. The full normalize to decode to map-back path with a length-changing fold active is only covered piecemeal. Add one test asserting a span from findInOriginal covers the correct original text when normalizeDashes is on with a non-BMP dash. Also, NameFinderDLEval still calls the now-deprecated find(...).
  • Nit: README and javadoc say dash folding is "offset preserving for BMP". Worth stating the stronger guarantee that findInOriginal stays correct even for non-BMP dashes via the offset map.

@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 67c922a to 81aa6c5 Compare June 21, 2026 19:00
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch 2 times, most recently from 5154da4 to c51f37d Compare June 21, 2026 19:21
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 81aa6c5 to 36de08f Compare June 21, 2026 19:21
@krickert

Copy link
Copy Markdown
Contributor Author

opennlp-dl now compile-depends on all of opennlp-runtime.
Fixed at the root. The minimal normalization engine (CharClass, CodePointSet, UnicodeWhitespace, UnicodeDash) moved to opennlp-api, which already ships small concrete implementations (e.g. BertTokenizer, WhitespaceTokenizer). opennlp-dl now compiles against opennlp-api plus onnxruntime only; opennlp-runtime is back to a test-only dependency.

@deprecated on find(String[]) is too broad.
Fixed. The @Deprecated is removed. find's javadoc instead documents the single case it diverges: whitespace folding is length-preserving, so find is only affected when normalizeDashes is enabled and a non-BMP dash is present, in which case findInOriginal gives the exact original mapping. Every other caller is correct and no longer sees a warning. (This also resolves NameFinderDLEval calling a deprecated method.)

findInOriginal is only on the concrete type, not on TokenNameFinder.
Added it as a capability interface: OffsetMappingNameFinder extends TokenNameFinder in opennlp-api declares findInOriginal, and NameFinderDL implements it, so an interface-typed caller (UIMA, eval harness) reaches the offset-correct path with finder instanceof OffsetMappingNameFinder — a plain type check, no reflection. I went with a separate interface rather than a default method on TokenNameFinder for a concrete reason: TokenNameFinder.find is documented to return token spans (indices into the token array, as NameFinderME does), while findInOriginal returns character offsets in the original input, so a default findInOriginal(t) { return find(t); } would hand back token indices where the method promises character offsets. The only mismatch-free way to put it directly on TokenNameFinder is a default that throws UnsupportedOperationException, which callers have to guard against anyway, so it buys nothing over the capability check. (For full honesty, NameFinderDL.find itself already returns character offsets rather than token spans, so the DL coordinate story is a pre-existing deviation from the interface contract.) It's your interface, so if you'd still rather it live on TokenNameFinder in some form, say the word and I'll switch it.

Missing end-to-end test.
Added a real one. NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the actual dslim/bert-base-NER ONNX model with normalizeDashes enabled and a non-BMP dash (Yezidi hyphen, U+10EAD) before the entity, and asserts the findInOriginal span covers "George Washington" at its true original offset rather than the one-unit-shorter normalized offset. It exercises the full normalize/tokenize/infer/decode/map-back path with no production test seam, and it passes. The offset-mapping math is additionally unit-tested model-free in AbstractDLChunkingTest. This also moots the "NameFinderDLEval still calls deprecated find()" point, since find() is no longer deprecated.

Nit: README/javadoc say offset-preserving "for BMP".
Fixed. The README now states the stronger guarantee: whitespace folding never moves offsets, and findInOriginal maps spans back through the Alignment so they stay correct in the original input even for non-BMP dashes.

@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 36de08f to d76a28b Compare June 21, 2026 22:59
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 40698dc to 001ac01 Compare June 21, 2026 22:59
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from d76a28b to 40cd729 Compare June 22, 2026 00:19
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 001ac01 to 038e23d Compare June 22, 2026 00:19
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 40cd729 to 8c2451a Compare June 22, 2026 01:52
krickert added a commit that referenced this pull request Jun 22, 2026
…indInOriginal end-to-end test

Address the two outstanding #1105 review points without a production test seam:

- Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api,
  declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval
  harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of
  the concrete type. It is a separate capability interface because the classic TokenNameFinder
  contract reports token-index spans, for which an original-character mapping is not meaningful.

- End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model
  with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the
  findInOriginal span covers the entity at its true original offset, not the one-unit-shorter
  normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 038e23d to bc401d3 Compare June 22, 2026 01:52
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 8c2451a to 3f06095 Compare June 22, 2026 02:09
krickert added a commit that referenced this pull request Jun 22, 2026
…indInOriginal end-to-end test

Address the two outstanding #1105 review points without a production test seam:

- Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api,
  declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval
  harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of
  the concrete type. It is a separate capability interface because the classic TokenNameFinder
  contract reports token-index spans, for which an original-character mapping is not meaningful.

- End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model
  with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the
  findInOriginal span covers the entity at its true original offset, not the one-unit-shorter
  normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from bc401d3 to 4c12897 Compare June 22, 2026 02:10
@krickert

krickert commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Status since the last review. Public API unchanged: find/findInOriginal signatures, OffsetMappingNameFinder, opennlp-dl compiles against opennlp-api only.

  • Chunk-boundary fix: decode each chunk bounded to its own character region, then mergeOverlappingSpans keeps the longer span (probability tie-break). The earlier "duplicate spans" framing did not hold; the forward cursor already serializes output. The real bug was a fuller boundary entity being dropped, now fixed. whitespaceChunkSpans carries the char span; whitespaceChunks delegates to it.
  • normalizeInputAligned composes whitespace and dash via Alignment.andThen, dropping the "whitespace is length-preserving" assumption. InferenceOptions javadoc documents 1:1 whitespace vs the runtime collapse-and-trim fold, and the supplementary-dash offset shift to findInOriginal().
  • Removed dead public I_PER/B_PER. mergeOverlappingSpans documented as length-dominant, not type-aware.
  • New tests: merge unit cases (containment, partial, tie, disjoint), two-chunk assembly, whitespaceChunkSpans offsets, composed whitespace+dash mapping, and two real-model NameFinderDLEval tests (boundary entity survives in full; overlap entity reported once). NameFinderDLEval is 13/13 on dslim/bert-base-NER.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comment on lines +43 to 45
import opennlp.tools.namefind.OffsetMappingNameFinder;
import opennlp.tools.namefind.TokenNameFinder;
import opennlp.tools.sentdetect.SentenceDetector;
Comment on lines +21 to +33
/**
* A {@link TokenNameFinder} that can additionally report detected spans in the character coordinates
* of the original input, mapping back through any text normalization applied before detection.
*
* <p>An implementation that normalizes input before detection (for example an ONNX model that folds
* Unicode whitespace or dashes) returns spans from {@link #find(String[])} in the coordinates of the
* normalized text, which no longer line up with the caller's input when a fold changes the length.
* {@link #findInOriginal(String[])} maps those spans back to original-input coordinates. This is a
* separate capability interface rather than a method on {@link TokenNameFinder} because the classic
* contract reports token-index spans, for which an original-character mapping is not meaningful; an
* interface-typed caller tests for the capability ({@code finder instanceof OffsetMappingNameFinder})
* instead of depending on a concrete implementation.</p>
*/
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 3f06095 to 0ec5a36 Compare June 22, 2026 03:31
krickert added a commit that referenced this pull request Jun 22, 2026
…indInOriginal end-to-end test

Address the two outstanding #1105 review points without a production test seam:

- Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api,
  declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval
  harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of
  the concrete type. It is a separate capability interface because the classic TokenNameFinder
  contract reports token-index spans, for which an original-character mapping is not meaningful.

- End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model
  with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the
  findInOriginal span covers the entity at its true original offset, not the one-unit-shorter
  normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 4c12897 to 2006d1d Compare June 22, 2026 03:31
krickert added a commit that referenced this pull request Jun 22, 2026
…indInOriginal end-to-end test

Address the two outstanding #1105 review points without a production test seam:

- Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api,
  declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval
  harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of
  the concrete type. It is a separate capability interface because the classic TokenNameFinder
  contract reports token-index spans, for which an original-character mapping is not meaningful.

- End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model
  with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the
  findInOriginal span covers the entity at its true original offset, not the one-unit-shorter
  normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 0ec5a36 to b150056 Compare June 22, 2026 03:51
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 2006d1d to fdd329f Compare June 22, 2026 03:51
krickert added 5 commits June 21, 2026 23:57
… components

Chunk on Unicode whitespace (not text.split), localize decoded entity spans with a
cursor-based matcher that treats span spaces as flexible Unicode whitespace, and add opt-in
InferenceOptions toggles to fold whitespace and dashes before inference.

Entity spans stay offset-safe under those folds: NameFinderDL.findInOriginal maps each decoded
span back to original-input coordinates through the normalization Alignment
(Alignment.toOriginalSpan), so a length-changing fold (a supplementary dash shrinking) does not
shift reported offsets. find() is the classic TokenNameFinder entry point and stays correct
except when normalizeDashes is on and a non-BMP dash is present, which its javadoc documents.

opennlp-dl compiles against opennlp-api plus onnxruntime only; CharClass lives in opennlp-api,
so opennlp-runtime is a test-only dependency.
…indInOriginal end-to-end test

Address the two outstanding #1105 review points without a production test seam:

- Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api,
  declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval
  harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of
  the concrete type. It is a separate capability interface because the classic TokenNameFinder
  contract reports token-index spans, for which an original-character mapping is not meaningful.

- End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model
  with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the
  findInOriginal span covers the entity at its true original offset, not the one-unit-shorter
  normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
…ignment

Decode each whitespace chunk bounded to its own character region instead of
threading one forward cursor across a sentence's overlapping chunks. The
forward cursor kept whichever decode it reached first and silently dropped the
other, so a boundary entity that one chunk truncates and the overlapping chunk
covers in full could lose its fuller span. Now both chunks emit a candidate and
mergeOverlappingSpans reconciles overlaps by keeping the longer span (the more
complete decode), breaking ties toward the higher probability; adjacent but
disjoint spans and repeated surface forms at distinct offsets are preserved.

whitespaceChunkSpans carries each chunk's character span (whitespaceChunks now
delegates to it, single-sourcing the chunking), so a chunk's decoded spans can
be located within the region it covers.

Compose the optional whitespace and dash folds in normalizeInputAligned through
Alignment.andThen rather than relying on whitespace folding staying
length-preserving, so findInOriginal() stays correct regardless of the
whitespace fold's behavior. Document on InferenceOptions that whitespace folding
is a one-for-one replacement (not the runtime collapse-and-trim fold) and that a
supplementary-plane dash shifts find() offsets, with findInOriginal() as the
exact-offset path.

Tests: mergeOverlappingSpans containment, partial overlap, probability tie, and
disjoint cases; a two-overlapping-chunk assembly that keeps the fuller boundary
span; whitespaceChunkSpans character offsets; and a composed whitespace+dash
aligned mapping back to the original.
…l constants

Add two NameFinderDLEval tests that exercise the chunk-boundary span resolution
end to end with the real ONNX model: one where "United States" straddles a chunk
boundary and must survive in full (not a truncated copy), and one where it sits
inside the chunk overlap and is decoded by both chunks but must be reported once.
Both also assert the output has no overlapping spans.

Remove the unused public I_PER and B_PER label constants (decoding handles any
B-/I- type, and they had no callers). Document that mergeOverlappingSpans is
length-dominant rather than type-aware.
DocumentCategorizerDL.categorize now rejects a tokenless document (empty or only
whitespace, including Unicode whitespace) with IllegalArgumentException, instead
of building an empty score list and letting the scoring strategy fail with an
IndexOutOfBoundsException. infer() reports a null model output as "null" rather
than dereferencing it for getClass(). Both DocumentCategorizerDL and NameFinderDL
constructors now reject a null InferenceOptions explicitly, alongside the other
collaborators, for a clear early failure.

NameFinderDL.buildSpanText skips RoBERTa <s>/</s> markers as well as BERT
[CLS]/[SEP], so a special token can never leak into a reconstructed span when a
RoBERTa-style tokenizer is used.

Tests: tokenless content rejection (empty, ASCII whitespace, no-break space);
null InferenceOptions rejection at construction; and buildSpanText skipping the
RoBERTa special tokens.
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from fdd329f to 47a39bf Compare June 22, 2026 03:59
krickert added 2 commits June 22, 2026 01:33
…tions

NameFinderDL.find and findInOriginal now reject a null token array up front via
locate(), instead of failing deeper inside String.join. Add an eval test that
both entry points throw on null.

The disabled categorizeWithGpu eval test built the categorizer with a fresh
InferenceOptions instead of the one it had configured with the GPU flags, so it
would not have exercised the GPU path if enabled; pass the configured options.
DocumentCategorizerDL.categorize now rejects a model whose output length does not
match the configured category count, instead of letting getBestCategory or
scoreMap index past the score array. softmax rejects a NaN logit rather than
silently returning an all-NaN distribution. Add a softmax NaN test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants