OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105
OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105krickert wants to merge 7 commits into
Conversation
|
OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to
Supersedes #1101. |
There was a problem hiding this comment.
Pull request overview
This PR (OPENNLP-1850, part 3/4) updates the DL components to handle Unicode whitespace/dashes more robustly and to locate decoded entity spans in the original source text without relying on regex-based matching, while adding opt-in, offset-aware input normalization controls via InferenceOptions.
Changes:
- Add Unicode-aware whitespace chunking in
AbstractDLand use it inNameFinderDL/DocumentCategorizerDLinstead oftext.split("\\s+"). - Replace regex-based span localization in
NameFinderDLwith a cursor-based matcher that treats span spaces as flexible Unicode whitespace and matches other code points case-insensitively. - Introduce opt-in
InferenceOptionstoggles for whitespace/dash folding and document the behavior; add targeted regression tests.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java | Adds shared Unicode whitespace/dash classes, optional input folding, and whitespace chunking helper used by DL components. |
| opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/InferenceOptions.java | Adds opt-in flags to normalize whitespace and dashes before inference. |
| opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java | Applies optional normalization, switches chunking to Unicode whitespace, and replaces regex span matching with a cursor matcher. |
| opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java | Applies optional normalization and switches chunking to Unicode whitespace. |
| opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/AbstractDLChunkingTest.java | New model-free tests covering Unicode whitespace chunking and opt-in normalization behavior. |
| opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/namefinder/NameFinderDLTest.java | Adds regression tests for decoding spans across NBSP/ideographic spaces and updates comments for the new matcher. |
| opennlp-core/opennlp-ml/opennlp-dl/README.md | Documents Unicode whitespace chunking, cursor-based span localization, and new normalization options. |
| opennlp-core/opennlp-ml/opennlp-dl/pom.xml | Makes opennlp-runtime a compile dependency to use CharClass at runtime. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
dab5605 to
67c922a
Compare
1c17110 to
8534bb3
Compare
|
Thx for the PR. Here are some suggestions:
|
67c922a to
81aa6c5
Compare
5154da4 to
c51f37d
Compare
81aa6c5 to
36de08f
Compare
|
opennlp-dl now compile-depends on all of opennlp-runtime. @deprecated on find(String[]) is too broad. findInOriginal is only on the concrete type, not on TokenNameFinder. Missing end-to-end test. Nit: README/javadoc say offset-preserving "for BMP". |
36de08f to
d76a28b
Compare
40698dc to
001ac01
Compare
d76a28b to
40cd729
Compare
001ac01 to
038e23d
Compare
40cd729 to
8c2451a
Compare
…indInOriginal end-to-end test Address the two outstanding #1105 review points without a production test seam: - Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api, declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of the concrete type. It is a separate capability interface because the classic TokenNameFinder contract reports token-index spans, for which an original-character mapping is not meaningful. - End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the findInOriginal span covers the entity at its true original offset, not the one-unit-shorter normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
038e23d to
bc401d3
Compare
8c2451a to
3f06095
Compare
…indInOriginal end-to-end test Address the two outstanding #1105 review points without a production test seam: - Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api, declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of the concrete type. It is a separate capability interface because the classic TokenNameFinder contract reports token-index spans, for which an original-character mapping is not meaningful. - End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the findInOriginal span covers the entity at its true original offset, not the one-unit-shorter normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
bc401d3 to
4c12897
Compare
|
Status since the last review. Public API unchanged:
|
| import opennlp.tools.namefind.OffsetMappingNameFinder; | ||
| import opennlp.tools.namefind.TokenNameFinder; | ||
| import opennlp.tools.sentdetect.SentenceDetector; |
| /** | ||
| * A {@link TokenNameFinder} that can additionally report detected spans in the character coordinates | ||
| * of the original input, mapping back through any text normalization applied before detection. | ||
| * | ||
| * <p>An implementation that normalizes input before detection (for example an ONNX model that folds | ||
| * Unicode whitespace or dashes) returns spans from {@link #find(String[])} in the coordinates of the | ||
| * normalized text, which no longer line up with the caller's input when a fold changes the length. | ||
| * {@link #findInOriginal(String[])} maps those spans back to original-input coordinates. This is a | ||
| * separate capability interface rather than a method on {@link TokenNameFinder} because the classic | ||
| * contract reports token-index spans, for which an original-character mapping is not meaningful; an | ||
| * interface-typed caller tests for the capability ({@code finder instanceof OffsetMappingNameFinder}) | ||
| * instead of depending on a concrete implementation.</p> | ||
| */ |
3f06095 to
0ec5a36
Compare
…indInOriginal end-to-end test Address the two outstanding #1105 review points without a production test seam: - Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api, declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of the concrete type. It is a separate capability interface because the classic TokenNameFinder contract reports token-index spans, for which an original-character mapping is not meaningful. - End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the findInOriginal span covers the entity at its true original offset, not the one-unit-shorter normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
4c12897 to
2006d1d
Compare
…indInOriginal end-to-end test Address the two outstanding #1105 review points without a production test seam: - Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api, declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of the concrete type. It is a separate capability interface because the classic TokenNameFinder contract reports token-index spans, for which an original-character mapping is not meaningful. - End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the findInOriginal span covers the entity at its true original offset, not the one-unit-shorter normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
0ec5a36 to
b150056
Compare
2006d1d to
fdd329f
Compare
… components Chunk on Unicode whitespace (not text.split), localize decoded entity spans with a cursor-based matcher that treats span spaces as flexible Unicode whitespace, and add opt-in InferenceOptions toggles to fold whitespace and dashes before inference. Entity spans stay offset-safe under those folds: NameFinderDL.findInOriginal maps each decoded span back to original-input coordinates through the normalization Alignment (Alignment.toOriginalSpan), so a length-changing fold (a supplementary dash shrinking) does not shift reported offsets. find() is the classic TokenNameFinder entry point and stays correct except when normalizeDashes is on and a non-BMP dash is present, which its javadoc documents. opennlp-dl compiles against opennlp-api plus onnxruntime only; CharClass lives in opennlp-api, so opennlp-runtime is a test-only dependency.
…indInOriginal end-to-end test Address the two outstanding #1105 review points without a production test seam: - Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api, declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of the concrete type. It is a separate capability interface because the classic TokenNameFinder contract reports token-index spans, for which an original-character mapping is not meaningful. - End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the findInOriginal span covers the entity at its true original offset, not the one-unit-shorter normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
…ignment Decode each whitespace chunk bounded to its own character region instead of threading one forward cursor across a sentence's overlapping chunks. The forward cursor kept whichever decode it reached first and silently dropped the other, so a boundary entity that one chunk truncates and the overlapping chunk covers in full could lose its fuller span. Now both chunks emit a candidate and mergeOverlappingSpans reconciles overlaps by keeping the longer span (the more complete decode), breaking ties toward the higher probability; adjacent but disjoint spans and repeated surface forms at distinct offsets are preserved. whitespaceChunkSpans carries each chunk's character span (whitespaceChunks now delegates to it, single-sourcing the chunking), so a chunk's decoded spans can be located within the region it covers. Compose the optional whitespace and dash folds in normalizeInputAligned through Alignment.andThen rather than relying on whitespace folding staying length-preserving, so findInOriginal() stays correct regardless of the whitespace fold's behavior. Document on InferenceOptions that whitespace folding is a one-for-one replacement (not the runtime collapse-and-trim fold) and that a supplementary-plane dash shifts find() offsets, with findInOriginal() as the exact-offset path. Tests: mergeOverlappingSpans containment, partial overlap, probability tie, and disjoint cases; a two-overlapping-chunk assembly that keeps the fuller boundary span; whitespaceChunkSpans character offsets; and a composed whitespace+dash aligned mapping back to the original.
…l constants Add two NameFinderDLEval tests that exercise the chunk-boundary span resolution end to end with the real ONNX model: one where "United States" straddles a chunk boundary and must survive in full (not a truncated copy), and one where it sits inside the chunk overlap and is decoded by both chunks but must be reported once. Both also assert the output has no overlapping spans. Remove the unused public I_PER and B_PER label constants (decoding handles any B-/I- type, and they had no callers). Document that mergeOverlappingSpans is length-dominant rather than type-aware.
DocumentCategorizerDL.categorize now rejects a tokenless document (empty or only whitespace, including Unicode whitespace) with IllegalArgumentException, instead of building an empty score list and letting the scoring strategy fail with an IndexOutOfBoundsException. infer() reports a null model output as "null" rather than dereferencing it for getClass(). Both DocumentCategorizerDL and NameFinderDL constructors now reject a null InferenceOptions explicitly, alongside the other collaborators, for a clear early failure. NameFinderDL.buildSpanText skips RoBERTa <s>/</s> markers as well as BERT [CLS]/[SEP], so a special token can never leak into a reconstructed span when a RoBERTa-style tokenizer is used. Tests: tokenless content rejection (empty, ASCII whitespace, no-break space); null InferenceOptions rejection at construction; and buildSpanText skipping the RoBERTa special tokens.
fdd329f to
47a39bf
Compare
…tions NameFinderDL.find and findInOriginal now reject a null token array up front via locate(), instead of failing deeper inside String.join. Add an eval test that both entry points throw on null. The disabled categorizeWithGpu eval test built the categorizer with a fresh InferenceOptions instead of the one it had configured with the GPU flags, so it would not have exercised the GPU path if enabled; pass the configured options.
DocumentCategorizerDL.categorize now rejects a model whose output length does not match the configured category count, instead of letting getBestCategory or scoreMap index past the score array. softmax rejects a NaN logit rather than silently returning an all-NaN distribution. Add a softmax NaN test.
Part 3/4 of OPENNLP-1850. Behavioral DL integration, isolated for focused review.
opennlp-dl compile-depends on opennlp-runtime for CharClass (input chunking on Unicode whitespace/dash); InferenceOptions opt-ins; AbstractDL applies them offset-safely so NameFinderDL/DocumentCategorizerDL decode spans back to the original text.
Note: this only depends on the foundation (#1103), not the tokenizer — once #1103 merges, this PR can be re-targeted to main and merged independently of #2.