OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4-docs/7) by krickert · Pull Request #1106 · apache/opennlp

krickert · 2026-06-20T12:36:52Z

Part 4-docs (7 of 7) of the OPENNLP-1850 stack: Developer Manual coverage of Unicode normalization and the UAX #29 tokenizer — a new "Text Normalization" chapter plus tokenizer / doccat / namefinder / introduction updates and the master opennlp.xml.

Base: OPENNLP-1850-3-dl (#1105).

krickert · 2026-06-20T12:37:31Z

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
OPENNLP-1850: Offset-safe input normalization in the DL components (3-dl/7) #1105 — Offset-safe input normalization in the DL components
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4-docs/7) #1106 — Documentation

Supersedes #1101.

Copilot

Pull request overview

Adds/updates Developer Manual documentation for Unicode-aware normalization and UAX #29 tokenization, and connects those docs to the DL (ONNX) components’ Unicode chunking/normalization behavior.

Changes:

Adds a new “Text Normalization” manual chapter and includes it in the master DocBook.
Extends the tokenizer chapter with guidance about Unicode preprocessing and a new UAX #29 tokenizer/segmenter section.
Updates NameFinderDL/DocumentCategorizerDL and introduction docs to reference Unicode-aware DL chunking and normalization options.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
opennlp-docs/src/docbkx/tokenizer.xml	Adds Unicode preprocessing guidance and documents the UAX #29 tokenizer/segmenter APIs.
opennlp-docs/src/docbkx/opennlp.xml	Includes the new normalizer chapter in the book build.
opennlp-docs/src/docbkx/normalizer.xml	New “Text Normalization” chapter covering normalizers, pipelines, term model, and reference data.
opennlp-docs/src/docbkx/namefinder.xml	Updates NameFinderDL constructor usage and documents Unicode-aware DL chunking and normalization options.
opennlp-docs/src/docbkx/introduction.xml	Links DL inference Unicode handling to the normalizer documentation.
opennlp-docs/src/docbkx/doccat.xml	Documents Unicode-aware DL chunking/normalization and adds ONNX usage snippet updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rzo1 · 2026-06-21T17:04:48Z

Thx for the PR. Here are some suggestions:

Declare xmlns:xlink explicitly on the chapter roots of normalizer.xml and tokenizer.xml. Both use xlink:href (normalizer.xml:457, tokenizer.xml:461) but rely on the DocBook 5.0 DTD's #FIXED default to bind the prefix. The Maven build resolves the DTD so it works, but it breaks under any non-validating namespace-aware tool (IDE linters, xmllint --nonet), and every other chapter declares it explicitly:
normalizer.xml: <chapter xml:id="tools.normalizer" xmlns:xlink="http://www.w3.org/1999/xlink">
tokenizer.xml: <chapter xml:id="tools.tokenizer" xmlns:xlink="http://www.w3.org/1999/xlink">
Note that tokenizer.xml newly introduces xlink usage, so this is the first chapter to add it there.

Otherwise the content is accurate.

krickert · 2026-06-22T02:58:20Z

Status since the last review. Normalizer chapter gains an "Offset-aware pipelines" section for buildAligned() and the capability interface with a worked dash-fold span example, the line-break-preserving rung in the fold table, and the supplementary-dash offset note in the DL fold options. Name-finder chapter names OffsetMappingNameFinder behind findInOriginal. docbkx HTML builds clean. Rebased onto the updated stack.

rzo1 · 2026-06-25T12:08:15Z

normalizer.xml and the modified tokenizer.xml use xlink:href without declaring xmlns:xlink on the chapter root, unlike the other xlink-using chapters (e.g. chunker.xml). It still builds (the DocBook DTD declares it #FIXED), but add xmlns:xlink for convention.
The second namefinder.xml ONNX snippet uses an empty ids2Labels map, which would throw IllegalStateException at runtime and contradicts the surrounding warning that the map must be exhaustive. Mirror the populated BIO map from the first snippet.

krickert · 2026-06-25T18:46:04Z

@rzo1 Both fixed (tip 121c2588).

xmlns:xlink declaration. Declared xmlns:xlink="http://www.w3.org/1999/xlink" on the normalizer.xml and tokenizer.xml chapter roots, matching the other xlink-using chapters — both use <link xlink:href> (the UTS #39 and UAX #29 references) and previously leaned on the DocBook DTD's #FIXED default, which non-validating namespace-aware tools don't honor.

Empty ids2Labels in the second snippet. Populated it with the same BIO map as the first snippet (O, B-PER/I-PER, B-ORG/I-ORG, B-LOC/I-LOC, B-MISC/I-MISC), so the findInOriginal example is runnable and consistent with the "map must be exhaustive" warning.

…nd DL handling Add the Text Normalization manual chapter (CharClass engine, normalizer pipeline, the Term model, and the Aligned offset variants that return an AlignedText carrying an Alignment), extend the tokenizer chapter with the UAX #29 segmenter, and document the DL components' Unicode-aware chunking and opt-in whitespace/dash folding with offset-safe findInOriginal. All embedded ONNX snippets are self-contained and compile.

…ligned) Add an "Offset-aware pipelines" section to the normalizer chapter covering TextNormalizer.Builder.buildAligned(), the OffsetAwareNormalizer capability interface, mapping a match back to the source with AlignedText/Alignment, and the fail-loud rejection of rungs that cannot report edits (NFC/NFKC). List the new line-break-preserving whitespace rung in the normalizer family table.

… the manual Document that NameFinderDL.findInOriginal comes from the OffsetMappingNameFinder capability interface, detectable with a plain instanceof check, so the name-finder chapter matches how the normalizer chapter presents OffsetAwareNormalizer.

…gits, ellipsis, bullets, umlaut)

…old options Note in the normalizer manual that, with dash folding enabled, a dash in the supplementary planes shrinks from two UTF-16 units to one and shifts later offsets, so find reports offsets into the normalized text in that case while findInOriginal maps them back to the original input. The one-for-one whitespace fold versus the run-collapsing whitespace rung is already covered in the same section.

Scope the "never relies on Character.isWhitespace" statement to the normalization engine rather than the whole library. Note that getInstance() gives the default shared instance and that case and accent folding also offer configured forms. Refer to the conformance file by its full name WordBreakTest.txt.

…enizer manual The Word Tokenizer section said it drops punctuation and keeps emoji without noting that emoji means any Extended_Pictographic code point, so symbol-like characters such as the copyright, trademark, and double-exclamation signs are kept. Match the WordTokenizer class javadoc.

…d hyphenation Show a concrete, exhaustive ids2Labels BIO mapping in the ONNX name-finder example instead of an empty map (an unmapped predicted index raises IllegalStateException at runtime), and note the exhaustiveness requirement. Hyphenate 'rule-based' and split the comma splice in the UAX #29 tokenizer section.

…ds2Labels example Declare the xlink namespace on the normalizer and tokenizer chapter roots -- both use <link xlink:href=...> (UAX #29 and UTS #39 references) but did not bind the prefix. Populate the ids2Labels map in the second NameFinderDL example (the InferenceOptions/findInOriginal one), which previously left it empty so the example would have located nothing.

…(); drop BM25/search framing Update the Text Normalization chapter examples for the searchDefault()->defaultChain() and searchAnalyzer()->matchingAnalyzer() renames, and drop the 'BM25-style search' phrasing.

krickert mentioned this pull request Jun 20, 2026

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103

Closed

This was referenced Jun 20, 2026

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104

Closed

OPENNLP-1850: Offset-safe input normalization in the DL components (3-dl/7) #1105

Open

OPENNLP-1850: Improve Whitespace UTF normalization #1101

Closed

krickert marked this pull request as draft June 20, 2026 14:43

krickert requested a review from Copilot June 20, 2026 14:56

Copilot started reviewing on behalf of krickert June 20, 2026 14:57 View session

krickert requested review from mawiesne and rzo1 June 20, 2026 14:58

Copilot AI reviewed Jun 20, 2026

View reviewed changes

Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated

Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated

Comment thread opennlp-docs/src/docbkx/doccat.xml Outdated

Comment thread opennlp-docs/src/docbkx/doccat.xml

krickert force-pushed the OPENNLP-1850-3-dl branch from 1c17110 to 8534bb3 Compare June 20, 2026 20:16

krickert force-pushed the OPENNLP-1850-4-docs branch from 3037db7 to 9a71f28 Compare June 20, 2026 20:16

krickert force-pushed the OPENNLP-1850-3-dl branch from 8534bb3 to 5154da4 Compare June 21, 2026 19:00

krickert force-pushed the OPENNLP-1850-4-docs branch 2 times, most recently from d7d316f to 0ff5d07 Compare June 21, 2026 19:21

krickert force-pushed the OPENNLP-1850-3-dl branch from 5154da4 to c51f37d Compare June 21, 2026 19:21

krickert force-pushed the OPENNLP-1850-4-docs branch 2 times, most recently from 667e850 to d71e472 Compare June 21, 2026 22:59

krickert force-pushed the OPENNLP-1850-3-dl branch from 40698dc to 001ac01 Compare June 21, 2026 22:59

krickert force-pushed the OPENNLP-1850-4-docs branch from b65c0de to 0022bc1 Compare June 22, 2026 00:19

krickert force-pushed the OPENNLP-1850-3-dl branch 2 times, most recently from 038e23d to bc401d3 Compare June 22, 2026 01:52

krickert force-pushed the OPENNLP-1850-4-docs branch from 0022bc1 to 2fd9543 Compare June 22, 2026 01:52

krickert force-pushed the OPENNLP-1850-3-dl branch from bc401d3 to 4c12897 Compare June 22, 2026 02:10

krickert force-pushed the OPENNLP-1850-4-docs branch from 2fd9543 to 743a955 Compare June 22, 2026 02:10

krickert requested a review from Copilot June 22, 2026 02:59

Copilot started reviewing on behalf of krickert June 22, 2026 02:59 View session

krickert force-pushed the OPENNLP-1850-3-dl branch from e7f3c59 to b6dc241 Compare June 23, 2026 15:17

krickert force-pushed the OPENNLP-1850-4-docs branch from c83701e to e0011e2 Compare June 23, 2026 15:17

krickert force-pushed the OPENNLP-1850-3-dl branch from b6dc241 to be98bd6 Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-4-docs branch from e0011e2 to 4a48643 Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-3-dl branch from be98bd6 to 5a1114c Compare June 24, 2026 11:54

krickert force-pushed the OPENNLP-1850-4-docs branch from 4a48643 to 0bb8b2d Compare June 24, 2026 11:54

krickert force-pushed the OPENNLP-1850-3-dl branch from 5a1114c to 57ceefd Compare June 25, 2026 08:26

krickert force-pushed the OPENNLP-1850-4-docs branch from 0bb8b2d to c0fb2eb Compare June 25, 2026 08:26

krickert changed the title ~~OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)~~ OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4-docs/7) Jun 25, 2026

krickert marked this pull request as ready for review June 25, 2026 11:27

krickert force-pushed the OPENNLP-1850-3-dl branch from 57ceefd to 2f4a5ab Compare June 25, 2026 17:25

krickert force-pushed the OPENNLP-1850-4-docs branch from c0fb2eb to 58bbde6 Compare June 25, 2026 17:25

krickert force-pushed the OPENNLP-1850-3-dl branch from 2f4a5ab to ed5c777 Compare June 25, 2026 18:31

krickert force-pushed the OPENNLP-1850-4-docs branch from 58bbde6 to 121c258 Compare June 25, 2026 18:32

rzo1 requested review from atarora and jzonthemtn June 26, 2026 13:08

krickert added 10 commits June 27, 2026 08:18

OPENNLP-1850 Document the offset-aware substitution folds (quotes, di…

4911806

…gits, ellipsis, bullets, umlaut)

OPENNLP-1850 Review nits: manual uses defaultChain()/matchingAnalyzer…

af56b24

…(); drop BM25/search framing Update the Text Normalization chapter examples for the searchDefault()->defaultChain() and searchAnalyzer()->matchingAnalyzer() renames, and drop the 'BM25-style search' phrasing.

krickert force-pushed the OPENNLP-1850-3-dl branch from ed5c777 to 105d8a1 Compare June 27, 2026 12:29

krickert force-pushed the OPENNLP-1850-4-docs branch from 121c258 to af56b24 Compare June 27, 2026 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4-docs/7)#1106

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4-docs/7)#1106
krickert wants to merge 10 commits into
OPENNLP-1850-3-dlfrom
OPENNLP-1850-4-docs

krickert commented Jun 20, 2026 •

edited

Loading

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rzo1 commented Jun 21, 2026 •

edited

Loading

Uh oh!

krickert commented Jun 22, 2026

Uh oh!

rzo1 commented Jun 25, 2026

Uh oh!

krickert commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

krickert commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rzo1 commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krickert commented Jun 22, 2026

Uh oh!

rzo1 commented Jun 25, 2026

Uh oh!

krickert commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krickert commented Jun 20, 2026 •

edited

Loading

rzo1 commented Jun 21, 2026 •

edited

Loading