Skip to content

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4-docs/7)#1106

Open
krickert wants to merge 10 commits into
OPENNLP-1850-3-dlfrom
OPENNLP-1850-4-docs
Open

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4-docs/7)#1106
krickert wants to merge 10 commits into
OPENNLP-1850-3-dlfrom
OPENNLP-1850-4-docs

Conversation

@krickert

@krickert krickert commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Part 4-docs (7 of 7) of the OPENNLP-1850 stack: Developer Manual coverage of Unicode normalization and the UAX #29 tokenizer — a new "Text Normalization" chapter plus tokenizer / doccat / namefinder / introduction updates and the master opennlp.xml.

Base: OPENNLP-1850-3-dl (#1105).

@krickert

Copy link
Copy Markdown
Contributor Author

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

  1. OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
  2. OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
  3. OPENNLP-1850: Offset-safe input normalization in the DL components (3-dl/7) #1105 — Offset-safe input normalization in the DL components
  4. OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4-docs/7) #1106 — Documentation

Supersedes #1101.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds/updates Developer Manual documentation for Unicode-aware normalization and UAX #29 tokenization, and connects those docs to the DL (ONNX) components’ Unicode chunking/normalization behavior.

Changes:

  • Adds a new “Text Normalization” manual chapter and includes it in the master DocBook.
  • Extends the tokenizer chapter with guidance about Unicode preprocessing and a new UAX #29 tokenizer/segmenter section.
  • Updates NameFinderDL/DocumentCategorizerDL and introduction docs to reference Unicode-aware DL chunking and normalization options.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
opennlp-docs/src/docbkx/tokenizer.xml Adds Unicode preprocessing guidance and documents the UAX #29 tokenizer/segmenter APIs.
opennlp-docs/src/docbkx/opennlp.xml Includes the new normalizer chapter in the book build.
opennlp-docs/src/docbkx/normalizer.xml New “Text Normalization” chapter covering normalizers, pipelines, term model, and reference data.
opennlp-docs/src/docbkx/namefinder.xml Updates NameFinderDL constructor usage and documents Unicode-aware DL chunking and normalization options.
opennlp-docs/src/docbkx/introduction.xml Links DL inference Unicode handling to the normalizer documentation.
opennlp-docs/src/docbkx/doccat.xml Documents Unicode-aware DL chunking/normalization and adds ONNX usage snippet updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated
Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated
Comment thread opennlp-docs/src/docbkx/doccat.xml Outdated
Comment thread opennlp-docs/src/docbkx/doccat.xml
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 1c17110 to 8534bb3 Compare June 20, 2026 20:16
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from 3037db7 to 9a71f28 Compare June 20, 2026 20:16
@rzo1

rzo1 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Thx for the PR. Here are some suggestions:

  • Declare xmlns:xlink explicitly on the chapter roots of normalizer.xml and tokenizer.xml. Both use xlink:href (normalizer.xml:457, tokenizer.xml:461) but rely on the DocBook 5.0 DTD's #FIXED default to bind the prefix. The Maven build resolves the DTD so it works, but it breaks under any non-validating namespace-aware tool (IDE linters, xmllint --nonet), and every other chapter declares it explicitly:
    normalizer.xml: <chapter xml:id="tools.normalizer" xmlns:xlink="http://www.w3.org/1999/xlink">
    tokenizer.xml: <chapter xml:id="tools.tokenizer" xmlns:xlink="http://www.w3.org/1999/xlink">
    Note that tokenizer.xml newly introduces xlink usage, so this is the first chapter to add it there.

Otherwise the content is accurate.

@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 8534bb3 to 5154da4 Compare June 21, 2026 19:00
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch 2 times, most recently from d7d316f to 0ff5d07 Compare June 21, 2026 19:21
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 5154da4 to c51f37d Compare June 21, 2026 19:21
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch 2 times, most recently from 667e850 to d71e472 Compare June 21, 2026 22:59
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 40698dc to 001ac01 Compare June 21, 2026 22:59
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from b65c0de to 0022bc1 Compare June 22, 2026 00:19
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch 2 times, most recently from 038e23d to bc401d3 Compare June 22, 2026 01:52
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from 0022bc1 to 2fd9543 Compare June 22, 2026 01:52
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from bc401d3 to 4c12897 Compare June 22, 2026 02:10
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from 2fd9543 to 743a955 Compare June 22, 2026 02:10
@krickert

Copy link
Copy Markdown
Contributor Author

Status since the last review. Normalizer chapter gains an "Offset-aware pipelines" section for buildAligned() and the capability interface with a worked dash-fold span example, the line-break-preserving rung in the fold table, and the supplementary-dash offset note in the DL fold options. Name-finder chapter names OffsetMappingNameFinder behind findInOriginal. docbkx HTML builds clean. Rebased onto the updated stack.

@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from e7f3c59 to b6dc241 Compare June 23, 2026 15:17
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from c83701e to e0011e2 Compare June 23, 2026 15:17
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from b6dc241 to be98bd6 Compare June 24, 2026 11:20
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from e0011e2 to 4a48643 Compare June 24, 2026 11:20
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from be98bd6 to 5a1114c Compare June 24, 2026 11:54
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from 4a48643 to 0bb8b2d Compare June 24, 2026 11:54
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 5a1114c to 57ceefd Compare June 25, 2026 08:26
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from 0bb8b2d to c0fb2eb Compare June 25, 2026 08:26
@krickert krickert changed the title OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4-docs/7) Jun 25, 2026
@krickert krickert marked this pull request as ready for review June 25, 2026 11:27
@rzo1

rzo1 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor
  • normalizer.xml and the modified tokenizer.xml use xlink:href without declaring xmlns:xlink on the chapter root, unlike the other xlink-using chapters (e.g. chunker.xml). It still builds (the DocBook DTD declares it #FIXED), but add xmlns:xlink for convention.
  • The second namefinder.xml ONNX snippet uses an empty ids2Labels map, which would throw IllegalStateException at runtime and contradicts the surrounding warning that the map must be exhaustive. Mirror the populated BIO map from the first snippet.

@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 57ceefd to 2f4a5ab Compare June 25, 2026 17:25
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from c0fb2eb to 58bbde6 Compare June 25, 2026 17:25
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 2f4a5ab to ed5c777 Compare June 25, 2026 18:31
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from 58bbde6 to 121c258 Compare June 25, 2026 18:32
@krickert

Copy link
Copy Markdown
Contributor Author

@rzo1 Both fixed (tip 121c2588).

xmlns:xlink declaration. Declared xmlns:xlink="http://www.w3.org/1999/xlink" on the normalizer.xml and tokenizer.xml chapter roots, matching the other xlink-using chapters — both use <link xlink:href> (the UTS #39 and UAX #29 references) and previously leaned on the DocBook DTD's #FIXED default, which non-validating namespace-aware tools don't honor.

Empty ids2Labels in the second snippet. Populated it with the same BIO map as the first snippet (O, B-PER/I-PER, B-ORG/I-ORG, B-LOC/I-LOC, B-MISC/I-MISC), so the findInOriginal example is runnable and consistent with the "map must be exhaustive" warning.

@rzo1 rzo1 requested review from atarora and jzonthemtn June 26, 2026 13:08
krickert added 10 commits June 27, 2026 08:18
…nd DL handling

Add the Text Normalization manual chapter (CharClass engine, normalizer pipeline, the Term
model, and the Aligned offset variants that return an AlignedText carrying an Alignment),
extend the tokenizer chapter with the UAX #29 segmenter, and document the DL components'
Unicode-aware chunking and opt-in whitespace/dash folding with offset-safe findInOriginal.
All embedded ONNX snippets are self-contained and compile.
…ligned)

Add an "Offset-aware pipelines" section to the normalizer chapter covering
TextNormalizer.Builder.buildAligned(), the OffsetAwareNormalizer capability
interface, mapping a match back to the source with AlignedText/Alignment, and the
fail-loud rejection of rungs that cannot report edits (NFC/NFKC). List the new
line-break-preserving whitespace rung in the normalizer family table.
… the manual

Document that NameFinderDL.findInOriginal comes from the OffsetMappingNameFinder
capability interface, detectable with a plain instanceof check, so the name-finder
chapter matches how the normalizer chapter presents OffsetAwareNormalizer.
…old options

Note in the normalizer manual that, with dash folding enabled, a dash in the
supplementary planes shrinks from two UTF-16 units to one and shifts later
offsets, so find reports offsets into the normalized text in that case while
findInOriginal maps them back to the original input. The one-for-one whitespace
fold versus the run-collapsing whitespace rung is already covered in the same
section.
Scope the "never relies on Character.isWhitespace" statement to the normalization
engine rather than the whole library. Note that getInstance() gives the default
shared instance and that case and accent folding also offer configured forms.
Refer to the conformance file by its full name WordBreakTest.txt.
…enizer manual

The Word Tokenizer section said it drops punctuation and keeps emoji without
noting that emoji means any Extended_Pictographic code point, so symbol-like
characters such as the copyright, trademark, and double-exclamation signs are
kept. Match the WordTokenizer class javadoc.
…d hyphenation

Show a concrete, exhaustive ids2Labels BIO mapping in the ONNX name-finder example instead of
an empty map (an unmapped predicted index raises IllegalStateException at runtime), and note the
exhaustiveness requirement. Hyphenate 'rule-based' and split the comma splice in the UAX #29
tokenizer section.
…ds2Labels example

Declare the xlink namespace on the normalizer and tokenizer chapter roots -- both use
<link xlink:href=...> (UAX #29 and UTS #39 references) but did not bind the prefix. Populate the
ids2Labels map in the second NameFinderDL example (the InferenceOptions/findInOriginal one), which
previously left it empty so the example would have located nothing.
…(); drop BM25/search framing

Update the Text Normalization chapter examples for the searchDefault()->defaultChain() and
searchAnalyzer()->matchingAnalyzer() renames, and drop the 'BM25-style search' phrasing.
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from ed5c777 to 105d8a1 Compare June 27, 2026 12:29
@krickert krickert force-pushed the OPENNLP-1850-4-docs branch from 121c258 to af56b24 Compare June 27, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants