OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7)#1110
OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7)#1110krickert wants to merge 3 commits into
Conversation
08de0d3 to
9af6d92
Compare
a450069 to
dc02b9e
Compare
9af6d92 to
9dc7d51
Compare
dc02b9e to
dd1906d
Compare
9dc7d51 to
b24c9ee
Compare
dd1906d to
3fae8aa
Compare
|
@rzo1 Both points on the tokenizer PR are addressed. Resource loading (done). Split into 2a / 2b / 2c (done). Along the three concepts you identified:
|
|
b24c9ee to
702acc5
Compare
3fae8aa to
f2d1d8c
Compare
|
@rzo1 Both addressed (tip Loader symmetry. I kept the two loaders deliberately different but documented why, and closed the test gap. The difference is real rather than an oversight:
|
…rdType (2a) Splits the former tokenizer PR (#1104) into the UAX #29 tokenizer (this PR), the Term model (2b), and the NormalizationProfile registry (2c), on review request. Self-contained: the conformant WordSegmenter/WordTokenizer/WordType/WordToken over the bundled Word_Break and Extended_Pictographic data (loaded lazily and recoverably via a double-checked accessor, no static-init resource I/O), the official WordBreakTest conformance suite, and the Unicode data LICENSE/NOTICE/rat-excludes. Builds on the alignment layer in 1b.
WordBreakProperty.parse threw an opaque StringIndexOutOfBoundsException on a non-comment line with no ';' (substring(0, -1)), unlike the sibling ExtendedPictographic.parse which guards it. It now throws IllegalStateException naming the offending line. Exposed parse() package-visibly; WordBreakPropertyTest proves the red->green. Real Word_Break data still loads (conformance suite).
…; WordType heuristic note (tokenizer) ExtendedPictographic.parse now fails loud (IllegalArgumentException naming the line) on malformed hex, matching the sibling loaders, with a comment explaining why its value column is optional (unlike WordBreakProperty); added a malformed-data test. Noted in WordType.of's comment that the script category comes from the first script code point (single-script UAX #29 segments; leading-script for mixed runs).
702acc5 to
2bed555
Compare
f2d1d8c to
9c8e3fc
Compare
Part 2a of the OPENNLP-1850 stack. Splits the former tokenizer PR (#1104) into the UAX #29 tokenizer (this PR), the Term model (2b), and the NormalizationProfile registry (2c), as requested in review.
Self-contained: the Unicode-conformant
WordSegmenter/WordTokenizer/WordType/WordTokenover the bundledWord_BreakandExtended_Pictographicdata, the officialWordBreakTest.txtconformance suite (1944/1944), and the Unicode dataLICENSE/NOTICE/rat-excludes.WordBreakPropertyandExtendedPictographicload their data lazily and recoverably (double-checked accessor, no classpath resource I/O in astatic {}block), per the same review point as on the foundation — so a resource the loader cannot see is a catchable exception at call time, not a class-poisoningExceptionInInitializerError.Base:
OPENNLP-1850-1b-alignment(#1109). Stack: 1a → 1b → 2a (this) → 2b → 2c → DL → docs.