Fix char round-trip above U+FFFF; cap \x escape at two digits#4
Conversation
show-char emitted \u with five or six hex digits for codepoints above the BMP, which the four-digit \u reader could not re-parse. Emit \U + eight digits for those codepoints and teach the character reader to accept \U escapes so they round-trip. Also cap the \x string escape at two hex digits (one byte) to match C semantics; extra digits were previously folded silently into the low byte. Adds round-trip and edge-case tests for \U and \x.
Run the formatter over the file so it passes the CI format check; no behaviour change.
|
Ran |
There was a problem hiding this comment.
Build & Tests
Checked out claude/char-astral-roundtrip, built and ran the suite locally with carp -x test/carp-reader.carp: 67 passed, 0 failed — including the four new char round-trip tests and the three new string-escape tests. carp -x gendocs.carp also succeeds. CI is green on both ubuntu-latest and macos-latest.
Findings
I went through the three real logic changes carefully and tried to break each:
show-char(carp-reader.carp:117): the new(Int.> (Char.to-int c) 65535) → \U%08xclause is correctly ordered before the> 127 → \u%04xclause, so astral codepoints take the 8-digit form and U+FFFF (= 65535, not> 65535) correctly stays on\u. Matches the description and the BMP-boundary test.\xcap (carp-reader.carp:560):max-end = (Int.min len (Int.+ digits-start 2))bounds the scan to at most two hex digits; dropping the& 0xFFmask is safe because ≤2 hex digits is always ≤ 255. The short-circuitandkeepsString.char-atin bounds (end < max-end ≤ len), so a bare\xat end-of-input errors instead of reading past the buffer.unicode-char-long(carp-reader.carp:727): mirrors the existingunicode-charexactly (just\U/count 8/read-hex 8) and is tried first in thechoice.\U/\udon't collide becauseParser.byteis case-sensitive.
Beyond the suite I ran 10 extra edge-case probes — uppercase-hex astral round-trip (→ lowercase), U+10FFFF (max valid codepoint), U+10001, two consecutive \x in one string, \x with uppercase hex, \x stopping at the first non-hex char, a bare \x (graceful parse error, no crash), single-digit \x, and a 7-digit \U (correctly rejected). All passed.
Verified the second commit (Format carp-reader.carp with carp-fmt) is whitespace-only via git diff -w — it only splits cond condition/consequent onto separate lines, no token changes. The bulk of the diff is that reformatting of the escape cond; the actual behavioral change is small. CHANGELOG is updated with user-facing wording.
No issues found. Nothing already raised in prior rounds (this is the first review).
Verdict: merge
Correct, well-tested, fits the existing parser conventions, and round-trips cleanly in every case I could construct.
What
Two related escape-handling fixes in
Form.str/ the reader:Characters above U+FFFF now round-trip.
show-charemitted\ufollowed by five or six hex digits for astral codepoints (e.g. for
😀). The character reader only consumes four hex digits after
\u,so that output re-parsed as a BMP char plus a stray digit — two
forms instead of one. Astral codepoints are now rendered as
\U+eight hex digits, and the character reader learned to accept
\Uescapes, so they round-trip cleanly.
\xcaps at two hex digits. Inside string literals\xconsumed all following hex digits and folded the result into the
low byte (
\x413→0x413 & 0xFF=0x13, silently dropping adigit). It now reads at most two hex digits (one byte), matching C:
\x413isAfollowed by a literal3. The now-redundant& 0xFFmask is gone.U+FFFF itself stays on the four-digit
\uform — only codepointsstrictly above the BMP use
\U.Why
Codepoints above U+FFFF could not survive a
Form.str→ re-parsecycle, and multi-digit
\xsequences were silently truncated ratherthan reported or handled per C semantics.
Tests
Added round-trip and edge-case coverage and ran the full suite
(67 passed, 0 failed):
\U\U00010000) round-trips\uat the BMP boundary\x4a→J;\x413→A3(cap proof)\U0001F600decodes to its UTF-8 bytesangleris clean on the changed files.Opened by the carpentry-org heartbeat agent (Claude). Veit has not reviewed this yet.