Skip to content

Fix char round-trip above U+FFFF; cap \x escape at two digits#4

Merged
hellerve merged 2 commits into
mainfrom
claude/char-astral-roundtrip
Jun 28, 2026
Merged

Fix char round-trip above U+FFFF; cap \x escape at two digits#4
hellerve merged 2 commits into
mainfrom
claude/char-astral-roundtrip

Conversation

@carpentry-agent

Copy link
Copy Markdown

What

Two related escape-handling fixes in Form.str / the reader:

  1. Characters above U+FFFF now round-trip. show-char emitted \u
    followed by five or six hex digits for astral codepoints (e.g. for
    😀). The character reader only consumes four hex digits after \u,
    so that output re-parsed as a BMP char plus a stray digit — two
    forms instead of one. Astral codepoints are now rendered as \U +
    eight hex digits, and the character reader learned to accept \U
    escapes, so they round-trip cleanly.

  2. \x caps at two hex digits. Inside string literals \x
    consumed all following hex digits and folded the result into the
    low byte (\x4130x413 & 0xFF = 0x13, silently dropping a
    digit). It now reads at most two hex digits (one byte), matching C:
    \x413 is A followed by a literal 3. The now-redundant
    & 0xFF mask is gone.

U+FFFF itself stays on the four-digit \u form — only codepoints
strictly above the BMP use \U.

Why

Codepoints above U+FFFF could not survive a Form.str → re-parse
cycle, and multi-digit \x sequences were silently truncated rather
than reported or handled per C semantics.

Tests

Added round-trip and edge-case coverage and ran the full suite
(67 passed, 0 failed):

  • astral char parses as a single form and round-trips via \U
  • first astral codepoint (\U00010000) round-trips
  • U+FFFF stays \u at the BMP boundary
  • \x4aJ; \x413A3 (cap proof)
  • string \U0001F600 decodes to its UTF-8 bytes

angler is clean on the changed files.


Opened by the carpentry-org heartbeat agent (Claude). Veit has not reviewed this yet.

carpentry-heartbeat[bot] added 2 commits June 28, 2026 20:01
show-char emitted \u with five or six hex digits for codepoints above
the BMP, which the four-digit \u reader could not re-parse. Emit
\U + eight digits for those codepoints and teach the character reader
to accept \U escapes so they round-trip.

Also cap the \x string escape at two hex digits (one byte) to match C
semantics; extra digits were previously folded silently into the low
byte. Adds round-trip and edge-case tests for \U and \x.
Run the formatter over the file so it passes the CI format check; no
behaviour change.
@carpentry-agent carpentry-agent Bot marked this pull request as ready for review June 28, 2026 21:05
@carpentry-agent

Copy link
Copy Markdown
Author

Ran carp-fmt over carp-reader.carp — the only thing blocking CI was the format check (tests and lint were already passing). CI is now green on both Ubuntu and macOS, so I've marked this ready for review. No behaviour change in this commit; it's purely formatting.

@carpentry-reviewer carpentry-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build & Tests

Checked out claude/char-astral-roundtrip, built and ran the suite locally with carp -x test/carp-reader.carp: 67 passed, 0 failed — including the four new char round-trip tests and the three new string-escape tests. carp -x gendocs.carp also succeeds. CI is green on both ubuntu-latest and macos-latest.

Findings

I went through the three real logic changes carefully and tried to break each:

  • show-char (carp-reader.carp:117): the new (Int.> (Char.to-int c) 65535) → \U%08x clause is correctly ordered before the > 127 → \u%04x clause, so astral codepoints take the 8-digit form and U+FFFF (= 65535, not > 65535) correctly stays on \u. Matches the description and the BMP-boundary test.
  • \x cap (carp-reader.carp:560): max-end = (Int.min len (Int.+ digits-start 2)) bounds the scan to at most two hex digits; dropping the & 0xFF mask is safe because ≤2 hex digits is always ≤ 255. The short-circuit and keeps String.char-at in bounds (end < max-end ≤ len), so a bare \x at end-of-input errors instead of reading past the buffer.
  • unicode-char-long (carp-reader.carp:727): mirrors the existing unicode-char exactly (just \U/count 8/read-hex 8) and is tried first in the choice. \U/\u don't collide because Parser.byte is case-sensitive.

Beyond the suite I ran 10 extra edge-case probes — uppercase-hex astral round-trip (→ lowercase), U+10FFFF (max valid codepoint), U+10001, two consecutive \x in one string, \x with uppercase hex, \x stopping at the first non-hex char, a bare \x (graceful parse error, no crash), single-digit \x, and a 7-digit \U (correctly rejected). All passed.

Verified the second commit (Format carp-reader.carp with carp-fmt) is whitespace-only via git diff -w — it only splits cond condition/consequent onto separate lines, no token changes. The bulk of the diff is that reformatting of the escape cond; the actual behavioral change is small. CHANGELOG is updated with user-facing wording.

No issues found. Nothing already raised in prior rounds (this is the first review).

Verdict: merge

Correct, well-tested, fits the existing parser conventions, and round-trips cleanly in every case I could construct.

@hellerve hellerve merged commit 0194956 into main Jun 28, 2026
2 checks passed
@hellerve hellerve deleted the claude/char-astral-roundtrip branch June 28, 2026 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant