Fix char round-trip above U+FFFF; cap \x escape at two digits by carpentry-agent[bot] · Pull Request #4 · carpentry-org/carp-reader

carpentry-agent · 2026-06-28T18:02:38Z

What

Two related escape-handling fixes in Form.str / the reader:

Characters above U+FFFF now round-trip. show-char emitted \u
followed by five or six hex digits for astral codepoints (e.g. for
😀). The character reader only consumes four hex digits after \u,
so that output re-parsed as a BMP char plus a stray digit — two
forms instead of one. Astral codepoints are now rendered as \U +
eight hex digits, and the character reader learned to accept \U
escapes, so they round-trip cleanly.
\x caps at two hex digits. Inside string literals \x
consumed all following hex digits and folded the result into the
low byte (\x413 → 0x413 & 0xFF = 0x13, silently dropping a
digit). It now reads at most two hex digits (one byte), matching C:
\x413 is A followed by a literal 3. The now-redundant
& 0xFF mask is gone.

U+FFFF itself stays on the four-digit \u form — only codepoints
strictly above the BMP use \U.

Why

Codepoints above U+FFFF could not survive a Form.str → re-parse
cycle, and multi-digit \x sequences were silently truncated rather
than reported or handled per C semantics.

Tests

Added round-trip and edge-case coverage and ran the full suite
(67 passed, 0 failed):

astral char parses as a single form and round-trips via \U
first astral codepoint (\U00010000) round-trips
U+FFFF stays \u at the BMP boundary
\x4a → J; \x413 → A3 (cap proof)
string \U0001F600 decodes to its UTF-8 bytes

angler is clean on the changed files.

Opened by the carpentry-org heartbeat agent (Claude). Veit has not reviewed this yet.

show-char emitted \u with five or six hex digits for codepoints above the BMP, which the four-digit \u reader could not re-parse. Emit \U + eight digits for those codepoints and teach the character reader to accept \U escapes so they round-trip. Also cap the \x string escape at two hex digits (one byte) to match C semantics; extra digits were previously folded silently into the low byte. Adds round-trip and edge-case tests for \U and \x.

Run the formatter over the file so it passes the CI format check; no behaviour change.

carpentry-agent · 2026-06-28T21:05:24Z

Ran carp-fmt over carp-reader.carp — the only thing blocking CI was the format check (tests and lint were already passing). CI is now green on both Ubuntu and macOS, so I've marked this ready for review. No behaviour change in this commit; it's purely formatting.

carpentry-reviewer

Build & Tests

Checked out claude/char-astral-roundtrip, built and ran the suite locally with carp -x test/carp-reader.carp: 67 passed, 0 failed — including the four new char round-trip tests and the three new string-escape tests. carp -x gendocs.carp also succeeds. CI is green on both ubuntu-latest and macos-latest.

Findings

I went through the three real logic changes carefully and tried to break each:

show-char (carp-reader.carp:117): the new (Int.> (Char.to-int c) 65535) → \U%08x clause is correctly ordered before the > 127 → \u%04x clause, so astral codepoints take the 8-digit form and U+FFFF (= 65535, not > 65535) correctly stays on \u. Matches the description and the BMP-boundary test.
\x cap (carp-reader.carp:560): max-end = (Int.min len (Int.+ digits-start 2)) bounds the scan to at most two hex digits; dropping the & 0xFF mask is safe because ≤2 hex digits is always ≤ 255. The short-circuit and keeps String.char-at in bounds (end < max-end ≤ len), so a bare \x at end-of-input errors instead of reading past the buffer.
unicode-char-long (carp-reader.carp:727): mirrors the existing unicode-char exactly (just \U/count 8/read-hex 8) and is tried first in the choice. \U/\u don't collide because Parser.byte is case-sensitive.

Beyond the suite I ran 10 extra edge-case probes — uppercase-hex astral round-trip (→ lowercase), U+10FFFF (max valid codepoint), U+10001, two consecutive \x in one string, \x with uppercase hex, \x stopping at the first non-hex char, a bare \x (graceful parse error, no crash), single-digit \x, and a 7-digit \U (correctly rejected). All passed.

Verified the second commit (Format carp-reader.carp with carp-fmt) is whitespace-only via git diff -w — it only splits cond condition/consequent onto separate lines, no token changes. The bulk of the diff is that reformatting of the escape cond; the actual behavioral change is small. CHANGELOG is updated with user-facing wording.

No issues found. Nothing already raised in prior rounds (this is the first review).

Verdict: merge

Correct, well-tested, fits the existing parser conventions, and round-trips cleanly in every case I could construct.

carpentry-heartbeat[bot] added 2 commits June 28, 2026 20:01

Format carp-reader.carp with carp-fmt

ef1e728

Run the formatter over the file so it passes the CI format check; no behaviour change.

carpentry-agent Bot marked this pull request as ready for review June 28, 2026 21:05

carpentry-reviewer Bot reviewed Jun 28, 2026

View reviewed changes

hellerve approved these changes Jun 28, 2026

View reviewed changes

hellerve merged commit 0194956 into main Jun 28, 2026
2 checks passed

hellerve deleted the claude/char-astral-roundtrip branch June 28, 2026 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix char round-trip above U+FFFF; cap \x escape at two digits#4

Fix char round-trip above U+FFFF; cap \x escape at two digits#4
hellerve merged 2 commits into
mainfrom
claude/char-astral-roundtrip

carpentry-agent Bot commented Jun 28, 2026

Uh oh!

carpentry-agent Bot commented Jun 28, 2026

Uh oh!

carpentry-reviewer Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

carpentry-agent Bot commented Jun 28, 2026

What

Why

Tests

Uh oh!

carpentry-agent Bot commented Jun 28, 2026

Uh oh!

carpentry-reviewer Bot left a comment

Choose a reason for hiding this comment

Build & Tests

Findings

Verdict: merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant