Fix silent acceptance of invalid escape sequences in string parser#2
Conversation
The string parser silently accepted unknown escape sequences (e.g. \z, \0) by emitting raw backslash + character bytes. It also accepted non-hex characters in \u and \U escapes, silently producing wrong codepoints via read-hex. Now all three cases return a parse error with a descriptive message: - Unknown escapes: "expected: valid escape sequence" - \x without hex digits: "expected: hex digits after \x" - \u/\U with non-hex chars: "expected: hex digits in \u/\U escape" Changed err from Bool to String to carry per-error messages.
There was a problem hiding this comment.
Build & Tests
Build passes. All 60 tests pass (macOS + Ubuntu CI green, verified locally). The 9 new tests cover valid escapes (\t, \x41, \u0041), invalid unknowns (\z, \q, \0), and hex-digit validation for \x, \u, and \U.
Findings
The changes are correct across all three escape categories:
-
Unknown escapes (
\z,\q,\0): thecondfallthrough now setserr @"valid escape sequence"instead of silently emitting raw bytes. The error message reads as "expected valid escape sequence" viaParseErr.expected-at— correct for the Parsec-style format. -
\u/\Uhex validation: the newall-hexloop correctly validates all required digits before callingread-hex. Previously\uGHIJwould silently produce a bogus codepoint. -
\xhandler: already had hex-digit validation via thewhileloop withhex-digit?— the PR only changes its error message from(set! err true)to(set! err @"hex digits after \\x"). No logic change needed here. -
errtype change fromBooltoString: the check(Int.= 0 (String.length &err))is equivalent to the old(not err), and(Int.> (String.length &err) 0)is equivalent toerr. Correct.
One minor note (not blocking): invalid Unicode codepoints beyond U+10FFFF (e.g. \U00110000) pass hex validation and reach push-utf8-bytes, which would produce invalid UTF-8. This is pre-existing behavior and out of scope for this PR.
CHANGELOG is updated.
Verdict: merge
Solid bug fix with good test coverage. The silent acceptance of invalid escapes was a real problem — \0 producing literal \0 bytes instead of erroring is the kind of thing that hides bugs for months.
Summary
The string parser silently accepted invalid escape sequences instead of returning parse errors:
\z,\q,\0emitted raw backslash + character bytes, masking bugs in user code\uand\Uescapes only checked that enough bytes were available but didn't validate that those bytes were hex digits —\uGHIJsilently produced a wrong codepoint viaread-hexChanges
\uand\Uhandlers now validate all required digits are hex before callingread-hexerrfromBooltoStringso each error site carries a descriptive message (e.g. "valid escape sequence", "hex digits in \u escape")\z,\q,\0,\xwithout hex,\xZZ,\uGHIJ,\U0000GHIJ) and valid escapes (\t,\x41,\u0041)Opened by the carpentry-org heartbeat agent (Claude). hellerve has not reviewed this yet.