Skip to content

Parser performance: 2-3x speedup via grammar preprocessing and interpreter optimisations#373

Closed
JanJakes wants to merge 20 commits intotrunkfrom
parser-performance
Closed

Parser performance: 2-3x speedup via grammar preprocessing and interpreter optimisations#373
JanJakes wants to merge 20 commits intotrunkfrom
parser-performance

Conversation

@JanJakes
Copy link
Copy Markdown
Member

@JanJakes JanJakes commented Apr 25, 2026

Summary

This is a draft PR to measure CI runtime improvement from a series of parser-performance optimisations on the recursive descent parser. Aimed primarily at lowering the time tests spend in PHPUnit, which is dominated by the 70-batch MySQL server parser test suite (~70k queries parsed per run).

End-to-end (lex+parse, 69,577 queries, local benchmark):

  • Trunk, no opcache: ~10,500 QPS (6.6 s)
  • Trunk, tracing JIT: ~23,700 QPS (2.9 s)
  • This branch, no opcache: ~25,100 QPS (2.8 s) — 2.4× faster
  • This branch, tracing JIT: ~43,500 QPS (1.6 s) — 1.8× faster end-to-end / 2.7× parse-only

CI runtime comparison

Run PHPUnit tests step duration on this branch vs the average of the last 3 trunk-merged feature PRs (#365 / #367 / #369):

PHP Trunk avg This PR Speedup
7.2 165 s 97 s 1.70×
7.3 126 s 73 s 1.73×
7.4 120 s 57 s 2.11×
8.0 132 s 68 s 1.94×
8.1 141 s 75 s 1.88×
8.2 128 s 70 s 1.83×
8.3 134 s 74 s 1.81×
8.4 138 s 72 s 1.92×
8.5 141 s 79 s 1.78×

Total: 20.4 min → 11.1 min of cumulative phpunit-step time across the matrix per run (~9 min saved per CI run).

The 70/70 server-suite parser test sub-suite is what dominates the step. Per-PHPUnit-step speedup is consistent with the local end-to-end benchmark.

Sequence of commits

  1. f935d03 Inline terminal matching and defer node allocation
  2. 0e34f25 Per-branch FIRST sets to skip unreachable branches
  3. 6df347a Short-circuit nullable fallback + inline single-branch fragments
  4. 9eea849 Strip epsilon markers + cache grammar refs
  5. ddfe4b6 Return fragment results as children arrays
  6. 77756bf End-of-input sentinel token (drop range checks)
  7. daec1cb Embed branch sequences in selector
  8. 2609898 Compare selectStatement by rule id
  9. c4e7f8f Add bench-parser-split.php tooling
  10. cd0b609 Deduplicate selector entries (memory: 40 MB → 10 MB)
  11. e8b006f Cache branch sequences in deduped selector
  12. b5959e8 Grammar compilation experiment + bench tools (kept as documentation, not used at runtime)
  13. bf1b1fe Pack switch case labels (compiled output cleanup)
  14. fa2fd65 Single-candidate rule fast path
  15. 1e7e3cf Direct-return refactor for single-candidate path
  16. 9fcfb27 Mark WP_Parser_Node as final
  17. b9b64a6 bench-final.php helper

Tests: 667/667 phpunit tests pass with the same ~1.43M assertions as trunk.
Grammar memory: ~10 MB (from ~1.5 MB on trunk; cost of the new per-token branch selector + FIRST/NULLABLE precomputation).
PHPCS: clean.

Test plan

  • composer run test (mysql-on-sqlite) - green locally
  • composer run check-cs - green
  • CI matrix on PHP 7.2 → 8.5 - all green, ~1.85× faster
  • Confirm WordPress PHPUnit / E2E Tests pass

Notes

This PR also includes a grammar→PHP compiler experiment under tests/tools/compile-grammar.php. It was kept as documentation: it's faster than the interpreter without JIT (+14-19%) but slightly slower under tracing JIT, so the interpreter remains the production path.

Draft because the goal here is to compare CI runtimes against recent merges before deciding what to ship.

JanJakes added 17 commits April 24, 2026 15:28
Hot-path changes in WP_Parser::parse_recursive():

- Inline the terminal match in the branch loop instead of recursing into
  parse_recursive() for every token. Over the full MySQL test suite this
  eliminates ~1.6M function calls.
- Hoist grammar, rules, fragment_ids, rule_names, tokens, and token_count
  into local variables so the inner loops avoid repeated property lookups
  on $this->grammar.
- Cache the token count on the instance to avoid a count() per call.
- Build branch children in a local array and only instantiate the
  WP_Parser_Node once the branch has matched; on the MySQL corpus ~75% of
  speculative nodes were previously created and thrown away.
- Drop a dead is_array($subnode) check that never fires in practice
  (subnodes are false, true, tokens, or nodes - never arrays).
- Inline fragment inlining: read the fragment's children directly instead
  of building a fragment node and immediately merging it.

End-to-end parser benchmark on the MySQL server test corpus:
  Before: ~11,500 QPS   After: ~14,900 QPS  (+29%)
The grammar now precomputes FIRST and NULLABLE via fixpoint, then indexes
each rule's branches by the tokens that can start them. At parse time the
parser jumps straight to the candidate branches for the current token
instead of iterating every branch and letting most fail.

On the full MySQL test suite, 59% of branch attempts previously failed
because the first token could never match the branch's FIRST set; with
per-branch lookahead those attempts are eliminated.

End-to-end parser benchmark:
  Before: ~14,900 QPS   After: ~22,400 QPS  (+50%)
Two grammar/parser refinements that both reduce recursive calls:

* In parse_recursive(): when the rule has a per-token branch selector but
  the current token is not in any branch's FIRST and the rule itself is
  nullable, return 'matched empty' immediately instead of descending into
  nullable branches that would recursively do the same thing. This alone
  eliminates ~460k recursive calls on the MySQL corpus.

* At grammar build time, expand every single-branch fragment rule into
  its call sites. Fragments exist only to factor shared sub-sequences and
  their children are already flattened into the parent AST node, so
  splicing them directly into parent branches is a no-op for the
  resulting tree but removes an entire recursive call per use. 480 of the
  grammar's fragments qualify.

Also drops the dead terminal branch at the top of parse_recursive() (the
branch loop inlines terminal matching, so parse_recursive is only ever
called with non-terminal rule ids) and the always-false empty-branches
guard.

End-to-end parser benchmark:
  Before: ~22,400 QPS   After: ~27,500 QPS  (+23%)
Two minor reductions in per-call work:

* Strip explicit EMPTY_RULE_ID symbols out of rule branches at grammar
  build time. The parser loop would have 'continue'd over them anyway, so
  removing them ahead of time lets the hot symbol loop drop the epsilon
  check. Pure-epsilon branches become empty branches and still match
  empty via the existing empty-children fast path.

* Cache the grammar's rules, fragment_ids, rule_names, branches_for_token,
  nullable_branches, and highest_terminal_id as direct parser instance
  fields so parse_recursive() no longer pays for a $this->grammar->...
  double hop on every call.

* Collapse the two-step node construction (new + set_children) into a
  single constructor call that takes the children array directly. This
  saves a method call per allocated node (~820k across the MySQL corpus).

End-to-end parser benchmark: ~27,500 QPS -> ~28,500 QPS (+3.5%).
Multi-branch fragment rules can't be expanded at grammar build time, but
their runtime role is still trivial: match a sequence of symbols and have
the caller splice the resulting children into its own node. The old code
allocated a full WP_Parser_Node for each fragment match just to have the
caller immediately copy its children out.

Return the children array directly from fragments instead. The caller
distinguishes via is_array($subnode) and splices in-place, saving a
Parser_Node allocation per fragment match (~253k per 10k queries).

End-to-end parser benchmark:
  Before: ~27,000 QPS (avg)   After: ~28,700 QPS (+6%).
Add a sentinel WP_Parser_Token with id EMPTY_RULE_ID (0) to the end of
the token array. Real MySQL tokens never have id 0 (WHITESPACE, the only
token with id 0, is stripped by the lexer before tokens reach the
parser), so the sentinel cannot match any real terminal.

This lets the hot path drop the 'position < token_count' range check
everywhere it reads the current token id: the selector lookup at method
entry, the inline terminal match inside the branch loop, and the
post-branch INTO negative lookahead for selectStatement. Any read past
the last real token falls naturally into the nullable-fallback or
branch-miss handling.

Also drop a few dead locals ($token_count, $fragment_ids) that no
longer appear in the hot path after the change.

End-to-end parser benchmark:
  Before: ~28,700 QPS (avg)   After: ~29,800 QPS (+4%).
Previously the per-(rule, token) selector stored a list of branch
indexes that the parser then had to look up in $rules[$rule_id] on
every branch attempt. Store the branch symbol sequences themselves so
the hot loop can iterate candidate branches directly.

PHP arrays are copy-on-write, so sharing the same branch sequence
across selector entries for many tokens costs negligible extra memory.
The nullable_branches map shrinks to a bool marker since the parser
only uses it for existence checks.

Also cache the start rule id on the grammar so parse() skips its
array_search() across rule_names on every call.

End-to-end parser benchmark:
  Before: ~29,800 QPS (avg)   After: ~31,700 QPS (+6%).
Minor cleanup in parse_recursive(): cache the selectStatement rule id
once and compare integers on every call instead of re-comparing the
'selectStatement' string against every rule's name. Also drops the
$rules instance cache from the parser, which the hot path no longer
touches now that branch sequences are embedded in the selector.
bench-parser-split.php pre-tokenizes the MySQL test suite once and then
times only the parser across multiple runs, so parser-specific changes
can be measured without lexer noise. The script accepts --runs=N and
--limit=M for reproducible comparisons.

Also adopts phpcbf's trivial whitespace alignment fixes in the grammar
and parser source to keep 'composer run check-cs' clean.
The per-(rule, token) branch selector stored a separate inner array per
token, even when many tokens within the same rule mapped to identical
branch-id lists (a single branch's FIRST set covers many tokens, for
example). Loading the MySQL grammar used ~40 MB of PHP memory, most of
which was duplicated inner arrays.

Deduplicate by signature during grammar build so all tokens that land
on the same branch-id list share one inner array via copy-on-write. The
selector is now stored as branch indexes again (instead of the cached
symbol sequences from the previous commit) - the one extra
$branches[$idx] lookup per branch attempt costs < 1% at runtime but
allows the inner arrays to be tiny and to share aggressively.

Grammar memory on the MySQL grammar drops from ~40 MB to ~10 MB.
PHPUnit peak memory drops from 198 MB to 110 MB.
Re-embed branch symbol sequences directly in the selector entries so the
hot loop can foreach over them without an extra $rules[$rule_id][$idx]
indirection per branch attempt. Per-rule dedup pairs tokens that land on
the same branch list to a single sequences array via copy-on-write, so
grammar memory stays at ~10 MB instead of the ~40 MB the naive form
needed.

Recovers the ~3% parse speedup lost in the memory-reduction commit while
keeping the lower footprint.

Parser benchmark:
  Best: ~32,400 QPS   Avg: ~31,300 QPS
(compared to ~30,700 avg in the indexes-only variant)
This commit preserves the exploration of 'full grammar compilation'
as a follow-up to the parser performance work. It is intentionally kept
separate from the main parser because the compilation trade-off is not
a clear win.

Tools added:
 - compile-grammar.php: walks the grammar and emits a self-contained
   class (WP_MySQL_Compiled_Parser) with one method per reachable rule.
   Single-branch fragments are inlined, single-use multi-branch
   fragments are kept as methods. Dispatch uses switch-on-tid for
   multi-group rules; a compact isset() lookup table for single-group
   rules; and a 'one-of-N-terminals' fast path for the many
   identifier-like rules with hundreds of single-terminal branches.
 - bench-compiled-parser.php: side-by-side run of interpreter vs
   compiled on the MySQL test corpus.
 - compare-asts.php: verifies the compiled parser produces the same
   AST as the interpreter on every query.
 - dump-inflated-grammar.php: dumps the post-inflation grammar data
   so the effect of skipping the FIRST/NULLABLE fixpoint at runtime
   could be measured.
 - bench-hot-rules.php: distribution of per-rule call counts for
   deciding which rules are worth specialising.

Empirical findings on the full MySQL test corpus (69,576 queries):

Interpreter (current parser on this branch):
 - no opcache:        ~32,500 QPS, ~12 MB grammar
 - opcache, no JIT:   ~35,100 QPS
 - opcache + JIT:     ~52,600 QPS (tracing)

Compiled parser:
 - no opcache:        ~38,300 QPS (+18% over interpreter)
 - opcache, no JIT:   ~41,800 QPS (+19%)
 - opcache + JIT:     ~49,700 QPS (slightly below interpreter)

Compiled file size:   ~2.6 MB (99k lines, 1,427 methods)
Compiled class load:  ~22 ms / ~17 MB RAM (vs 38 ms / 12 MB to inflate
                      the compressed grammar).

Why compilation helps without JIT:
 - eliminates the generic dispatch (branches_for_token, fragment_ids,
   nullable_branches isset checks) baked into every interpreter call;
 - resolves fragment vs non-fragment and nullable vs non-nullable at
   compile time so the emitted code has no runtime type checks;
 - collapses 'accepts any of N terminals' rules (251 of them) into an
   8-line isset() + consume instead of ~2.8k-line switches.

Why compilation does not help with tracing JIT:
 - the interpreter's hot loop is small and regular, which tracing JIT
   optimises aggressively (9-10 ns per recursive call);
 - the compiled parser generates a handful of very large methods (the
   biggest is ~2.8k lines of a 406-branch fragment) that tracing JIT
   struggles to optimise, and incurs a substantial JIT-compile penalty
   on first use (~0.7 s of the first-run time).

Conclusion: keep the interpreter as the primary parser. The compiler
is preserved here as documentation and as a fallback path for
environments without JIT.
Previously the compiler emitted each case label on its own line
(`\t\t\tcase 5:\n`), and case labels were 56% of all generated code.
Group multiple labels per line instead (up to 10) so the switch
dispatch is still readable but the file shrinks from ~99k lines
(~2.63 MB) to ~51k lines (~2.48 MB) with no behaviour change.

No runtime impact: verified 0 AST mismatches across the 69k-query
corpus and identical QPS to the previous output under all opcache/JIT
configurations.
On the MySQL grammar, 1,290 of 1,916 rules have a selector where every
(rule, token) entry points to exactly one branch. Those rules account
for ~55% of parse_recursive calls on the test corpus (722k of 1.3M per
10k queries).

Flag those rules at grammar build time. In parse_recursive, detect the
flag and skip the outer 'foreach ($candidate_branches as ...)' by
taking $candidate_branches[0] directly. The branch-match body is
otherwise identical to the multi-candidate path.

End-to-end parser benchmark:
  no JIT:     ~31.6K -> ~32.6K QPS avg   (+3%)
  tracing JIT: ~52.6K -> ~55.7K QPS avg  (+6%)
Replace the $branch_matches flag + break+reset sequence with direct
'$this->position = $position; return false;' exits on each failure
path. Removes one local variable and a pair of conditional branches
from the hot inner loop. Minor but measurable improvement; the code is
also simpler.
Nothing extends WP_Parser_Node. Marking it final lets PHP's opcache
and tracing JIT specialize property access and method dispatch since
the class layout is now fixed. Small but consistent improvement
measured across multiple runs under tracing JIT (~+2% avg, ~+2% best).

End-to-end parser benchmark:
  tracing JIT: ~57K -> ~57-58K QPS avg, 60-61K QPS best
  no JIT:      ~33K -> ~34K QPS avg, 35K QPS best
Reports best/median/average QPS over N runs with the currently-loaded
PHP interpreter configuration. Used to measure the effect of the
interpreter changes on top of opcache and tracing JIT configurations.
@JanJakes
Copy link
Copy Markdown
Member Author

CI runtime comparison — full matrix

Compared this PR (parser-performance) against the average of three recent merged feature PRs (#365, #367, #369). All values are full-job duration (includes SQLite build + composer + tests).

Workflow Parser-perf Trunk avg Δ
PHP 7.2 / SQLite 3.27.0 132 s 201 s −34%
PHP 7.3 / SQLite 3.31.1 116 s 172 s −32%
PHP 7.4 / SQLite 3.34.1 94 s 158 s −40%
PHP 8.0 / SQLite 3.37.0 111 s 169 s −34%
PHP 8.1 / SQLite 3.40.1 116 s 183 s −36%
PHP 8.2 / SQLite 3.45.1 115 s 176 s −35%
PHP 8.3 / SQLite 3.46.1 114 s 175 s −35%
PHP 8.4 / SQLite 3.51.2 103 s 175 s −41%
PHP 8.5 / SQLite latest 119 s 183 s −35%
PHPUnit Tests (sum) 1020 s 1592 s −36%, ~9.5 min saved
WordPress PHPUnit Tests 587 s 713 s −18%
MySQL Proxy Tests 18 s 20 s -10%
End-to-end Tests 340 s 315 s +8% (noise)
WordPress End-to-end Tests 441 s 360 s +22% (noise)

The PHPUnit numbers above include ~30-40 s of SQLite build and ~10 s of composer install which are unchanged. Looking at just the Run PHPUnit tests step:

PHP Trunk avg This PR Speedup
7.2 165 s 97 s 1.70×
7.3 126 s 73 s 1.73×
7.4 120 s 57 s 2.11×
8.0 132 s 68 s 1.94×
8.1 141 s 75 s 1.88×
8.2 128 s 70 s 1.83×
8.3 134 s 74 s 1.81×
8.4 138 s 72 s 1.92×
8.5 141 s 79 s 1.78×

~1.85× faster phpunit across all PHP versions on real CI, matching the local end-to-end benchmark.

E2E variance is browser/wp-env startup dominated and within normal run-to-run noise; not parser-related.

All tests pass. Total saved per CI run on PHPUnit + WordPress PHPUnit alone: ~11 minutes.

Experiment: compile the grammar to a single PCRE2 pattern using:
- each token id encoded as a Unicode codepoint at offset 0x4000
- each rule emitted as (?<rN>...) named subroutine
- (*THEN) on each branch's first symbol of *single-candidate* rules
  (where sibling-branch FIRST sets are disjoint, so committing is safe)
- aggressive transitive inlining of single-use non-recursive rules to
  shrink the bytecode below PCRE2's compiled-pattern size limit

Result on the 69,576-query MySQL test corpus (PCRE2 JIT enabled):
- Pattern: ~76 KB source, 1127 named subroutines after 789 rules inlined
- Match throughput: ~97,600 QPS, vs the optimised interpreter's ~62k.
- 99.82% accuracy: ~120 spurious failures, mostly the 'SELECT ... INTO'
  ambiguity that the interpreter handles via a runtime negative
  lookahead the regex doesn't model.

Trade-offs:
1. Match-only - the regex doesn't build an AST, so it's not a drop-in
   replacement for the recursive-descent parser the SQLite driver
   needs.
2. Without (*THEN) the matcher backtracks catastrophically on nested
   compound statements (CREATE TRIGGER ... BEGIN ... IF ...).
3. With (*THEN) on every branch (not just single-candidate) the regex
   gives spurious failures because PCRE commits to the first
   first-symbol match and can't try a sibling alternative.
4. Pattern size is constrained by PCRE2's default LINK_SIZE=2 bytecode
   limit; aggressive rule inlining is needed to fit a non-trivial
   grammar.

Kept as documentation: an interesting upper bound on PHP-side parsing
speed when the AST shape is not required.
@JanJakes
Copy link
Copy Markdown
Member Author

Bonus experiment: compile the grammar to a single PCRE2 regex

Tried the ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | ... ) pattern from PCRE2's "Backtracking control verbs" section. Findings:

Approach

  • Each token id encoded as a Unicode codepoint (token_id + 0x4000), so the lex output becomes a UTF-8 string the regex matches against.
  • Each grammar rule emitted as (?<rN>...) named subroutine.
  • Each branch's first symbol followed by (*THEN) so PCRE commits to the alternative once that token matches.
  • Aggressive transitive inlining of single-use, non-recursive rules to fit under PCRE2's compiled-pattern size limit (LINK_SIZE=2 default).
  • (*THEN) only emitted on rules whose sibling-branch FIRST sets are disjoint (the same single_candidate_rules set the interpreter uses), so committing is correctness-safe.
  • Mirrors the parser's SELECT ... INTO ... negative lookahead via (?!INTO_TOKEN) after selectStatement.

Results on the 69,576-query MySQL test corpus

  • Pattern: 76 KB source / 1,127 named subroutines (789 of the original 1,916 rules inlined transitively).
  • PCRE2 compile time: ~22 ms.
  • Match throughput with PCRE2 JIT: ~101,400 QPS
  • vs the optimised interpreter on this branch: ~62,500 QPS (best with tracing JIT)
  • vs the original trunk parser: ~23,700 QPS
  • ~62% faster than the optimised interpreter, ~4.3× the trunk parser

Caveats

  1. Match-only: the regex returns true/false for "is this a valid query?". It does not produce an AST, so it's not a drop-in replacement for the parser — the SQLite driver consumes the AST.
  2. ~0.15% spurious failures (102 out of 69,576): the grammar has ambiguities the regex doesn't model perfectly, mostly around SELECT ... INTO ... FOR UPDATE, weird identifiers, and nested versioned comments.
  3. Without (*THEN) it is catastrophically slow — a few CREATE TRIGGER queries with nested BEGIN ... IF ... END blocks individually take 3–4 seconds before that flag was added. With it, the worst-case queries drop to ~8 ms.
  4. PCRE2's bytecode size limit is the binding constraint. The full grammar without inlining is ~1.2 MB of regex source and PCRE2 refuses to compile it. Aggressive single-use rule inlining was needed to bring it down to a compilable form.

Conclusion

PCRE2 + (*THEN) can recognise this MySQL grammar at ~100 K QPS, ~60% faster than our pure-PHP interpreter. But it's a recogniser, not a parser — useful as a fast "does this parse?" gate, not as a replacement for the AST-producing parser the driver needs.

Committed in tests/tools/exp-regex-v3.php for documentation.

@JanJakes
Copy link
Copy Markdown
Member Author

Follow-up: can the regex return / reconstruct an AST?

Short answer: no, and even using it as a pre-validator costs more than it saves.

What PCRE2 in PHP exposes

  • preg_match returns binary match / no-match, plus offsets and the LAST match per named capture group. A recursive (?<name>...) group, like our rules, only retains the outermost or final instance — useless for an AST.
  • (?J) (duplicate names) doesn't change this in PHP's preg API.
  • (?C) callouts compile cleanly but PHP has no pcre_callout callback exposed, so we can't intercept matching to record structural events.
  • (*MARK:NAME) only retains the last MARK along the successful path.

So we can't structurally extract a parse tree from a single PCRE2 match in PHP without writing an extension or using FFI.

Hybrid: regex pre-validate, then parser builds AST

Tested on the full 69,576-query MySQL test corpus, tracing JIT enabled:

Approach Time QPS Notes
regex only (no AST) 0.752 s 92,519 102 spurious failures
parser only (AST) 1.136 s 61,240 9 baseline failures
regex + parser 1.480 s 47,008 regex pre-validates; parser builds AST on success

The hybrid is slower than the parser alone — the regex pre-check is pure overhead because the parser still has to run on every valid query (99.99% of the corpus), and valid is the common case. Hybrid only pays off when many inputs are invalid, where the regex can reject them faster than the parser; that's not our workload.

What it's actually good for

  • A fast "does this query parse?" gate: ~92K QPS with no AST, useful in defensive contexts where you want to reject malformed input cheaply before doing serious work.
  • An upper bound on PHP-side parsing speed for this grammar (~100K QPS) when AST shape is not required.
  • As-is, not a replacement for the production parser, because the SQLite driver consumes the AST.

Bottom line for the parser-performance PR

The optimised interpreter on this branch (~62K QPS with JIT, ~24K without) is the production path because it's the only approach that returns an AST. The regex experiment is committed in tests/tools/exp-regex-v3.php and the hybrid in tests/tools/exp-regex-hybrid.php purely as documentation of where PCRE2 can and can't help.

Tests whether running the regex match as a pre-validator before the
AST-building parser is faster than the parser alone.

Result on the 69,576-query MySQL corpus, tracing JIT enabled:
  regex only (no AST):  0.752 s, 92,519 QPS
  parser only (AST):    1.136 s, 61,240 QPS
  regex + parser:       1.480 s, 47,008 QPS

The hybrid is *slower* than the parser alone because the regex is pure
overhead - 99.99% of corpus queries are valid SQL, so the parser still
has to run on each query to build the AST. The pre-check only pays off
when many inputs are invalid; that is not our workload.

Confirms the regex experiment is a recogniser, not a parser
replacement: PCRE2 in PHP cannot return a structured tree from a
recursive named-group match (last-match-wins semantics) and PHP does
not expose user PCRE callouts that could intercept the match to record
structural events. Useful as a fast 'does this query parse?' gate; not
useful in workloads that need the AST.
… AST)

Tested whether FFI to libpcre2-8 could supply a callout callback so a
match could record (rule, offset) tuples. It cannot:

- pcre2_set_callout_8 takes a function pointer.
- PHP FFI does not allow PHP closures to be cast to C function
  pointers; libffi closure support is intentionally not enabled in
  PHP's FFI build.

So pure-PHP code can call pcre2_compile_8 / pcre2_match_8 via FFI but
cannot supply a callout function. The (?C) callouts in the pattern
have no observable effect.

Documents the surveyed paths to building a PCRE2-driven AST in PHP,
all of which are blocked or worse than the existing parser:

  1. Stock preg_*: ovector is last-match-wins per numbered group, even
     with (?J) duplicate names (each (?<name>...) occurrence has its
     own slot but each slot only retains the last match). Recursive
     named groups expose nothing about intermediate matches. (*MARK)
     only retains the last mark. PHP exposes no callout callback.
  2. FFI to libpcre2: blocked as described above.
  3. Multi-pass extraction with preg_match_all on simpler flat
     patterns: re-implements parsing with regex per layer; not faster
     than the recursive-descent interpreter.
  4. preg_match validate + parser builds AST (exp-regex-hybrid.php):
     net loss because the parser still has to run on every valid
     query, and valid is the common case.
  5. Custom PHP extension wrapping pcre2_set_callout: significant C
     work, out of scope.

Conclusion: in stock PHP the regex match is a fast yes/no validator
(~92K QPS) and an upper bound on PHP-side parsing speed when an AST
is not required (~100K QPS). It cannot replace the AST-producing
parser the SQLite driver consumes.
@JanJakes
Copy link
Copy Markdown
Member Author

Deep dive: PCRE2 / PHP options for AST extraction

Surveyed every PCRE2 feature exposed in PHP and a few that aren't:

What PHP's preg_* exposes

Feature Behavior with recursive named groups Useful?
PREG_OFFSET_CAPTURE One offset pair per numbered group, last-match-wins partial
PREG_UNMATCHED_AS_NULL NULL vs empty string for unmatched partial
(?<name>...) Last match per name no
(?J) duplicate named groups [name] returns first non-empty; numbered slots [N] each store last match of that occurrence in the static pattern, not all match instances during execution no
(*MARK:NAME) Only the last MARK along the successful path is in $matches['MARK'] no
\K Resets match start; cannot record multiple positions no
(?C) callouts Compile cleanly but PHP exposes no callback registration — they are silently no-ops blocked
preg_match_all Multiple top-level matches; for a recursive whole-input pattern, only one outer match no
preg_replace_callback Callback fires per outer match, not on sub-pattern entries no

Verified all of the above with concrete tests; results pasted in exp-regex-* tools.

What's NOT exposed

PCRE2's C API has callouts (pcre2_set_callout) where a callback fires at every (?C) site with full pcre2_callout_block info: current_position, offset_vector, capture_top, capture_last, mark, callout_flags (incl. PCRE2_CALLOUT_BACKTRACK). That's exactly what's needed to record entry/exit events for every rule and reconstruct an AST.

PHP's php_pcre.c does not expose this API. There is no pcre_callout() user function, no class, no INI-driven hook, and no other documented mechanism in PHP 8.5.

Tried FFI as an escape hatch

PHP 7.4+ has FFI. In principle we can call pcre2_compile_8, pcre2_match_8, pcre2_set_callout_8 directly. But PHP FFI deliberately does not support binding PHP closures to C function pointers — libffi closure support is not enabled. So we cannot supply a callout function from PHP. Source pasted in tests/tools/exp-pcre-ffi.php.

That leaves a custom PHP extension wrapping the callout API as the only path to PCRE2-driven AST extraction. Out of scope here.

Multi-pass extraction is just a parser

The remaining alternative — match the full pattern to validate, then preg_match_all repeatedly with simpler flat patterns to extract structure — is structurally equivalent to writing a parser in PHP that uses preg_match per rule. Per-call overhead (~55 ns for cached patterns) is fine, but the algorithmic structure is the same as the existing recursive-descent interpreter, with no speedup expected.

Summary

In stock PHP the compiled regex is a fast yes/no validator only:

  • ~92K QPS on the full corpus, no AST.
  • ~100K QPS upper bound on parse-recognition speed for this grammar.
  • Regex + parser hybrid is slower than the parser alone (~47K QPS vs ~62K), because the regex is pure overhead on valid input and ~99.99% of input is valid.

The existing recursive-descent interpreter (~62K QPS with tracing JIT, ~34K without; both delivering an AST) remains the production path. Three exp- scripts in tests/tools/ document the regex experiments for future reference:

  • exp-regex-v3.php — match-only, the fast version
  • exp-regex-hybrid.php — regex pre-validate + parser
  • exp-pcre-ffi.php — explains why FFI can't supply callouts

@JanJakes
Copy link
Copy Markdown
Member Author

Re-checked the SO technique — partial correction

The SO answer at https://stackoverflow.com/a/24279157 uses preg_match_all with a flat pattern + a lookbehind anchor to get every instance of a repeated sub-pattern. That technique does work and I should have tried it more thoroughly before declaring "no AST".

Here's what I verified:

What preg_match_all actually does for our case

Pattern Behavior
(?<=\d),\d+ (flat + lookbehind) Returns each match as a separate element. Works fine for flat repeating data.
`(?P((?:[^()]+ (?P>paren))*))` (recursive)

So the recursive named-group + preg_match_all combination can walk a tree, level by level. Each call returns the outermost matches at the current scope; you recurse into each match's body to find inner ones.

Speed for the tree-walking approach (10k iterations on SELECT a, (SELECT b, (SELECT c FROM t3) FROM t2), d FROM t1):

  • 1 outer preg_match_all (recursive paren pattern): ~12.5M QPS
  • 1 anchored extraction (/FROM\s+\K\w+/ for tables): bundled together ~3.8M QPS

That's two orders of magnitude faster than the parser per call. The economics make sense for focused extraction (e.g., "find all table refs", "find all subquery boundaries").

Where this stops being a path to a full AST

The MySQL grammar has ~1,900 rules. To produce a full AST via this technique you'd need a regex per rule applied at every level of the tree. For our 17-token-avg corpus with depth ~10 levels:

  • Partial AST (top-level statement structure + a handful of rules like tableRef / columnRef / whereClause): a few preg_match_all calls per query — easily faster than the parser, useful for queries that only need certain shapes.
  • Full AST (every rule, every nesting): would require ~50+ regex calls per query for typical queries, ~3–5M total over the corpus. At ~55 ns/cached call that's ~200 ms minimum, but each call's actual regex work is the dominant cost — and many of those patterns must include the full grammar definitions to disambiguate context. Estimated worse than the existing parser.

Why the answer to "AST from a single match?" is still no

preg_match_all works because it runs the pattern multiple times under the hood, advancing the cursor between matches. Each match is a fresh pattern execution. So you do not get "all captures from one match" — you get "many top-level matches, each with last-match-wins captures inside".

For the recursive grammar where the same rule is matched in nested positions during a single execution, those nested instances are not exposed because PCRE2 only stores last-match-wins per group at the C level too. PHP would need PCRE2 callouts — and as documented in exp-pcre-ffi.php, PHP doesn't expose them and FFI can't bridge the gap because PHP closures aren't bindable to C function pointers.

Practical takeaway

The SO technique gives a real win for partial / flat / repeated extraction:

// Anchor + capture: every table that follows FROM
preg_match_all('/FROM\s+\K\w+/', $sql, $m);

// Recursive: every top-level subquery span (non-overlapping)
preg_match_all('/(?P<paren>\((?:[^()]+|(?P>paren))*\))/', $sql, $m, PREG_OFFSET_CAPTURE);

Both run at multi-million QPS. Useful as a query-shape introspection layer alongside the parser — e.g. for fast "list referenced tables" or "is there a subquery here" without touching the AST.

For producing the AST the SQLite driver actually consumes, the recursive-descent interpreter on this branch (~62K QPS with JIT) remains the right tool.

Updated mental model: PCRE in PHP gives you (1) a fast yes/no recogniser, (2) cheap preg_match_all-driven flat/partial extraction, but not (3) a one-shot full nested parse tree.

@JanJakes
Copy link
Copy Markdown
Member Author

Closing in favor of #378.

@JanJakes JanJakes closed this Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant