Parser performance: 2-3x speedup via grammar preprocessing and interpreter optimisations by JanJakes · Pull Request #373 · WordPress/sqlite-database-integration

JanJakes · 2026-04-25T06:41:06Z

Summary

This is a draft PR to measure CI runtime improvement from a series of parser-performance optimisations on the recursive descent parser. Aimed primarily at lowering the time tests spend in PHPUnit, which is dominated by the 70-batch MySQL server parser test suite (~70k queries parsed per run).

End-to-end (lex+parse, 69,577 queries, local benchmark):

Trunk, no opcache: ~10,500 QPS (6.6 s)
Trunk, tracing JIT: ~23,700 QPS (2.9 s)
This branch, no opcache: ~25,100 QPS (2.8 s) — 2.4× faster
This branch, tracing JIT: ~43,500 QPS (1.6 s) — 1.8× faster end-to-end / 2.7× parse-only

CI runtime comparison

Run PHPUnit tests step duration on this branch vs the average of the last 3 trunk-merged feature PRs (#365 / #367 / #369):

PHP	Trunk avg	This PR	Speedup
7.2	165 s	97 s	1.70×
7.3	126 s	73 s	1.73×
7.4	120 s	57 s	2.11×
8.0	132 s	68 s	1.94×
8.1	141 s	75 s	1.88×
8.2	128 s	70 s	1.83×
8.3	134 s	74 s	1.81×
8.4	138 s	72 s	1.92×
8.5	141 s	79 s	1.78×

Total: 20.4 min → 11.1 min of cumulative phpunit-step time across the matrix per run (~9 min saved per CI run).

The 70/70 server-suite parser test sub-suite is what dominates the step. Per-PHPUnit-step speedup is consistent with the local end-to-end benchmark.

Sequence of commits

f935d03 Inline terminal matching and defer node allocation
0e34f25 Per-branch FIRST sets to skip unreachable branches
6df347a Short-circuit nullable fallback + inline single-branch fragments
9eea849 Strip epsilon markers + cache grammar refs
ddfe4b6 Return fragment results as children arrays
77756bf End-of-input sentinel token (drop range checks)
daec1cb Embed branch sequences in selector
2609898 Compare selectStatement by rule id
c4e7f8f Add bench-parser-split.php tooling
cd0b609 Deduplicate selector entries (memory: 40 MB → 10 MB)
e8b006f Cache branch sequences in deduped selector
b5959e8 Grammar compilation experiment + bench tools (kept as documentation, not used at runtime)
bf1b1fe Pack switch case labels (compiled output cleanup)
fa2fd65 Single-candidate rule fast path
1e7e3cf Direct-return refactor for single-candidate path
9fcfb27 Mark WP_Parser_Node as final
b9b64a6 bench-final.php helper

Tests: 667/667 phpunit tests pass with the same ~1.43M assertions as trunk.
Grammar memory: ~10 MB (from ~1.5 MB on trunk; cost of the new per-token branch selector + FIRST/NULLABLE precomputation).
PHPCS: clean.

Test plan

composer run test (mysql-on-sqlite) - green locally
composer run check-cs - green
CI matrix on PHP 7.2 → 8.5 - all green, ~1.85× faster
Confirm WordPress PHPUnit / E2E Tests pass

Notes

This PR also includes a grammar→PHP compiler experiment under tests/tools/compile-grammar.php. It was kept as documentation: it's faster than the interpreter without JIT (+14-19%) but slightly slower under tracing JIT, so the interpreter remains the production path.

Draft because the goal here is to compare CI runtimes against recent merges before deciding what to ship.

Hot-path changes in WP_Parser::parse_recursive(): - Inline the terminal match in the branch loop instead of recursing into parse_recursive() for every token. Over the full MySQL test suite this eliminates ~1.6M function calls. - Hoist grammar, rules, fragment_ids, rule_names, tokens, and token_count into local variables so the inner loops avoid repeated property lookups on $this->grammar. - Cache the token count on the instance to avoid a count() per call. - Build branch children in a local array and only instantiate the WP_Parser_Node once the branch has matched; on the MySQL corpus ~75% of speculative nodes were previously created and thrown away. - Drop a dead is_array($subnode) check that never fires in practice (subnodes are false, true, tokens, or nodes - never arrays). - Inline fragment inlining: read the fragment's children directly instead of building a fragment node and immediately merging it. End-to-end parser benchmark on the MySQL server test corpus: Before: ~11,500 QPS After: ~14,900 QPS (+29%)

The grammar now precomputes FIRST and NULLABLE via fixpoint, then indexes each rule's branches by the tokens that can start them. At parse time the parser jumps straight to the candidate branches for the current token instead of iterating every branch and letting most fail. On the full MySQL test suite, 59% of branch attempts previously failed because the first token could never match the branch's FIRST set; with per-branch lookahead those attempts are eliminated. End-to-end parser benchmark: Before: ~14,900 QPS After: ~22,400 QPS (+50%)

Two grammar/parser refinements that both reduce recursive calls: * In parse_recursive(): when the rule has a per-token branch selector but the current token is not in any branch's FIRST and the rule itself is nullable, return 'matched empty' immediately instead of descending into nullable branches that would recursively do the same thing. This alone eliminates ~460k recursive calls on the MySQL corpus. * At grammar build time, expand every single-branch fragment rule into its call sites. Fragments exist only to factor shared sub-sequences and their children are already flattened into the parent AST node, so splicing them directly into parent branches is a no-op for the resulting tree but removes an entire recursive call per use. 480 of the grammar's fragments qualify. Also drops the dead terminal branch at the top of parse_recursive() (the branch loop inlines terminal matching, so parse_recursive is only ever called with non-terminal rule ids) and the always-false empty-branches guard. End-to-end parser benchmark: Before: ~22,400 QPS After: ~27,500 QPS (+23%)

Two minor reductions in per-call work: * Strip explicit EMPTY_RULE_ID symbols out of rule branches at grammar build time. The parser loop would have 'continue'd over them anyway, so removing them ahead of time lets the hot symbol loop drop the epsilon check. Pure-epsilon branches become empty branches and still match empty via the existing empty-children fast path. * Cache the grammar's rules, fragment_ids, rule_names, branches_for_token, nullable_branches, and highest_terminal_id as direct parser instance fields so parse_recursive() no longer pays for a $this->grammar->... double hop on every call. * Collapse the two-step node construction (new + set_children) into a single constructor call that takes the children array directly. This saves a method call per allocated node (~820k across the MySQL corpus). End-to-end parser benchmark: ~27,500 QPS -> ~28,500 QPS (+3.5%).

Multi-branch fragment rules can't be expanded at grammar build time, but their runtime role is still trivial: match a sequence of symbols and have the caller splice the resulting children into its own node. The old code allocated a full WP_Parser_Node for each fragment match just to have the caller immediately copy its children out. Return the children array directly from fragments instead. The caller distinguishes via is_array($subnode) and splices in-place, saving a Parser_Node allocation per fragment match (~253k per 10k queries). End-to-end parser benchmark: Before: ~27,000 QPS (avg) After: ~28,700 QPS (+6%).

Add a sentinel WP_Parser_Token with id EMPTY_RULE_ID (0) to the end of the token array. Real MySQL tokens never have id 0 (WHITESPACE, the only token with id 0, is stripped by the lexer before tokens reach the parser), so the sentinel cannot match any real terminal. This lets the hot path drop the 'position < token_count' range check everywhere it reads the current token id: the selector lookup at method entry, the inline terminal match inside the branch loop, and the post-branch INTO negative lookahead for selectStatement. Any read past the last real token falls naturally into the nullable-fallback or branch-miss handling. Also drop a few dead locals ($token_count, $fragment_ids) that no longer appear in the hot path after the change. End-to-end parser benchmark: Before: ~28,700 QPS (avg) After: ~29,800 QPS (+4%).

Previously the per-(rule, token) selector stored a list of branch indexes that the parser then had to look up in $rules[$rule_id] on every branch attempt. Store the branch symbol sequences themselves so the hot loop can iterate candidate branches directly. PHP arrays are copy-on-write, so sharing the same branch sequence across selector entries for many tokens costs negligible extra memory. The nullable_branches map shrinks to a bool marker since the parser only uses it for existence checks. Also cache the start rule id on the grammar so parse() skips its array_search() across rule_names on every call. End-to-end parser benchmark: Before: ~29,800 QPS (avg) After: ~31,700 QPS (+6%).

Minor cleanup in parse_recursive(): cache the selectStatement rule id once and compare integers on every call instead of re-comparing the 'selectStatement' string against every rule's name. Also drops the $rules instance cache from the parser, which the hot path no longer touches now that branch sequences are embedded in the selector.

bench-parser-split.php pre-tokenizes the MySQL test suite once and then times only the parser across multiple runs, so parser-specific changes can be measured without lexer noise. The script accepts --runs=N and --limit=M for reproducible comparisons. Also adopts phpcbf's trivial whitespace alignment fixes in the grammar and parser source to keep 'composer run check-cs' clean.

The per-(rule, token) branch selector stored a separate inner array per token, even when many tokens within the same rule mapped to identical branch-id lists (a single branch's FIRST set covers many tokens, for example). Loading the MySQL grammar used ~40 MB of PHP memory, most of which was duplicated inner arrays. Deduplicate by signature during grammar build so all tokens that land on the same branch-id list share one inner array via copy-on-write. The selector is now stored as branch indexes again (instead of the cached symbol sequences from the previous commit) - the one extra $branches[$idx] lookup per branch attempt costs < 1% at runtime but allows the inner arrays to be tiny and to share aggressively. Grammar memory on the MySQL grammar drops from ~40 MB to ~10 MB. PHPUnit peak memory drops from 198 MB to 110 MB.

Re-embed branch symbol sequences directly in the selector entries so the hot loop can foreach over them without an extra $rules[$rule_id][$idx] indirection per branch attempt. Per-rule dedup pairs tokens that land on the same branch list to a single sequences array via copy-on-write, so grammar memory stays at ~10 MB instead of the ~40 MB the naive form needed. Recovers the ~3% parse speedup lost in the memory-reduction commit while keeping the lower footprint. Parser benchmark: Best: ~32,400 QPS Avg: ~31,300 QPS (compared to ~30,700 avg in the indexes-only variant)

This commit preserves the exploration of 'full grammar compilation' as a follow-up to the parser performance work. It is intentionally kept separate from the main parser because the compilation trade-off is not a clear win. Tools added: - compile-grammar.php: walks the grammar and emits a self-contained class (WP_MySQL_Compiled_Parser) with one method per reachable rule. Single-branch fragments are inlined, single-use multi-branch fragments are kept as methods. Dispatch uses switch-on-tid for multi-group rules; a compact isset() lookup table for single-group rules; and a 'one-of-N-terminals' fast path for the many identifier-like rules with hundreds of single-terminal branches. - bench-compiled-parser.php: side-by-side run of interpreter vs compiled on the MySQL test corpus. - compare-asts.php: verifies the compiled parser produces the same AST as the interpreter on every query. - dump-inflated-grammar.php: dumps the post-inflation grammar data so the effect of skipping the FIRST/NULLABLE fixpoint at runtime could be measured. - bench-hot-rules.php: distribution of per-rule call counts for deciding which rules are worth specialising. Empirical findings on the full MySQL test corpus (69,576 queries): Interpreter (current parser on this branch): - no opcache: ~32,500 QPS, ~12 MB grammar - opcache, no JIT: ~35,100 QPS - opcache + JIT: ~52,600 QPS (tracing) Compiled parser: - no opcache: ~38,300 QPS (+18% over interpreter) - opcache, no JIT: ~41,800 QPS (+19%) - opcache + JIT: ~49,700 QPS (slightly below interpreter) Compiled file size: ~2.6 MB (99k lines, 1,427 methods) Compiled class load: ~22 ms / ~17 MB RAM (vs 38 ms / 12 MB to inflate the compressed grammar). Why compilation helps without JIT: - eliminates the generic dispatch (branches_for_token, fragment_ids, nullable_branches isset checks) baked into every interpreter call; - resolves fragment vs non-fragment and nullable vs non-nullable at compile time so the emitted code has no runtime type checks; - collapses 'accepts any of N terminals' rules (251 of them) into an 8-line isset() + consume instead of ~2.8k-line switches. Why compilation does not help with tracing JIT: - the interpreter's hot loop is small and regular, which tracing JIT optimises aggressively (9-10 ns per recursive call); - the compiled parser generates a handful of very large methods (the biggest is ~2.8k lines of a 406-branch fragment) that tracing JIT struggles to optimise, and incurs a substantial JIT-compile penalty on first use (~0.7 s of the first-run time). Conclusion: keep the interpreter as the primary parser. The compiler is preserved here as documentation and as a fallback path for environments without JIT.

Previously the compiler emitted each case label on its own line (`\t\t\tcase 5:\n`), and case labels were 56% of all generated code. Group multiple labels per line instead (up to 10) so the switch dispatch is still readable but the file shrinks from ~99k lines (~2.63 MB) to ~51k lines (~2.48 MB) with no behaviour change. No runtime impact: verified 0 AST mismatches across the 69k-query corpus and identical QPS to the previous output under all opcache/JIT configurations.

On the MySQL grammar, 1,290 of 1,916 rules have a selector where every (rule, token) entry points to exactly one branch. Those rules account for ~55% of parse_recursive calls on the test corpus (722k of 1.3M per 10k queries). Flag those rules at grammar build time. In parse_recursive, detect the flag and skip the outer 'foreach ($candidate_branches as ...)' by taking $candidate_branches[0] directly. The branch-match body is otherwise identical to the multi-candidate path. End-to-end parser benchmark: no JIT: ~31.6K -> ~32.6K QPS avg (+3%) tracing JIT: ~52.6K -> ~55.7K QPS avg (+6%)

Replace the $branch_matches flag + break+reset sequence with direct '$this->position = $position; return false;' exits on each failure path. Removes one local variable and a pair of conditional branches from the hot inner loop. Minor but measurable improvement; the code is also simpler.

Nothing extends WP_Parser_Node. Marking it final lets PHP's opcache and tracing JIT specialize property access and method dispatch since the class layout is now fixed. Small but consistent improvement measured across multiple runs under tracing JIT (~+2% avg, ~+2% best). End-to-end parser benchmark: tracing JIT: ~57K -> ~57-58K QPS avg, 60-61K QPS best no JIT: ~33K -> ~34K QPS avg, 35K QPS best

Reports best/median/average QPS over N runs with the currently-loaded PHP interpreter configuration. Used to measure the effect of the interpreter changes on top of opcache and tracing JIT configurations.

JanJakes · 2026-04-25T06:53:35Z

CI runtime comparison — full matrix

Compared this PR (parser-performance) against the average of three recent merged feature PRs (#365, #367, #369). All values are full-job duration (includes SQLite build + composer + tests).

Workflow	Parser-perf	Trunk avg	Δ
PHP 7.2 / SQLite 3.27.0	132 s	201 s	−34%
PHP 7.3 / SQLite 3.31.1	116 s	172 s	−32%
PHP 7.4 / SQLite 3.34.1	94 s	158 s	−40%
PHP 8.0 / SQLite 3.37.0	111 s	169 s	−34%
PHP 8.1 / SQLite 3.40.1	116 s	183 s	−36%
PHP 8.2 / SQLite 3.45.1	115 s	176 s	−35%
PHP 8.3 / SQLite 3.46.1	114 s	175 s	−35%
PHP 8.4 / SQLite 3.51.2	103 s	175 s	−41%
PHP 8.5 / SQLite latest	119 s	183 s	−35%
PHPUnit Tests (sum)	1020 s	1592 s	−36%, ~9.5 min saved
WordPress PHPUnit Tests	587 s	713 s	−18%
MySQL Proxy Tests	18 s	20 s	-10%
End-to-end Tests	340 s	315 s	+8% (noise)
WordPress End-to-end Tests	441 s	360 s	+22% (noise)

The PHPUnit numbers above include ~30-40 s of SQLite build and ~10 s of composer install which are unchanged. Looking at just the Run PHPUnit tests step:

PHP	Trunk avg	This PR	Speedup
7.2	165 s	97 s	1.70×
7.3	126 s	73 s	1.73×
7.4	120 s	57 s	2.11×
8.0	132 s	68 s	1.94×
8.1	141 s	75 s	1.88×
8.2	128 s	70 s	1.83×
8.3	134 s	74 s	1.81×
8.4	138 s	72 s	1.92×
8.5	141 s	79 s	1.78×

~1.85× faster phpunit across all PHP versions on real CI, matching the local end-to-end benchmark.

E2E variance is browser/wp-env startup dominated and within normal run-to-run noise; not parser-related.

All tests pass. Total saved per CI run on PHPUnit + WordPress PHPUnit alone: ~11 minutes.

Experiment: compile the grammar to a single PCRE2 pattern using: - each token id encoded as a Unicode codepoint at offset 0x4000 - each rule emitted as (?<rN>...) named subroutine - (*THEN) on each branch's first symbol of *single-candidate* rules (where sibling-branch FIRST sets are disjoint, so committing is safe) - aggressive transitive inlining of single-use non-recursive rules to shrink the bytecode below PCRE2's compiled-pattern size limit Result on the 69,576-query MySQL test corpus (PCRE2 JIT enabled): - Pattern: ~76 KB source, 1127 named subroutines after 789 rules inlined - Match throughput: ~97,600 QPS, vs the optimised interpreter's ~62k. - 99.82% accuracy: ~120 spurious failures, mostly the 'SELECT ... INTO' ambiguity that the interpreter handles via a runtime negative lookahead the regex doesn't model. Trade-offs: 1. Match-only - the regex doesn't build an AST, so it's not a drop-in replacement for the recursive-descent parser the SQLite driver needs. 2. Without (*THEN) the matcher backtracks catastrophically on nested compound statements (CREATE TRIGGER ... BEGIN ... IF ...). 3. With (*THEN) on every branch (not just single-candidate) the regex gives spurious failures because PCRE commits to the first first-symbol match and can't try a sibling alternative. 4. Pattern size is constrained by PCRE2's default LINK_SIZE=2 bytecode limit; aggressive rule inlining is needed to fit a non-trivial grammar. Kept as documentation: an interesting upper bound on PHP-side parsing speed when the AST shape is not required.

JanJakes · 2026-04-25T10:31:35Z

Bonus experiment: compile the grammar to a single PCRE2 regex

Tried the ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | ... ) pattern from PCRE2's "Backtracking control verbs" section. Findings:

Approach

Each token id encoded as a Unicode codepoint (token_id + 0x4000), so the lex output becomes a UTF-8 string the regex matches against.
Each grammar rule emitted as (?<rN>...) named subroutine.
Each branch's first symbol followed by (*THEN) so PCRE commits to the alternative once that token matches.
Aggressive transitive inlining of single-use, non-recursive rules to fit under PCRE2's compiled-pattern size limit (LINK_SIZE=2 default).
(*THEN) only emitted on rules whose sibling-branch FIRST sets are disjoint (the same single_candidate_rules set the interpreter uses), so committing is correctness-safe.
Mirrors the parser's SELECT ... INTO ... negative lookahead via (?!INTO_TOKEN) after selectStatement.

Results on the 69,576-query MySQL test corpus

Pattern: 76 KB source / 1,127 named subroutines (789 of the original 1,916 rules inlined transitively).
PCRE2 compile time: ~22 ms.
Match throughput with PCRE2 JIT: ~101,400 QPS
vs the optimised interpreter on this branch: ~62,500 QPS (best with tracing JIT)
vs the original trunk parser: ~23,700 QPS
~62% faster than the optimised interpreter, ~4.3× the trunk parser

Caveats

Match-only: the regex returns true/false for "is this a valid query?". It does not produce an AST, so it's not a drop-in replacement for the parser — the SQLite driver consumes the AST.
~0.15% spurious failures (102 out of 69,576): the grammar has ambiguities the regex doesn't model perfectly, mostly around SELECT ... INTO ... FOR UPDATE, weird identifiers, and nested versioned comments.
Without (*THEN) it is catastrophically slow — a few CREATE TRIGGER queries with nested BEGIN ... IF ... END blocks individually take 3–4 seconds before that flag was added. With it, the worst-case queries drop to ~8 ms.
PCRE2's bytecode size limit is the binding constraint. The full grammar without inlining is ~1.2 MB of regex source and PCRE2 refuses to compile it. Aggressive single-use rule inlining was needed to bring it down to a compilable form.

Conclusion

PCRE2 + (*THEN) can recognise this MySQL grammar at ~100 K QPS, ~60% faster than our pure-PHP interpreter. But it's a recogniser, not a parser — useful as a fast "does this parse?" gate, not as a replacement for the AST-producing parser the driver needs.

Committed in tests/tools/exp-regex-v3.php for documentation.

JanJakes · 2026-04-25T13:24:42Z

Follow-up: can the regex return / reconstruct an AST?

Short answer: no, and even using it as a pre-validator costs more than it saves.

What PCRE2 in PHP exposes

preg_match returns binary match / no-match, plus offsets and the LAST match per named capture group. A recursive (?<name>...) group, like our rules, only retains the outermost or final instance — useless for an AST.
(?J) (duplicate names) doesn't change this in PHP's preg API.
(?C) callouts compile cleanly but PHP has no pcre_callout callback exposed, so we can't intercept matching to record structural events.
(*MARK:NAME) only retains the last MARK along the successful path.

So we can't structurally extract a parse tree from a single PCRE2 match in PHP without writing an extension or using FFI.

Hybrid: regex pre-validate, then parser builds AST

Tested on the full 69,576-query MySQL test corpus, tracing JIT enabled:

Approach	Time	QPS	Notes
regex only (no AST)	0.752 s	92,519	102 spurious failures
parser only (AST)	1.136 s	61,240	9 baseline failures
regex + parser	1.480 s	47,008	regex pre-validates; parser builds AST on success

The hybrid is slower than the parser alone — the regex pre-check is pure overhead because the parser still has to run on every valid query (99.99% of the corpus), and valid is the common case. Hybrid only pays off when many inputs are invalid, where the regex can reject them faster than the parser; that's not our workload.

What it's actually good for

A fast "does this query parse?" gate: ~92K QPS with no AST, useful in defensive contexts where you want to reject malformed input cheaply before doing serious work.
An upper bound on PHP-side parsing speed for this grammar (~100K QPS) when AST shape is not required.
As-is, not a replacement for the production parser, because the SQLite driver consumes the AST.

Bottom line for the parser-performance PR

The optimised interpreter on this branch (~62K QPS with JIT, ~24K without) is the production path because it's the only approach that returns an AST. The regex experiment is committed in tests/tools/exp-regex-v3.php and the hybrid in tests/tools/exp-regex-hybrid.php purely as documentation of where PCRE2 can and can't help.

Tests whether running the regex match as a pre-validator before the AST-building parser is faster than the parser alone. Result on the 69,576-query MySQL corpus, tracing JIT enabled: regex only (no AST): 0.752 s, 92,519 QPS parser only (AST): 1.136 s, 61,240 QPS regex + parser: 1.480 s, 47,008 QPS The hybrid is *slower* than the parser alone because the regex is pure overhead - 99.99% of corpus queries are valid SQL, so the parser still has to run on each query to build the AST. The pre-check only pays off when many inputs are invalid; that is not our workload. Confirms the regex experiment is a recogniser, not a parser replacement: PCRE2 in PHP cannot return a structured tree from a recursive named-group match (last-match-wins semantics) and PHP does not expose user PCRE callouts that could intercept the match to record structural events. Useful as a fast 'does this query parse?' gate; not useful in workloads that need the AST.

… AST) Tested whether FFI to libpcre2-8 could supply a callout callback so a match could record (rule, offset) tuples. It cannot: - pcre2_set_callout_8 takes a function pointer. - PHP FFI does not allow PHP closures to be cast to C function pointers; libffi closure support is intentionally not enabled in PHP's FFI build. So pure-PHP code can call pcre2_compile_8 / pcre2_match_8 via FFI but cannot supply a callout function. The (?C) callouts in the pattern have no observable effect. Documents the surveyed paths to building a PCRE2-driven AST in PHP, all of which are blocked or worse than the existing parser: 1. Stock preg_*: ovector is last-match-wins per numbered group, even with (?J) duplicate names (each (?<name>...) occurrence has its own slot but each slot only retains the last match). Recursive named groups expose nothing about intermediate matches. (*MARK) only retains the last mark. PHP exposes no callout callback. 2. FFI to libpcre2: blocked as described above. 3. Multi-pass extraction with preg_match_all on simpler flat patterns: re-implements parsing with regex per layer; not faster than the recursive-descent interpreter. 4. preg_match validate + parser builds AST (exp-regex-hybrid.php): net loss because the parser still has to run on every valid query, and valid is the common case. 5. Custom PHP extension wrapping pcre2_set_callout: significant C work, out of scope. Conclusion: in stock PHP the regex match is a fast yes/no validator (~92K QPS) and an upper bound on PHP-side parsing speed when an AST is not required (~100K QPS). It cannot replace the AST-producing parser the SQLite driver consumes.

JanJakes · 2026-04-25T13:35:10Z

Deep dive: PCRE2 / PHP options for AST extraction

Surveyed every PCRE2 feature exposed in PHP and a few that aren't:

What PHP's preg_* exposes

Feature	Behavior with recursive named groups	Useful?
`PREG_OFFSET_CAPTURE`	One offset pair per numbered group, last-match-wins	partial
`PREG_UNMATCHED_AS_NULL`	NULL vs empty string for unmatched	partial
`(?<name>...)`	Last match per name	no
`(?J)` duplicate named groups	`[name]` returns first non-empty; numbered slots [N] each store last match of that occurrence in the static pattern, not all match instances during execution	no
`(*MARK:NAME)`	Only the last MARK along the successful path is in `$matches['MARK']`	no
`\K`	Resets match start; cannot record multiple positions	no
`(?C)` callouts	Compile cleanly but PHP exposes no callback registration — they are silently no-ops	blocked
`preg_match_all`	Multiple top-level matches; for a recursive whole-input pattern, only one outer match	no
`preg_replace_callback`	Callback fires per outer match, not on sub-pattern entries	no

Verified all of the above with concrete tests; results pasted in exp-regex-* tools.

What's NOT exposed

PCRE2's C API has callouts (pcre2_set_callout) where a callback fires at every (?C) site with full pcre2_callout_block info: current_position, offset_vector, capture_top, capture_last, mark, callout_flags (incl. PCRE2_CALLOUT_BACKTRACK). That's exactly what's needed to record entry/exit events for every rule and reconstruct an AST.

PHP's php_pcre.c does not expose this API. There is no pcre_callout() user function, no class, no INI-driven hook, and no other documented mechanism in PHP 8.5.

Tried FFI as an escape hatch

PHP 7.4+ has FFI. In principle we can call pcre2_compile_8, pcre2_match_8, pcre2_set_callout_8 directly. But PHP FFI deliberately does not support binding PHP closures to C function pointers — libffi closure support is not enabled. So we cannot supply a callout function from PHP. Source pasted in tests/tools/exp-pcre-ffi.php.

That leaves a custom PHP extension wrapping the callout API as the only path to PCRE2-driven AST extraction. Out of scope here.

Multi-pass extraction is just a parser

The remaining alternative — match the full pattern to validate, then preg_match_all repeatedly with simpler flat patterns to extract structure — is structurally equivalent to writing a parser in PHP that uses preg_match per rule. Per-call overhead (~55 ns for cached patterns) is fine, but the algorithmic structure is the same as the existing recursive-descent interpreter, with no speedup expected.

Summary

In stock PHP the compiled regex is a fast yes/no validator only:

~92K QPS on the full corpus, no AST.
~100K QPS upper bound on parse-recognition speed for this grammar.
Regex + parser hybrid is slower than the parser alone (~47K QPS vs ~62K), because the regex is pure overhead on valid input and ~99.99% of input is valid.

The existing recursive-descent interpreter (~62K QPS with tracing JIT, ~34K without; both delivering an AST) remains the production path. Three exp- scripts in tests/tools/ document the regex experiments for future reference:

exp-regex-v3.php — match-only, the fast version
exp-regex-hybrid.php — regex pre-validate + parser
exp-pcre-ffi.php — explains why FFI can't supply callouts

JanJakes · 2026-04-25T13:46:55Z

Re-checked the SO technique — partial correction

The SO answer at https://stackoverflow.com/a/24279157 uses preg_match_all with a flat pattern + a lookbehind anchor to get every instance of a repeated sub-pattern. That technique does work and I should have tried it more thoroughly before declaring "no AST".

Here's what I verified:

What `preg_match_all` actually does for our case

Pattern	Behavior
`(?<=\d),\d+` (flat + lookbehind)	Returns each match as a separate element. Works fine for flat repeating data.
`(?P((?:[^()]+	(?P>paren))*))` (recursive)

So the recursive named-group + preg_match_all combination can walk a tree, level by level. Each call returns the outermost matches at the current scope; you recurse into each match's body to find inner ones.

Speed for the tree-walking approach (10k iterations on SELECT a, (SELECT b, (SELECT c FROM t3) FROM t2), d FROM t1):

1 outer preg_match_all (recursive paren pattern): ~12.5M QPS
1 anchored extraction (/FROM\s+\K\w+/ for tables): bundled together ~3.8M QPS

That's two orders of magnitude faster than the parser per call. The economics make sense for focused extraction (e.g., "find all table refs", "find all subquery boundaries").

Where this stops being a path to a full AST

The MySQL grammar has ~1,900 rules. To produce a full AST via this technique you'd need a regex per rule applied at every level of the tree. For our 17-token-avg corpus with depth ~10 levels:

Partial AST (top-level statement structure + a handful of rules like tableRef / columnRef / whereClause): a few preg_match_all calls per query — easily faster than the parser, useful for queries that only need certain shapes.
Full AST (every rule, every nesting): would require ~50+ regex calls per query for typical queries, ~3–5M total over the corpus. At ~55 ns/cached call that's ~200 ms minimum, but each call's actual regex work is the dominant cost — and many of those patterns must include the full grammar definitions to disambiguate context. Estimated worse than the existing parser.

Why the answer to "AST from a single match?" is still no

preg_match_all works because it runs the pattern multiple times under the hood, advancing the cursor between matches. Each match is a fresh pattern execution. So you do not get "all captures from one match" — you get "many top-level matches, each with last-match-wins captures inside".

For the recursive grammar where the same rule is matched in nested positions during a single execution, those nested instances are not exposed because PCRE2 only stores last-match-wins per group at the C level too. PHP would need PCRE2 callouts — and as documented in exp-pcre-ffi.php, PHP doesn't expose them and FFI can't bridge the gap because PHP closures aren't bindable to C function pointers.

Practical takeaway

The SO technique gives a real win for partial / flat / repeated extraction:

// Anchor + capture: every table that follows FROM
preg_match_all('/FROM\s+\K\w+/', $sql, $m);

// Recursive: every top-level subquery span (non-overlapping)
preg_match_all('/(?P<paren>\((?:[^()]+|(?P>paren))*\))/', $sql, $m, PREG_OFFSET_CAPTURE);

Both run at multi-million QPS. Useful as a query-shape introspection layer alongside the parser — e.g. for fast "list referenced tables" or "is there a subquery here" without touching the AST.

For producing the AST the SQLite driver actually consumes, the recursive-descent interpreter on this branch (~62K QPS with JIT) remains the right tool.

Updated mental model: PCRE in PHP gives you (1) a fast yes/no recogniser, (2) cheap preg_match_all-driven flat/partial extraction, but not (3) a one-shot full nested parse tree.

JanJakes · 2026-04-29T11:24:07Z

Closing in favor of #378.

JanJakes added 17 commits April 24, 2026 15:28

Add bench-final.php helper for multi-config parser benchmarking

b9b64a6

Reports best/median/average QPS over N runs with the currently-loaded PHP interpreter configuration. Used to measure the effect of the interpreter changes on top of opcache and tracing JIT configurations.

JanJakes added 2 commits April 25, 2026 15:25

JanJakes mentioned this pull request Apr 28, 2026

Parser + lexer performance: consolidated 2–3× end-to-end speedup #378

Open

5 tasks

JanJakes closed this Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser performance: 2-3x speedup via grammar preprocessing and interpreter optimisations#373

Parser performance: 2-3x speedup via grammar preprocessing and interpreter optimisations#373
JanJakes wants to merge 20 commits intotrunkfrom
parser-performance

JanJakes commented Apr 25, 2026 •

edited

Loading

Uh oh!

JanJakes commented Apr 25, 2026

Uh oh!

JanJakes commented Apr 25, 2026

Uh oh!

JanJakes commented Apr 25, 2026

Uh oh!

JanJakes commented Apr 25, 2026

Uh oh!

JanJakes commented Apr 25, 2026

Uh oh!

JanJakes commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JanJakes commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

CI runtime comparison

Sequence of commits

Test plan

Notes

Uh oh!

JanJakes commented Apr 25, 2026

CI runtime comparison — full matrix

Uh oh!

JanJakes commented Apr 25, 2026

Bonus experiment: compile the grammar to a single PCRE2 regex

Approach

Results on the 69,576-query MySQL test corpus

Caveats

Conclusion

Uh oh!

JanJakes commented Apr 25, 2026

Follow-up: can the regex return / reconstruct an AST?

What PCRE2 in PHP exposes

Hybrid: regex pre-validate, then parser builds AST

What it's actually good for

Bottom line for the parser-performance PR

Uh oh!

JanJakes commented Apr 25, 2026

Deep dive: PCRE2 / PHP options for AST extraction

What PHP's preg_* exposes

What's NOT exposed

Tried FFI as an escape hatch

Multi-pass extraction is just a parser

Summary

Uh oh!

JanJakes commented Apr 25, 2026

Re-checked the SO technique — partial correction

What preg_match_all actually does for our case

Where this stops being a path to a full AST

Why the answer to "AST from a single match?" is still no

Practical takeaway

Uh oh!

JanJakes commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JanJakes commented Apr 25, 2026 •

edited

Loading

What `preg_match_all` actually does for our case