Parser performance: 2-3x speedup via grammar preprocessing and interpreter optimisations#373
Parser performance: 2-3x speedup via grammar preprocessing and interpreter optimisations#373
Conversation
Hot-path changes in WP_Parser::parse_recursive(): - Inline the terminal match in the branch loop instead of recursing into parse_recursive() for every token. Over the full MySQL test suite this eliminates ~1.6M function calls. - Hoist grammar, rules, fragment_ids, rule_names, tokens, and token_count into local variables so the inner loops avoid repeated property lookups on $this->grammar. - Cache the token count on the instance to avoid a count() per call. - Build branch children in a local array and only instantiate the WP_Parser_Node once the branch has matched; on the MySQL corpus ~75% of speculative nodes were previously created and thrown away. - Drop a dead is_array($subnode) check that never fires in practice (subnodes are false, true, tokens, or nodes - never arrays). - Inline fragment inlining: read the fragment's children directly instead of building a fragment node and immediately merging it. End-to-end parser benchmark on the MySQL server test corpus: Before: ~11,500 QPS After: ~14,900 QPS (+29%)
The grammar now precomputes FIRST and NULLABLE via fixpoint, then indexes each rule's branches by the tokens that can start them. At parse time the parser jumps straight to the candidate branches for the current token instead of iterating every branch and letting most fail. On the full MySQL test suite, 59% of branch attempts previously failed because the first token could never match the branch's FIRST set; with per-branch lookahead those attempts are eliminated. End-to-end parser benchmark: Before: ~14,900 QPS After: ~22,400 QPS (+50%)
Two grammar/parser refinements that both reduce recursive calls: * In parse_recursive(): when the rule has a per-token branch selector but the current token is not in any branch's FIRST and the rule itself is nullable, return 'matched empty' immediately instead of descending into nullable branches that would recursively do the same thing. This alone eliminates ~460k recursive calls on the MySQL corpus. * At grammar build time, expand every single-branch fragment rule into its call sites. Fragments exist only to factor shared sub-sequences and their children are already flattened into the parent AST node, so splicing them directly into parent branches is a no-op for the resulting tree but removes an entire recursive call per use. 480 of the grammar's fragments qualify. Also drops the dead terminal branch at the top of parse_recursive() (the branch loop inlines terminal matching, so parse_recursive is only ever called with non-terminal rule ids) and the always-false empty-branches guard. End-to-end parser benchmark: Before: ~22,400 QPS After: ~27,500 QPS (+23%)
Two minor reductions in per-call work: * Strip explicit EMPTY_RULE_ID symbols out of rule branches at grammar build time. The parser loop would have 'continue'd over them anyway, so removing them ahead of time lets the hot symbol loop drop the epsilon check. Pure-epsilon branches become empty branches and still match empty via the existing empty-children fast path. * Cache the grammar's rules, fragment_ids, rule_names, branches_for_token, nullable_branches, and highest_terminal_id as direct parser instance fields so parse_recursive() no longer pays for a $this->grammar->... double hop on every call. * Collapse the two-step node construction (new + set_children) into a single constructor call that takes the children array directly. This saves a method call per allocated node (~820k across the MySQL corpus). End-to-end parser benchmark: ~27,500 QPS -> ~28,500 QPS (+3.5%).
Multi-branch fragment rules can't be expanded at grammar build time, but their runtime role is still trivial: match a sequence of symbols and have the caller splice the resulting children into its own node. The old code allocated a full WP_Parser_Node for each fragment match just to have the caller immediately copy its children out. Return the children array directly from fragments instead. The caller distinguishes via is_array($subnode) and splices in-place, saving a Parser_Node allocation per fragment match (~253k per 10k queries). End-to-end parser benchmark: Before: ~27,000 QPS (avg) After: ~28,700 QPS (+6%).
Add a sentinel WP_Parser_Token with id EMPTY_RULE_ID (0) to the end of the token array. Real MySQL tokens never have id 0 (WHITESPACE, the only token with id 0, is stripped by the lexer before tokens reach the parser), so the sentinel cannot match any real terminal. This lets the hot path drop the 'position < token_count' range check everywhere it reads the current token id: the selector lookup at method entry, the inline terminal match inside the branch loop, and the post-branch INTO negative lookahead for selectStatement. Any read past the last real token falls naturally into the nullable-fallback or branch-miss handling. Also drop a few dead locals ($token_count, $fragment_ids) that no longer appear in the hot path after the change. End-to-end parser benchmark: Before: ~28,700 QPS (avg) After: ~29,800 QPS (+4%).
Previously the per-(rule, token) selector stored a list of branch indexes that the parser then had to look up in $rules[$rule_id] on every branch attempt. Store the branch symbol sequences themselves so the hot loop can iterate candidate branches directly. PHP arrays are copy-on-write, so sharing the same branch sequence across selector entries for many tokens costs negligible extra memory. The nullable_branches map shrinks to a bool marker since the parser only uses it for existence checks. Also cache the start rule id on the grammar so parse() skips its array_search() across rule_names on every call. End-to-end parser benchmark: Before: ~29,800 QPS (avg) After: ~31,700 QPS (+6%).
Minor cleanup in parse_recursive(): cache the selectStatement rule id once and compare integers on every call instead of re-comparing the 'selectStatement' string against every rule's name. Also drops the $rules instance cache from the parser, which the hot path no longer touches now that branch sequences are embedded in the selector.
bench-parser-split.php pre-tokenizes the MySQL test suite once and then times only the parser across multiple runs, so parser-specific changes can be measured without lexer noise. The script accepts --runs=N and --limit=M for reproducible comparisons. Also adopts phpcbf's trivial whitespace alignment fixes in the grammar and parser source to keep 'composer run check-cs' clean.
The per-(rule, token) branch selector stored a separate inner array per token, even when many tokens within the same rule mapped to identical branch-id lists (a single branch's FIRST set covers many tokens, for example). Loading the MySQL grammar used ~40 MB of PHP memory, most of which was duplicated inner arrays. Deduplicate by signature during grammar build so all tokens that land on the same branch-id list share one inner array via copy-on-write. The selector is now stored as branch indexes again (instead of the cached symbol sequences from the previous commit) - the one extra $branches[$idx] lookup per branch attempt costs < 1% at runtime but allows the inner arrays to be tiny and to share aggressively. Grammar memory on the MySQL grammar drops from ~40 MB to ~10 MB. PHPUnit peak memory drops from 198 MB to 110 MB.
Re-embed branch symbol sequences directly in the selector entries so the hot loop can foreach over them without an extra $rules[$rule_id][$idx] indirection per branch attempt. Per-rule dedup pairs tokens that land on the same branch list to a single sequences array via copy-on-write, so grammar memory stays at ~10 MB instead of the ~40 MB the naive form needed. Recovers the ~3% parse speedup lost in the memory-reduction commit while keeping the lower footprint. Parser benchmark: Best: ~32,400 QPS Avg: ~31,300 QPS (compared to ~30,700 avg in the indexes-only variant)
This commit preserves the exploration of 'full grammar compilation'
as a follow-up to the parser performance work. It is intentionally kept
separate from the main parser because the compilation trade-off is not
a clear win.
Tools added:
- compile-grammar.php: walks the grammar and emits a self-contained
class (WP_MySQL_Compiled_Parser) with one method per reachable rule.
Single-branch fragments are inlined, single-use multi-branch
fragments are kept as methods. Dispatch uses switch-on-tid for
multi-group rules; a compact isset() lookup table for single-group
rules; and a 'one-of-N-terminals' fast path for the many
identifier-like rules with hundreds of single-terminal branches.
- bench-compiled-parser.php: side-by-side run of interpreter vs
compiled on the MySQL test corpus.
- compare-asts.php: verifies the compiled parser produces the same
AST as the interpreter on every query.
- dump-inflated-grammar.php: dumps the post-inflation grammar data
so the effect of skipping the FIRST/NULLABLE fixpoint at runtime
could be measured.
- bench-hot-rules.php: distribution of per-rule call counts for
deciding which rules are worth specialising.
Empirical findings on the full MySQL test corpus (69,576 queries):
Interpreter (current parser on this branch):
- no opcache: ~32,500 QPS, ~12 MB grammar
- opcache, no JIT: ~35,100 QPS
- opcache + JIT: ~52,600 QPS (tracing)
Compiled parser:
- no opcache: ~38,300 QPS (+18% over interpreter)
- opcache, no JIT: ~41,800 QPS (+19%)
- opcache + JIT: ~49,700 QPS (slightly below interpreter)
Compiled file size: ~2.6 MB (99k lines, 1,427 methods)
Compiled class load: ~22 ms / ~17 MB RAM (vs 38 ms / 12 MB to inflate
the compressed grammar).
Why compilation helps without JIT:
- eliminates the generic dispatch (branches_for_token, fragment_ids,
nullable_branches isset checks) baked into every interpreter call;
- resolves fragment vs non-fragment and nullable vs non-nullable at
compile time so the emitted code has no runtime type checks;
- collapses 'accepts any of N terminals' rules (251 of them) into an
8-line isset() + consume instead of ~2.8k-line switches.
Why compilation does not help with tracing JIT:
- the interpreter's hot loop is small and regular, which tracing JIT
optimises aggressively (9-10 ns per recursive call);
- the compiled parser generates a handful of very large methods (the
biggest is ~2.8k lines of a 406-branch fragment) that tracing JIT
struggles to optimise, and incurs a substantial JIT-compile penalty
on first use (~0.7 s of the first-run time).
Conclusion: keep the interpreter as the primary parser. The compiler
is preserved here as documentation and as a fallback path for
environments without JIT.
Previously the compiler emitted each case label on its own line (`\t\t\tcase 5:\n`), and case labels were 56% of all generated code. Group multiple labels per line instead (up to 10) so the switch dispatch is still readable but the file shrinks from ~99k lines (~2.63 MB) to ~51k lines (~2.48 MB) with no behaviour change. No runtime impact: verified 0 AST mismatches across the 69k-query corpus and identical QPS to the previous output under all opcache/JIT configurations.
On the MySQL grammar, 1,290 of 1,916 rules have a selector where every (rule, token) entry points to exactly one branch. Those rules account for ~55% of parse_recursive calls on the test corpus (722k of 1.3M per 10k queries). Flag those rules at grammar build time. In parse_recursive, detect the flag and skip the outer 'foreach ($candidate_branches as ...)' by taking $candidate_branches[0] directly. The branch-match body is otherwise identical to the multi-candidate path. End-to-end parser benchmark: no JIT: ~31.6K -> ~32.6K QPS avg (+3%) tracing JIT: ~52.6K -> ~55.7K QPS avg (+6%)
Replace the $branch_matches flag + break+reset sequence with direct '$this->position = $position; return false;' exits on each failure path. Removes one local variable and a pair of conditional branches from the hot inner loop. Minor but measurable improvement; the code is also simpler.
Nothing extends WP_Parser_Node. Marking it final lets PHP's opcache and tracing JIT specialize property access and method dispatch since the class layout is now fixed. Small but consistent improvement measured across multiple runs under tracing JIT (~+2% avg, ~+2% best). End-to-end parser benchmark: tracing JIT: ~57K -> ~57-58K QPS avg, 60-61K QPS best no JIT: ~33K -> ~34K QPS avg, 35K QPS best
Reports best/median/average QPS over N runs with the currently-loaded PHP interpreter configuration. Used to measure the effect of the interpreter changes on top of opcache and tracing JIT configurations.
CI runtime comparison — full matrixCompared this PR (
The PHPUnit numbers above include ~30-40 s of SQLite build and ~10 s of composer install which are unchanged. Looking at just the
~1.85× faster phpunit across all PHP versions on real CI, matching the local end-to-end benchmark. E2E variance is browser/wp-env startup dominated and within normal run-to-run noise; not parser-related. All tests pass. Total saved per CI run on PHPUnit + WordPress PHPUnit alone: ~11 minutes. |
Experiment: compile the grammar to a single PCRE2 pattern using: - each token id encoded as a Unicode codepoint at offset 0x4000 - each rule emitted as (?<rN>...) named subroutine - (*THEN) on each branch's first symbol of *single-candidate* rules (where sibling-branch FIRST sets are disjoint, so committing is safe) - aggressive transitive inlining of single-use non-recursive rules to shrink the bytecode below PCRE2's compiled-pattern size limit Result on the 69,576-query MySQL test corpus (PCRE2 JIT enabled): - Pattern: ~76 KB source, 1127 named subroutines after 789 rules inlined - Match throughput: ~97,600 QPS, vs the optimised interpreter's ~62k. - 99.82% accuracy: ~120 spurious failures, mostly the 'SELECT ... INTO' ambiguity that the interpreter handles via a runtime negative lookahead the regex doesn't model. Trade-offs: 1. Match-only - the regex doesn't build an AST, so it's not a drop-in replacement for the recursive-descent parser the SQLite driver needs. 2. Without (*THEN) the matcher backtracks catastrophically on nested compound statements (CREATE TRIGGER ... BEGIN ... IF ...). 3. With (*THEN) on every branch (not just single-candidate) the regex gives spurious failures because PCRE commits to the first first-symbol match and can't try a sibling alternative. 4. Pattern size is constrained by PCRE2's default LINK_SIZE=2 bytecode limit; aggressive rule inlining is needed to fit a non-trivial grammar. Kept as documentation: an interesting upper bound on PHP-side parsing speed when the AST shape is not required.
Bonus experiment: compile the grammar to a single PCRE2 regexTried the Approach
Results on the 69,576-query MySQL test corpus
Caveats
ConclusionPCRE2 + Committed in |
Follow-up: can the regex return / reconstruct an AST?Short answer: no, and even using it as a pre-validator costs more than it saves. What PCRE2 in PHP exposes
So we can't structurally extract a parse tree from a single PCRE2 match in PHP without writing an extension or using FFI. Hybrid: regex pre-validate, then parser builds ASTTested on the full 69,576-query MySQL test corpus, tracing JIT enabled:
The hybrid is slower than the parser alone — the regex pre-check is pure overhead because the parser still has to run on every valid query (99.99% of the corpus), and valid is the common case. Hybrid only pays off when many inputs are invalid, where the regex can reject them faster than the parser; that's not our workload. What it's actually good for
Bottom line for the parser-performance PRThe optimised interpreter on this branch (~62K QPS with JIT, ~24K without) is the production path because it's the only approach that returns an AST. The regex experiment is committed in |
Tests whether running the regex match as a pre-validator before the AST-building parser is faster than the parser alone. Result on the 69,576-query MySQL corpus, tracing JIT enabled: regex only (no AST): 0.752 s, 92,519 QPS parser only (AST): 1.136 s, 61,240 QPS regex + parser: 1.480 s, 47,008 QPS The hybrid is *slower* than the parser alone because the regex is pure overhead - 99.99% of corpus queries are valid SQL, so the parser still has to run on each query to build the AST. The pre-check only pays off when many inputs are invalid; that is not our workload. Confirms the regex experiment is a recogniser, not a parser replacement: PCRE2 in PHP cannot return a structured tree from a recursive named-group match (last-match-wins semantics) and PHP does not expose user PCRE callouts that could intercept the match to record structural events. Useful as a fast 'does this query parse?' gate; not useful in workloads that need the AST.
… AST)
Tested whether FFI to libpcre2-8 could supply a callout callback so a
match could record (rule, offset) tuples. It cannot:
- pcre2_set_callout_8 takes a function pointer.
- PHP FFI does not allow PHP closures to be cast to C function
pointers; libffi closure support is intentionally not enabled in
PHP's FFI build.
So pure-PHP code can call pcre2_compile_8 / pcre2_match_8 via FFI but
cannot supply a callout function. The (?C) callouts in the pattern
have no observable effect.
Documents the surveyed paths to building a PCRE2-driven AST in PHP,
all of which are blocked or worse than the existing parser:
1. Stock preg_*: ovector is last-match-wins per numbered group, even
with (?J) duplicate names (each (?<name>...) occurrence has its
own slot but each slot only retains the last match). Recursive
named groups expose nothing about intermediate matches. (*MARK)
only retains the last mark. PHP exposes no callout callback.
2. FFI to libpcre2: blocked as described above.
3. Multi-pass extraction with preg_match_all on simpler flat
patterns: re-implements parsing with regex per layer; not faster
than the recursive-descent interpreter.
4. preg_match validate + parser builds AST (exp-regex-hybrid.php):
net loss because the parser still has to run on every valid
query, and valid is the common case.
5. Custom PHP extension wrapping pcre2_set_callout: significant C
work, out of scope.
Conclusion: in stock PHP the regex match is a fast yes/no validator
(~92K QPS) and an upper bound on PHP-side parsing speed when an AST
is not required (~100K QPS). It cannot replace the AST-producing
parser the SQLite driver consumes.
Deep dive: PCRE2 / PHP options for AST extractionSurveyed every PCRE2 feature exposed in PHP and a few that aren't: What PHP's preg_* exposes
Verified all of the above with concrete tests; results pasted in exp-regex-* tools. What's NOT exposedPCRE2's C API has callouts ( PHP's Tried FFI as an escape hatchPHP 7.4+ has FFI. In principle we can call That leaves a custom PHP extension wrapping the callout API as the only path to PCRE2-driven AST extraction. Out of scope here. Multi-pass extraction is just a parserThe remaining alternative — match the full pattern to validate, then SummaryIn stock PHP the compiled regex is a fast yes/no validator only:
The existing recursive-descent interpreter (~62K QPS with tracing JIT, ~34K without; both delivering an AST) remains the production path. Three exp- scripts in
|
Re-checked the SO technique — partial correctionThe SO answer at https://stackoverflow.com/a/24279157 uses Here's what I verified: What
|
| Pattern | Behavior |
|---|---|
(?<=\d),\d+ (flat + lookbehind) |
Returns each match as a separate element. Works fine for flat repeating data. |
| `(?P((?:[^()]+ | (?P>paren))*))` (recursive) |
So the recursive named-group + preg_match_all combination can walk a tree, level by level. Each call returns the outermost matches at the current scope; you recurse into each match's body to find inner ones.
Speed for the tree-walking approach (10k iterations on SELECT a, (SELECT b, (SELECT c FROM t3) FROM t2), d FROM t1):
- 1 outer
preg_match_all(recursive paren pattern): ~12.5M QPS - 1 anchored extraction (
/FROM\s+\K\w+/for tables): bundled together ~3.8M QPS
That's two orders of magnitude faster than the parser per call. The economics make sense for focused extraction (e.g., "find all table refs", "find all subquery boundaries").
Where this stops being a path to a full AST
The MySQL grammar has ~1,900 rules. To produce a full AST via this technique you'd need a regex per rule applied at every level of the tree. For our 17-token-avg corpus with depth ~10 levels:
- Partial AST (top-level statement structure + a handful of rules like tableRef / columnRef / whereClause): a few preg_match_all calls per query — easily faster than the parser, useful for queries that only need certain shapes.
- Full AST (every rule, every nesting): would require ~50+ regex calls per query for typical queries, ~3–5M total over the corpus. At ~55 ns/cached call that's ~200 ms minimum, but each call's actual regex work is the dominant cost — and many of those patterns must include the full grammar definitions to disambiguate context. Estimated worse than the existing parser.
Why the answer to "AST from a single match?" is still no
preg_match_all works because it runs the pattern multiple times under the hood, advancing the cursor between matches. Each match is a fresh pattern execution. So you do not get "all captures from one match" — you get "many top-level matches, each with last-match-wins captures inside".
For the recursive grammar where the same rule is matched in nested positions during a single execution, those nested instances are not exposed because PCRE2 only stores last-match-wins per group at the C level too. PHP would need PCRE2 callouts — and as documented in exp-pcre-ffi.php, PHP doesn't expose them and FFI can't bridge the gap because PHP closures aren't bindable to C function pointers.
Practical takeaway
The SO technique gives a real win for partial / flat / repeated extraction:
// Anchor + capture: every table that follows FROM
preg_match_all('/FROM\s+\K\w+/', $sql, $m);
// Recursive: every top-level subquery span (non-overlapping)
preg_match_all('/(?P<paren>\((?:[^()]+|(?P>paren))*\))/', $sql, $m, PREG_OFFSET_CAPTURE);Both run at multi-million QPS. Useful as a query-shape introspection layer alongside the parser — e.g. for fast "list referenced tables" or "is there a subquery here" without touching the AST.
For producing the AST the SQLite driver actually consumes, the recursive-descent interpreter on this branch (~62K QPS with JIT) remains the right tool.
Updated mental model: PCRE in PHP gives you (1) a fast yes/no recogniser, (2) cheap preg_match_all-driven flat/partial extraction, but not (3) a one-shot full nested parse tree.
|
Closing in favor of #378. |
Summary
This is a draft PR to measure CI runtime improvement from a series of parser-performance optimisations on the recursive descent parser. Aimed primarily at lowering the time tests spend in PHPUnit, which is dominated by the 70-batch MySQL server parser test suite (~70k queries parsed per run).
End-to-end (lex+parse, 69,577 queries, local benchmark):
CI runtime comparison
Run PHPUnit testsstep duration on this branch vs the average of the last 3 trunk-merged feature PRs (#365 / #367 / #369):Total: 20.4 min → 11.1 min of cumulative phpunit-step time across the matrix per run (~9 min saved per CI run).
The 70/70 server-suite parser test sub-suite is what dominates the step. Per-PHPUnit-step speedup is consistent with the local end-to-end benchmark.
Sequence of commits
f935d03Inline terminal matching and defer node allocation0e34f25Per-branch FIRST sets to skip unreachable branches6df347aShort-circuit nullable fallback + inline single-branch fragments9eea849Strip epsilon markers + cache grammar refsddfe4b6Return fragment results as children arrays77756bfEnd-of-input sentinel token (drop range checks)daec1cbEmbed branch sequences in selector2609898Compare selectStatement by rule idc4e7f8fAdd bench-parser-split.php toolingcd0b609Deduplicate selector entries (memory: 40 MB → 10 MB)e8b006fCache branch sequences in deduped selectorb5959e8Grammar compilation experiment + bench tools (kept as documentation, not used at runtime)bf1b1fePack switch case labels (compiled output cleanup)fa2fd65Single-candidate rule fast path1e7e3cfDirect-return refactor for single-candidate path9fcfb27Mark WP_Parser_Node as finalb9b64a6bench-final.php helperTests: 667/667 phpunit tests pass with the same ~1.43M assertions as trunk.
Grammar memory: ~10 MB (from ~1.5 MB on trunk; cost of the new per-token branch selector + FIRST/NULLABLE precomputation).
PHPCS: clean.
Test plan
composer run test(mysql-on-sqlite) - green locallycomposer run check-cs- greenNotes
This PR also includes a grammar→PHP compiler experiment under
tests/tools/compile-grammar.php. It was kept as documentation: it's faster than the interpreter without JIT (+14-19%) but slightly slower under tracing JIT, so the interpreter remains the production path.Draft because the goal here is to compare CI runtimes against recent merges before deciding what to ship.