Parser + lexer performance: consolidated 2–3× end-to-end speedup by JanJakes · Pull Request #378 · WordPress/sqlite-database-integration

JanJakes · 2026-04-28T09:17:57Z

Summary

This branch consolidates the parser-performance optimisations from #373, the lexer + token-construction wins from #375, and the has_child() micro-opt from #376 into a single clean, linear history. Squashes intermediate refactors and drops abandoned-experiment commits so each commit on this branch is one shippable change.

End-to-end (lex+parse) on the 69,577-query MySQL server corpus, best across 3 ABAB-alternated rounds × 5 timed iterations (with 2 warmup iters per round to fully heat the tracing JIT):

Config	Trunk	This branch	Speedup
Lex-only, JIT	188,323 QPS	369,687 QPS	1.96×
Parse-only, JIT	27,091 QPS	56,691 QPS	2.09×
End-to-end, JIT	23,855 QPS	50,061 QPS	2.10×
End-to-end, no opcache	10,722 QPS	28,362 QPS	2.65×

Numbers above are PHP 8.5. PHP 8.1 verified within ~5% on the same machine (1.62× / 2.24× / 2.13× / 2.64× across the four configs).

Methodology note. Tracing JIT is per-worker-process and shared across all requests served by that worker, so a steady-state production request always hits warm JIT. The numbers above use 2 warmup iterations to reach that state. Cold-JIT first-pass numbers (which only describe the very first request after a worker spawns) are roughly 1.2× for the JIT configs and don't represent typical traffic.

Relationship to #373, #375, #376

This PR is built from the best parts of three independent efforts. The empirical decomposition (full-corpus, fresh measurements per leg) is the basis for the picks below.

Parser performance: 2-3x speedup via grammar preprocessing and interpreter optimisations #373 (parser-performance, draft): all 13 hand-tuned parser optimisations from that PR are preserved here. The parser side delivers ~95% of the wall-clock speedup.
[codex] Speed up MySQL lexing and parsing #375 (lexer + parser, draft): only the lexer-side wins were pulled in (cached $sql_length, byte comparisons replacing strspn/strcspn mask checks, strpos-based comment-end and quote scans, inlined remaining_tokens), plus the WP_MySQL_Token constructor shortcut that bypasses parent::__construct(). The parser-side overlap with [codex] Speed up MySQL lexing and parsing #375 was deliberately not applied — Parser performance: 2-3x speedup via grammar preprocessing and interpreter optimisations #373 already has equivalent or stronger implementations of the same optimisations (single-candidate fast path, embedded branch sequences with copy-on-write dedup, end-of-input sentinel token, integer rule-id comparisons, etc.). Mixing the two parsers measured no incremental win; layering only the lexer side gave the cleanest result.
[codex] Split parser single-branch path #376 (parse-recursive split, closed): only the ! empty( $children ) micro-opt was kept. The parse_recursive outer-loop split was a different shape than the existing single-candidate fast path on this branch and didn't add value.

Approach

Empirical decomposition. Mixed trunk lexer with PR [codex] Speed up MySQL lexing and parsing #375's parser, and vice versa, across the full corpus to identify which side of each PR carried the wins. Result: parser-side overlap is real and Parser performance: 2-3x speedup via grammar preprocessing and interpreter optimisations #373's implementations are stronger; lexer side is independent and worth pulling in.
Squash and clean history. Two intermediate refactors were squashed (selector dedup→re-cache, single-candidate fast-path→direct-return). Five experimental commits that lived only in tests/tools/ (regex grammar matcher, grammar compilation experiment, PCRE2 callouts research) were dropped — each is a documented dead-end that doesn't ship in src/. Two benchmark helper scripts that were used only for measurement were also dropped from the consolidated branch.
Verify each step. Tests (684 / 1,427,724 assertions / 2 skipped / 2 incomplete — same as trunk) and PHPCS clean on the final tree.

Cost vs benefit

Per-commit marginal contributions, measured cumulatively against the previous state (end-to-end JIT, full corpus, with ABAB confirmation on small/suspect deltas). Source-file LOC excludes test-tool additions.

Big, robust wins

Optimization	src LOC	Δ
Per-token FIRST sets to skip unreachable branches	+122	+40%
Inline terminal matching + defer parse-node allocation	+47	+30%
Inline single-branch fragments + short-circuit nullable fallback	+50	+16%
Mark the parse-node class as `final` (lets opcache/JIT specialise)	0 (1 keyword)	+7%
Skip parent constructor in the lexer-token subclass	+4	+5%
Deduplicate selector entries via copy-on-write while embedding branch sequences	+17	+5%
Strip epsilon markers + cache grammar refs on the parser instance	+26	+4%
Return fragment results as children-array, skip allocating an intermediate node	+12	+4%

These eight optimisations carry ~95% of the cumulative speedup. Total cost: ~278 src LOC for ~2.0× combined.

Lexer dispatch fast paths (added after the original measurement)

Three lexer-side commits restructure read_next_token()'s elseif chain into early fast paths for the most common token-start bytes. Marginal end-to-end JIT, measured per-commit on the same corpus:

Optimization	src LOC	Δ end-to-end	Δ lex-only
Inline leading-whitespace skip (one unguarded `strspn()` per token loop, replacing per-WS-run `read_next_token()` round-trips)	+6	+4%	+10%
Catch identifier and keyword tokens at the top of the chain (letter / `_` / `$` / UTF-8 starter as the first arm, with `next_byte !== "'"` to keep `x'..'`/`n'..'` on their dedicated branches)	+16	+4%	+29%
Single-byte operator dispatch table (static `byte → token id` map for `(`, `)`, `,`, `;`, `+`, `~`, `%`, `^`, `?`, `{`, `}`, `=` as a chain arm in front of the per-byte elseifs)	+20	+4%	+9%

Total cost: +42 src LOC for +13% end-to-end JIT / +55% lex-only on top of the previous tip. The lex-side is where most of the gain lives; end-to-end dilutes it because lex is only ~13% of total parse time under JIT.

Small but plausibly real (1–2% range, near noise floor)

Optimization	src LOC	Δ
End-of-input sentinel token (drop range-checks in hot loop)	+17	+1.6%
Embed branch-symbol sequences directly in the per-token selector	+39	+1.6%
Compare select-statement by rule id instead of by name (string→int compare)	+3	+1.3%
Direct-return fast path for single-candidate rules	+68	+1.8% mean (high variance: +5.5 / +0.2 / −0.3 across ABAB rounds — could be +5% in clean conditions)

Each is ≤2% individually, all positive. Total cost: ~127 LOC for an estimated +6% combined. The "Embed branch-symbol sequences" row introduces a small WP_Parser_Grammar::get_or_cache_rule_id($name) helper that the start-rule and selectStatement lookups now share; the "Compare select-statement" row drops the parser-side ad-hoc lazy cache in favour of that helper, which is why its LOC is small.

Effectively zero under JIT (production case)

Optimization	src LOC	Δ JIT	Δ no-opcache	Recommendation
Lexer byte-comparison swaps + `strpos` for comment-end and quote scans	+47	~0%	+1–2%	Keep — free for non-JIT users (many shared hosts); the new dispatch fast paths above sit on top of these byte-comparison primitives
`! empty()` instead of `count() > 0` in `has_child()`	0 (1 line)	~0%	~0%	Keep — `has_child()` is only called from tests; one-line idiom hygiene
Coding-standard whitespace re-alignment after recent changes	0 perf-relevant	~0%	~0%	Keep — required to keep `composer run check-cs` clean

Verdict

Nothing on the branch is harming runtime; nothing kept is an "accidental attempt." The three "zero under JIT" commits either benefit non-JIT environments (lexer changes), are pure correctness/idiom improvements (one-line changes), or are required CS hygiene at zero LOC cost.

Optimizations not picked up

Multi-shape regex fast-path

A separate experimental implementation detects 13 common query shapes (INSERT/SELECT/UPDATE/DELETE/DROP/SHOW/USE/TRUNCATE/SET/EXPLAIN/BEGIN/COMMIT/ROLLBACK) via a single PCRE2 union pattern over a codepoint-encoded token stream, then builds the AST directly without descending into the recursive parser.

dimension	value
LOC cost	~1,271 src LOC (mostly one 1,209-line file)
End-to-end JIT speedup	+30% stacked on this branch (measured)
No-opcache speedup	+29%
Tests	684/684 pass; AST output is byte-for-byte equivalent to the recursive parser
LOC cost per shape	~93 LOC average; new shapes cost roughly 50–150 LOC each
Coverage	13 shapes → ~30% hit rate on this corpus; doubling coverage roughly doubles the file size

Tradeoff: the biggest individual win available, but at ~42 LOC per percent of speedup vs ~7 LOC per percent for the optimisations on this branch. Worth landing if the LOC vs maintenance cost is acceptable. Deferred from this PR to keep the consolidation scoped.

AST caching

Lex-once-then-cache by parameterised token-stream signature evaluates as the single highest-leverage optimisation for typical WordPress workloads (up to ~65× speedup on workloads that fit a 200-entry LRU cap, ~2.8 MB memory cost at that cap), and will be proposed in a separate PR.

Unbuilt ideas

Idea	Estimated win	Notes
Pratt parser for the expression cascade	+5–25% on expression-heavy queries	Medium engineering; replaces the deep `expr → boolPri → predicate → bitExpr → simpleExpr → …` chain
Partial compile of top-N hot rules under JIT-trace limit	+5–15%	Pivot of the failed full-grammar PHP compile
LL(2) selectors to eliminate residual backtracking	+5–15%	High engineering cost
Lazy CST (defer node materialisation until consumers ask)	Approaches the validator-only ceiling for partial-tree consumers	Requires consumer cooperation

Test plan

composer run test (mysql-on-sqlite) — 684/684, 1,427,724 assertions
composer run check-cs — clean
PHP 8.5 + PHP 8.1 verified (within ~5% of each other)
CI matrix on PHP 7.2–8.5
WordPress PHPUnit / E2E

Draft so the CI runtime can be compared against #373 before deciding which to land.

Hot-path changes in WP_Parser::parse_recursive(): - Inline the terminal match in the branch loop instead of recursing into parse_recursive() for every token. Over the full MySQL test suite this eliminates ~1.6M function calls. - Hoist grammar, rules, fragment_ids, rule_names, tokens, and token_count into local variables so the inner loops avoid repeated property lookups on $this->grammar. - Cache the token count on the instance to avoid a count() per call. - Build branch children in a local array and only instantiate the WP_Parser_Node once the branch has matched; on the MySQL corpus ~75% of speculative nodes were previously created and thrown away. - Drop a dead is_array($subnode) check that never fires in practice (subnodes are false, true, tokens, or nodes - never arrays). - Inline fragment inlining: read the fragment's children directly instead of building a fragment node and immediately merging it. End-to-end parser benchmark on the MySQL server test corpus: Before: ~11,500 QPS After: ~14,900 QPS (+29%)

The grammar now precomputes FIRST and NULLABLE via fixpoint, then indexes each rule's branches by the tokens that can start them. At parse time the parser jumps straight to the candidate branches for the current token instead of iterating every branch and letting most fail. On the full MySQL test suite, 59% of branch attempts previously failed because the first token could never match the branch's FIRST set; with per-branch lookahead those attempts are eliminated. End-to-end parser benchmark: Before: ~14,900 QPS After: ~22,400 QPS (+50%)

Two grammar/parser refinements that both reduce recursive calls: * In parse_recursive(): when the rule has a per-token branch selector but the current token is not in any branch's FIRST and the rule itself is nullable, return 'matched empty' immediately instead of descending into nullable branches that would recursively do the same thing. This alone eliminates ~460k recursive calls on the MySQL corpus. * At grammar build time, expand every single-branch fragment rule into its call sites. Fragments exist only to factor shared sub-sequences and their children are already flattened into the parent AST node, so splicing them directly into parent branches is a no-op for the resulting tree but removes an entire recursive call per use. 480 of the grammar's fragments qualify. Also drops the dead terminal branch at the top of parse_recursive() (the branch loop inlines terminal matching, so parse_recursive is only ever called with non-terminal rule ids) and the always-false empty-branches guard. End-to-end parser benchmark: Before: ~22,400 QPS After: ~27,500 QPS (+23%)

Two minor reductions in per-call work: * Strip explicit EMPTY_RULE_ID symbols out of rule branches at grammar build time. The parser loop would have 'continue'd over them anyway, so removing them ahead of time lets the hot symbol loop drop the epsilon check. Pure-epsilon branches become empty branches and still match empty via the existing empty-children fast path. * Cache the grammar's rules, fragment_ids, rule_names, branches_for_token, nullable_branches, and highest_terminal_id as direct parser instance fields so parse_recursive() no longer pays for a $this->grammar->... double hop on every call. * Collapse the two-step node construction (new + set_children) into a single constructor call that takes the children array directly. This saves a method call per allocated node (~820k across the MySQL corpus). End-to-end parser benchmark: ~27,500 QPS -> ~28,500 QPS (+3.5%).

Multi-branch fragment rules can't be expanded at grammar build time, but their runtime role is still trivial: match a sequence of symbols and have the caller splice the resulting children into its own node. The old code allocated a full WP_Parser_Node for each fragment match just to have the caller immediately copy its children out. Return the children array directly from fragments instead. The caller distinguishes via is_array($subnode) and splices in-place, saving a Parser_Node allocation per fragment match (~253k per 10k queries). End-to-end parser benchmark: Before: ~27,000 QPS (avg) After: ~28,700 QPS (+6%).

Add a sentinel WP_Parser_Token with id EMPTY_RULE_ID (0) to the end of the token array. Real MySQL tokens never have id 0 (WHITESPACE, the only token with id 0, is stripped by the lexer before tokens reach the parser), so the sentinel cannot match any real terminal. This lets the hot path drop the 'position < token_count' range check everywhere it reads the current token id: the selector lookup at method entry, the inline terminal match inside the branch loop, and the post-branch INTO negative lookahead for selectStatement. Any read past the last real token falls naturally into the nullable-fallback or branch-miss handling. Also drop a few dead locals ($token_count, $fragment_ids) that no longer appear in the hot path after the change. End-to-end parser benchmark: Before: ~28,700 QPS (avg) After: ~29,800 QPS (+4%).

Previously the per-(rule, token) selector stored a list of branch indexes that the parser then had to look up in $rules[$rule_id] on every branch attempt. Store the branch symbol sequences themselves so the hot loop can iterate candidate branches directly. PHP arrays are copy-on-write, so sharing the same branch sequence across selector entries for many tokens costs negligible extra memory. The nullable_branches map shrinks to a bool marker since the parser only uses it for existence checks. Also cache the start rule id on the grammar so parse() skips its array_search() across rule_names on every call. End-to-end parser benchmark: Before: ~29,800 QPS (avg) After: ~31,700 QPS (+6%).

Minor cleanup in parse_recursive(): cache the selectStatement rule id once and compare integers on every call instead of re-comparing the 'selectStatement' string against every rule's name. Also drops the $rules instance cache from the parser, which the hot path no longer touches now that branch sequences are embedded in the selector.

Adopts phpcbf's trivial whitespace alignment fixes in the grammar and parser source to keep `composer run check-cs` clean after the prior optimisation commits added new local variables and reshaped the selector-build code.

The per-(rule, token) branch selector stored a separate inner array per token, even when many tokens within the same rule mapped to identical branch lists (a single branch's FIRST set covers many tokens, for example). Loading the MySQL grammar used ~40 MB of PHP memory, most of which was duplicated inner arrays. Deduplicate by signature during grammar build so all tokens that land on the same branch list share one inner array via copy-on-write. The inner arrays still embed the branch symbol sequences directly so the hot loop iterates them without an extra $rules[$rule_id][$idx] indirection per branch attempt. Grammar memory on the MySQL grammar drops from ~40 MB to ~10 MB. PHPUnit peak memory drops from 198 MB to 110 MB. Parser throughput is unchanged from the previous (non-deduplicated) embedded-sequences form.

On the MySQL grammar, 1,290 of 1,916 rules have a selector where every (rule, token) entry points to exactly one branch. Those rules account for ~55% of parse_recursive calls on the test corpus (722k of 1.3M per 10k queries). Flag those rules at grammar build time. In parse_recursive, detect the flag and take the only candidate branch directly, skipping the candidate-iteration loop. On match failure, restore $position and return false directly instead of going through the multi-candidate branch_matches/break sequence. End-to-end parser benchmark: no JIT: ~31.6K -> ~32.6K QPS avg (+3%) tracing JIT: ~52.6K -> ~55.7K QPS avg (+6%)

Nothing extends WP_Parser_Node. Marking it final lets PHP's opcache and tracing JIT specialize property access and method dispatch since the class layout is now fixed. Small but consistent improvement measured across multiple runs under tracing JIT (~+2% avg, ~+2% best). End-to-end parser benchmark: tracing JIT: ~57K -> ~57-58K QPS avg, 60-61K QPS best no JIT: ~33K -> ~34K QPS avg, 35K QPS best

Apply lexer optimisations from PR #375: - Cache `strlen($sql)` once in `$sql_length` instead of recomputing on each EOF check. - Replace `strspn($byte, MASK) > 0` with direct byte comparisons (`$byte >= '0' && $byte <= '9'`, `false !== strpos(MASK, $byte)`, unrolled whitespace check). - Use `strpos($sql, '*/', $pos)` instead of a manual scan loop in `read_comment_content()`. - In `read_quoted_text()`, use `strpos()` to find the next quote, eliminating the separate end-of-input check that follows the `strcspn()` scan. - Inline `next_token()` + `get_token()` in `remaining_tokens()` so the hot loop builds tokens directly. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375

Token construction is on the lexer hot path; bypassing the `WP_Parser_Token::__construct()` indirection and assigning the four properties directly removes one method call per token. Requires `$input` on `WP_Parser_Token` to be `protected` instead of `private` so the subclass can write to it. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375

`! empty( $this->children )` short-circuits without calling `count()`, saving one function call per invocation. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #376

Both next_token() and remaining_tokens() previously paid a read_next_token() function call per whitespace run only to recognise and skip the resulting WHITESPACE token. A single unguarded strspn() at the top of each loop iteration absorbs the run inline, saving the call overhead for ~one whitespace run per real token across millions of tokens. The strspn() call is unguarded because an unconditional strspn() (which returns 0 in a single C-side call when nothing matches) is faster than gating it on a five-arm '$byte === ...' precheck.

ASCII letters and UTF-8 multibyte start bytes account for most token-start bytes on the MySQL corpus. They previously fell into the catch-all `else` at the bottom of read_next_token() after walking every operator arm in between. The new branch sits at the top of the elseif chain and dispatches them directly. The `next_byte !== "'"` guard keeps the x'..', n'..' and similar specials on their dedicated branches. `_` and `$` starters stay on the catch-all so the UNDERSCORE_CHARSET lookup still fires.

The ASCII bytes (, ), ',' ;, +, ~, %, ^, ?, {, }, and = each map to a unique single-byte token type with no lookahead. A static array + isset() arm dispatches them in one lookup, short-circuiting the per-byte elseif arms further down the chain. '*' and '|' are deliberately excluded because their token type depends on context (in_mysql_comment for '*/', SQL_MODE_PIPES_ AS_CONCAT for '||').

JanJakes force-pushed the performance branch 2 times, most recently from cdde227 to 1c666e2 Compare April 28, 2026 12:23

JanJakes added 15 commits April 29, 2026 09:32

Re-align grammar and parser whitespace after recent changes

7150de6

Adopts phpcbf's trivial whitespace alignment fixes in the grammar and parser source to keep `composer run check-cs` clean after the prior optimisation commits added new local variables and reshaped the selector-build code.

Use ! empty() in WP_Parser_Node::has_child()

c4aff56

`! empty( $this->children )` short-circuits without calling `count()`, saving one function call per invocation. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #376

JanJakes force-pushed the performance branch from 1c666e2 to c4aff56 Compare April 29, 2026 07:36

JanJakes mentioned this pull request Apr 29, 2026

Parser performance: 2-3x speedup via grammar preprocessing and interpreter optimisations #373

Closed

4 tasks

JanJakes added 2 commits April 29, 2026 14:38

JanJakes changed the title ~~Parser + lexer performance: consolidated 2.4× end-to-end speedup~~ Parser + lexer performance: consolidated 2–3× end-to-end speedup Apr 29, 2026

JanJakes marked this pull request as ready for review April 29, 2026 15:04

JanJakes requested a review from adamziel April 29, 2026 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser + lexer performance: consolidated 2–3× end-to-end speedup#378

Parser + lexer performance: consolidated 2–3× end-to-end speedup#378
JanJakes wants to merge 18 commits intotrunkfrom
performance

JanJakes commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JanJakes commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Relationship to #373, #375, #376

Approach

Cost vs benefit

Big, robust wins

Lexer dispatch fast paths (added after the original measurement)

Small but plausibly real (1–2% range, near noise floor)

Effectively zero under JIT (production case)

Verdict

Optimizations not picked up

Multi-shape regex fast-path

AST caching

Unbuilt ideas

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JanJakes commented Apr 28, 2026 •

edited

Loading