Skip to content

Parser + lexer performance: consolidated 2–3× end-to-end speedup#378

Open
JanJakes wants to merge 18 commits intotrunkfrom
performance
Open

Parser + lexer performance: consolidated 2–3× end-to-end speedup#378
JanJakes wants to merge 18 commits intotrunkfrom
performance

Conversation

@JanJakes
Copy link
Copy Markdown
Member

@JanJakes JanJakes commented Apr 28, 2026

Summary

This branch consolidates the parser-performance optimisations from #373, the lexer + token-construction wins from #375, and the has_child() micro-opt from #376 into a single clean, linear history. Squashes intermediate refactors and drops abandoned-experiment commits so each commit on this branch is one shippable change.

End-to-end (lex+parse) on the 69,577-query MySQL server corpus, best across 3 ABAB-alternated rounds × 5 timed iterations (with 2 warmup iters per round to fully heat the tracing JIT):

Config Trunk This branch Speedup
Lex-only, JIT 188,323 QPS 369,687 QPS 1.96×
Parse-only, JIT 27,091 QPS 56,691 QPS 2.09×
End-to-end, JIT 23,855 QPS 50,061 QPS 2.10×
End-to-end, no opcache 10,722 QPS 28,362 QPS 2.65×

Numbers above are PHP 8.5. PHP 8.1 verified within ~5% on the same machine (1.62× / 2.24× / 2.13× / 2.64× across the four configs).

Methodology note. Tracing JIT is per-worker-process and shared across all requests served by that worker, so a steady-state production request always hits warm JIT. The numbers above use 2 warmup iterations to reach that state. Cold-JIT first-pass numbers (which only describe the very first request after a worker spawns) are roughly 1.2× for the JIT configs and don't represent typical traffic.

Relationship to #373, #375, #376

This PR is built from the best parts of three independent efforts. The empirical decomposition (full-corpus, fresh measurements per leg) is the basis for the picks below.

Approach

  1. Empirical decomposition. Mixed trunk lexer with PR [codex] Speed up MySQL lexing and parsing #375's parser, and vice versa, across the full corpus to identify which side of each PR carried the wins. Result: parser-side overlap is real and Parser performance: 2-3x speedup via grammar preprocessing and interpreter optimisations #373's implementations are stronger; lexer side is independent and worth pulling in.
  2. Squash and clean history. Two intermediate refactors were squashed (selector dedup→re-cache, single-candidate fast-path→direct-return). Five experimental commits that lived only in tests/tools/ (regex grammar matcher, grammar compilation experiment, PCRE2 callouts research) were dropped — each is a documented dead-end that doesn't ship in src/. Two benchmark helper scripts that were used only for measurement were also dropped from the consolidated branch.
  3. Verify each step. Tests (684 / 1,427,724 assertions / 2 skipped / 2 incomplete — same as trunk) and PHPCS clean on the final tree.

Cost vs benefit

Per-commit marginal contributions, measured cumulatively against the previous state (end-to-end JIT, full corpus, with ABAB confirmation on small/suspect deltas). Source-file LOC excludes test-tool additions.

Big, robust wins

Optimization src LOC Δ
Per-token FIRST sets to skip unreachable branches +122 +40%
Inline terminal matching + defer parse-node allocation +47 +30%
Inline single-branch fragments + short-circuit nullable fallback +50 +16%
Mark the parse-node class as final (lets opcache/JIT specialise) 0 (1 keyword) +7%
Skip parent constructor in the lexer-token subclass +4 +5%
Deduplicate selector entries via copy-on-write while embedding branch sequences +17 +5%
Strip epsilon markers + cache grammar refs on the parser instance +26 +4%
Return fragment results as children-array, skip allocating an intermediate node +12 +4%

These eight optimisations carry ~95% of the cumulative speedup. Total cost: ~278 src LOC for ~2.0× combined.

Lexer dispatch fast paths (added after the original measurement)

Three lexer-side commits restructure read_next_token()'s elseif chain into early fast paths for the most common token-start bytes. Marginal end-to-end JIT, measured per-commit on the same corpus:

Optimization src LOC Δ end-to-end Δ lex-only
Inline leading-whitespace skip (one unguarded strspn() per token loop, replacing per-WS-run read_next_token() round-trips) +6 +4% +10%
Catch identifier and keyword tokens at the top of the chain (letter / _ / $ / UTF-8 starter as the first arm, with next_byte !== "'" to keep x'..'/n'..' on their dedicated branches) +16 +4% +29%
Single-byte operator dispatch table (static byte → token id map for (, ), ,, ;, +, ~, %, ^, ?, {, }, = as a chain arm in front of the per-byte elseifs) +20 +4% +9%

Total cost: +42 src LOC for +13% end-to-end JIT / +55% lex-only on top of the previous tip. The lex-side is where most of the gain lives; end-to-end dilutes it because lex is only ~13% of total parse time under JIT.

Small but plausibly real (1–2% range, near noise floor)

Optimization src LOC Δ
End-of-input sentinel token (drop range-checks in hot loop) +17 +1.6%
Embed branch-symbol sequences directly in the per-token selector +39 +1.6%
Compare select-statement by rule id instead of by name (string→int compare) +3 +1.3%
Direct-return fast path for single-candidate rules +68 +1.8% mean (high variance: +5.5 / +0.2 / −0.3 across ABAB rounds — could be +5% in clean conditions)

Each is ≤2% individually, all positive. Total cost: ~127 LOC for an estimated +6% combined. The "Embed branch-symbol sequences" row introduces a small WP_Parser_Grammar::get_or_cache_rule_id($name) helper that the start-rule and selectStatement lookups now share; the "Compare select-statement" row drops the parser-side ad-hoc lazy cache in favour of that helper, which is why its LOC is small.

Effectively zero under JIT (production case)

Optimization src LOC Δ JIT Δ no-opcache Recommendation
Lexer byte-comparison swaps + strpos for comment-end and quote scans +47 ~0% +1–2% Keep — free for non-JIT users (many shared hosts); the new dispatch fast paths above sit on top of these byte-comparison primitives
! empty() instead of count() > 0 in has_child() 0 (1 line) ~0% ~0% Keephas_child() is only called from tests; one-line idiom hygiene
Coding-standard whitespace re-alignment after recent changes 0 perf-relevant ~0% ~0% Keep — required to keep composer run check-cs clean

Verdict

Nothing on the branch is harming runtime; nothing kept is an "accidental attempt." The three "zero under JIT" commits either benefit non-JIT environments (lexer changes), are pure correctness/idiom improvements (one-line changes), or are required CS hygiene at zero LOC cost.

Optimizations not picked up

Multi-shape regex fast-path

A separate experimental implementation detects 13 common query shapes (INSERT/SELECT/UPDATE/DELETE/DROP/SHOW/USE/TRUNCATE/SET/EXPLAIN/BEGIN/COMMIT/ROLLBACK) via a single PCRE2 union pattern over a codepoint-encoded token stream, then builds the AST directly without descending into the recursive parser.

dimension value
LOC cost ~1,271 src LOC (mostly one 1,209-line file)
End-to-end JIT speedup +30% stacked on this branch (measured)
No-opcache speedup +29%
Tests 684/684 pass; AST output is byte-for-byte equivalent to the recursive parser
LOC cost per shape ~93 LOC average; new shapes cost roughly 50–150 LOC each
Coverage 13 shapes → ~30% hit rate on this corpus; doubling coverage roughly doubles the file size

Tradeoff: the biggest individual win available, but at ~42 LOC per percent of speedup vs ~7 LOC per percent for the optimisations on this branch. Worth landing if the LOC vs maintenance cost is acceptable. Deferred from this PR to keep the consolidation scoped.

AST caching

Lex-once-then-cache by parameterised token-stream signature evaluates as the single highest-leverage optimisation for typical WordPress workloads (up to ~65× speedup on workloads that fit a 200-entry LRU cap, ~2.8 MB memory cost at that cap), and will be proposed in a separate PR.

Unbuilt ideas

Idea Estimated win Notes
Pratt parser for the expression cascade +5–25% on expression-heavy queries Medium engineering; replaces the deep expr → boolPri → predicate → bitExpr → simpleExpr → … chain
Partial compile of top-N hot rules under JIT-trace limit +5–15% Pivot of the failed full-grammar PHP compile
LL(2) selectors to eliminate residual backtracking +5–15% High engineering cost
Lazy CST (defer node materialisation until consumers ask) Approaches the validator-only ceiling for partial-tree consumers Requires consumer cooperation

Test plan

  • composer run test (mysql-on-sqlite) — 684/684, 1,427,724 assertions
  • composer run check-cs — clean
  • PHP 8.5 + PHP 8.1 verified (within ~5% of each other)
  • CI matrix on PHP 7.2–8.5
  • WordPress PHPUnit / E2E

Draft so the CI runtime can be compared against #373 before deciding which to land.

@JanJakes JanJakes force-pushed the performance branch 2 times, most recently from cdde227 to 1c666e2 Compare April 28, 2026 12:23
JanJakes added 15 commits April 29, 2026 09:32
Hot-path changes in WP_Parser::parse_recursive():

- Inline the terminal match in the branch loop instead of recursing into
  parse_recursive() for every token. Over the full MySQL test suite this
  eliminates ~1.6M function calls.
- Hoist grammar, rules, fragment_ids, rule_names, tokens, and token_count
  into local variables so the inner loops avoid repeated property lookups
  on $this->grammar.
- Cache the token count on the instance to avoid a count() per call.
- Build branch children in a local array and only instantiate the
  WP_Parser_Node once the branch has matched; on the MySQL corpus ~75% of
  speculative nodes were previously created and thrown away.
- Drop a dead is_array($subnode) check that never fires in practice
  (subnodes are false, true, tokens, or nodes - never arrays).
- Inline fragment inlining: read the fragment's children directly instead
  of building a fragment node and immediately merging it.

End-to-end parser benchmark on the MySQL server test corpus:
  Before: ~11,500 QPS   After: ~14,900 QPS  (+29%)
The grammar now precomputes FIRST and NULLABLE via fixpoint, then indexes
each rule's branches by the tokens that can start them. At parse time the
parser jumps straight to the candidate branches for the current token
instead of iterating every branch and letting most fail.

On the full MySQL test suite, 59% of branch attempts previously failed
because the first token could never match the branch's FIRST set; with
per-branch lookahead those attempts are eliminated.

End-to-end parser benchmark:
  Before: ~14,900 QPS   After: ~22,400 QPS  (+50%)
Two grammar/parser refinements that both reduce recursive calls:

* In parse_recursive(): when the rule has a per-token branch selector but
  the current token is not in any branch's FIRST and the rule itself is
  nullable, return 'matched empty' immediately instead of descending into
  nullable branches that would recursively do the same thing. This alone
  eliminates ~460k recursive calls on the MySQL corpus.

* At grammar build time, expand every single-branch fragment rule into
  its call sites. Fragments exist only to factor shared sub-sequences and
  their children are already flattened into the parent AST node, so
  splicing them directly into parent branches is a no-op for the
  resulting tree but removes an entire recursive call per use. 480 of the
  grammar's fragments qualify.

Also drops the dead terminal branch at the top of parse_recursive() (the
branch loop inlines terminal matching, so parse_recursive is only ever
called with non-terminal rule ids) and the always-false empty-branches
guard.

End-to-end parser benchmark:
  Before: ~22,400 QPS   After: ~27,500 QPS  (+23%)
Two minor reductions in per-call work:

* Strip explicit EMPTY_RULE_ID symbols out of rule branches at grammar
  build time. The parser loop would have 'continue'd over them anyway, so
  removing them ahead of time lets the hot symbol loop drop the epsilon
  check. Pure-epsilon branches become empty branches and still match
  empty via the existing empty-children fast path.

* Cache the grammar's rules, fragment_ids, rule_names, branches_for_token,
  nullable_branches, and highest_terminal_id as direct parser instance
  fields so parse_recursive() no longer pays for a $this->grammar->...
  double hop on every call.

* Collapse the two-step node construction (new + set_children) into a
  single constructor call that takes the children array directly. This
  saves a method call per allocated node (~820k across the MySQL corpus).

End-to-end parser benchmark: ~27,500 QPS -> ~28,500 QPS (+3.5%).
Multi-branch fragment rules can't be expanded at grammar build time, but
their runtime role is still trivial: match a sequence of symbols and have
the caller splice the resulting children into its own node. The old code
allocated a full WP_Parser_Node for each fragment match just to have the
caller immediately copy its children out.

Return the children array directly from fragments instead. The caller
distinguishes via is_array($subnode) and splices in-place, saving a
Parser_Node allocation per fragment match (~253k per 10k queries).

End-to-end parser benchmark:
  Before: ~27,000 QPS (avg)   After: ~28,700 QPS (+6%).
Add a sentinel WP_Parser_Token with id EMPTY_RULE_ID (0) to the end of
the token array. Real MySQL tokens never have id 0 (WHITESPACE, the only
token with id 0, is stripped by the lexer before tokens reach the
parser), so the sentinel cannot match any real terminal.

This lets the hot path drop the 'position < token_count' range check
everywhere it reads the current token id: the selector lookup at method
entry, the inline terminal match inside the branch loop, and the
post-branch INTO negative lookahead for selectStatement. Any read past
the last real token falls naturally into the nullable-fallback or
branch-miss handling.

Also drop a few dead locals ($token_count, $fragment_ids) that no
longer appear in the hot path after the change.

End-to-end parser benchmark:
  Before: ~28,700 QPS (avg)   After: ~29,800 QPS (+4%).
Previously the per-(rule, token) selector stored a list of branch
indexes that the parser then had to look up in $rules[$rule_id] on
every branch attempt. Store the branch symbol sequences themselves so
the hot loop can iterate candidate branches directly.

PHP arrays are copy-on-write, so sharing the same branch sequence
across selector entries for many tokens costs negligible extra memory.
The nullable_branches map shrinks to a bool marker since the parser
only uses it for existence checks.

Also cache the start rule id on the grammar so parse() skips its
array_search() across rule_names on every call.

End-to-end parser benchmark:
  Before: ~29,800 QPS (avg)   After: ~31,700 QPS (+6%).
Minor cleanup in parse_recursive(): cache the selectStatement rule id
once and compare integers on every call instead of re-comparing the
'selectStatement' string against every rule's name. Also drops the
$rules instance cache from the parser, which the hot path no longer
touches now that branch sequences are embedded in the selector.
Adopts phpcbf's trivial whitespace alignment fixes in the grammar and
parser source to keep `composer run check-cs` clean after the prior
optimisation commits added new local variables and reshaped the
selector-build code.
The per-(rule, token) branch selector stored a separate inner array per
token, even when many tokens within the same rule mapped to identical
branch lists (a single branch's FIRST set covers many tokens, for
example). Loading the MySQL grammar used ~40 MB of PHP memory, most of
which was duplicated inner arrays.

Deduplicate by signature during grammar build so all tokens that land
on the same branch list share one inner array via copy-on-write. The
inner arrays still embed the branch symbol sequences directly so the
hot loop iterates them without an extra $rules[$rule_id][$idx]
indirection per branch attempt.

Grammar memory on the MySQL grammar drops from ~40 MB to ~10 MB.
PHPUnit peak memory drops from 198 MB to 110 MB. Parser throughput is
unchanged from the previous (non-deduplicated) embedded-sequences form.
On the MySQL grammar, 1,290 of 1,916 rules have a selector where every
(rule, token) entry points to exactly one branch. Those rules account
for ~55% of parse_recursive calls on the test corpus (722k of 1.3M per
10k queries).

Flag those rules at grammar build time. In parse_recursive, detect the
flag and take the only candidate branch directly, skipping the
candidate-iteration loop. On match failure, restore $position and
return false directly instead of going through the multi-candidate
branch_matches/break sequence.

End-to-end parser benchmark:
  no JIT:      ~31.6K -> ~32.6K QPS avg (+3%)
  tracing JIT: ~52.6K -> ~55.7K QPS avg (+6%)
Nothing extends WP_Parser_Node. Marking it final lets PHP's opcache
and tracing JIT specialize property access and method dispatch since
the class layout is now fixed. Small but consistent improvement
measured across multiple runs under tracing JIT (~+2% avg, ~+2% best).

End-to-end parser benchmark:
  tracing JIT: ~57K -> ~57-58K QPS avg, 60-61K QPS best
  no JIT:      ~33K -> ~34K QPS avg, 35K QPS best
Apply lexer optimisations from PR #375:

- Cache `strlen($sql)` once in `$sql_length` instead of recomputing on each
  EOF check.
- Replace `strspn($byte, MASK) > 0` with direct byte comparisons
  (`$byte >= '0' && $byte <= '9'`, `false !== strpos(MASK, $byte)`,
  unrolled whitespace check).
- Use `strpos($sql, '*/', $pos)` instead of a manual scan loop in
  `read_comment_content()`.
- In `read_quoted_text()`, use `strpos()` to find the next quote, eliminating
  the separate end-of-input check that follows the `strcspn()` scan.
- Inline `next_token()` + `get_token()` in `remaining_tokens()` so the hot
  loop builds tokens directly.

Co-authored-by: Adam Zieliński <adam@adamziel.com>

Adapted from #375
Token construction is on the lexer hot path; bypassing the
`WP_Parser_Token::__construct()` indirection and assigning the four
properties directly removes one method call per token.

Requires `$input` on `WP_Parser_Token` to be `protected` instead of
`private` so the subclass can write to it.

Co-authored-by: Adam Zieliński <adam@adamziel.com>

Adapted from #375
`! empty( $this->children )` short-circuits without calling `count()`,
saving one function call per invocation.

Co-authored-by: Adam Zieliński <adam@adamziel.com>

Adapted from #376
Both next_token() and remaining_tokens() previously paid a
read_next_token() function call per whitespace run only to
recognise and skip the resulting WHITESPACE token. A single
unguarded strspn() at the top of each loop iteration absorbs
the run inline, saving the call overhead for ~one whitespace
run per real token across millions of tokens.

The strspn() call is unguarded because an unconditional strspn()
(which returns 0 in a single C-side call when nothing matches)
is faster than gating it on a five-arm '$byte === ...' precheck.
ASCII letters and UTF-8 multibyte start bytes account for most
token-start bytes on the MySQL corpus. They previously fell into
the catch-all `else` at the bottom of read_next_token() after
walking every operator arm in between. The new branch sits at
the top of the elseif chain and dispatches them directly.

The `next_byte !== "'"` guard keeps the x'..', n'..' and similar
specials on their dedicated branches. `_` and `$` starters stay
on the catch-all so the UNDERSCORE_CHARSET lookup still fires.
The ASCII bytes (, ), ',' ;, +, ~, %, ^, ?, {, }, and = each
map to a unique single-byte token type with no lookahead. A
static array + isset() arm dispatches them in one lookup,
short-circuiting the per-byte elseif arms further down the
chain.

'*' and '|' are deliberately excluded because their token type
depends on context (in_mysql_comment for '*/', SQL_MODE_PIPES_
AS_CONCAT for '||').
@JanJakes JanJakes changed the title Parser + lexer performance: consolidated 2.4× end-to-end speedup Parser + lexer performance: consolidated 2–3× end-to-end speedup Apr 29, 2026
@JanJakes JanJakes marked this pull request as ready for review April 29, 2026 15:04
@JanJakes JanJakes requested a review from adamziel April 29, 2026 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant