Parser + lexer performance: consolidated 2–3× end-to-end speedup#378
Open
Parser + lexer performance: consolidated 2–3× end-to-end speedup#378
Conversation
cdde227 to
1c666e2
Compare
Hot-path changes in WP_Parser::parse_recursive(): - Inline the terminal match in the branch loop instead of recursing into parse_recursive() for every token. Over the full MySQL test suite this eliminates ~1.6M function calls. - Hoist grammar, rules, fragment_ids, rule_names, tokens, and token_count into local variables so the inner loops avoid repeated property lookups on $this->grammar. - Cache the token count on the instance to avoid a count() per call. - Build branch children in a local array and only instantiate the WP_Parser_Node once the branch has matched; on the MySQL corpus ~75% of speculative nodes were previously created and thrown away. - Drop a dead is_array($subnode) check that never fires in practice (subnodes are false, true, tokens, or nodes - never arrays). - Inline fragment inlining: read the fragment's children directly instead of building a fragment node and immediately merging it. End-to-end parser benchmark on the MySQL server test corpus: Before: ~11,500 QPS After: ~14,900 QPS (+29%)
The grammar now precomputes FIRST and NULLABLE via fixpoint, then indexes each rule's branches by the tokens that can start them. At parse time the parser jumps straight to the candidate branches for the current token instead of iterating every branch and letting most fail. On the full MySQL test suite, 59% of branch attempts previously failed because the first token could never match the branch's FIRST set; with per-branch lookahead those attempts are eliminated. End-to-end parser benchmark: Before: ~14,900 QPS After: ~22,400 QPS (+50%)
Two grammar/parser refinements that both reduce recursive calls: * In parse_recursive(): when the rule has a per-token branch selector but the current token is not in any branch's FIRST and the rule itself is nullable, return 'matched empty' immediately instead of descending into nullable branches that would recursively do the same thing. This alone eliminates ~460k recursive calls on the MySQL corpus. * At grammar build time, expand every single-branch fragment rule into its call sites. Fragments exist only to factor shared sub-sequences and their children are already flattened into the parent AST node, so splicing them directly into parent branches is a no-op for the resulting tree but removes an entire recursive call per use. 480 of the grammar's fragments qualify. Also drops the dead terminal branch at the top of parse_recursive() (the branch loop inlines terminal matching, so parse_recursive is only ever called with non-terminal rule ids) and the always-false empty-branches guard. End-to-end parser benchmark: Before: ~22,400 QPS After: ~27,500 QPS (+23%)
Two minor reductions in per-call work: * Strip explicit EMPTY_RULE_ID symbols out of rule branches at grammar build time. The parser loop would have 'continue'd over them anyway, so removing them ahead of time lets the hot symbol loop drop the epsilon check. Pure-epsilon branches become empty branches and still match empty via the existing empty-children fast path. * Cache the grammar's rules, fragment_ids, rule_names, branches_for_token, nullable_branches, and highest_terminal_id as direct parser instance fields so parse_recursive() no longer pays for a $this->grammar->... double hop on every call. * Collapse the two-step node construction (new + set_children) into a single constructor call that takes the children array directly. This saves a method call per allocated node (~820k across the MySQL corpus). End-to-end parser benchmark: ~27,500 QPS -> ~28,500 QPS (+3.5%).
Multi-branch fragment rules can't be expanded at grammar build time, but their runtime role is still trivial: match a sequence of symbols and have the caller splice the resulting children into its own node. The old code allocated a full WP_Parser_Node for each fragment match just to have the caller immediately copy its children out. Return the children array directly from fragments instead. The caller distinguishes via is_array($subnode) and splices in-place, saving a Parser_Node allocation per fragment match (~253k per 10k queries). End-to-end parser benchmark: Before: ~27,000 QPS (avg) After: ~28,700 QPS (+6%).
Add a sentinel WP_Parser_Token with id EMPTY_RULE_ID (0) to the end of the token array. Real MySQL tokens never have id 0 (WHITESPACE, the only token with id 0, is stripped by the lexer before tokens reach the parser), so the sentinel cannot match any real terminal. This lets the hot path drop the 'position < token_count' range check everywhere it reads the current token id: the selector lookup at method entry, the inline terminal match inside the branch loop, and the post-branch INTO negative lookahead for selectStatement. Any read past the last real token falls naturally into the nullable-fallback or branch-miss handling. Also drop a few dead locals ($token_count, $fragment_ids) that no longer appear in the hot path after the change. End-to-end parser benchmark: Before: ~28,700 QPS (avg) After: ~29,800 QPS (+4%).
Previously the per-(rule, token) selector stored a list of branch indexes that the parser then had to look up in $rules[$rule_id] on every branch attempt. Store the branch symbol sequences themselves so the hot loop can iterate candidate branches directly. PHP arrays are copy-on-write, so sharing the same branch sequence across selector entries for many tokens costs negligible extra memory. The nullable_branches map shrinks to a bool marker since the parser only uses it for existence checks. Also cache the start rule id on the grammar so parse() skips its array_search() across rule_names on every call. End-to-end parser benchmark: Before: ~29,800 QPS (avg) After: ~31,700 QPS (+6%).
Minor cleanup in parse_recursive(): cache the selectStatement rule id once and compare integers on every call instead of re-comparing the 'selectStatement' string against every rule's name. Also drops the $rules instance cache from the parser, which the hot path no longer touches now that branch sequences are embedded in the selector.
Adopts phpcbf's trivial whitespace alignment fixes in the grammar and parser source to keep `composer run check-cs` clean after the prior optimisation commits added new local variables and reshaped the selector-build code.
The per-(rule, token) branch selector stored a separate inner array per token, even when many tokens within the same rule mapped to identical branch lists (a single branch's FIRST set covers many tokens, for example). Loading the MySQL grammar used ~40 MB of PHP memory, most of which was duplicated inner arrays. Deduplicate by signature during grammar build so all tokens that land on the same branch list share one inner array via copy-on-write. The inner arrays still embed the branch symbol sequences directly so the hot loop iterates them without an extra $rules[$rule_id][$idx] indirection per branch attempt. Grammar memory on the MySQL grammar drops from ~40 MB to ~10 MB. PHPUnit peak memory drops from 198 MB to 110 MB. Parser throughput is unchanged from the previous (non-deduplicated) embedded-sequences form.
On the MySQL grammar, 1,290 of 1,916 rules have a selector where every (rule, token) entry points to exactly one branch. Those rules account for ~55% of parse_recursive calls on the test corpus (722k of 1.3M per 10k queries). Flag those rules at grammar build time. In parse_recursive, detect the flag and take the only candidate branch directly, skipping the candidate-iteration loop. On match failure, restore $position and return false directly instead of going through the multi-candidate branch_matches/break sequence. End-to-end parser benchmark: no JIT: ~31.6K -> ~32.6K QPS avg (+3%) tracing JIT: ~52.6K -> ~55.7K QPS avg (+6%)
Nothing extends WP_Parser_Node. Marking it final lets PHP's opcache and tracing JIT specialize property access and method dispatch since the class layout is now fixed. Small but consistent improvement measured across multiple runs under tracing JIT (~+2% avg, ~+2% best). End-to-end parser benchmark: tracing JIT: ~57K -> ~57-58K QPS avg, 60-61K QPS best no JIT: ~33K -> ~34K QPS avg, 35K QPS best
Apply lexer optimisations from PR #375: - Cache `strlen($sql)` once in `$sql_length` instead of recomputing on each EOF check. - Replace `strspn($byte, MASK) > 0` with direct byte comparisons (`$byte >= '0' && $byte <= '9'`, `false !== strpos(MASK, $byte)`, unrolled whitespace check). - Use `strpos($sql, '*/', $pos)` instead of a manual scan loop in `read_comment_content()`. - In `read_quoted_text()`, use `strpos()` to find the next quote, eliminating the separate end-of-input check that follows the `strcspn()` scan. - Inline `next_token()` + `get_token()` in `remaining_tokens()` so the hot loop builds tokens directly. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375
Token construction is on the lexer hot path; bypassing the `WP_Parser_Token::__construct()` indirection and assigning the four properties directly removes one method call per token. Requires `$input` on `WP_Parser_Token` to be `protected` instead of `private` so the subclass can write to it. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #375
`! empty( $this->children )` short-circuits without calling `count()`, saving one function call per invocation. Co-authored-by: Adam Zieliński <adam@adamziel.com> Adapted from #376
Both next_token() and remaining_tokens() previously paid a read_next_token() function call per whitespace run only to recognise and skip the resulting WHITESPACE token. A single unguarded strspn() at the top of each loop iteration absorbs the run inline, saving the call overhead for ~one whitespace run per real token across millions of tokens. The strspn() call is unguarded because an unconditional strspn() (which returns 0 in a single C-side call when nothing matches) is faster than gating it on a five-arm '$byte === ...' precheck.
Closed
4 tasks
ASCII letters and UTF-8 multibyte start bytes account for most token-start bytes on the MySQL corpus. They previously fell into the catch-all `else` at the bottom of read_next_token() after walking every operator arm in between. The new branch sits at the top of the elseif chain and dispatches them directly. The `next_byte !== "'"` guard keeps the x'..', n'..' and similar specials on their dedicated branches. `_` and `$` starters stay on the catch-all so the UNDERSCORE_CHARSET lookup still fires.
The ASCII bytes (, ), ',' ;, +, ~, %, ^, ?, {, }, and = each
map to a unique single-byte token type with no lookahead. A
static array + isset() arm dispatches them in one lookup,
short-circuiting the per-byte elseif arms further down the
chain.
'*' and '|' are deliberately excluded because their token type
depends on context (in_mysql_comment for '*/', SQL_MODE_PIPES_
AS_CONCAT for '||').
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch consolidates the parser-performance optimisations from #373, the lexer + token-construction wins from #375, and the
has_child()micro-opt from #376 into a single clean, linear history. Squashes intermediate refactors and drops abandoned-experiment commits so each commit on this branch is one shippable change.End-to-end (lex+parse) on the 69,577-query MySQL server corpus, best across 3 ABAB-alternated rounds × 5 timed iterations (with 2 warmup iters per round to fully heat the tracing JIT):
Numbers above are PHP 8.5. PHP 8.1 verified within ~5% on the same machine (1.62× / 2.24× / 2.13× / 2.64× across the four configs).
Relationship to #373, #375, #376
This PR is built from the best parts of three independent efforts. The empirical decomposition (full-corpus, fresh measurements per leg) is the basis for the picks below.
$sql_length, byte comparisons replacingstrspn/strcspnmask checks,strpos-based comment-end and quote scans, inlinedremaining_tokens), plus theWP_MySQL_Tokenconstructor shortcut that bypassesparent::__construct(). The parser-side overlap with [codex] Speed up MySQL lexing and parsing #375 was deliberately not applied — Parser performance: 2-3x speedup via grammar preprocessing and interpreter optimisations #373 already has equivalent or stronger implementations of the same optimisations (single-candidate fast path, embedded branch sequences with copy-on-write dedup, end-of-input sentinel token, integer rule-id comparisons, etc.). Mixing the two parsers measured no incremental win; layering only the lexer side gave the cleanest result.! empty( $children )micro-opt was kept. Theparse_recursiveouter-loop split was a different shape than the existing single-candidate fast path on this branch and didn't add value.Approach
tests/tools/(regex grammar matcher, grammar compilation experiment, PCRE2 callouts research) were dropped — each is a documented dead-end that doesn't ship insrc/. Two benchmark helper scripts that were used only for measurement were also dropped from the consolidated branch.Cost vs benefit
Per-commit marginal contributions, measured cumulatively against the previous state (end-to-end JIT, full corpus, with ABAB confirmation on small/suspect deltas). Source-file LOC excludes test-tool additions.
Big, robust wins
final(lets opcache/JIT specialise)These eight optimisations carry ~95% of the cumulative speedup. Total cost: ~278 src LOC for ~2.0× combined.
Lexer dispatch fast paths (added after the original measurement)
Three lexer-side commits restructure
read_next_token()'s elseif chain into early fast paths for the most common token-start bytes. Marginal end-to-end JIT, measured per-commit on the same corpus:strspn()per token loop, replacing per-WS-runread_next_token()round-trips)_/$/ UTF-8 starter as the first arm, withnext_byte !== "'"to keepx'..'/n'..'on their dedicated branches)byte → token idmap for(,),,,;,+,~,%,^,?,{,},=as a chain arm in front of the per-byte elseifs)Total cost: +42 src LOC for +13% end-to-end JIT / +55% lex-only on top of the previous tip. The lex-side is where most of the gain lives; end-to-end dilutes it because lex is only ~13% of total parse time under JIT.
Small but plausibly real (1–2% range, near noise floor)
Each is ≤2% individually, all positive. Total cost: ~127 LOC for an estimated +6% combined. The "Embed branch-symbol sequences" row introduces a small
WP_Parser_Grammar::get_or_cache_rule_id($name)helper that the start-rule andselectStatementlookups now share; the "Compare select-statement" row drops the parser-side ad-hoc lazy cache in favour of that helper, which is why its LOC is small.Effectively zero under JIT (production case)
strposfor comment-end and quote scans! empty()instead ofcount() > 0inhas_child()has_child()is only called from tests; one-line idiom hygienecomposer run check-cscleanVerdict
Nothing on the branch is harming runtime; nothing kept is an "accidental attempt." The three "zero under JIT" commits either benefit non-JIT environments (lexer changes), are pure correctness/idiom improvements (one-line changes), or are required CS hygiene at zero LOC cost.
Optimizations not picked up
Multi-shape regex fast-path
A separate experimental implementation detects 13 common query shapes (INSERT/SELECT/UPDATE/DELETE/DROP/SHOW/USE/TRUNCATE/SET/EXPLAIN/BEGIN/COMMIT/ROLLBACK) via a single PCRE2 union pattern over a codepoint-encoded token stream, then builds the AST directly without descending into the recursive parser.
Tradeoff: the biggest individual win available, but at ~42 LOC per percent of speedup vs ~7 LOC per percent for the optimisations on this branch. Worth landing if the LOC vs maintenance cost is acceptable. Deferred from this PR to keep the consolidation scoped.
AST caching
Lex-once-then-cache by parameterised token-stream signature evaluates as the single highest-leverage optimisation for typical WordPress workloads (up to ~65× speedup on workloads that fit a 200-entry LRU cap, ~2.8 MB memory cost at that cap), and will be proposed in a separate PR.
Unbuilt ideas
expr → boolPri → predicate → bitExpr → simpleExpr → …chainTest plan
composer run test(mysql-on-sqlite) — 684/684, 1,427,724 assertionscomposer run check-cs— cleanDraft so the CI runtime can be compared against #373 before deciding which to land.