Multithreading for KeY: combined preview with the performance series#3843
Draft
unp1 wants to merge 45 commits into
Draft
Multithreading for KeY: combined preview with the performance series#3843unp1 wants to merge 45 commits into
unp1 wants to merge 45 commits into
Conversation
The Java program inside a modality was matched by a single monolithic MatchProgramInstruction delegating to ProgramElement.match. Make the generic program part (ordinary statements/expressions: class + exact arity + child recursion, and non-list program schema variables) matchable by the same instruction VM the rest of the find pattern uses: - MatchProgramElementInstruction (class + exact arity, generic over SyntaxElement) and MatchSubProgramInstruction (runs a sub-program over the modality's program via its own cursor). - The generator converts such programs into a VMInstruction sub-program; anything it does not handle falls back to MatchProgramInstruction. - Seams introduced here: MatchProgram (the match-program abstraction implemented by VMProgramInterpreter, and later by the compiled matcher) and ProgramChildrenMatcher (for matching a run of program children). Gated behind -Dkey.matcher.programInstructions (read at matcher construction so it can be toggled by reloading; default off → unchanged monolithic path); behaviour-preserving when on. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The .. ... pattern of symbolic-execution taclets (ContextStatementBlock) is matched specially: variable-length prefix descent to the active statement, inner execution context, active-statement matching, and prefix/suffix position bookkeeping. Convert phase (3) -- the active statements -- to VM instructions while keeping the intricate phases (1)(2)(4) in place: - ContextStatementBlock.match gains a phase-(3) seam taking a ProgramChildrenMatcher; the default still uses matchChildren, but a supplied matcher (a VM sub-program here) can drive the active-statement matching instead. - MatchContextStatementBlockInstruction wires a context block to that seam. - VMProgramInterpreter.matchChildrenFrom runs a sub-program over a run of source children from a child offset (the active statements). - The generator emits the context-block instruction for a top-level context block, falling back when an active statement is not convertible. Same -Dkey.matcher.programInstructions gate; behaviour-preserving when on.
CompiledMatchProgram is a second MatchProgram backend that navigates the term and Java-program structure directly (term.op()/sub(i) and SyntaxElement.getChild), avoiding the PoolSyntaxElementCursor entirely. It compiles essentially the whole find-taclet base: ordinary operators and schema variables, bound variables (quantifiers/substitutions), modalities with their program (generic programs and context blocks), parametric function instances and elementary updates; program elements with their own match (value literals, type refs, loops) and variable-arity children (list SVs #slist) are reused cursor-free by delegating to their own match. VMTacletMatcher selects the compiled find-matcher when -Dkey.matcher.compiled is set (read at construction, so toggling it and reloading switches matchers; default off → the interpreter, which stays the source of truth). Behaviour-preserving; ~1.2-1.7x on matcher-bound proving. CompiledMatchProgramTest checks the compiled matcher against explicit expectations (propositional, function and bound-variable patterns, success and failure).
testRAP (generated runAllProofs regression suite, in-process ProveTest per fork) now runs on up to 10 parallel JVMs (-PrapForks=N), with a configurable per-fork heap (-PrapHeap), and forwards the compiled-matcher switch (-Pmatcher.compiled=true / -Dkey.matcher.compiled) to the proof runs so the regression suite can exercise the compiled matcher.
…RE MERGE] Development-only verification/measurement, not intended for the PR: - ProgramMatchDifferentialTest: every find-taclet matched by the interpreter oracle vs the converted/compiled matchers over a real-proof term corpus, asserting identical results (success/failure + instantiations). - CompiledMatchProgramBenchmark / ContextMatchBenchmark: isolated interpreter-vs-compiled matching-time measurements.
A new language-agnostic module holding the match-plan IR from which both find-matcher back-ends (the VMProgramInterpreter and the cursor-free compiled matcher) are derived from a single description, so a construct described once drives both. - MatchPlan: the IR node (emitInstructions for the interpreter, compile for the compiled matcher); OperatorPlan / SchemaVarPlan cover the term skeleton. - MatchHead: the operator-specific check (no subterm recursion); GenericOperatorHead handles ordinary operators. - BinderMatcher / ProgramMatchHook: the two cross-language SPIs (bound variables and the modality program AST), kept abstract here so other ncore-based provers (Rust, Solidity) can reuse the framework. The module depends only on key.ncore / key.ncore.calculus / key.util (no Java-DL types); key.core gains a dependency on it.
The single Java-DL dispatch (JavaMatchPlanBuilder.buildPlan) builds a MatchPlan for a find pattern; both back-ends are then derived from it. It covers the FOL term skeleton, elementary updates, parametric function instances and modalities, falling back to the legacy matchers only for term labels. - Heads ElementaryUpdateHead / ParametricFunctionHead / ModalityHead carry the operator-specific interpreter + compiled fragments, lifted verbatim from the hand-written generator and compiled matcher (so behaviour is preserved). - JavaBinderMatcher / JavaProgramMatchHook implement the two SPIs (bound-variable binding/renaming; the JavaBlock / ContextStatementBlock program matching). - CompiledMatchProgram.compiledProgramMatcher is extracted from compileModality (now keyed on the JavaBlock) so the compiled matcher and the program hook share one program-matching implementation; buildProgramInstruction is made package-visible for the hook's interpreter side. - JavaMatchPlanBuilder also exposes the production facades interpreterProgram / compiledProgram (framework-built, with legacy fallback for term labels).
The find and assumes matchers are now built via JavaMatchPlanBuilder (interpreterProgram / compiledProgram) instead of calling the two hand-written dispatches directly, making the match-plan IR the single source of truth in production. The facades fall back to the legacy generator / compiled matcher for the constructs the framework does not build yet (term labels), so behaviour is unchanged. The key.matcher.compiled / key.matcher.programInstructions flags keep their meaning.
…ROP BEFORE MERGE] Extends the dev-only differential test and micro-benchmark (added in the "matcher differential test + isolated benchmarks" drop commit) to also exercise the match-plan framework alongside the hand-written matchers: - ProgramMatchDifferentialTest builds the plan and verifies its interpreter (with program-instruction conversion both off and on, since production reads that flag) and its compiled matcher against the legacy oracle (24.8M comparisons). - CompiledMatchProgramBenchmark times the framework-built matchers next to the hand-written ones for both back-ends (the no-overhead check). Like the commit it extends, this is dropped before merge.
…e source)
With both back-ends now derived from the match-plan framework, the two
hand-written per-construct dispatches are redundant and are removed, leaving the
framework as the single source of truth for find-matching:
- delete CompiledMatchProgram's term-level dispatch (compile / compileStep /
compileCore / compile{ElementaryUpdate,ParametricFunction,Modality}); the heads
already carry that logic. Its reused Java-program helpers (compiledProgramMatcher,
compileProgram, delegateToMatch, compileActiveStatements) move to a small
JavaProgramCompiler used by the program hook.
- delete SyntaxElementMatchProgramGenerator's createProgram dispatch; only the
program-instruction conversion helpers (buildProgramInstruction & co), which the
hook reuses, remain.
- migrate term labels into the framework (JavaMatchPlanBuilder.LabelPlan, reusing
the matchTermLabelSV instruction) so buildPlan is total; the facades no longer
fall back -- an unsupported pattern raises a clear error naming the missing head
(no current taclet hits this; the whole standard base is covered).
- retarget CompiledMatchProgramTest to the framework facade.
Net: ~390 fewer lines of production matcher code, no duplicated dispatch. The
interpreter/compiled engines and the program helpers are unchanged.
…only benchmark The differential test and the context benchmark depend on the hand-written matchers as an independent oracle, which no longer exist in this branch. They are retained on a separate development branch (with the reference interpreter) as the regression net, and removed here. CompiledMatchProgramBenchmark is retargeted to compare the framework's interpreter vs its compiled matcher (no oracle).
Selecting the compiled find-matcher previously needed -Dkey.matcher.compiled,
which is awkward to pass through Gradle when trying it out. Expose it as a KeY
feature flag (Settings -> Feature Flags) too:
- VMTacletMatcher.COMPILED_MATCHER_FEATURE ("MATCHER_COMPILED"); the matcher is
selected when the system property OR the feature flag is set. The property is
kept for headless / CI (testRAP).
- SettingsManager references the flag so it is registered and shown in the
feature-settings panel on a fresh start, before any proof is loaded.
The flag is read per taclet at construction, so it applies to newly loaded proofs
(or after reloading the current one); the panel shows a "reload required" notice
(restartRequired = true). No on-the-fly switch of an open proof's matchers.
The cursor-free compiled find-matcher is now selected by default; the legacy interpreter becomes an opt-out via -Dkey.matcher.interpreter or the MATCHER_INTERPRETER feature flag (and remains the automatic fallback for patterns the compiler does not handle). Differential testing established the two back-ends are byte-identical, so this only changes which one runs, not any proof.
Carry a rule-app container's strategy cost forward across the per-round re-expansion instead of recomputing it, when the taclet's cost is a pure function of the app + find subterm (plus the always-refreshed age term and NonDuplicateApp vetoes). Sound-by-construction, annotation-driven classification (CostLocal/CostNonLocal, default non-local); generator-aware so a composite summing over a sequent-scanning generator stays non-local. A development aid -Dkey.strategy.costReuse.verify recomputes the cost and warns on any mismatch. Byte-identical on the perfTest/perfValidation corpus; ~7% automode speedup on cost-bound proofs.
introductionTime cached the not-introduced-yet answer (-1); the symbol may be introduced by a later rule, after which the real time would be found, so the frozen -1 made the value depend on whether the symbol was first compared before or after its introduction -- an access-pattern dependence that makes term ordering, and hence OneStepSimplifier rewriting, subtly non-deterministic. Only cache a real introduction time (stable once found).
Replace the per-call new Feature[0] / r.length==0 idiom with a shared INELIGIBLE constant and identity check. An eligible taclet always carries at least the top-level NonDuplicateApp veto, so identity is the clearer 'not eligible' test. Pure refactor: byte-identical (symmArray 14601 nodes, 0 verify-mode mismatches).
Age (goal time) was contributed inside each top-level strategy's computeCost (AgeFeature in ModularJavaDLStrategy's cost/inst sums; getTime() in FIFOStrategy and SimpleFilteredStrategy). Move it out into a single container-level term, RuleAppContainer.withAge, added exactly once when a container is built -- so strategies (and their components) compute only their age-free cost and age is added once regardless of how strategies are composed. AgeFeature is removed. This lets cost reuse carry the age-free base forward verbatim: TacletAppContainer stores the age-free cost and the reuse fast path is just 'base + current age' with no getTime()-getAge() reconstruction and no age>=0 guard (initial containers reuse soundly too). As a side effect the container's age field is decoupled from the cost and is now purely the AssumesInstantiator freshness key. Behaviour-preserving: byte-identical to the parent on SLL, saddleback, symmArray, median (verified by A/B against the legacy age-in-features path before it was removed; isolated timing showed the relocation is performance-neutral, so its value is code quality plus enabling the simpler, broader cost-reuse path).
Park assumes-incomplete taclet bases that re-expand to nothing (97-99.6% of base re-expansions) out of the active queue, and wake them by a precise operator index: index each parked base by the concrete top operator(s) of its \assumes formulas (resolved through the find-match's SV instantiations); wake exactly the bases whose operator matches a formula added/modified that round (Goal.fireSequentChanged -> sequentChanged), walking the changed formula's update-prefix spine (a sound superset, since the assumes matcher strips the update context). Only effectively-indexable bases are parked; unbound-generic tops stay in the queue. Insertion-ordered (LinkedHashMap/Set) for determinism; clone() deep-copies the index. Active by default. Provability-safe on the full real RAP suite (681 goals) once the redundant order-fragile lenOfSeqSubEQ is dropped from automode (see follow-up commit). Not byte-identical (reordering shifts proof shapes, all still close); ~40% faster automode on the perfTest goals.
lenOfSeqSubEQ rewrites seqLen(EQ) via an antecedent equation EQ = seqSub(...). It is redundant for completeness -- the direct lenOfSeqSub suffices (full RAP closes all 681 goals without it). It is also order-fragile: when the negated-goal equation seqSub(s,0,i)=seqSub(s,0,i+5) is reused as an \assumes to simplify, the simplification reproduces that same formula, and the duplicate/cycle guard then refuses to re-apply subSeqEqual -> the goal dead-ends. The original rule order avoided this by luck; any reordering (e.g. assumes-parking) can expose it. Only the \heuristics is commented out, so the taclet stays defined and existing proofs that applied it still load/replay.
…Labels removeIrrelevantLabels rebuilt the whole term tree on every call (stream().map()/filter() .collect() per node), the single biggest allocator during proof search (~20%), even though most subterms have no irrelevant label. Replace with an identity-preserving rebuild (plain loops, lazy sub-array, return the original term when its subtree has no irrelevant label). Behaviour-preserving (terms are immutable; result is structurally identical).
Objects.hash(first, second) allocates an Object[] on every call; Pair is heavily used as a hash-map key during proof search. Inline the same hash value without the array.
applyReplacewithHelper allocated a PiTIterator (posInTerm().iterator()) per rewrite-taclet application and consumed it in the recursive replace(). Thread the PosInTerm + a depth index instead (same indices/order), avoiding the per-application iterator object.
The KeY and JML ANTLR parsers build a prediction (DFA) cache lazily while parsing, held on the generated parsers' static fields, so it stays resident for the whole JVM -- including the (long) proof search, where it is unused (~17 MB retained on a large proof). It is a pure cache that ANTLR rebuilds transparently on the next parse, so dropping it after a problem/proof has finished loading is correctness-safe. Add ParsingFacade.clearParserCaches() (KeY/JavaDL) and JmlFacade.clearCaches() (JML) and call them from AbstractProblemLoader.load().
…former RewriteTaclet.checkPrefix walks the whole root-to-position prefix (PIOPathIterator) at every position it is asked about. During taclet-index construction / one-step simplification of a deep term that is O(depth) per position over ~d positions, i.e. O(d^2) -- the dominant cost on deeply nested terms such as chained if-then-else (a JFR profile showed 54% self-time in checkPrefix on a trivial 3-node proof). For an unrestricted (NONE) rewrite taclet -- the common case -- the only prefix-dependent outcome of that walk is a veto when a Transformer occurs on the path; the update/polarity/modality handling is guarded by a non-NONE restriction and the polarity is discarded. So if the formula provably contains no Transformer anywhere, no prefix can, and the walk can be skipped. "Formula contains a Transformer" is computed once and cached per term (JTerm.containsTransformerRecursive, mirroring containsJavaBlockRecursive), giving O(1) amortized and dropping the per-position prefix cost from O(depth) to O(1) in the transformer-free case; the O(d^2) on deep terms becomes O(d). Behaviour-preserving: it only short-circuits a provably-equivalent case; restricted taclets and transformer-bearing formulas still take the full walk.
…oofs.runOnlyOn=perfTest) Adds the curated 6-problem perfTest group used for the combined benchmark. By default all runAllProofs groups run (full regression, like main); pass -Dkey.runallproofs.runOnlyOn=perfTest to restrict to the perfTest group, and -PrapForks=1 for clean serial timing.
The active statement (and its position) were eagerly computed when a node's rule application was set, on the proving path. Compute them lazily on first access instead; nodes whose active statement is never inspected do no work.
The loop-invariant rule cached the last instantiation in a static field shared across proofs, which could leak an instantiation between unrelated proofs. Store it on the rule application instead.
Give the MethodCall metaconstruct a fresh per-application copy instead of mutating shared instance state, make the class final to lock in that argument, and make JmlAssert's condition final.
A stored settings value may deserialize as Integer or Long depending on its magnitude and the format, but the numeric property converters assumed a fixed boxed type and threw a ClassCastException on the other -- which could crash KeY on startup while loading the settings. Accept any Number instead.
A recent-file entry whose profile name was never set serialized the null value as the string "null"; reloading it then tried to resolve a profile named "null", failed, and showed an error dialog instead of opening the file. Treat a missing profile name (null or the legacy "null" string) as "use the default profile" on load, and stop writing the "null" placeholder when saving.
After a compound macro finished, the status line kept the partial count from the macro's last internal strategy pass (e.g. "Applied 2 rules, 1 goal" even though the macro had closed thousands) instead of the macro's own aggregate. Display the ProofMacroFinishedInfo's aggregate result when a macro completes.
ConcurrentLruCache (exact-LRU, single-lock, drop-in for synchronizedMap over an LRUCache) for eviction-sensitive caches, StripedLruCache for pure caches, and WeakValueInterner for identity-preserving interning under concurrency, with concurrency tests.
672aeef to
55089a1
Compare
Add the parallelProverEnabled / parallelProverThreadCount settings and a Profile.supportsParallelAutomode() capability: the standard Java profile opts in, the well-definedness, information-flow and symbolic-execution profiles keep the single-core fallback.
55089a1 to
b21eb6e
Compare
Add EssentialProofListener and Proof.suspendNonEssentialListeners(), which detaches every proof-tree and rule-app listener not tagged essential for the duration of a run and re-attaches them afterwards, so nothing unrelated to proving fires on a worker thread.
ParallelProver runs automatic proof search on a worker pool, one open goal per worker, behind the AutoProvers selection seam. A single GoalScheduler monitor hands out goals depth-first; Goal.apply is split into a concurrent compute phase and a commit phase serialized on one lock; fresh names come from a partitioned per-proof allocator (atomic Counter), namespace flushes are deferred, and the merge rule is disabled during parallel runs. The taclet-index cache key is immutable so siblings sharing the cache set cannot race on it. Includes the scheduler unit test and the proof-equivalence gate.
Harden the shared state the workers touch during search: route the matching and cost-path caches through the thread-safe cache utilities, intern parametric operators and elementary updates identity-preservingly, give the per-call instantiation caches and counters safe publication or per-worker copies, and guard the specification repository's proving-time maps. Behaviour-preserving.
Add a settings pane and a clickable status-line indicator to switch between the single-core and multi-core prover, a --threads CLI option, and gating of the single-core-only features: proof caching, slicing and the merge rule are disabled (with explanatory tooltips, updated live) while the multi-core prover is active, and restored when switching back.
Route StrategyProofMacro and TryCloseMacro through the AutoProvers seam. The try-close prune stays safe because start() only returns once every worker has stopped, so no worker is live during the tree mutation; a stress test exercises the prune-and-close path at eight workers.
Add the real-proof speedup benchmark, the synthetic best/worst-case benchmark, a JFR profiling probe, and a high-worker non-closure stress test wired into the CI integration-test matrix. Pin the test suite to the single-core prover by default; opt-in parallel tests override it at runtime.
b21eb6e to
448b709
Compare
448b709 to
84e593a
Compare
84e593a to
890f466
Compare
…) when the compiled matcher is selected When the compiled find-matcher is selected, the taclet's \assumes formulas were still matched by the interpreter (their Java program blocks via the monolithic program instruction). Build them with the same cursor-free compiled matcher (which also compiles the modality program, since the compiled back-end always uses the compiled program hook), falling back to the interpreter for any pattern the compiler has no head for. Opt out for A/B via -Dkey.matcher.interpreterAssumes. ~8% faster automode on a modality-heavy benchmark (behavior_run), byte-identical proofs.
…on-rebuild # Conflicts: # key.core/build.gradle
The merge of main into the branch dropped the closing brace of the if (key.mt.jfr) block in tasks.withType(Test), leaving tasks.withType unclosed so the following tasks.register(...) failed to parse.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Intended Change
This is an integrated preview branch: the multi-core prover (a revival of the 2018 HacKeYthon prototype; see PR1 for
the full multithreading description, architecture, problems and fixes) combined
with the 3.1 performance series. It
exists to show the combined effect of single-thread performance work and
goal-level parallelism on the same proofs.
It is not intended to be merged as-is; it tracks both lines of work together.
For everything about the multithreading design, the problems encountered and how
they are fixed and verified, see PR1 (#3842). For the
single-thread optimisations, see the perf series PRs and
Performance Optimizations (3.1).
Benchmarks
Same corpus and machine as PR1 (16-core, one isolated run per proof). Cells show
wall-clock time with speedup vs the single-threaded
mainin parentheses.mainis the unmodified single-threaded baseline (= PR1's single column); theperf SC column is the perf series alone (single-core, no MT); the 2/4/8×
columns are the perf series and the multi-core prover combined.
Real-world proofs — combined effect vs
main.The combined effect reaches ~4–5× at 4–8 workers on the wide proofs. The perf
series also lifts the narrow case: Saddleback_search no longer regresses at 8× compared to baseline main
(PR1 was 0.69×; here 2.36× vs
main), because cheaper matching/allocation shrinksthe cost of the off-critical-path speculative work.
Diagrams. Rendered speedup bar charts (multithreading-only and combined) are in the dev doc's Combined effect section.
Type of pull request
Ensuring quality
integrated branch.
adds the perf-series optimisations on top.
Configuration
Identical to PR1: GUI Settings → Prover (Single / Multi-Core) + status-line
indicator, CLI
--threads N, tests pinned single-core. The single-core-onlyfeatures (proof caching, slicing, merge rule) are gated the same way, and
selecting single-core re-enables them.
Additional information and contact(s)
This preview shows the benefit of stacking single-thread performance work and
multi-core proving, with the same safe single-core fallback as PR1.
This PR has been done with AI tooling support.
The contributions within this pull request are licensed under GPLv2 (only) for inclusion in KeY.