[Draft] feat(vardiff): add in-process simulation framework + baseline regression tests#2154
[Draft] feat(vardiff): add in-process simulation framework + baseline regression tests#2154gimballock wants to merge 31 commits into
Conversation
11b2560 to
88d8d1d
Compare
|
The code is cheap and only meant to demonstrate the feasibility, but the concept ack revolves around these points imo:
|
| | share/min | rate | p10 | p50 | p90 | p99 | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | 6 | 83.3% | 10m | 12m | 21m | 25m | | ||
| | 12 | 95.4% | 10m | 10m | 20m | 25m | | ||
| | 30 | 99.5% | 10m | 10m | 15m | 25m | | ||
| | 60 | 100.0% | 10m | 10m | 10m | 20m | | ||
| | 120 | 100.0% | 10m | 10m | 10m | 15m | |
There was a problem hiding this comment.
The first row here shows results of the convergence time test for the the default case (6 spm).
The convergence times are between 10 and 25 minutes, with total failures to converge (w/in quiet_window_secs of simulated time) occurring 17% of the time!
the next few rows describe the results for faster share rates. the most extreme times (25m reduces to 15m) and the total failure cases generally disappear around 30 spm.
| ## Settled accuracy (stable load, post-convergence) | ||
|
|
||
| `|final_hashrate / true_hashrate - 1|` at trial end. Smaller is better. | ||
|
|
||
| | share/min | p10 | p50 | p90 | p99 | | ||
| | --- | --- | --- | --- | --- | | ||
| | 6 | 0.0% | 4.9% | 23.6% | 70.3% | | ||
| | 12 | 0.0% | 0.0% | 12.3% | 26.9% | | ||
| | 30 | 0.0% | 0.0% | 0.8% | 15.6% | | ||
| | 60 | 0.0% | 0.0% | 0.0% | 3.1% | | ||
| | 120 | 0.0% | 0.0% | 0.0% | 0.0% | | ||
|
|
||
| ## Steady-state jitter (fires per minute) | ||
|
|
||
| Post-convergence rate of vardiff fires. Smaller is better — ideal is zero under stable load. | ||
|
|
||
| | share/min | p50 | p90 | p99 | mean | | ||
| | --- | --- | --- | --- | --- | | ||
| | 6 | 0.000 | 0.200 | 0.385 | 0.059 | | ||
| | 12 | 0.000 | 0.077 | 0.217 | 0.019 | | ||
| | 30 | 0.000 | 0.000 | 0.067 | 0.002 | | ||
| | 60 | 0.000 | 0.000 | 0.000 | 0.000 | | ||
| | 120 | 0.000 | 0.000 | 0.000 | 0.000 | |
There was a problem hiding this comment.
These two metrics (proximity to true hashrate and post-converged adjustments) show a similar trend,
Lots of undesired behavior in the extreme cases (top 10%, top 1%) of 6 shares/min case that is alleviated at higher share rates.
| ## Reaction time to a 50% drop (step at 15 min) | ||
|
|
||
| | share/min | reacted | p10 | p50 | p90 | p99 | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | 6 | 69.7% | 1m | 3m | 5m | 5m | | ||
| | 12 | 54.8% | 1m | 3m | 5m | 5m | | ||
| | 30 | 32.6% | 2m | 4m | 5m | 5m | | ||
| | 60 | 16.3% | 3m | 5m | 5m | 5m | | ||
| | 120 | 8.6% | 4m | 5m | 5m | 5m | | ||
|
|
||
| ## Reaction sensitivity (P[fire within 5 min of step change]) | ||
|
|
||
| | Δ% | 6 | 12 | 30 | 60 | 120 | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | -50% | 0.70 | 0.55 | 0.33 | 0.16 | 0.09 | | ||
| | -25% | 0.44 | 0.23 | 0.08 | 0.00 | 0.00 | | ||
| | -10% | 0.39 | 0.15 | 0.02 | 0.00 | 0.00 | | ||
| | -5% | 0.40 | 0.15 | 0.02 | 0.00 | 0.00 | | ||
| | +5% | 0.39 | 0.13 | 0.02 | 0.00 | 0.00 | | ||
| | +10% | 0.42 | 0.17 | 0.03 | 0.00 | 0.00 | | ||
| | +25% | 0.48 | 0.23 | 0.07 | 0.01 | 0.00 | | ||
| | +50% | 0.64 | 0.47 | 0.32 | 0.22 | 0.29 | |
There was a problem hiding this comment.
These tables show how long it takes for vardiff to respond to an unexpected change in hashrate. Where the changes are to either increase or decrease by proportional amounts anywhere from 5% to 50%.
The first table specifically looks at a 50% draw down showing that a full 30% of the time vardiff fails to adjust after 5 min. The next few rows show that the situation worsens at higher share rates, at 120 spm 91% of the trials failed to adjust after 5m.
The second table shows that this effect is basically the same for hashrate changes in the opposite direction and also that changes of lesser magnitude respond much more quickly.
| //! - **Convergence rate**: `current >= baseline - 0.01` | ||
| //! - **Convergence p90**: `current <= baseline * 1.10` | ||
| //! - **Settled accuracy p50 / p90**: `current <= baseline * 1.15` | ||
| //! - **Jitter p50**: `current <= baseline + 0.02` (absolute; baseline can be near zero) | ||
| //! - **Jitter p95**: `current <= baseline * 1.25` | ||
| //! - **Reaction rate**: `current >= baseline - 0.02` | ||
| //! - **Reaction p50**: `current <= baseline * 1.20` | ||
| //! - **Sensitivity at large |Δ| (|Δ| >= 50%)**: `current >= baseline - 0.02` | ||
| //! - **Sensitivity at small |Δ| (|Δ| <= 5%)**: `current <= baseline + 0.05` |
There was a problem hiding this comment.
Convergence rate: Must be no more than 1% slower than the baseline convergence time
Convergence p90: The slowest 10% convergence times must be within 10% of the baseline's convergence time
Settled accuracy: must be within 15% of baseline's accuracy for the slowest 50% / 10%
Jitter p50/p95: must be within 2% and 25% of baseline
...etc.
You see the pattern, there are lots of magic thresholds in this portion of the code that are arbitrarily chosen at this point and fair game for analysis.
5cbed7c to
85d6f8b
Compare
|
after some optimization I got
|
I'm so excited to see people other people nerding out on vardiff with me! Thank you! A couple things I noticed in your results, the 2m convergence time is impressive but your response to a 50% hashrate drop only succeeds in readjusting 4.4% of the time. I'm not sure how best to balance those two metrics but probably not one at the expense of the other. |
2d10f57 to
414afbb
Compare
211bc98 to
2a88fde
Compare
|
Some learning's I had @gimballock
|
63a19d0 to
a18c3a3
Compare
|
Thanks for these insights @adammwest — especially the point about fitness decomposition and normalization. A lot of what you're describing matches the evolution I've gone through on this PR, so let me give a timeline of Phase 1: Basic metrics + simulation harness Initially I focused on three metrics I thought were important: convergence time, jitter, and accuracy. These were evaluated via a time-compressed simulation that replays a synthetic share stream through the vardiff Phase 2: Decomposed pipeline model I wanted to make algorithm search more systematic, so I decomposed "a vardiff algorithm" into four independent, replaceable components: estimator, statistic, boundary, and decision rule. The idea was to mix-and-match This model worked well for the classic algorithm, the parametric variant, and the EWMA approach. But when I tried to embed a Bayesian model, it broke down — the components aren't truly independent. There's a sequential The resulting three-stage pipeline (Estimator → Boundary → UpdateRule) is what's in this PR. It successfully hosts the classic algorithm, EWMA, AdaCUSUM, and could host a Bayesian approach. Phase 3: Aggregate fitness metric To your point about "how you combine all metrics into a final value" — we now have a configurable aggregate metric that allows weighting across the underlying measurements. This addresses exactly the gaming concern you Your suggestion to separate fitness into improvement vs. regression categories per scenario group (stable, coldstart, reaction) is a good one. Currently the regression test does compare per-cell, so a coldstart regression Phase 4: Realistic operating conditions After discussions with hardware engineers, I retuned the test scenarios to realistic share rates (2–30 spm instead of the earlier 6–120 range). The engineers confirmed that responding to partial hardware failures and Current direction I've backed off from prioritizing convergence speed after seeing overcorrection in practice. The current focus is on:
On your point about normalization: agreed, and the per-metric tolerance budgets in the regression test (absolute slack + optional multiplicative slack) are our current mechanism for this. Open to suggestions on better |
|
Hi, this is quite a detailed analysis and great decomposition of relevant parts. Have you had a look at how ckpool implements this? I think he has quite naturally arrived at a very optimal state, balancing the different metrics. This is the repo, you'll have to grep through to find the vardiff implementation: https://github.com/ckolivas/ckpool I've re-implemented his approach in Rust as well and made it a bit more configurable here: https://github.com/parasitepool/para/blob/master/src/vardiff.rs Would be interesting to see how that algorithm performs in your benchmarks. |
Thanks for the info @paratoxicdev , I will add this to my investigation. I know that is a sv1 native pool but i will see what bits of his research crossover to sv2 context and see if it's competitive! |
|
Here is a breakdown of the calibration comparisons I made between the real hashrate tests and the simulation results confirming the predictions with the understanding that; with this algorithm responsiveness scales with the age of the connection. Suggesting that this detail be included in the simulation so metrics are more directly comparable: |
006363a to
a58132b
Compare
…metrics
Complete redesign of the vardiff framework and a new production algorithm
that dominates the legacy implementation on every operational metric.
## New Algorithm: AdaCUSUM
VardiffState::new() now returns the AdaCUSUM composition internally:
EwmaEstimator(120s) + AdaptiveCusumBoundary(s=1.5, f=0.05) + PartialRetarget(0.5)
Zero downstream API changes — sv2-apps continues calling VardiffState::new().
Performance vs legacy VardiffState at SPM=12:
- Convergence: 6 min vs 10 min (40% faster)
- React to -10% decline: 62% vs 14% (4.4x better)
- React to -50% decline: 99% vs 55% (near-perfect)
- Jitter: 0.107/min vs 0.018/min (acceptable trade-off)
- Overshoot: 26% vs 87% (3.3x less)
## Architecture: Three-Stage Pipeline
Estimator → Boundary → UpdateRule
- Statistic trait removed (deviation is inline arithmetic)
- Composed<E, S, B, U> → Composed<E, B, U>
- Estimator::reset() → on_fire(new_hashrate, old_hashrate)
- EstimatorSnapshot gains uncertainty: Option<Uncertainty>
- Boundary receives &EstimatorSnapshot (uncertainty-aware)
- UpdateRule receives threshold (margin-aware)
- EwmaEstimator::on_fire rescales instead of zeroing
## operational_fitness Metric
fitness = 0.25 × reaction_rate(-10%)
+ 0.20 × reaction_rate(-50%)
+ 0.20 × clamp(1 - jitter/0.30, 0, 1)
+ 0.25 × convergence_rate × clamp(1 - conv_p50/600s, 0, 1)
+ 0.10 × clamp(1 - overshoot_p99, 0, 1)
## Components Explored
New estimators: BayesianEstimator, KalmanEstimator
New boundaries: CredibleIntervalBoundary, CusumBoundary, AdaptiveCusumBoundary
New update rules: AdaptivePartialRetarget
New metrics: FireDecisiveness, StepCorrection, OperationalFitness
All 104 channels_sv2 + 141 sim tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds AsymmetricCusumBoundary that uses different thresholds based on whether firing would tighten or ease difficulty: - Easing (miner slowing): uses base threshold (fire quickly, free action) - Tightening (miner speeding): uses base × tighten_multiplier (cautious, because tightening rejects in-flight shares) Rationale: SetTarget that makes difficulty harder invalidates shares already being computed by miners. SetTarget that makes difficulty easier has zero cost — old harder work is still valid under the new easier target. Results with tighten_multiplier=3.0 (AsymCUSUM-t30): Mean fitness: 0.751 (vs symmetric AdaCUSUM 0.676, FullRemedy 0.565) Jitter: 0.045/min (vs 0.175 symmetric, 58% reduction) Convergence: 4 min (vs 6 min symmetric, 33% faster) Overshoot: 16.6% (vs 26.4% symmetric, 37% less) Detection -50%: 99.0% (unchanged) Detection -10%: 44.3% (vs 61.9% — the cost of asymmetry) The jitter reduction directly translates to fewer share rejections in production: ~3 costly tighten-fires per hour instead of ~6. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ran cargo fmt to fix line-length violations introduced during rebase. Regenerated baseline_VardiffState.toml to reflect the current AsymmetricCusumBoundary behavior (directional cost awareness produces expected negative asymmetry values at higher SPM rates). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…magnitude metric
Update OperationalFitness weights to penalize algorithms that raise
difficulty as aggressively as they lower it:
fitness = 0.25 × reaction(-10%) + 0.20 × reaction(-50%)
+ 0.20 × jitter + 0.15 × convergence
+ 0.10 × asymmetry_preference + 0.10 × overshoot
The asymmetry term rewards algorithms where reaction_rate(Step -10%)
exceeds reaction_rate(Step +10%) — i.e. faster detection of hashrate
drops than spikes. This encodes the operational insight that upward
difficulty adjustments are more disruptive (causing difficulty-too-low
share rejections) than downward ones.
New metric: upward_step_magnitude — tracks the ratio (new/old) of all
upward difficulty adjustments during steady state. The p95 value serves
as a deterministic proxy for difficulty-too-low rejection risk without
depending on stochastic share-rejection simulation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rity Restructure OperationalFitness to reflect operational reality: "never surprise the miner" matters more than "detect changes fast." New weights (harm-avoidance 60%, reactivity 25%, convergence 15%): 0.15 × reaction(-10%) + 0.10 × reaction(-50%) 0.25 × jitter_control + 0.25 × step_magnitude_safety 0.15 × convergence + 0.10 × overshoot_safety The step_magnitude_safety term directly penalizes large upward difficulty jumps: p95 step of 1.0× scores 1.0, 1.5× scores 0.0. Result: FullRemedy now wins at 6-12 spm (realistic miner rates) where its cautious approach correctly avoids the timeout death spiral observed with physical miners. VardiffState still wins at 15-30 spm where aggressive reactivity is less harmful. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shape-proxy calibration testing on testnet4 revealed that the classic
vardiff algorithm's reaction time is dominated by counter age (time
since last retarget fire) — not share rate alone. A -50% hashrate drop
takes 2 minutes to detect at a 5-minute counter but 51 minutes at a
120-minute counter. The existing simulation framework tested only at
~15-minute counter ages, masking this critical dependency and making the
classic algorithm appear more capable than it is in production.
This commit folds those calibration findings into the core evaluation:
Scenario: SettledStep { settle_minutes, delta_pct }
Holds hashrate stable for settle_minutes (aging the counter), then
steps. The observation window is 60 minutes — long enough to capture
the slow reactions at mature counter ages. Also emits the *actual*
counter age at step time (time since last fire), which reveals that
jitter fires cap achievable counter age for certain algorithms.
Metric: CounterAgeSensitivity (derived, per share rate)
Score = P[detect a -10% drop within 60 min at a 60-minute counter].
This directly answers "can the algorithm notice partial degradation
(thermal throttle, failing ASICs) under typical steady-state
conditions?" Also emits diagnostic ratios (p50 reaction time at
mature vs young counter) for both -50% and -10% steps.
Results at 1000 trials:
VardiffState (AdaCUSUM): 100% detection — age-independent
FullRemedy, EWMA-60s: 100% — age-independent
SlidingWindow-10t: 95% — nearly age-independent
ClassicComposed: 3-30% — functionally blind at high SPM
Parametric: <1% — completely unable to detect
ClassicPartialRetarget: 0-19% — mostly blind
Metric: ComprehensiveFitness (derived, per share rate)
comprehensive = 0.80 × OperationalFitness + 0.20 × counter_age_score
Designed as the objective function for automated parameter search.
Preserves the existing OperationalFitness formula (harm-avoidance
dominant, backward-compatible) while adding a 20% weight for
counter-age robustness — enough to correctly penalize algorithms
that appear reactive at favorable counter ages but fail in production.
Grid changes:
compare-algorithms and iterative-eval now include 4 SettledStep
scenarios (settle=5min and 60min, delta=-50% and -10%) alongside the
existing Step scenarios. Cost: +32 cells/algo (~20s at 1000 trials).
A standalone counter-age-sweep binary provides the full 5×2×8 grid
for deep characterization.
The top-tier ranking (VardiffState > FullRemedy > AdaCUSUM > EWMA-60s)
is unchanged — these algorithms were already best on OperationalFitness
and the counter-age dimension reinforces their lead. The mid-tier
reorders: Parametric/ParametricStrict drop below SlidingWindow because
their threshold-ladder boundary, while less extreme than Classic's,
still degrades at mature counters.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Investigate PID-based vardiff approaches (P-only with pow2 quantization, and well-tuned P+I+D) as alternatives to the three-stage pipeline. Neither outperforms the Composed architecture, but the PID integral term concept reveals a gap in our UpdateRule: fixed η ignores fire history. The resulting improvements: AcceleratingPartialRetarget (new UpdateRule): η ramps from base→max on consecutive same-direction fires. Parameter sweep (500 trials × 80 cells) confirms acc=0.2, cap=0.6 gives 22% better convergence with zero jitter cost. SpmRatioEstimator (new Estimator): EWMA with direct SPM-ratio scaling — bypasses hash_rate_from_target U256 path. Behaviorally equivalent, simpler code. SignPersistenceCusumBoundary (new Boundary): Lowers threshold when deviation sign persists across ticks. +6% detection on ±10% steps but +23% jitter — needs further tuning. Also adds: - pow2_pid.rs: reference pow2-quantized PID for simulation comparison - pid_tuned.rs: well-tuned PID (P+I+D active) reference - Four sim binaries (compare-pid, compare-best, convergence-time, sweep-accelerating) and PID_INVESTIGATION.md documenting the full analysis and tradeoffs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Swap VardiffState's internal composition from fixed PartialRetarget(0.5) to the sweep-validated BestOfBest: SpmRatioEstimator(120s) + AsymmetricCusumBoundary(s=1.5, floor=0.05, tighten=3.0) + AcceleratingPartialRetarget(base=0.2, max=0.6, acc=0.2) The η acceleration (0.2→0.4→0.6 on consecutive same-direction fires) yields 22% better convergence with zero jitter cost. Trades ~2 min slower cold start for 2-3× better steady-state accuracy at SPM≥15. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SpmRatioEstimator's linear formula (h_estimate = current_h × realized/expected) assumes the target-hashrate-spm invariant holds perfectly across retargets. In production, floating-point drift between nominal_hashrate and the U256 target breaks this assumption, causing the estimator to produce inverted retarget decisions (easing when it should tighten). EwmaEstimator uses hash_rate_from_target() which derives hashrate directly from the actual target, making it immune to this drift. SpmRatioEstimator remains available for simulation (where the invariant holds by construction). Discovered via live deployment: slot 4 eased from 232K→5K difficulty while the miner was producing 2× target SPM (should have tightened). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… rejection Add ERROR_CODE_OPEN_MINING_CHANNEL_EXTENDED_CHANNELS_NOT_SUPPORTED_FOR_STANDARD_JOBS required by sv2-apps pool code when rejecting extended channel open requests on pools configured for standard jobs only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EwmaEstimator assumes one observe() call per tick (60s), but the pool calls increment_shares_since_last_update() once per share arrival. With 12+ shares per tick, each observe(1) applies a full EWMA decay, collapsing the rate estimate toward 1.0 regardless of actual throughput. This causes the algorithm to consistently see under-performance and ease. CumulativeCounter simply accumulates shares and computes realized SPM at snapshot time from the total count ÷ elapsed time — immune to the per-share vs per-tick calling convention difference. Retains AcceleratingPartialRetarget and AsymmetricCusumBoundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EwmaEstimator previously applied EWMA decay on every observe() call, assuming one call per tick. Production callers invoke observe(1) per share arrival (~12× per tick), causing the rate to collapse toward 1.0. Fix: observe() now accumulates into a pending counter. The EWMA decay is applied once per snapshot() call (one per vardiff tick), making the estimator produce identical results whether called as observe(12) once or observe(1) twelve times. Uses AtomicU64/AtomicU32 for interior mutability in snapshot() (the Estimator trait requires &self) to satisfy Send+Sync bounds. Also documents the calling convention contract in the Estimator trait: implementations MUST handle both per-share and per-tick observe patterns. Production VardiffState uses CumulativeCounter (immune to this issue) until EwmaEstimator is validated in production. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… is fixed EwmaEstimator is now safe for per-share observe() calls (pending shares accumulate, decay applied once per snapshot). Swap back from CumulativeCounter to get the EWMA's temporal smoothing benefits: better noise rejection and more stable estimates under Poisson variance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_time simulation Port ckpool's vardiff algorithm (dual-window adaptive EWMA with per-share decay_time updates) to the three-stage pipeline. Key finding: the correct way to port a continuous-time EMA to a tick-based framework is to simulate per-share updates within each tick, not to apply time-bias correction factors. While tuning gets CkpoolRemedy within ~5% of FullRemedy's comprehensive fitness, it yields no Pareto improvement — the dual-window switching adds complexity without outperforming a single-window EWMA(120s). New components: - CkpoolEstimator: per-share decay_time() simulation with dual-window adaptive switching and configurable fast-threshold - HysteresisGate: binary fire/no-fire boundary with data gate + dead band - CkpoolRetarget: full retarget with oscillation guard - TimeBiasEwmaEstimator: single-window EWMA with time-bias correction Grid registrations: ckpool(), ckpool_remedy(), ckpool_remedy_ft(n), ckpool_narrow_hyst(), ckpool_with(), time_bias_remedy() Also updates PID_INVESTIGATION.md with structural analysis of why PID fails (stage conflation makes the 41% dead zone undiagnosable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e0213c8 to
6aa27a9
Compare
The upstream commit df4e764 added ERROR_CODE_OPEN_MINING_CHANNEL_EXTENDED_CHANNELS_NOT_SUPPORTED_FOR_STANDARD_JOBS, which our branch already defined. Remove the duplicate to fix the build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The baseline was stale — generated before the EwmaEstimator promotion (f4cd687) changed VardiffState's internal composition. Regenerated with default 1000 trials × 80 cells at the canonical seed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for analysis @gimballock! Why not consider share-based vardiff, if you've used it before? Since you're also Rust-based I would have a look at the vardiff.rs + decay.rs files with Claude. The type system has allowed much better encapsulation and reasoning about behaviour and I think that'll be easier to compare than the C code from ckpool. I do have to disagree with some of the things your Claude said:
Each channel has their own Vardiff so there's no shared data structure with contention. The pool is overall O(shares), where a couple of floating point operations for share based vardiff are completely dwarfed by the cost to deserialize a share from the wire and then do a double sha256 hash for nonce validation (which is done no matter what Vardiff you use). I'm not that caught up on SV2, so please correct me if that's not what it does.
So the ckpool algo not only specifies an estimator (exponentially weighted moving average, see decay.rs) but also a boundary (when to adjust difficulty). The two main parts which make up the boundary are the HYSTERESIS bounds and MIN_WINDOW_RATIO. /// Minimum window ratio before considering adjustment.
/// Fraction of expected time (or shares) per window.
/// Derived from ckpool: 240s / 300s window = 0.8
const MIN_WINDOW_RATIO: f64 = 0.8;
/// Only decrease difficulty when rate drops below this fraction of target.
/// Copied from ckpool.
const HYSTERESIS_LOW: f64 = 0.5;
/// Only increase difficulty when rate exceeds this fraction of target.
/// Copied from ckpool.
const HYSTERESIS_HIGH: f64 = 1.33;
#[derive(Debug, Clone)]
pub(crate) struct Vardiff {
period: Duration,
window: Duration,
min_shares_for_adjustment: u32,
min_time_for_adjustment: Duration,
dsps: DecayingAverage,
current_diff: Difficulty,
old_diff: Difficulty,
first_share: Option<Instant>,
last_diff_change: Instant,
shares_since_change: u32,
min_diff: Option<Difficulty>,
max_diff: Option<Difficulty>,
diff_change_job_id: Option<JobId>,
}So not only is the responsiveness tuneable, you can see its responsiveness working on real mining machines in the wild: https://stats.ckpool.org/. From experience I can say that it responds beautifully to any scale of hashrate, be it a cpu miner, a bitaxe, or 1 EH/s rental hashrate. I really like the theoretical work you're doing and it has deepened my understanding ( (I now know words like estimator and boundary) of what I originally just copied from ckpool. I feel like this algo should be the benchmark that any new vardiff is compared with, since its proven to work with real hashrate. That's just my two cents, like the work you're doing! |
…h SPM Replace the static AsymmetricCusumBoundary with AdaptivePoissonCusum, which selects the boundary based on the miner's configured share rate: - Below SPM 10: PoissonCI (prevents overshoot on sparse data) - At SPM 10+: AsymmetricCUSUM (fast reaction with abundant evidence) This eliminates the FullRemedy vs VardiffState trade-off — PoissonCI was better at low SPM (bitaxe/small miners) while CUSUM was better at high SPM (large hashrate). The adaptive boundary gets both. Also caps AcceleratingPartialRetarget η at 0.4 (was 0.6). The lower cap prevents cold-start overshoot while still accelerating convergence after step changes. Parameter sweep (compare_out7/) confirmed this is the Pareto-optimal cap. New components: - AdaptivePoissonCusum: SPM-threshold boundary selector - GuardedAccelRetarget: acceleration only after first direction reversal (experimental, not used in production) - sweep-adaptive.rs: parameter sweep binary for boundary tuning Simulation results vs previous production (comprehensive fitness): SPM 4: 0.706 vs 0.636 (+11%) SPM 8: 0.771 vs 0.737 (+5%) SPM 10: 0.783 vs 0.774 (+1%) SPM 15: 0.810 vs 0.803 (+1%) SPM 20: 0.894 vs 0.882 (+1%) SPM 25: 0.903 vs 0.894 (+1%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…estimator equivalence test Adds two follow-up investigations to CKPOOL_INVESTIGATION.md: 1. Hysteresis boundary sweep: tested band widths from [0.5,1.33] through [0.9,1.1] with varying data gates and update rules. Conclusion: hysteresis achieves excellent reaction rates (96-100%) but at 10× jitter cost vs statistical boundaries. No parameterization achieves competitive comprehensive fitness. 2. Estimator equivalence test (revised): paired CkpoolEstimator with the production-tuned boundary (AdaptivePoissonCusum) and update (AcceleratingPartialRetarget). Result: NOT equivalent — the per-share decay_time() simulation is noisier per tick than batch EWMA, and this difference is amplified by CUSUM's aggressive firing. EwmaEstimator(120s) is genuinely better, not just simpler. Also adds ckpool estimator + hysteresis variants to compare-algorithms binary for reproducibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@paratoxicdev You're right on both points — I should have posted my actual findings rather than a raw Claude transcript. On scalability: You're correct. Each channel owns its own Vardiff — no shared state, no contention. The FP cost of On "strictly better": Also wrong. The reality is more nuanced than either of us stated — I'll explain below. What I found after reading your I ported ckpool's algorithm and benchmarked it across SPM 4–30. Full writeup in commit 6aa27a9, updated in b77eff8. On your code specifically:
Why we didn't use ckpool's components in production: I tested all three stages of the pipeline (estimator, boundary, update) independently: Estimator: After correct porting via per-share simulation,
The per-share Boundary: I swept hysteresis from your native [0.5, 1.33] through [0.9, 1.1] with varying data gates:
Narrower bands get excellent reaction rates (96–100%, better than anything else), but at 5–10× jitter cost. The fundamental issue: hysteresis fires whenever the rate ratio crosses the band, with no evidence accumulation. Statistical boundaries (PoissonCI, CUSUM) require cumulative evidence before firing, so they distinguish real changes from noise. Update: ckpool's full retarget (η=1.0) overshoots in the tick framework. Our What we ended up with: The investigation led to a new production composition that adapts its boundary strategy based on the miner's share rate — conservative (PoissonCI) for low-SPM miners where data is sparse, aggressive (CUSUM) for high-SPM miners where evidence is abundant. This is analogous to what ckpool does with its dual-window adaptive switching, just at the boundary layer instead of the estimator layer. Your algorithm's idea of "be conservative when data is sparse, aggressive when data is abundant" is exactly right — we just implement it differently.
|
|
If you have any idea or suggestions to improve the comparison, let me know. |
The field was documented as a share count but actually used as an SPM threshold — align the name with the semantics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clean extraction of the best-performing vardiff algorithm from the simulation framework in stratum-mining#2154, with all test scaffolding, traits, and alternative algorithm implementations removed. The previous VardiffState used a fixed time-dependent threshold ladder and full retarget. This produced: - 6.6% median settled error (p99: 30% at low SPM) - 5–9 minute cold-start convergence (p90) - 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs) - 28% target overshoot during cold-start ramp (p99 at SPM 6) The new algorithm (EWMA + adaptive boundary + accelerating partial retarget): - Settled accuracy: <3% median error across all SPM - Cold-start overshoot bounded to <10% (was 28%) - Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets - Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%) - Transient disconnects recover in 1–2 fires rather than requiring a full cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash) - Asymmetric cost: loosening fires 3x faster than tightening, because loosening is free but tightening rejects in-flight shares Breaking: adds private fields to VardiffState (previously all-pub). Requires channels_sv2 major version bump. Public constructor API (new, new_with_min) and Vardiff trait interface are unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add AsymmetricPoissonCI: applies a tighten_multiplier to the PoissonCI
threshold when the miner is over-performing, reflecting the asymmetric
cost of tightening (rejects in-flight shares) vs loosening (free).
Parameter sweep results (500 trials/cell, SPM 4-30):
Comprehensive fitness at low SPM (where PoissonCI is active):
symmetric (t=1.0): SPM4=0.706, SPM6=0.744, SPM8=0.758
t=1.5: SPM4=0.743, SPM6=0.771, SPM8=0.784
t=2.0: SPM4=0.799, SPM6=0.821, SPM8=0.801
t=3.0: SPM4=0.858, SPM6=0.872, SPM8=0.882
t=3.0 matches CUSUM's tighten_multiplier and is the clear winner
(+21% at SPM 4, +17% at SPM 6). SPM 10+ is unchanged (CUSUM active).
New files:
- AsymmetricPoissonCI in boundary.rs
- sweep-asymmetric-poisson.rs: parameter sweep binary
- asymmetric_poisson_sweep.md: sweep results
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clean extraction of the best-performing vardiff algorithm from the simulation framework in stratum-mining#2154, with all test scaffolding, traits, and alternative algorithm implementations removed. The previous VardiffState used a fixed time-dependent threshold ladder and full retarget. This produced: - 6.6% median settled error (p99: 30% at low SPM) - 5–9 minute cold-start convergence (p90) - 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs) - 28% target overshoot during cold-start ramp (p99 at SPM 6) The new algorithm (EWMA + adaptive boundary + accelerating partial retarget): - Settled accuracy: <3% median error across all SPM - Cold-start overshoot bounded to <10% (was 28%) - Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets - Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%) - Transient disconnects recover in 1–2 fires rather than requiring a full cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash) - Asymmetric cost: loosening fires 3x faster than tightening, because loosening is free but tightening rejects in-flight shares Breaking: adds private fields to VardiffState (previously all-pub). Requires channels_sv2 major version bump. Public constructor API (new, new_with_min) and Vardiff trait interface are unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clean extraction of the best-performing vardiff algorithm from the simulation framework in stratum-mining#2154, with all test scaffolding, traits, and alternative algorithm implementations removed. The previous VardiffState used a fixed time-dependent threshold ladder and full retarget. This produced: - 6.6% median settled error (p99: 30% at low SPM) - 5–9 minute cold-start convergence (p90) - 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs) - 28% target overshoot during cold-start ramp (p99 at SPM 6) The new algorithm (EWMA + adaptive boundary + accelerating partial retarget): - Settled accuracy: <3% median error across all SPM - Cold-start overshoot bounded to <10% (was 28%) - Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets - Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%) - Transient disconnects recover in 1–2 fires rather than requiring a full cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash) - Asymmetric cost: loosening fires 3x faster than tightening, because loosening is free but tightening rejects in-flight shares Breaking: adds private fields to VardiffState (previously all-pub). Requires channels_sv2 major version bump. Public constructor API (new, new_with_min) and Vardiff trait interface are unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>



Adds a deterministic in-process simulation framework that characterizes
any
Vardiffimplementation across the operational rate range, andcommits the current algorithm's measurements as a baseline for automated
regression testing.
The "vardiff fires too often on noise" / "vardiff doesn't react fast
enough" conversations have circled the same issues (#396 and adjacent)
without a way to settle questions empirically. This PR adds the missing
infrastructure: any future proposal can now produce a quantitative
delta against a fixed reference.
The finding that motivates this
Before the framework, "is the algorithm too noisy or too sluggish?" was
a matter of opinion. With it, the question is a table:
Reaction sensitivity to a -50% step change (probability of firing
within 5 minutes after the change):
Higher share rates produce less responsive algorithms — counter to
expectations. The mechanism is mechanical (post-convergence
delta_timegrows indefinitely, diluting the post-step signal in the cumulative
window) and surfaced clearly in the data despite taking months of
deployment observation to half-articulate. Full numbers and analysis
in
vardiff_baseline.md.This isn't a fix. It's the measurement that lets the fix be evaluated.
What's in the PR
3 commits, ~2000 LOC plus baseline data:
feat(vardiff): inject Clock trait + add_shares trait methodMinimum API additions to
channels_sv2for testability andsimulation performance. Production behavior unchanged — existing
constructors default to
SystemClock, the new trait method has adefault implementation that keeps existing impls compiling.
feat(vardiff_sim): in-process simulation frameworkNew crate at
sv2/channels-sv2/sim/. Per-tick Poisson sharesampling, five behavioral metrics with percentile distributions,
50-cell parameterized sweep (5 share rates × 10 scenarios), CLI
binary for baseline generation, regression test asserting against
a committed baseline.
data(vardiff_sim): design doc + baseline characterizationThe design proposal documenting metric definitions and tolerance
policy, plus the measured baseline as both TOML (consumed by the
regression test) and Markdown (for human review).
What the framework measures
Five behavioral attributes, each as a distribution across 1000
independent trials per cell:
Per-metric tolerances are asserted automatically against the checked-in
baseline. Failed assertions identify the cell and metric with specific
baseline-vs-current numbers. Mid-range Δ values (10-25%) are reported
but not asserted — that's where legitimate algorithmic tradeoffs live
and a reviewer should judge by looking at the full delta.
How to run
From
sv2/channels-sv2/sim/:What this enables
For any future vardiff proposal:
Vardiffimplcargo run --release --bin generate-baselineto produce comparablemeasurements
No more "I think this is better." Instead "this changes p50 jitter from
X to Y at 12 spm, at the cost of p90 reaction time going from A to B."
Where to look in this PR
rationale):
sv2/channels-sv2/sim/VARDIFF_SIMULATION_FRAMEWORK.mdworkflow):
sv2/channels-sv2/sim/README.mdsv2/channels-sv2/sim/vardiff_baseline.mdWhat this PR is NOT
VardiffStatebehavior is unchanged.The only public-API additions are
Vardiff::add_shares(with adefault impl) and the
Clocktrait. Production code defaults toSystemClockand behaves identically to before.data suggests 12-30 spm is the operational sweet spot, but this
PR doesn't touch any defaults.
a GitHub Action to be a true CI gate. Follow-up.
Open follow-ups
cargo test --release --lib -- --ignoredinto CI on PRstouching
vardiff/*or the sim crate.channels_sv25.0.0 → 5.1.0 once the workspace lockfilesituation allows (the trait-method addition is technically a
minor-version semver change). TODO comment in
Cargo.tomltracksthis.
surfaces the problem; fixing it is a separate proposal that this
framework will be the right tool to evaluate.
Test plan
cargo test -p channels_sv2 --lib vardiff— 17 tests, all passcargo testfromsv2/channels-sv2/sim/— 53 fast unit testscargo test --release --lib -- --ignoredfrom sim/ — slowregression test passes against committed baseline
cargo run --release --bin generate-baseline— reproduces thecommitted
vardiff_baseline.tomlbyte-for-byte at the same seed