Skip to content

[Draft] feat(vardiff): add in-process simulation framework + baseline regression tests#2154

Open
gimballock wants to merge 31 commits into
stratum-mining:mainfrom
marafoundation:vardiff/simulation-framework
Open

[Draft] feat(vardiff): add in-process simulation framework + baseline regression tests#2154
gimballock wants to merge 31 commits into
stratum-mining:mainfrom
marafoundation:vardiff/simulation-framework

Conversation

@gimballock

Copy link
Copy Markdown

Adds a deterministic in-process simulation framework that characterizes
any Vardiff implementation across the operational rate range, and
commits the current algorithm's measurements as a baseline for automated
regression testing.

The "vardiff fires too often on noise" / "vardiff doesn't react fast
enough" conversations have circled the same issues (#396 and adjacent)
without a way to settle questions empirically. This PR adds the missing
infrastructure: any future proposal can now produce a quantitative
delta against a fixed reference.

The finding that motivates this

Before the framework, "is the algorithm too noisy or too sluggish?" was
a matter of opinion. With it, the question is a table:

Reaction sensitivity to a -50% step change (probability of firing
within 5 minutes after the change):

share/min sensitivity
6 0.70
12 0.55
30 0.33
60 0.16
120 0.09

Higher share rates produce less responsive algorithms — counter to
expectations. The mechanism is mechanical (post-convergence delta_time
grows indefinitely, diluting the post-step signal in the cumulative
window) and surfaced clearly in the data despite taking months of
deployment observation to half-articulate. Full numbers and analysis
in vardiff_baseline.md.

This isn't a fix. It's the measurement that lets the fix be evaluated.

What's in the PR

3 commits, ~2000 LOC plus baseline data:

  1. feat(vardiff): inject Clock trait + add_shares trait method
    Minimum API additions to channels_sv2 for testability and
    simulation performance. Production behavior unchanged — existing
    constructors default to SystemClock, the new trait method has a
    default implementation that keeps existing impls compiling.

  2. feat(vardiff_sim): in-process simulation framework
    New crate at sv2/channels-sv2/sim/. Per-tick Poisson share
    sampling, five behavioral metrics with percentile distributions,
    50-cell parameterized sweep (5 share rates × 10 scenarios), CLI
    binary for baseline generation, regression test asserting against
    a committed baseline.

  3. data(vardiff_sim): design doc + baseline characterization
    The design proposal documenting metric definitions and tolerance
    policy, plus the measured baseline as both TOML (consumed by the
    regression test) and Markdown (for human review).

What the framework measures

Five behavioral attributes, each as a distribution across 1000
independent trials per cell:

Metric Better is What it tells you
Convergence time Smaller How fast the algorithm settles after cold start
Settled accuracy Smaller How close to truth the algorithm lands
Steady-state jitter Smaller How often it fires on noise post-settle
Reaction time Smaller How fast it responds to genuine load changes
Reaction sensitivity ≈ 1 for real Δ, ≈ 0 for noise Whether it distinguishes signal from noise

Per-metric tolerances are asserted automatically against the checked-in
baseline. Failed assertions identify the cell and metric with specific
baseline-vs-current numbers. Mid-range Δ values (10-25%) are reported
but not asserted — that's where legitimate algorithmic tradeoffs live
and a reviewer should judge by looking at the full delta.

How to run

From sv2/channels-sv2/sim/:

# Fast unit tests (~1 second)
cargo test

# Generate a fresh baseline (~5-15 seconds)
cargo run --release --bin generate-baseline

# Run the slow regression test (~5-15 seconds; #[ignore]-d by default)
cargo test --release --lib -- --ignored

What this enables

For any future vardiff proposal:

  1. Implement the new algorithm as a Vardiff impl
  2. cargo run --release --bin generate-baseline to produce comparable
    measurements
  3. Diff against the committed baseline
  4. Make the case with numbers

No more "I think this is better." Instead "this changes p50 jitter from
X to Y at 12 spm, at the cost of p90 reaction time going from A to B."

Where to look in this PR

What this PR is NOT

  • Not an algorithm change. VardiffState behavior is unchanged.
    The only public-API additions are Vardiff::add_shares (with a
    default impl) and the Clock trait. Production code defaults to
    SystemClock and behaves identically to before.
  • Not a recommendation about share rate defaults. The baseline
    data suggests 12-30 spm is the operational sweet spot, but this
    PR doesn't touch any defaults.
  • Not a CI workflow. The regression test works locally but needs
    a GitHub Action to be a true CI gate. Follow-up.

Open follow-ups

  • Wire cargo test --release --lib -- --ignored into CI on PRs
    touching vardiff/* or the sim crate.
  • Bump channels_sv2 5.0.0 → 5.1.0 once the workspace lockfile
    situation allows (the trait-method addition is technically a
    minor-version semver change). TODO comment in Cargo.toml tracks
    this.
  • Investigate the reactivity-degrades-with-rate finding. The framework
    surfaces the problem; fixing it is a separate proposal that this
    framework will be the right tool to evaluate.

Test plan

  • cargo test -p channels_sv2 --lib vardiff — 17 tests, all pass
  • cargo test from sv2/channels-sv2/sim/ — 53 fast unit tests
  • cargo test --release --lib -- --ignored from sim/ — slow
    regression test passes against committed baseline
  • cargo run --release --bin generate-baseline — reproduces the
    committed vardiff_baseline.toml byte-for-byte at the same seed

@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 11b2560 to 88d8d1d Compare May 13, 2026 22:06
@gimballock

gimballock commented May 13, 2026

Copy link
Copy Markdown
Author

The code is cheap and only meant to demonstrate the feasibility, but the concept ack revolves around these points imo:

  • We can play dice with share-received events to simulate running the vardiff algorithms over arbitrary ranges of time. But we need to mock SystemTime::now() and add a way to bulk add new shares.
  • With fake time simulations we can do large scale vardiff trials of whatever metrics we want and contrast against correlated attributes like target shares-per-minute.
    • I was interested in convergence time, stable-state jitter, and convergence accuracy
    • But responsiveness to external change is also a key capability, (how fast to adjust to a 50% spike/dip in hashrate)
  • With this compilation of reproducible test results compiled into a profile we can use integration tests to lock in established performance thresholds and ratchet up the expectations if we find better algorithms.

Comment on lines +7 to +13
| share/min | rate | p10 | p50 | p90 | p99 |
| --- | --- | --- | --- | --- | --- |
| 6 | 83.3% | 10m | 12m | 21m | 25m |
| 12 | 95.4% | 10m | 10m | 20m | 25m |
| 30 | 99.5% | 10m | 10m | 15m | 25m |
| 60 | 100.0% | 10m | 10m | 10m | 20m |
| 120 | 100.0% | 10m | 10m | 10m | 15m |

@gimballock gimballock May 13, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first row here shows results of the convergence time test for the the default case (6 spm).
The convergence times are between 10 and 25 minutes, with total failures to converge (w/in quiet_window_secs of simulated time) occurring 17% of the time!

the next few rows describe the results for faster share rates. the most extreme times (25m reduces to 15m) and the total failure cases generally disappear around 30 spm.

Comment on lines +15 to +37
## Settled accuracy (stable load, post-convergence)

`|final_hashrate / true_hashrate - 1|` at trial end. Smaller is better.

| share/min | p10 | p50 | p90 | p99 |
| --- | --- | --- | --- | --- |
| 6 | 0.0% | 4.9% | 23.6% | 70.3% |
| 12 | 0.0% | 0.0% | 12.3% | 26.9% |
| 30 | 0.0% | 0.0% | 0.8% | 15.6% |
| 60 | 0.0% | 0.0% | 0.0% | 3.1% |
| 120 | 0.0% | 0.0% | 0.0% | 0.0% |

## Steady-state jitter (fires per minute)

Post-convergence rate of vardiff fires. Smaller is better — ideal is zero under stable load.

| share/min | p50 | p90 | p99 | mean |
| --- | --- | --- | --- | --- |
| 6 | 0.000 | 0.200 | 0.385 | 0.059 |
| 12 | 0.000 | 0.077 | 0.217 | 0.019 |
| 30 | 0.000 | 0.000 | 0.067 | 0.002 |
| 60 | 0.000 | 0.000 | 0.000 | 0.000 |
| 120 | 0.000 | 0.000 | 0.000 | 0.000 |

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two metrics (proximity to true hashrate and post-converged adjustments) show a similar trend,
Lots of undesired behavior in the extreme cases (top 10%, top 1%) of 6 shares/min case that is alleviated at higher share rates.

Comment on lines +39 to +60
## Reaction time to a 50% drop (step at 15 min)

| share/min | reacted | p10 | p50 | p90 | p99 |
| --- | --- | --- | --- | --- | --- |
| 6 | 69.7% | 1m | 3m | 5m | 5m |
| 12 | 54.8% | 1m | 3m | 5m | 5m |
| 30 | 32.6% | 2m | 4m | 5m | 5m |
| 60 | 16.3% | 3m | 5m | 5m | 5m |
| 120 | 8.6% | 4m | 5m | 5m | 5m |

## Reaction sensitivity (P[fire within 5 min of step change])

| Δ% | 6 | 12 | 30 | 60 | 120 |
| --- | --- | --- | --- | --- | --- |
| -50% | 0.70 | 0.55 | 0.33 | 0.16 | 0.09 |
| -25% | 0.44 | 0.23 | 0.08 | 0.00 | 0.00 |
| -10% | 0.39 | 0.15 | 0.02 | 0.00 | 0.00 |
| -5% | 0.40 | 0.15 | 0.02 | 0.00 | 0.00 |
| +5% | 0.39 | 0.13 | 0.02 | 0.00 | 0.00 |
| +10% | 0.42 | 0.17 | 0.03 | 0.00 | 0.00 |
| +25% | 0.48 | 0.23 | 0.07 | 0.01 | 0.00 |
| +50% | 0.64 | 0.47 | 0.32 | 0.22 | 0.29 |

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tables show how long it takes for vardiff to respond to an unexpected change in hashrate. Where the changes are to either increase or decrease by proportional amounts anywhere from 5% to 50%.

The first table specifically looks at a 50% draw down showing that a full 30% of the time vardiff fails to adjust after 5 min. The next few rows show that the situation worsens at higher share rates, at 120 spm 91% of the trials failed to adjust after 5m.

The second table shows that this effect is basically the same for hashrate changes in the opposite direction and also that changes of lesser magnitude respond much more quickly.

Comment thread sv2/channels-sv2/sim/src/regression.rs Outdated
Comment on lines +16 to +24
//! - **Convergence rate**: `current >= baseline - 0.01`
//! - **Convergence p90**: `current <= baseline * 1.10`
//! - **Settled accuracy p50 / p90**: `current <= baseline * 1.15`
//! - **Jitter p50**: `current <= baseline + 0.02` (absolute; baseline can be near zero)
//! - **Jitter p95**: `current <= baseline * 1.25`
//! - **Reaction rate**: `current >= baseline - 0.02`
//! - **Reaction p50**: `current <= baseline * 1.20`
//! - **Sensitivity at large |Δ| (|Δ| >= 50%)**: `current >= baseline - 0.02`
//! - **Sensitivity at small |Δ| (|Δ| <= 5%)**: `current <= baseline + 0.05`

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convergence rate: Must be no more than 1% slower than the baseline convergence time
Convergence p90: The slowest 10% convergence times must be within 10% of the baseline's convergence time
Settled accuracy: must be within 15% of baseline's accuracy for the slowest 50% / 10%
Jitter p50/p95: must be within 2% and 25% of baseline
...etc.

You see the pattern, there are lots of magic thresholds in this portion of the code that are arbitrarily chosen at this point and fair game for analysis.

@gimballock gimballock force-pushed the vardiff/simulation-framework branch 2 times, most recently from 5cbed7c to 85d6f8b Compare May 17, 2026 14:48
@gimballock gimballock changed the title feat(vardiff): add in-process simulation framework + baseline regression tests [Draft] feat(vardiff): add in-process simulation framework + baseline regression tests May 17, 2026
@adammwest

Copy link
Copy Markdown

after some optimization I got
https://github.com/adammwest/stratum/blob/feat/vardiff_kalman/sv2/channels-sv2/sim/vardiff_best.md
with the Bayesian model

Method Result
Bayesian model Good, Best result
Kalman filters Good
Jurik moving averages Good but artifacts on hashrate changes
Thompson sampling Bad

@gimballock

gimballock commented May 18, 2026

Copy link
Copy Markdown
Author

after some optimization I got https://github.com/adammwest/stratum/blob/feat/vardiff_kalman/sv2/channels-sv2/sim/vardiff_best.md with the Bayesian model
Method Result
Bayesian model Good, Best result
Kalman filters Good
Jurik moving averages Good but artifacts on hashrate changes
Thompson sampling Bad

I'm so excited to see people other people nerding out on vardiff with me! Thank you!

A couple things I noticed in your results, the 2m convergence time is impressive but your response to a 50% hashrate drop only succeeds in readjusting 4.4% of the time. I'm not sure how best to balance those two metrics but probably not one at the expense of the other.

@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 2d10f57 to 414afbb Compare May 19, 2026 14:25
@gimballock gimballock force-pushed the vardiff/simulation-framework branch 4 times, most recently from 211bc98 to 2a88fde Compare May 20, 2026 15:09
@adammwest

Copy link
Copy Markdown

Some learning's I had @gimballock

  • This task is hard, I think the current implementation is optimized to a degree.
  • the most critical thing is the fitness, currently there are many metrics, how you combine all of them into a final value is what determines the goodness for any algorithm, as its a summary there are ways to game it.
  • Use every value in the toml file, if you dont those values will naturally will degrade.
  • There are many ways to combine many numbers which lead to slightly better and slightly worse performance.
    one thing I did which helped was to separate fitness into 2 categories improvement and regression, even better separating these per group e.g stable,coldstart ,... then you can decompose the value.
  • Normalize each metric/group otherwise the more numerous or larger numbers will be the focus.
  • There are many cases where you can get a good score, but the fitness prefers optimizing 1 variable or a set of variables at the expense of others.
  • For grid parameter sweeps, they usually discretize the domain so you are bounded in improvement only by dimension range and amount of queries. so you need to constantly increase queries or shrink ranges. usually you are limited due to time. For this reason I prefer random restart hill climbing I find is generally pretty good when you don't make assumptions about the data.
  • If you have too many parameters to optimize you can over fit, and end up just gaming the test

@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 63a19d0 to a18c3a3 Compare May 21, 2026 14:28
@gimballock

gimballock commented May 21, 2026

Copy link
Copy Markdown
Author

Thanks for these insights @adammwest — especially the point about fitness decomposition and normalization. A lot of what you're describing matches the evolution I've gone through on this PR, so let me give a timeline of
how the approach has matured:

Phase 1: Basic metrics + simulation harness

Initially I focused on three metrics I thought were important: convergence time, jitter, and accuracy. These were evaluated via a time-compressed simulation that replays a synthetic share stream through the vardiff
algorithm. This gave us reproducible, large-scale trials (50 cells × 1000 trials) against correlated attributes like target shares-per-minute.

Phase 2: Decomposed pipeline model

I wanted to make algorithm search more systematic, so I decomposed "a vardiff algorithm" into four independent, replaceable components: estimator, statistic, boundary, and decision rule. The idea was to mix-and-match
implementations at each slot for the best composite.

This model worked well for the classic algorithm, the parametric variant, and the EWMA approach. But when I tried to embed a Bayesian model, it broke down — the components aren't truly independent. There's a sequential
data flow: the estimator needs to communicate its belief to the boundary ("should we respond?") and to the update rule. Additionally, since vardiff triggers on a timer rather than on share arrival, the decision rule needs
to call back to the estimator to update state when adjustments occur. I also dropped the "statistic" component as it wasn't pulling its weight.

The resulting three-stage pipeline (Estimator → Boundary → UpdateRule) is what's in this PR. It successfully hosts the classic algorithm, EWMA, AdaCUSUM, and could host a Bayesian approach.

Phase 3: Aggregate fitness metric

To your point about "how you combine all metrics into a final value" — we now have a configurable aggregate metric that allows weighting across the underlying measurements. This addresses exactly the gaming concern you
raised: rather than optimizing one metric at the expense of others, we can define a weighted composite that represents our desired tradeoff. The regression baseline locks in the full vector of metrics so we catch
regressions in any dimension, not just the aggregate.

Your suggestion to separate fitness into improvement vs. regression categories per scenario group (stable, coldstart, reaction) is a good one. Currently the regression test does compare per-cell, so a coldstart regression
can't hide behind a stable-state improvement, but making this more explicit in the scoring would help.

Phase 4: Realistic operating conditions

After discussions with hardware engineers, I retuned the test scenarios to realistic share rates (2–30 spm instead of the earlier 6–120 range). The engineers confirmed that responding to partial hardware failures and
network slowdowns on an established channel is valuable functionality — even though in practice many hashrate changes currently cause miner reconnections (which resets vardiff anyway). We've been doing live testing with
physical miners on testnet4 and confirmed this pattern: when vardiff ramps difficulty too aggressively, it can interact with firmware timeout behaviors in ways that force reconnections, making reactivity testing harder
than expected.

Current direction

I've backed off from prioritizing convergence speed after seeing overcorrection in practice. The current focus is on:

  • Stability under steady-state (minimal oscillation once converged)
  • Reasonable reactivity (detect genuine changes within 2–3 retarget windows, not 1)
  • Asymmetric cost awareness — difficulty increases are more disruptive than decreases. An overshoot upward causes difficulty-too-low share rejections (wasted miner work), while an undershoot downward just means slightly
    more shares than optimal (cheap). The AsymmetricCusumBoundary encodes this: it requires stronger evidence before raising difficulty than lowering it. We can now actually measure the impact via the share-rejection metrics
    (shares_rejected_total{reason="difficulty-too-low"}) that were recently added to the pool's monitoring (sv2-apps PR Docs: Channel Factory #491).

On your point about normalization: agreed, and the per-metric tolerance budgets in the regression test (absolute slack + optional multiplicative slack) are our current mechanism for this. Open to suggestions on better
normalization approaches

@gimballock

gimballock commented May 30, 2026

Copy link
Copy Markdown
Author

I've been working on a simple proxy to validate the vardiff responsiveness findings from the simulations. My first test aimed to confirm the assessment that the existing vardiff algorithm is slow to respond to hashrate changes after it has already converged.

Test Setup:

I configured an S21 miner → tproxy (vardiff disabled) → shape-proxy (controllable share rate) → SRI pool. The shape-proxy maintains a smoothed share rate based on current pool difficulty and can selectively drop shares to simulate hashrate changes on command.

Methodology:

I ran two parallel instances of the unmodified SRI pool server (main branch) and triggered a 50% share rate drop via API command. I then measured how long it took for the pool's hashrate estimate and share acceptance rate to adjust to the new rate.

Results:

The pool required approximately 40 minutes to complete the initial hashrate adjustment, while the simulation predicted 70% of miners failed to adjust at all and of those that did it took 5m. So it may look like a 20× discrepancy between simulation and observed behavior, its more of a confirmation that it takes a very long time.

I will investigate if there is any discrepancy and re-calibrate the simulation if so. The attached image shows the resulting hashrate and accepted share rate for both trials.

Screenshot 2026-05-30 at 1 33 10 PM

The blue lines are the pool hashrate, you see the initial spike after it first starts up is the convergence spike then is settles and stays flat for not-quite an hour before I activate the 50% hashrate drop, immediately visible in the share rate drop. This stays flat until eventually vardiff fires the adjustment allowing the share rate to return to normal.

@gimballock

Copy link
Copy Markdown
Author
Screenshot 2026-05-30 at 2 15 21 PM Here you can see the previous 12h of consistent hashrate w/ zero rejected shares as evidence that the 'smoothing' of the hashrate via my proxy is stable enough to test against.

@paratoxicdev

Copy link
Copy Markdown

Hi, this is quite a detailed analysis and great decomposition of relevant parts.

Have you had a look at how ckpool implements this? I think he has quite naturally arrived at a very optimal state, balancing the different metrics. This is the repo, you'll have to grep through to find the vardiff implementation: https://github.com/ckolivas/ckpool

I've re-implemented his approach in Rust as well and made it a bit more configurable here: https://github.com/parasitepool/para/blob/master/src/vardiff.rs

Would be interesting to see how that algorithm performs in your benchmarks.

@gimballock

Copy link
Copy Markdown
Author

Hi, this is quite a detailed analysis and great decomposition of relevant parts.

Have you had a look at how ckpool implements this? I think he has quite naturally arrived at a very optimal state, balancing the different metrics. This is the repo, you'll have to grep through to find the vardiff implementation: https://github.com/ckolivas/ckpool

I've re-implemented his approach in Rust as well and made it a bit more configurable here: https://github.com/parasitepool/para/blob/master/src/vardiff.rs

Would be interesting to see how that algorithm performs in your benchmarks.

Thanks for the info @paratoxicdev , I will add this to my investigation. I know that is a sv1 native pool but i will see what bits of his research crossover to sv2 context and see if it's competitive!

@gimballock

Copy link
Copy Markdown
Author

Here is a breakdown of the calibration comparisons I made between the real hashrate tests and the simulation results confirming the predictions with the understanding that; with this algorithm responsiveness scales with the age of the connection. Suggesting that this detail be included in the simulation so metrics are more directly comparable:

⏺ Vardiff Calibration Summary: Simulation vs Real Miner Testing

  Algorithm under test: Classic Parametric vardiff (VardiffState from stratum-core, branch vardiff/parametric-thresholds)

  Hardware: Antminer S21 (~200 TH/s), testnet4
  Pool config: shares_per_minute = 6 and shares_per_minute = 20
  Tool: shape-proxy with Step{1.0→0.5, at_secs:300} and Track{1.0} reset

  ---
  Structural findings confirmed
  
  ┌────────────────────────────┬───────────────────────────────────────────────────────┬────────────────────────────────────────────────────────┬───────┐
  │          Property          │                    Sim prediction                     │                    Real observation                    │ Match │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Steady-state jitter        │ 0.000 fires/min                                       │ Zero fires during 30+ min stable operation (both       │ Yes   │
  │                            │                                                       │ rates)                                                 │       │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Per-fire step magnitude    │ Deterministic: realized_spm / target_spm ratio        │ -16.7% consistently (5 consecutive fires)              │ Yes   │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Fire cadence after counter │ Exactly 300s (15% threshold at delta_time≥300)        │ 22:39→22:44→22:49→22:54→22:59 (5-min cadence)          │ Yes   │
  │  reset                     │                                                       │                                                        │       │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Overshoot after staircase  │ p99 = 69% at 6 spm                                    │ ~60% overshoot → share flood → oscillation             │ Yes   │
  │ descent                    │                                                       │                                                        │       │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Sensitivity depends on     │ Implicit in algorithm (accumulating counter never     │ 5-min counter: 4.4 min reaction; 51-min counter: 51.8  │ Yes   │
  │ counter age                │ resets except on fire)                                │ min reaction                                           │       │
  ├────────────────────────────┼───────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼───────┤
  │ Algorithm symmetric for    │ Sim shows similar reaction rates both directions      │ Confirmed: similar timescales for step-down and        │ Yes   │
  │ ±50%                       │                                                       │ step-up                                                │       │
  └────────────────────────────┴───────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┴───────┘

  ---
  Reaction time: -50% step (hashrate halved)
  
  ┌────────────────────────────────────┬──────────────────────────────────────────────────────────┬──────────────────────────────────┐
  │             Condition              │                      Sim prediction                      │           Real result            │
  ├────────────────────────────────────┼──────────────────────────────────────────────────────────┼──────────────────────────────────┤
  │ 6 spm, fresh counter (~5 min)      │ 3.7% fire within 5 min; those that fire: p50=5m          │ Slot 3: 4.4 min, Slot 4: 8.0 min │
  ├────────────────────────────────────┼──────────────────────────────────────────────────────────┼──────────────────────────────────┤
  │ 6 spm, settled counter (~51 min)   │ ~96% don't fire within 5 min (tail extends indefinitely) │ Slot 3: 51.8 min                 │
  ├────────────────────────────────────┼──────────────────────────────────────────────────────────┼──────────────────────────────────┤
  │ 20 spm, moderate counter (~27 min) │ 14.2% fire within 5 min                                  │ Slot 3: 6.9 min, Slot 4: 8.4 min │
  └────────────────────────────────────┴──────────────────────────────────────────────────────────┴──────────────────────────────────┘

  ---
  Reaction time: +50% step (return to full rate)
  
  ┌───────────────────────────┬──────────────────────────────────┬──────────────────────────────────┐
  │         Condition         │          Sim prediction          │           Real result            │
  ├───────────────────────────┼──────────────────────────────────┼──────────────────────────────────┤
  │ 6 spm, 51-min counter     │ Deep in tail (>96% non-reactive) │ Slot 3: 51.8 min                 │
  ├───────────────────────────┼──────────────────────────────────┼──────────────────────────────────┤
  │ 6 spm, 68-min counter     │ Deep in tail                     │ Slot 4: 15.4 min                 │
  ├───────────────────────────┼──────────────────────────────────┼──────────────────────────────────┤
  │ 20 spm, 20-21 min counter │ Higher spm improves detection    │ Slot 3: 5.9 min, Slot 4: 6.3 min │
  └───────────────────────────┴──────────────────────────────────┴──────────────────────────────────┘

  ---
  Key insight: counter age is the dominant variable
  
  The sim's test design (step at t=15min, 5 min after cold-start convergence) always tests with a fresh counter. This produces the "3.7% react within 5 min"
  figure. In reality, a pool that hasn't fired in hours has a massive accumulated counter that dilutes any step signal — explaining 40-minute to 2.5-hour
  response times observed in earlier tests.

  At 20 spm vs 6 spm with the same counter age (~20 min), reaction time improves ~8x (51.8 min → 5.9 min). This is because more shares per minute means the
  new rate accumulates statistical weight faster against the counter history.

  ---
  Simulation validity assessment
  
  The simulation is trustworthy for relative algorithm comparisons. It correctly models:
  - The threshold table mechanics (fire/no-fire decisions)
  - The deterministic staircase behavior post-fire
  - The zero-jitter steady state
  - The share-rate dependence on detection speed
  - The fundamental sensitivity-decay-over-time flaw

  Gap to address: The sim should add a "counter age" axis (test steps at t=30m, t=60m, t=120m) to characterize the tail distribution, which is where
  real-world pools spend most of their time. The current 5-minute observation window and 15-minute step timing understate the algorithm's poor real-world
  responsiveness.```  

@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 006363a to a58132b Compare June 3, 2026 04:51
@gimballock

gimballock commented Jun 3, 2026

Copy link
Copy Markdown
Author

Ok here is evidence that the top algorithm (deployed side-by-side with the current vardiff) is immune to the age-dependence effect. This reproduces results predicted in the simulation but now seen in real life.

I started a mining channel against both pools and let them mine overnight.
In the morning I dropped the hashrate in half for both pools at roughly the same time and waited till both pool's vardiffs finished adjusting to the new hashrate changes. The annotated image below shows the results from the grafana dashboard.

Screenshot 2026-06-03 at 5 09 20 PM

The current vardiff algorithm not only took several hours to respond, the response it eventually made was in the wrong direction! While the new algorithm responded in a few minutes and settled on the correct value.

Eric Price and others added 14 commits June 4, 2026 12:57
…metrics

Complete redesign of the vardiff framework and a new production algorithm
that dominates the legacy implementation on every operational metric.

## New Algorithm: AdaCUSUM

VardiffState::new() now returns the AdaCUSUM composition internally:
  EwmaEstimator(120s) + AdaptiveCusumBoundary(s=1.5, f=0.05) + PartialRetarget(0.5)

Zero downstream API changes — sv2-apps continues calling VardiffState::new().

Performance vs legacy VardiffState at SPM=12:
  - Convergence: 6 min vs 10 min (40% faster)
  - React to -10% decline: 62% vs 14% (4.4x better)
  - React to -50% decline: 99% vs 55% (near-perfect)
  - Jitter: 0.107/min vs 0.018/min (acceptable trade-off)
  - Overshoot: 26% vs 87% (3.3x less)

## Architecture: Three-Stage Pipeline

  Estimator → Boundary → UpdateRule

- Statistic trait removed (deviation is inline arithmetic)
- Composed<E, S, B, U> → Composed<E, B, U>
- Estimator::reset() → on_fire(new_hashrate, old_hashrate)
- EstimatorSnapshot gains uncertainty: Option<Uncertainty>
- Boundary receives &EstimatorSnapshot (uncertainty-aware)
- UpdateRule receives threshold (margin-aware)
- EwmaEstimator::on_fire rescales instead of zeroing

## operational_fitness Metric

  fitness = 0.25 × reaction_rate(-10%)
          + 0.20 × reaction_rate(-50%)
          + 0.20 × clamp(1 - jitter/0.30, 0, 1)
          + 0.25 × convergence_rate × clamp(1 - conv_p50/600s, 0, 1)
          + 0.10 × clamp(1 - overshoot_p99, 0, 1)

## Components Explored

New estimators: BayesianEstimator, KalmanEstimator
New boundaries: CredibleIntervalBoundary, CusumBoundary, AdaptiveCusumBoundary
New update rules: AdaptivePartialRetarget
New metrics: FireDecisiveness, StepCorrection, OperationalFitness

All 104 channels_sv2 + 141 sim tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds AsymmetricCusumBoundary that uses different thresholds based on
whether firing would tighten or ease difficulty:

- Easing (miner slowing): uses base threshold (fire quickly, free action)
- Tightening (miner speeding): uses base × tighten_multiplier (cautious,
  because tightening rejects in-flight shares)

Rationale: SetTarget that makes difficulty harder invalidates shares
already being computed by miners. SetTarget that makes difficulty easier
has zero cost — old harder work is still valid under the new easier target.

Results with tighten_multiplier=3.0 (AsymCUSUM-t30):
  Mean fitness: 0.751 (vs symmetric AdaCUSUM 0.676, FullRemedy 0.565)
  Jitter: 0.045/min (vs 0.175 symmetric, 58% reduction)
  Convergence: 4 min (vs 6 min symmetric, 33% faster)
  Overshoot: 16.6% (vs 26.4% symmetric, 37% less)
  Detection -50%: 99.0% (unchanged)
  Detection -10%: 44.3% (vs 61.9% — the cost of asymmetry)

The jitter reduction directly translates to fewer share rejections in
production: ~3 costly tighten-fires per hour instead of ~6.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ran cargo fmt to fix line-length violations introduced during rebase.
Regenerated baseline_VardiffState.toml to reflect the current
AsymmetricCusumBoundary behavior (directional cost awareness produces
expected negative asymmetry values at higher SPM rates).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…magnitude metric

Update OperationalFitness weights to penalize algorithms that raise
difficulty as aggressively as they lower it:

  fitness = 0.25 × reaction(-10%) + 0.20 × reaction(-50%)
          + 0.20 × jitter + 0.15 × convergence
          + 0.10 × asymmetry_preference + 0.10 × overshoot

The asymmetry term rewards algorithms where reaction_rate(Step -10%)
exceeds reaction_rate(Step +10%) — i.e. faster detection of hashrate
drops than spikes. This encodes the operational insight that upward
difficulty adjustments are more disruptive (causing difficulty-too-low
share rejections) than downward ones.

New metric: upward_step_magnitude — tracks the ratio (new/old) of all
upward difficulty adjustments during steady state. The p95 value serves
as a deterministic proxy for difficulty-too-low rejection risk without
depending on stochastic share-rejection simulation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rity

Restructure OperationalFitness to reflect operational reality:
"never surprise the miner" matters more than "detect changes fast."

New weights (harm-avoidance 60%, reactivity 25%, convergence 15%):
  0.15 × reaction(-10%) + 0.10 × reaction(-50%)
  0.25 × jitter_control + 0.25 × step_magnitude_safety
  0.15 × convergence + 0.10 × overshoot_safety

The step_magnitude_safety term directly penalizes large upward
difficulty jumps: p95 step of 1.0× scores 1.0, 1.5× scores 0.0.

Result: FullRemedy now wins at 6-12 spm (realistic miner rates)
where its cautious approach correctly avoids the timeout death
spiral observed with physical miners. VardiffState still wins at
15-30 spm where aggressive reactivity is less harmful.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shape-proxy calibration testing on testnet4 revealed that the classic
vardiff algorithm's reaction time is dominated by counter age (time
since last retarget fire) — not share rate alone. A -50% hashrate drop
takes 2 minutes to detect at a 5-minute counter but 51 minutes at a
120-minute counter. The existing simulation framework tested only at
~15-minute counter ages, masking this critical dependency and making the
classic algorithm appear more capable than it is in production.

This commit folds those calibration findings into the core evaluation:

Scenario: SettledStep { settle_minutes, delta_pct }
  Holds hashrate stable for settle_minutes (aging the counter), then
  steps. The observation window is 60 minutes — long enough to capture
  the slow reactions at mature counter ages. Also emits the *actual*
  counter age at step time (time since last fire), which reveals that
  jitter fires cap achievable counter age for certain algorithms.

Metric: CounterAgeSensitivity (derived, per share rate)
  Score = P[detect a -10% drop within 60 min at a 60-minute counter].
  This directly answers "can the algorithm notice partial degradation
  (thermal throttle, failing ASICs) under typical steady-state
  conditions?" Also emits diagnostic ratios (p50 reaction time at
  mature vs young counter) for both -50% and -10% steps.

  Results at 1000 trials:
    VardiffState (AdaCUSUM): 100% detection — age-independent
    FullRemedy, EWMA-60s:   100% — age-independent
    SlidingWindow-10t:       95% — nearly age-independent
    ClassicComposed:          3-30% — functionally blind at high SPM
    Parametric:               <1% — completely unable to detect
    ClassicPartialRetarget:   0-19% — mostly blind

Metric: ComprehensiveFitness (derived, per share rate)
  comprehensive = 0.80 × OperationalFitness + 0.20 × counter_age_score

  Designed as the objective function for automated parameter search.
  Preserves the existing OperationalFitness formula (harm-avoidance
  dominant, backward-compatible) while adding a 20% weight for
  counter-age robustness — enough to correctly penalize algorithms
  that appear reactive at favorable counter ages but fail in production.

Grid changes:
  compare-algorithms and iterative-eval now include 4 SettledStep
  scenarios (settle=5min and 60min, delta=-50% and -10%) alongside the
  existing Step scenarios. Cost: +32 cells/algo (~20s at 1000 trials).
  A standalone counter-age-sweep binary provides the full 5×2×8 grid
  for deep characterization.

The top-tier ranking (VardiffState > FullRemedy > AdaCUSUM > EWMA-60s)
is unchanged — these algorithms were already best on OperationalFitness
and the counter-age dimension reinforces their lead. The mid-tier
reorders: Parametric/ParametricStrict drop below SlidingWindow because
their threshold-ladder boundary, while less extreme than Classic's,
still degrades at mature counters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Investigate PID-based vardiff approaches (P-only with pow2 quantization,
and well-tuned P+I+D) as alternatives to the three-stage pipeline.
Neither outperforms the Composed architecture, but the PID integral
term concept reveals a gap in our UpdateRule: fixed η ignores fire
history. The resulting improvements:

AcceleratingPartialRetarget (new UpdateRule):
  η ramps from base→max on consecutive same-direction fires. Parameter
  sweep (500 trials × 80 cells) confirms acc=0.2, cap=0.6 gives 22%
  better convergence with zero jitter cost.

SpmRatioEstimator (new Estimator):
  EWMA with direct SPM-ratio scaling — bypasses hash_rate_from_target
  U256 path. Behaviorally equivalent, simpler code.

SignPersistenceCusumBoundary (new Boundary):
  Lowers threshold when deviation sign persists across ticks. +6%
  detection on ±10% steps but +23% jitter — needs further tuning.

Also adds:
- pow2_pid.rs: reference pow2-quantized PID for simulation comparison
- pid_tuned.rs: well-tuned PID (P+I+D active) reference
- Four sim binaries (compare-pid, compare-best, convergence-time,
  sweep-accelerating) and PID_INVESTIGATION.md documenting the full
  analysis and tradeoffs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Swap VardiffState's internal composition from fixed PartialRetarget(0.5)
to the sweep-validated BestOfBest:

  SpmRatioEstimator(120s)
  + AsymmetricCusumBoundary(s=1.5, floor=0.05, tighten=3.0)
  + AcceleratingPartialRetarget(base=0.2, max=0.6, acc=0.2)

The η acceleration (0.2→0.4→0.6 on consecutive same-direction fires)
yields 22% better convergence with zero jitter cost. Trades ~2 min
slower cold start for 2-3× better steady-state accuracy at SPM≥15.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SpmRatioEstimator's linear formula (h_estimate = current_h × realized/expected)
assumes the target-hashrate-spm invariant holds perfectly across retargets.
In production, floating-point drift between nominal_hashrate and the U256
target breaks this assumption, causing the estimator to produce inverted
retarget decisions (easing when it should tighten).

EwmaEstimator uses hash_rate_from_target() which derives hashrate directly
from the actual target, making it immune to this drift. SpmRatioEstimator
remains available for simulation (where the invariant holds by construction).

Discovered via live deployment: slot 4 eased from 232K→5K difficulty while
the miner was producing 2× target SPM (should have tightened).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… rejection

Add ERROR_CODE_OPEN_MINING_CHANNEL_EXTENDED_CHANNELS_NOT_SUPPORTED_FOR_STANDARD_JOBS
required by sv2-apps pool code when rejecting extended channel open
requests on pools configured for standard jobs only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EwmaEstimator assumes one observe() call per tick (60s), but the pool
calls increment_shares_since_last_update() once per share arrival.
With 12+ shares per tick, each observe(1) applies a full EWMA decay,
collapsing the rate estimate toward 1.0 regardless of actual throughput.
This causes the algorithm to consistently see under-performance and ease.

CumulativeCounter simply accumulates shares and computes realized SPM
at snapshot time from the total count ÷ elapsed time — immune to the
per-share vs per-tick calling convention difference.

Retains AcceleratingPartialRetarget and AsymmetricCusumBoundary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EwmaEstimator previously applied EWMA decay on every observe() call,
assuming one call per tick. Production callers invoke observe(1) per
share arrival (~12× per tick), causing the rate to collapse toward 1.0.

Fix: observe() now accumulates into a pending counter. The EWMA decay
is applied once per snapshot() call (one per vardiff tick), making the
estimator produce identical results whether called as observe(12) once
or observe(1) twelve times.

Uses AtomicU64/AtomicU32 for interior mutability in snapshot() (the
Estimator trait requires &self) to satisfy Send+Sync bounds.

Also documents the calling convention contract in the Estimator trait:
implementations MUST handle both per-share and per-tick observe patterns.

Production VardiffState uses CumulativeCounter (immune to this issue)
until EwmaEstimator is validated in production.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… is fixed

EwmaEstimator is now safe for per-share observe() calls (pending shares
accumulate, decay applied once per snapshot). Swap back from
CumulativeCounter to get the EWMA's temporal smoothing benefits:
better noise rejection and more stable estimates under Poisson variance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_time simulation

Port ckpool's vardiff algorithm (dual-window adaptive EWMA with per-share
decay_time updates) to the three-stage pipeline. Key finding: the correct
way to port a continuous-time EMA to a tick-based framework is to simulate
per-share updates within each tick, not to apply time-bias correction
factors. While tuning gets CkpoolRemedy within ~5% of FullRemedy's
comprehensive fitness, it yields no Pareto improvement — the dual-window
switching adds complexity without outperforming a single-window EWMA(120s).

New components:
- CkpoolEstimator: per-share decay_time() simulation with dual-window
  adaptive switching and configurable fast-threshold
- HysteresisGate: binary fire/no-fire boundary with data gate + dead band
- CkpoolRetarget: full retarget with oscillation guard
- TimeBiasEwmaEstimator: single-window EWMA with time-bias correction

Grid registrations: ckpool(), ckpool_remedy(), ckpool_remedy_ft(n),
ckpool_narrow_hyst(), ckpool_with(), time_bias_remedy()

Also updates PID_INVESTIGATION.md with structural analysis of why PID
fails (stage conflation makes the 41% dead zone undiagnosable).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gimballock gimballock force-pushed the vardiff/simulation-framework branch from e0213c8 to 6aa27a9 Compare June 4, 2026 16:59
Eric Price and others added 3 commits June 4, 2026 13:00
The upstream commit df4e764 added ERROR_CODE_OPEN_MINING_CHANNEL_EXTENDED_CHANNELS_NOT_SUPPORTED_FOR_STANDARD_JOBS,
which our branch already defined. Remove the duplicate to fix the build.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The baseline was stale — generated before the EwmaEstimator promotion
(f4cd687) changed VardiffState's internal composition. Regenerated with
default 1000 trials × 80 cells at the canonical seed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@paratoxicdev

Copy link
Copy Markdown

Thanks for analysis @gimballock!

Why not consider share-based vardiff, if you've used it before?

Since you're also Rust-based I would have a look at the vardiff.rs + decay.rs files with Claude. The type system has allowed much better encapsulation and reasoning about behaviour and I think that'll be easier to compare than the C code from ckpool.

I do have to disagree with some of the things your Claude said:

Where timer-based genuinely wins: Scalability. In SV2 with multiplexed channels, evaluating vardiff on every share means the hot path (share validation → channel dispatch → vardiff state update)
gets heavier per share. Timer-based decouples evaluation from the ingestion path — shares go into a counter, evaluation happens on a separate cadence.

Each channel has their own Vardiff so there's no shared data structure with contention. The pool is overall O(shares), where a couple of floating point operations for share based vardiff are completely dwarfed by the cost to deserialize a share from the wire and then do a double sha256 hash for nonce validation (which is done no matter what Vardiff you use). I'm not that caught up on SV2, so please correct me if that's not what it does.

The real answer: Timer-based with a well-tuned boundary (PoissonCI or CUSUM) is strictly better for pool architecture. The "responsiveness" argument for share-based evaluation is illusory — it's
the boundary's evidence threshold that gates reaction time, not how often you ask the question. Asking more often with less data per ask doesn't help.

So the ckpool algo not only specifies an estimator (exponentially weighted moving average, see decay.rs) but also a boundary (when to adjust difficulty). The two main parts which make up the boundary are the HYSTERESIS bounds and MIN_WINDOW_RATIO.

/// Minimum window ratio before considering adjustment.
/// Fraction of expected time (or shares) per window.
/// Derived from ckpool: 240s / 300s window = 0.8
const MIN_WINDOW_RATIO: f64 = 0.8;

/// Only decrease difficulty when rate drops below this fraction of target.
/// Copied from ckpool.
const HYSTERESIS_LOW: f64 = 0.5;

/// Only increase difficulty when rate exceeds this fraction of target.
/// Copied from ckpool.
const HYSTERESIS_HIGH: f64 = 1.33;

#[derive(Debug, Clone)]
pub(crate) struct Vardiff {
    period: Duration,
    window: Duration,
    min_shares_for_adjustment: u32,
    min_time_for_adjustment: Duration,
    dsps: DecayingAverage,
    current_diff: Difficulty,
    old_diff: Difficulty,
    first_share: Option<Instant>,
    last_diff_change: Instant,
    shares_since_change: u32,
    min_diff: Option<Difficulty>,
    max_diff: Option<Difficulty>,
    diff_change_job_id: Option<JobId>,
}

So not only is the responsiveness tuneable, you can see its responsiveness working on real mining machines in the wild: https://stats.ckpool.org/. From experience I can say that it responds beautifully to any scale of hashrate, be it a cpu miner, a bitaxe, or 1 EH/s rental hashrate.

I really like the theoretical work you're doing and it has deepened my understanding ( (I now know words like estimator and boundary) of what I originally just copied from ckpool. I feel like this algo should be the benchmark that any new vardiff is compared with, since its proven to work with real hashrate.

That's just my two cents, like the work you're doing!

Eric Price and others added 2 commits June 5, 2026 20:58
…h SPM

Replace the static AsymmetricCusumBoundary with AdaptivePoissonCusum,
which selects the boundary based on the miner's configured share rate:

- Below SPM 10: PoissonCI (prevents overshoot on sparse data)
- At SPM 10+: AsymmetricCUSUM (fast reaction with abundant evidence)

This eliminates the FullRemedy vs VardiffState trade-off — PoissonCI
was better at low SPM (bitaxe/small miners) while CUSUM was better at
high SPM (large hashrate). The adaptive boundary gets both.

Also caps AcceleratingPartialRetarget η at 0.4 (was 0.6). The lower
cap prevents cold-start overshoot while still accelerating convergence
after step changes. Parameter sweep (compare_out7/) confirmed this is
the Pareto-optimal cap.

New components:
- AdaptivePoissonCusum: SPM-threshold boundary selector
- GuardedAccelRetarget: acceleration only after first direction reversal
  (experimental, not used in production)
- sweep-adaptive.rs: parameter sweep binary for boundary tuning

Simulation results vs previous production (comprehensive fitness):
  SPM 4:  0.706 vs 0.636 (+11%)
  SPM 8:  0.771 vs 0.737 (+5%)
  SPM 10: 0.783 vs 0.774 (+1%)
  SPM 15: 0.810 vs 0.803 (+1%)
  SPM 20: 0.894 vs 0.882 (+1%)
  SPM 25: 0.903 vs 0.894 (+1%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…estimator equivalence test

Adds two follow-up investigations to CKPOOL_INVESTIGATION.md:

1. Hysteresis boundary sweep: tested band widths from [0.5,1.33] through
   [0.9,1.1] with varying data gates and update rules. Conclusion:
   hysteresis achieves excellent reaction rates (96-100%) but at 10×
   jitter cost vs statistical boundaries. No parameterization achieves
   competitive comprehensive fitness.

2. Estimator equivalence test (revised): paired CkpoolEstimator with the
   production-tuned boundary (AdaptivePoissonCusum) and update
   (AcceleratingPartialRetarget). Result: NOT equivalent — the per-share
   decay_time() simulation is noisier per tick than batch EWMA, and this
   difference is amplified by CUSUM's aggressive firing. EwmaEstimator(120s)
   is genuinely better, not just simpler.

Also adds ckpool estimator + hysteresis variants to compare-algorithms
binary for reproducibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gimballock

Copy link
Copy Markdown
Author

@paratoxicdev You're right on both points — I should have posted my actual findings rather than a raw Claude transcript.

On scalability: You're correct. Each channel owns its own Vardiff — no shared state, no contention. The FP cost of decay_time() is trivial next to deserialize + SHA256d. Timer-based in SRI is an architectural choice (the pool already runs a 60s tick loop), not a performance win.

On "strictly better": Also wrong. The reality is more nuanced than either of us stated — I'll explain below.

What I found after reading your vardiff.rs + decay.rs:

I ported ckpool's algorithm and benchmarked it across SPM 4–30. Full writeup in commit 6aa27a9, updated in b77eff8.

On your code specifically:

  1. Your single-window DecayingAverage is simpler than ckpool's dual-window. My sweep confirmed — once τ_long is shortened to match our tick cadence, the dual-window switching adds nothing. You already knew this.

  2. Your time_bias uses time_since_first (miner lifetime), not time since last retarget. This is critical — for any miner running >5 minutes, bias ≈ 1.0 and it's effectively inert. When I naively reset bias on each retarget (as our pipeline's on_fire() does), I got 1 / (1 - e^(-60/300)) ≈ 5.5× amplification on the first tick after every difficulty change — catastrophic overshoot (177–275% settled accuracy). The fix was to simulate per-share decay_time() calls within each tick, letting the EMA warm up organically.

  3. Your value_at() (read-without-mutate) is a clean pattern — prevents observation from altering state.

Why we didn't use ckpool's components in production:

I tested all three stages of the pipeline (estimator, boundary, update) independently:

Estimator: After correct porting via per-share simulation, CkpoolEstimator initially appeared equivalent to EwmaEstimator(120s) — they matched within ~5% under a conservative boundary (PoissonCI). But when I paired both with our production-tuned boundary (which uses CUSUM for high-SPM miners), the ckpool estimator underperformed significantly:

Composition SPM 4 SPM 8 SPM 12 SPM 20 SPM 30
EwmaEstimator(120s) + tuned boundary 0.689 0.768 0.787 0.882 0.869
CkpoolEstimator(60,300) + same boundary 0.698 0.638 0.696 0.777 0.858
CkpoolEstimator(60,120) + same boundary 0.598 0.688 0.764 0.805 0.858

The per-share decay_time() simulation running N decay steps per tick introduces more per-tick noise than a single batch EWMA update. Under the conservative PoissonCI boundary this noise is masked (the boundary requires large deviations to fire). Under CUSUM (which fires on smaller accumulated deviations), the extra noise triggers more false fires → higher jitter → lower fitness. So the evaluation cadence isn't purely scheduling — the numerical stability of the estimate depends on how the information is processed.

Boundary: I swept hysteresis from your native [0.5, 1.33] through [0.9, 1.1] with varying data gates:

Boundary SPM 4 SPM 10 SPM 20 SPM 30 Jitter (SPM 10)
AdaptivePoissonCusum (production) 0.689 0.774 0.882 0.869 0.060/min
Hyst [0.7, 1.3] 0.621 0.695 0.666 0.608 0.043/min
Hyst [0.8, 1.2] 0.572 0.675 0.723 0.726 0.152/min
Hyst [0.85, 1.15] 0.578 0.634 0.706 0.730 0.254/min
Hyst [0.5, 1.33] (native) 0.630 0.542 0.508 0.504 0.005/min

Narrower bands get excellent reaction rates (96–100%, better than anything else), but at 5–10× jitter cost. The fundamental issue: hysteresis fires whenever the rate ratio crosses the band, with no evidence accumulation. Statistical boundaries (PoissonCI, CUSUM) require cumulative evidence before firing, so they distinguish real changes from noise.

Update: ckpool's full retarget (η=1.0) overshoots in the tick framework. Our AcceleratingPartialRetarget (η ramps 0.2→0.4) achieves the same convergence speed without overshoot.

What we ended up with:

The investigation led to a new production composition that adapts its boundary strategy based on the miner's share rate — conservative (PoissonCI) for low-SPM miners where data is sparse, aggressive (CUSUM) for high-SPM miners where evidence is abundant. This is analogous to what ckpool does with its dual-window adaptive switching, just at the boundary layer instead of the estimator layer. Your algorithm's idea of "be conservative when data is sparse, aggressive when data is abundant" is exactly right — we just implement it differently.

SPM Old VardiffState New Production Improvement
4 0.636 0.706 +11%
8 0.737 0.771 +5%
12 0.787 0.793 +1%
20 0.882 0.894 +1%
30 0.869 0.874 +1%

CkpoolEstimator is preserved in the simulation grid as a benchmark contender — it runs alongside every other algorithm in our characterization suite and any future changes are measured against it.

@gimballock

Copy link
Copy Markdown
Author

If you have any idea or suggestions to improve the comparison, let me know.

The field was documented as a share count but actually used as an SPM
threshold — align the name with the semantics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gimballock pushed a commit to marafoundation/stratum that referenced this pull request Jun 9, 2026
Clean extraction of the best-performing vardiff algorithm from the
simulation framework in stratum-mining#2154, with all test scaffolding, traits, and
alternative algorithm implementations removed.

The previous VardiffState used a fixed time-dependent threshold ladder
and full retarget. This produced:

- 6.6% median settled error (p99: 30% at low SPM)
- 5–9 minute cold-start convergence (p90)
- 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs)
- 28% target overshoot during cold-start ramp (p99 at SPM 6)

The new algorithm (EWMA + adaptive boundary + accelerating partial retarget):

- Settled accuracy: <3% median error across all SPM
- Cold-start overshoot bounded to <10% (was 28%)
- Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets
- Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%)
- Transient disconnects recover in 1–2 fires rather than requiring a full
  cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash)
- Asymmetric cost: loosening fires 3x faster than tightening, because
  loosening is free but tightening rejects in-flight shares

Breaking: adds private fields to VardiffState (previously all-pub).
Requires channels_sv2 major version bump. Public constructor API
(new, new_with_min) and Vardiff trait interface are unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add AsymmetricPoissonCI: applies a tighten_multiplier to the PoissonCI
threshold when the miner is over-performing, reflecting the asymmetric
cost of tightening (rejects in-flight shares) vs loosening (free).

Parameter sweep results (500 trials/cell, SPM 4-30):

  Comprehensive fitness at low SPM (where PoissonCI is active):
    symmetric (t=1.0): SPM4=0.706, SPM6=0.744, SPM8=0.758
    t=1.5:             SPM4=0.743, SPM6=0.771, SPM8=0.784
    t=2.0:             SPM4=0.799, SPM6=0.821, SPM8=0.801
    t=3.0:             SPM4=0.858, SPM6=0.872, SPM8=0.882

  t=3.0 matches CUSUM's tighten_multiplier and is the clear winner
  (+21% at SPM 4, +17% at SPM 6). SPM 10+ is unchanged (CUSUM active).

New files:
- AsymmetricPoissonCI in boundary.rs
- sweep-asymmetric-poisson.rs: parameter sweep binary
- asymmetric_poisson_sweep.md: sweep results

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gimballock pushed a commit to marafoundation/stratum that referenced this pull request Jun 10, 2026
Clean extraction of the best-performing vardiff algorithm from the
simulation framework in stratum-mining#2154, with all test scaffolding, traits, and
alternative algorithm implementations removed.

The previous VardiffState used a fixed time-dependent threshold ladder
and full retarget. This produced:

- 6.6% median settled error (p99: 30% at low SPM)
- 5–9 minute cold-start convergence (p90)
- 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs)
- 28% target overshoot during cold-start ramp (p99 at SPM 6)

The new algorithm (EWMA + adaptive boundary + accelerating partial retarget):

- Settled accuracy: <3% median error across all SPM
- Cold-start overshoot bounded to <10% (was 28%)
- Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets
- Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%)
- Transient disconnects recover in 1–2 fires rather than requiring a full
  cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash)
- Asymmetric cost: loosening fires 3x faster than tightening, because
  loosening is free but tightening rejects in-flight shares

Breaking: adds private fields to VardiffState (previously all-pub).
Requires channels_sv2 major version bump. Public constructor API
(new, new_with_min) and Vardiff trait interface are unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gimballock pushed a commit to marafoundation/stratum that referenced this pull request Jun 10, 2026
Clean extraction of the best-performing vardiff algorithm from the
simulation framework in stratum-mining#2154, with all test scaffolding, traits, and
alternative algorithm implementations removed.

The previous VardiffState used a fixed time-dependent threshold ladder
and full retarget. This produced:

- 6.6% median settled error (p99: 30% at low SPM)
- 5–9 minute cold-start convergence (p90)
- 33% detection rate for 10% hashrate declines (thermal throttle, failing ASICs)
- 28% target overshoot during cold-start ramp (p99 at SPM 6)

The new algorithm (EWMA + adaptive boundary + accelerating partial retarget):

- Settled accuracy: <3% median error across all SPM
- Cold-start overshoot bounded to <10% (was 28%)
- Jitter: 0.03 fires/min at low SPM (was 0.06) — half the unnecessary retargets
- Small-change detection: 85% reaction to -10% steps at SPM 6 (was 33%)
- Transient disconnects recover in 1–2 fires rather than requiring a full
  cold-start ramp (20%/fire partial retarget vs old algo's 50–67% slash)
- Asymmetric cost: loosening fires 3x faster than tightening, because
  loosening is free but tightening rejects in-flight shares

Breaking: adds private fields to VardiffState (previously all-pub).
Requires channels_sv2 major version bump. Public constructor API
(new, new_with_min) and Vardiff trait interface are unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants