[WIP] Use Kahan (Neumaier) summation for float sums#8373
[WIP] Use Kahan (Neumaier) summation for float sums#8373dimitarvdimitrov wants to merge 3 commits into
Conversation
Add f64 cases to the aggregate_sum and aggregate_grouped benchmarks to establish a baseline for upcoming float summation changes. Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>
SumState::Float now carries a KahanSum (running sum plus compensation term) instead of a bare f64, so float sum aggregates accumulate with Neumaier's variant of Kahan summation. The compensation update is skipped when the addition produces a non-finite value, keeping the existing overflow-to-infinity and inf + -inf => NaN (saturated) semantics. Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>
Float sum partials are now a struct {sum: f64, compensation: f64}
instead of a plain f64, so compensated-summation precision survives
partial flush/combine boundaries (chunked arrays, kernel partials,
accumulator merges).
- Sum::partial_dtype diverges from return_dtype for floats;
to_scalar emits the struct, finalize/finalize_scalar collapse it
back to sum + compensation.
- Sum::combine_partials accepts both the struct partial and a plain
f64 (constant-multiply path).
- The legacy stats bridge in Accumulator::accumulate now falls
through on a partial dtype mismatch instead of erroring, and
Sum::try_accumulate consumes the cached f64 Stat::Sum for floats.
Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>
Merging this PR will degrade performance by 20.43%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | sum_f64_all_valid |
294.2 µs | 509.8 µs | -42.28% |
| ❌ | Simulation | sum_f64_nulls_clustered |
629.1 µs | 1,033.5 µs | -39.13% |
| ❌ | Simulation | sum_f64 |
711.9 µs | 1,105.9 µs | -35.63% |
| ❌ | Simulation | sum_f64_clustered_nulls |
553.3 µs | 768.5 µs | -28% |
| ❌ | Simulation | decompress_rd[f64, (100000, 0.0)] |
845.5 µs | 1,024.6 µs | -17.48% |
| ❌ | Simulation | decompress_rd[f32, (100000, 0.0)] |
499.1 µs | 587.2 µs | -15% |
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[128] |
244.4 ns | 273.6 ns | -10.66% |
| ⚡ | Simulation | chunked_varbinview_canonical_into[(100, 100)] |
309 µs | 272.8 µs | +13.24% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(100, 100)] |
361 µs | 326.3 µs | +10.63% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing mitko/kahan-float-aggregations (86623b1) with mitko/sum/f64-bench (4c8059c)
| Some(match &return_dtype { | ||
| // Float partials carry the Kahan compensation term so that precision is preserved | ||
| // when partial sums are flushed and re-combined. | ||
| DType::Primitive(ptype, _) if ptype.is_float() => float_partial_dtype(), |
There was a problem hiding this comment.
CC @joseph-isaacs - but this would be a breaking change for serialized SUM statistics if we had started to serialize them in this way.
Do other engines use this?
There was a problem hiding this comment.
In general, we should treat Vortex as a collection of physical kernels that some query engine may resolve to based on the engine's own semantics.
Therefore, this should be its own aggregate function. And DuckDB / DataFusion aggregate push-down can pick this one if it needs to have matching execution semantics.
There was a problem hiding this comment.
Datafusion supports reporting Sum statistic, that was blocked though until 54 that released this week since sum stats were very slow
There was a problem hiding this comment.
we spoke with Joe offline too. Keeping backwards compat would make this change more complicated.
the only one I could find was duckdb which has a dedicated fsum (aka sum_kahan) function. Because of that I think it's ok to keep a less accurate sum aggregation
Stacked on #8367 so the CodSpeed run shows the cost of compensated summation against the freshly-added benchmarks.
In draft because we probably don't want to eat the perf cost
What this pr does
SumState::Floatnow holds aKahanSum(running sum + compensation term) implementing Neumaier's variant of Kahan compensated summation. When an addition produces a non-finite value the compensation update is skipped, preserving the existing overflow-to-infinity andinf + -inf => NaN(saturated) semantics.{sum: f64, compensation: f64}instead of a plainf64, so precision survives partial flush/combine boundaries (chunked arrays, kernel partials, accumulator merges):Sum::combine_partialsaccepts both the struct partial and a plainf64.Accumulator::accumulatefalls through on a partial dtype mismatch instead of erroring, andSum::try_accumulateconsumes the cached f64Stat::Sumfor floats so the stats short-circuit is preserved.