feat: add cast_value and scale_offset codecs#3874
feat: add cast_value and scale_offset codecs#3874d-v-b wants to merge 26 commits intozarr-developers:mainfrom
Conversation
Defines two new codecs that together provide a v3-native replacement for the existing `numcodecs.fixedscaleoffset` codec. The `cast_value` codec requires an optional dependency on the `cast-value-rs` package.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3874 +/- ##
==========================================
+ Coverage 93.11% 93.27% +0.16%
==========================================
Files 85 87 +2
Lines 11365 11672 +307
==========================================
+ Hits 10582 10887 +305
- Misses 783 785 +2
🚀 New features to boost your workflow:
|
…into feat/scale-offset-cast-value
…-b/zarr-python into feat/scale-offset-cast-value
|
the coverage gaps are spurious because some of these routines only run when the optional |
|
👋 This is great. With this codec we could reduce the size of our stored dataset 🙏 |
maxrjones
left a comment
There was a problem hiding this comment.
Nice, thanks @d-v-b! A few comments:
protect against data corruption
I think you need to protect against integer overflow/underflow. Here's a couple tests that demonstrate corrupted data:
def test_encode_uint8_underflow_should_error():
"""uint8(0) - offset(1) wraps to 255, then * 2 = 254. Decode gives 128, not 0."""
arr = zarr.create_array(
store={},
shape=(3,),
dtype="uint8",
chunks=(3,),
filters=[ScaleOffset(offset=1, scale=2)],
compressors=None,
fill_value=1,
)
data = np.array([0, 1, 2], dtype="uint8")
with pytest.raises((ValueError, OverflowError)):
arr[:] = data
def test_encode_int8_overflow_should_error():
"""int8(50) * scale(3) = 150 wraps to -106. Decode gives -36, not 50."""
arr = zarr.create_array(
store={},
shape=(1,),
dtype="int8",
chunks=(1,),
filters=[ScaleOffset(offset=0, scale=3)],
compressors=None,
fill_value=0,
)
data = np.array([50], dtype="int8")
with pytest.raises((ValueError, OverflowError)):
arr[:] = data
user-guide documentation
I think it'd be worth having some documentation page for users coming from other data formats, since they could easily mix up the scale offset parameters. Here's a summary:
The main ones across scientific/geospatial data:
CF conventions (NetCDF): unpacked = packed * scale_factor + add_offset
- Scale means "multiply when unpacking"
GDAL/GeoTIFF: value = raw * scale + offset
- Same formula as CF. GDAL uses this for raster band transformations.
HDF-EOS5: value = scale_factor * (stored - offset)
- Subtract offset before scaling — different from CF which adds offset after scaling. This is a common source of confusion between HDF-EOS and CF.
GRIB (WMO): value = reference_value + (raw * 2^binary_scale_factor) / 10^decimal_scale_factor
- Two scale factors (binary and decimal) plus a reference value. Much more complex.
numcodecs FixedScaleOffset (zarr v2): encode: round((x - offset) * scale), decode: x / scale + offset
- Same formula as the new zarr spec, plus explicit rounding on encode.
Zarr v3 scale_offset: encode: (x - offset) * scale, decode: x / scale + offset
- Same as numcodecs but without built-in rounding (that's now delegated to
cast_value).
ML quantization (e.g., TensorFlow Lite): real = (quantized - zero_point) * scale
- Scale means "multiply when unpacking" (like CF).
zero_pointis the integer that maps to real zero.
The key axes of variation:
| Scale means | Order | Rounding | |
|---|---|---|---|
| CF / GDAL / ML quant | multiply to unpack | scale then offset | implicit |
| HDF-EOS5 | multiply to unpack | offset then scale | implicit |
| Zarr v3 / numcodecs | multiply to pack | offset then scale | explicit (cast_value) |
| GRIB | two scale factors | its own thing | N/A |
The biggest gotcha is that CF's scale_factor=0.1 means the same thing as zarr's scale=10. Users migrating CF data need to invert the scale value.
sustainability
Will you add me or the zarr team as a maintainer on PyPI for cast-value-rs if this is merged?
| offset = self._to_scalar(self.offset, chunk_spec.dtype) | ||
| scale = self._to_scalar(self.scale, chunk_spec.dtype) | ||
| if np.issubdtype(arr.dtype, np.integer): | ||
| result = (arr // scale) + offset # type: ignore[operator] |
There was a problem hiding this comment.
per the spec, should this use / and error if the result isn't representable in the data type?
There was a problem hiding this comment.
yes, I fixed this is recent commits. the checks are much more stringent, and the code is quite a bit slower. We are going against the grain of numpy here, as numpy defaults to mathematically sensible operations at the expense of silent casting. If this is a performance problem worth fixing, we might be looking at a scale-offset-rs situation :/
| [tool.hatch.envs.cast-value] | ||
| template = "test" | ||
| features = ["cast-value-rs"] | ||
|
|
||
| [[tool.hatch.envs.cast-value.matrix]] | ||
| python = ["3.12"] | ||
|
|
||
| [tool.hatch.envs.cast-value.scripts] | ||
| run = "pytest tests/test_codecs/test_cast_value.py {args:}" |
There was a problem hiding this comment.
these should be tested in the CI
There was a problem hiding this comment.
good catch, I added cast-value-rs to the optional branch of our test matrix, which should cover this. And I removed the cast-value-rs-specific hatch env.
…-b/zarr-python into feat/scale-offset-cast-value
absolutely! I made you (via the zarr-python core devs team) admins of |
I totally agree, but I wonder if this should be a separate PR? I don't know those conventions well at all, and I worry I would mess something up. |
Defines two new codecs that together provide a v3-native replacement for the existing
numcodecs.fixedscaleoffsetcodec.The
cast_valuecodec requires an optional dependency on thecast-value-rspackage. The reason for bringing in rust here is because it gives us better average speed and much better memory usage than a numpy implementation. Happy to show this with benchmarks if needed.fwiw, I would rather not define these codecs in
zarr-python. In fact I already defined these codec classes in another package. But it's not possible to make zarr-python depend on an externally-defined codec without creating a circular dependency, so here we are. See discussion in the related issue.related, and possibly closes #3867
edit: if depending on a new optional package for
cast_valueis too much, then we should revisit #3774. The conclusion there was that defining thecast_valuecodec in zarr-python itself was too much. And if the conclusion is that both defining the codec externally is too much, and defining it internally is too much, then we should do something like deprecatenumcodecs.fixedscaleoffsetand direct people to the python packages that implement its replacement.