feat: add cast_value and scale_offset codecs by d-v-b · Pull Request #3874 · zarr-developers/zarr-python

d-v-b · 2026-04-06T12:20:13Z

Defines two new codecs that together provide a v3-native replacement for the existing numcodecs.fixedscaleoffset codec.

The cast_value codec requires an optional dependency on the cast-value-rs package. The reason for bringing in rust here is because it gives us better average speed and much better memory usage than a numpy implementation. Happy to show this with benchmarks if needed.

fwiw, I would rather not define these codecs in zarr-python. In fact I already defined these codec classes in another package. But it's not possible to make zarr-python depend on an externally-defined codec without creating a circular dependency, so here we are. See discussion in the related issue.

related, and possibly closes #3867

edit: if depending on a new optional package for cast_value is too much, then we should revisit #3774. The conclusion there was that defining the cast_value codec in zarr-python itself was too much. And if the conclusion is that both defining the codec externally is too much, and defining it internally is too much, then we should do something like deprecate numcodecs.fixedscaleoffset and direct people to the python packages that implement its replacement.

Defines two new codecs that together provide a v3-native replacement for the existing `numcodecs.fixedscaleoffset` codec. The `cast_value` codec requires an optional dependency on the `cast-value-rs` package.

d-v-b · 2026-04-06T12:53:51Z

cc @emmanuelmathot

codecov · 2026-04-06T13:11:47Z

Codecov Report

❌ Patch coverage is 99.34853% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.27%. Comparing base (866aa8d) to head (c5d8c87).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/zarr/codecs/cast_value.py	98.54%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3874      +/-   ##
==========================================
+ Coverage   93.11%   93.27%   +0.16%     
==========================================
  Files          85       87       +2     
  Lines       11365    11672     +307     
==========================================
+ Hits        10582    10887     +305     
- Misses        783      785       +2

Files with missing lines	Coverage Δ
src/zarr/codecs/__init__.py	`100.00% <100.00%> (ø)`
src/zarr/codecs/scale_offset.py	`100.00% <100.00%> (ø)`
src/zarr/codecs/cast_value.py	`98.54% <98.54%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…into feat/scale-offset-cast-value

…-b/zarr-python into feat/scale-offset-cast-value

d-v-b · 2026-04-10T13:47:54Z

the coverage gaps are spurious because some of these routines only run when the optional cast-value-rs package is available. this is ready for review.

vincentsarago · 2026-04-14T10:25:14Z

👋 This is great.

With this codec we could reduce the size of our stored dataset 🙏

maxrjones

Nice, thanks @d-v-b! A few comments:

protect against data corruption

I think you need to protect against integer overflow/underflow. Here's a couple tests that demonstrate corrupted data:

def test_encode_uint8_underflow_should_error():
    """uint8(0) - offset(1) wraps to 255, then * 2 = 254. Decode gives 128, not 0."""
    arr = zarr.create_array(
        store={},
        shape=(3,),
        dtype="uint8",
        chunks=(3,),
        filters=[ScaleOffset(offset=1, scale=2)],
        compressors=None,
        fill_value=1,
    )
    data = np.array([0, 1, 2], dtype="uint8")
    with pytest.raises((ValueError, OverflowError)):
        arr[:] = data


def test_encode_int8_overflow_should_error():
    """int8(50) * scale(3) = 150 wraps to -106. Decode gives -36, not 50."""
    arr = zarr.create_array(
        store={},
        shape=(1,),
        dtype="int8",
        chunks=(1,),
        filters=[ScaleOffset(offset=0, scale=3)],
        compressors=None,
        fill_value=0,
    )
    data = np.array([50], dtype="int8")
    with pytest.raises((ValueError, OverflowError)):
        arr[:] = data

user-guide documentation

I think it'd be worth having some documentation page for users coming from other data formats, since they could easily mix up the scale offset parameters. Here's a summary:

The main ones across scientific/geospatial data:

CF conventions (NetCDF): unpacked = packed * scale_factor + add_offset

Scale means "multiply when unpacking"

GDAL/GeoTIFF: value = raw * scale + offset

Same formula as CF. GDAL uses this for raster band transformations.

HDF-EOS5: value = scale_factor * (stored - offset)

Subtract offset before scaling — different from CF which adds offset after scaling. This is a common source of confusion between HDF-EOS and CF.

GRIB (WMO): value = reference_value + (raw * 2^binary_scale_factor) / 10^decimal_scale_factor

Two scale factors (binary and decimal) plus a reference value. Much more complex.

numcodecs FixedScaleOffset (zarr v2): encode: round((x - offset) * scale), decode: x / scale + offset

Same formula as the new zarr spec, plus explicit rounding on encode.

Zarr v3 scale_offset: encode: (x - offset) * scale, decode: x / scale + offset

Same as numcodecs but without built-in rounding (that's now delegated to cast_value).

ML quantization (e.g., TensorFlow Lite): real = (quantized - zero_point) * scale

Scale means "multiply when unpacking" (like CF). zero_point is the integer that maps to real zero.

The key axes of variation:

	Scale means	Order	Rounding
CF / GDAL / ML quant	multiply to unpack	scale then offset	implicit
HDF-EOS5	multiply to unpack	offset then scale	implicit
Zarr v3 / numcodecs	multiply to pack	offset then scale	explicit (cast_value)
GRIB	two scale factors	its own thing	N/A

The biggest gotcha is that CF's scale_factor=0.1 means the same thing as zarr's scale=10. Users migrating CF data need to invert the scale value.

sustainability

Will you add me or the zarr team as a maintainer on PyPI for cast-value-rs if this is merged?

maxrjones · 2026-04-15T03:54:10Z

+        offset = self._to_scalar(self.offset, chunk_spec.dtype)
+        scale = self._to_scalar(self.scale, chunk_spec.dtype)
+        if np.issubdtype(arr.dtype, np.integer):
+            result = (arr // scale) + offset  # type: ignore[operator]


per the spec, should this use / and error if the result isn't representable in the data type?

yes, I fixed this is recent commits. the checks are much more stringent, and the code is quite a bit slower. We are going against the grain of numpy here, as numpy defaults to mathematically sensible operations at the expense of silent casting. If this is a performance problem worth fixing, we might be looking at a scale-offset-rs situation :/

maxrjones · 2026-04-15T04:09:02Z

+[tool.hatch.envs.cast-value]
+template = "test"
+features = ["cast-value-rs"]
+
+[[tool.hatch.envs.cast-value.matrix]]
+python = ["3.12"]
+
+[tool.hatch.envs.cast-value.scripts]
+run = "pytest tests/test_codecs/test_cast_value.py {args:}"


these should be tested in the CI

good catch, I added cast-value-rs to the optional branch of our test matrix, which should cover this. And I removed the cast-value-rs-specific hatch env.

…into feat/scale-offset-cast-value

…-b/zarr-python into feat/scale-offset-cast-value

…deps

d-v-b · 2026-04-23T13:21:47Z

Will you add me or the zarr team as a maintainer on PyPI for cast-value-rs if this is merged?

absolutely! I made you (via the zarr-python core devs team) admins of cast-value-py and cast-value-rs.

d-v-b · 2026-04-23T13:22:45Z

I think it'd be worth having some documentation page for users coming from other data formats, since they could easily mix up the scale offset parameters. Here's a summary:

I totally agree, but I wonder if this should be a separate PR? I don't know those conventions well at all, and I worry I would mess something up.

d-v-b added 2 commits April 6, 2026 14:00

feat: add cast_value and scale_offset codecs

a50e316

Defines two new codecs that together provide a v3-native replacement for the existing `numcodecs.fixedscaleoffset` codec. The `cast_value` codec requires an optional dependency on the `cast-value-rs` package.

Merge branch 'main' into feat/scale-offset-cast-value

64918d4

d-v-b requested a review from TomAugspurger April 6, 2026 12:20

github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 6, 2026

docs: changelog

94ab34a

github-actions Bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 6, 2026

Merge branch 'main' into feat/scale-offset-cast-value

3ed9847

d-v-b requested a review from a team April 7, 2026 13:38

d-v-b added 5 commits April 7, 2026 23:18

chore: simplify scalar map handling

7f5f2b2

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

3f8cfdb

…into feat/scale-offset-cast-value

chore: coverage

b35d5a3

chore: preserve JSON encoding of scale and offset parameters

0e01a21

chore: internal cleanup

0d6e48d

d-v-b requested a review from maxrjones April 8, 2026 15:44

d-v-b added 6 commits April 9, 2026 14:48

Merge branch 'main' into feat/scale-offset-cast-value

def02b6

Merge branch 'main' into feat/scale-offset-cast-value

ff46db2

fix: use explicit list of data type names

ee15c9e

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

ae9464f

…into feat/scale-offset-cast-value

Merge branch 'feat/scale-offset-cast-value' of https://github.com/d-v…

22b6689

…-b/zarr-python into feat/scale-offset-cast-value

docs: add comment

d1cee73

benbovy mentioned this pull request Apr 10, 2026

Codec-based alternative to attribute-based scale-offset encoding pydata/xarray#11280

Open

Merge branch 'main' into feat/scale-offset-cast-value

2e8d644

maxrjones requested changes Apr 15, 2026

View reviewed changes

d-v-b added 3 commits April 15, 2026 14:59

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

5945c17

…into feat/scale-offset-cast-value

fix: make encode / decode stricter about dtypes

b123813

Merge branch 'main' into feat/scale-offset-cast-value

e8a35ab

d-v-b added 7 commits April 21, 2026 11:47

Merge branch 'feat/scale-offset-cast-value' of https://github.com/d-v…

82b5d8d

…-b/zarr-python into feat/scale-offset-cast-value

test: don't use async incorrectly

8dcc4e2

test: patch holes in test coverage

63aaa65

Merge branch 'main' into feat/scale-offset-cast-value

2873764

chore: lint

8ffbd94

chore: remove cast-value-rs group, and add cast-value-rs to optional …

f484736

…deps

improve docstrings

c5d8c87

d-v-b requested a review from maxrjones April 23, 2026 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add cast_value and scale_offset codecs#3874

feat: add cast_value and scale_offset codecs#3874
d-v-b wants to merge 26 commits intozarr-developers:mainfrom
d-v-b:feat/scale-offset-cast-value

d-v-b commented Apr 6, 2026 •

edited

Loading

Uh oh!

d-v-b commented Apr 6, 2026

Uh oh!

codecov Bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

d-v-b commented Apr 10, 2026

Uh oh!

vincentsarago commented Apr 14, 2026

Uh oh!

maxrjones left a comment

Uh oh!

maxrjones Apr 15, 2026

Uh oh!

d-v-b Apr 23, 2026

Uh oh!

maxrjones Apr 15, 2026

Uh oh!

d-v-b Apr 23, 2026

Uh oh!

d-v-b commented Apr 23, 2026

Uh oh!

d-v-b commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

d-v-b commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Apr 6, 2026

Uh oh!

codecov Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

d-v-b commented Apr 10, 2026

Uh oh!

vincentsarago commented Apr 14, 2026

Uh oh!

maxrjones left a comment

Choose a reason for hiding this comment

protect against data corruption

user-guide documentation

sustainability

Uh oh!

maxrjones Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

d-v-b Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

maxrjones Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

d-v-b Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Apr 23, 2026

Uh oh!

d-v-b commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

d-v-b commented Apr 6, 2026 •

edited

Loading

codecov Bot commented Apr 6, 2026 •

edited

Loading