numcodecs improvement plan #705

d-v-b · 2025-02-12T10:16:18Z

d-v-b
Feb 12, 2025
Maintainer

the zarr v2 / v3 transition has highlighted some room for improvement in numcodecs. I would like to propose the following changes to this library, some of them breaking:

type annotations
type annotations are currently largely absent from this library. This makes it hard to integrate numcodecs into libraries that are heavily annotated, namely zarr-python 3.x. I don't think any of the codec methods are particularly hard to annotate, but the existence of the pickle codec means that we are committed to accepting object as a input / output type in the abc. We should consider if we should deprecate pickle for this and other reasons. There are 2 main classes of methods that would benefit from annotations: the encode/decode, so consumers can know what kind of data goes in / comes out of the codec, and get_config / from_config, so users can know how a codec will serialize to / from JSON. relevant PRs: (chore) type hints for tests #698, (chore) Type hints for abc codec codec_id attribute #702, (chore) Type hints for GZip #701
package layout
everything is in the same module namespace, including the abc. Each codec is a single .py or .pyx file. Let organize this a bit and put all the codecs in their own module, separate from the abc. Lets give each codec its own module (put blosc.pyx inside a module called blosc) lets also pull the tests out the source directory, and use a src/numcodecs layout. relevant issues: copy zarr-python dev environment #697
build
numcodecs does not use a declarative build process -- compiling various codecs depends on libraries that are not present in pyproject.toml, and requires creating a conda environment. We can do better than this. I know we can use pixi to declare our compiler dependencies in pyproject.toml, but there might be other tools that would work. relevant issues: declarative build #703
v2 and v3 codec serialization
zarr v2 defines the JSON serialization of a codec as {'id': <str>, **config}. zarr v3 defines the JSON serialization of a codec as {'name': <str>, 'configuration': {**config}}. As far as I know, this is the only material difference between v2 and v3 codecs. So the exact same codec class should work for zarr v2 and zarr v3. We just need a way for users in a v2 context to use the v2 serialization, and likewise for v3. I can think of a few solutions here. relevant issues: Integrate Zarr v3 compatibility module with registry. #699, formalize old and new styles of json serialization #686

I think we should be open to breaking changes here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numcodecs improvement plan #705

{{title}}

Replies: 0 comments

Select a reply

numcodecs improvement plan #705

d-v-b Feb 12, 2025 Maintainer

Replies: 0 comments

d-v-b
Feb 12, 2025
Maintainer