Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pcodec #501

Merged
merged 14 commits into from
Feb 24, 2024
Merged

Add pcodec #501

merged 14 commits into from
Feb 24, 2024

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Jan 24, 2024

This PR adds an exciting new codec to numcodecs: pcodec.

From the point of view of the V3 codec spec pcodec is an array -> bytes. It takes a numpy array of dtype i4, i8, u4, u8, f4, f8 and turns it into encoded bytes (and reverses this for decoding). It seems like numcodecs hasn't quite formalized this way of categorizing codecs (e.g. there is no numcodecs.abc.ArrayToBytesCodec). I've done my best to apply appropriate testing.

TODO:

  • Unit tests and/or doctests in docstrings
  • Tests pass locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • Docs build locally
  • GitHub Actions CI passes
  • Test coverage to 100% (Codecov passes)

@pep8speaks
Copy link

pep8speaks commented Jan 24, 2024

Hello @rabernat! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 119:1: E402 module level import not at top of file
Line 120:23: W292 no newline at end of file

Comment last updated at 2024-02-23 21:25:17 UTC

Copy link

codecov bot commented Jan 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.91%. Comparing base (0878717) to head (d637773).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #501   +/-   ##
=======================================
  Coverage   99.91%   99.91%           
=======================================
  Files          57       59    +2     
  Lines        2265     2323   +58     
=======================================
+ Hits         2263     2321   +58     
  Misses          2        2           
Files Coverage Δ
numcodecs/__init__.py 100.00% <100.00%> (ø)
numcodecs/pcodec.py 100.00% <100.00%> (ø)
numcodecs/tests/common.py 100.00% <100.00%> (ø)
numcodecs/tests/test_pcodec.py 100.00% <100.00%> (ø)

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of test files!

@rabernat
Copy link
Contributor Author

That's a lot of test files!

The test files were auto generated. Do you think this is a problem?

@martindurant
Copy link
Member

If you can generate them reliably at CI/test time even better; but they are not big, so I am not worried.

@rabernat
Copy link
Contributor Author

If you can generate them reliably at CI/test time even better

My understanding is that the reason we store files in the repo is to ensure backwards compatibility of the codecs, i.e. to guard against changes to the implementation that would render existing data unreadable. That wouldn't work if you regenerate them each time.

@martindurant
Copy link
Member

That wouldn't work if you regenerate them each time.

Right, it would need to be a totally independent and repeatable way to make them. If there is no such way, you store them.

@rabernat
Copy link
Contributor Author

Would anyone care to review this PR?

@@ -115,3 +115,7 @@

from numcodecs.fletcher32 import Fletcher32
register_codec(Fletcher32)

with suppress(ImportError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to make sure this import succeeds (import pcodec inside encode/decode, below), but presenting a helpful message whenever someone tries to use Pcodec without the necessary deps? Otherwise, the user would just see some "codec not found" message without any idea of what to do about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this comment. The approach used here was copied from other implementations, but I'm going to change it to what you suggested.

self.float_mult_spec = float_mult_spec
self.max_page_n = max_page_n

def encode(self, buf):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def encode(self, buf):
def encode(self, buf) -> bytes:

?
(This may be a good way to label array-bytes codecs; maybe also type for buf should be ndarray)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that there is much value in adding these sorts of type hints if we are not actually running type checking on the library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not for other users of the library, and their IDEs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But incorrect type hints are worse than none at all! For example, is ndarray really the correct type for buf? Maybe, but who knows? I could add it, but without running mypy, we'll never know for sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True that correctness is important, of course; but this is like the "light" version of array->array V array->bytes V bytes->bytes. Still useful. You can always get around mypy too, if you want.

@rabernat
Copy link
Contributor Author

I am sad that the CI is now failing. I don't understand why. (Everything is good locally).

Going to try reverting the change to how we import.

@rabernat rabernat merged commit 4abe4be into zarr-developers:main Feb 24, 2024
27 checks passed
DimitriPapadopoulos pushed a commit to DimitriPapadopoulos/numcodecs that referenced this pull request Aug 10, 2024
* added PCodec

* fix line length and print statements

* docs

* mock pcodec on rtd

* fix typo

* add dtype details

* changed import style for pcodec

* fix flake8

* revert import changes

* fix errors due to changes in pcodec API

* change import style

* skip coverage of failed import path

* skip pcodec tests if not installed
@dstansby dstansby mentioned this pull request Aug 11, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants