Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert numpy scalars to python types before yaml encoding #1605

Merged
merged 1 commit into from
Jul 27, 2023

Conversation

braingram
Copy link
Contributor

@braingram braingram commented Jul 26, 2023

nep51 changes numpy scalars modifying the repr.

this change conflicts with represent_float/int (from pyyaml, which uses repr). With numpy 2.0 (which implements nep51) scalars are being encoded as, for example, 'np.float64(3.14)' instead of '3.14'. To work around this change, convert the values to builtin python types before passing them to pyyaml for encoding.

This PR also disables pyyaml devdeps testing as it is incompatible with cython 3.0 and it's failure to install stops devdeps testing.

@braingram braingram added Downstream CI development No backport required labels Jul 26, 2023
@github-actions github-actions bot modified the milestone: 3.0.0 Jul 26, 2023
@braingram braingram marked this pull request as ready for review July 26, 2023 16:30
@braingram braingram requested a review from a team as a code owner July 26, 2023 16:30
@braingram
Copy link
Contributor Author

jwst devdeps failure is because of an unreleased fix in stdatamodels:
spacetelescope/stdatamodels#184
and could also be fixed by:
#1594

for scalar_type in util.iter_subclasses(np.floating):
AsdfDumper.add_representer(scalar_type, AsdfDumper.represent_float)
AsdfDumper.add_representer(scalar_type, lambda dumper, data: dumper.represent_float(float(data)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this fully preserve floating point precision?

Alternately you can just set the numpy representation version you want using np.set_printoptions

Adding something like this snippet in the correct place (such as conftest.py to fix unit tests) should also fix the issue

from asdf.util import minversion
import numpy as np

if not minversion(np, "2.0.dev"):
    np.set_printoptions(legacy="1.25")

For example adding this to romancal's top level conftest.py fixes the errors resulting from the proposed numpy 2.0 changes.

See astropy/astropy#15096

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that astropy/astropy#15065, in particular the astropy.io.yaml suggest a much clearer way to do this than my above suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look and for sharing the links to how astropy is handling this change. astropy/astropy#15065 looks interesting (as it appears numpy calls str prior to repr where it adds the dtype portion of repr). However due to the way pyyaml deserializes floats I'm not sure we want to continue using str or repr on the numpy scalars (more on that below).

The short answer the precision question is that this PR should have no negative impact (and in some cases have a positive impact) on roundtripping floating point scalars.

The long answer is more complicated.

There has been some recent discussion on scalar handling and ASDF (see #1519). As discussed, the YAML standard is not overly prescriptive on float precision https://yaml.org/type/float.html. ASDF does not further define float handling and we should document that users should use arrays stored in ASDF blocks for accurate roundtripping and control of precision. I'm pinging @perrygreenfield @eslavich and @nden as they can hopefully fill in details from the discussion that I've forgotten and/or failed to note.

The simplest case is float128 (and anything else more than 64 bits). asdf currently deserializes floating point scalars in the tree as python native float. This means that there is already and continues to be loss of precision for (systems that support) floats with more than 64 bits when these values are written to the tree and not the ASDF blocks.

For 64 bit floats both numpy and python reprs (by default) select the shortest string that will roundtrip (see https://docs.python.org/3/tutorial/floatingpoint.html). So converting a np.float64 to float prior to serialization has no impact on precision.

For floats with less than 64 bits the situation is the messiest and is where the changes in this PR will have a small impact on what is written to and roundtripped through an ASDF file. The difference comes from the numpy repr choosing the shortest string that roundtrips with the precision of the datatype (whereas asdf will always convert the value to 64 bits on read). It's probably helpful to use an example.
Prior to this PR, np.float16(3.143) did not roundtrip.

>>> v = np.float16(3.143)
>>> asdf.testing.helpers.roundtrip_object(v) == v
False

This can be boiled down to numpy repr choosing '3.143' to represent the value, which when loaded as a float produces a slightly different '3.143'.

>>> repr(v)
3.143
>>> fv = float(repr(v))  # repr also 3.143
>>> fv == v
False
>>> abs(fv - v)
0.00042187499999979394

However, as expected, casting the read value back to 16 bit allows the comparison to pass:

>>> np.float16(fv) == v
True

To summarize, prior to this PR, what was written for a <64 bit numpy floating point scalar was the shortest string that would round trip if the precision of the original float was used. However because pyyaml (and asdf) read the floats as 64 bits, the read values can fail a 64 bit comparison.

With this PR the values are converted to a float prior to conversion to a string for serialization. Using the above example, with this PR

>>> v = np.float16(3.143)
>>> asdf.testing.helpers.roundtrip_object(v) == v
np.True_  # different repr due to NEP51

Casting the value back to 16 bits also works without issue:

>>> np.float16(asdf.testing.helpers.roundtrip_object(v)) == v
np.True_

However it should be noted that to achieve this, the string written to the file is different due to the conversion to 64 bits prior to the call to repr (which will now select the shortest string that reconstructs the 64 instead of 16 bit float).

>>> v = np.float16(3.143)
>>> v
np.float16(3.143)
>>> float(v)
3.142578125

So for this example with this PR asdf will write 3.142578125 instead of 3.143 to the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. As I stated before, I wanted to make sure ASDF can "roundtrip" scalars correctly.

Now that light has been shined on the issue of numpy scalars, I think ASDF should work towards serializing numpy scalars differently than the built in python scalars. NEP 41 is working towards having more complicated dtypes (which would apply to scalars too). One of the main motivations for this effort is to encode units into the dtype itself (for performance). Looking towards this indicates that ASDF might want to start considering how to encode the dtype for scalars.

nep51 changes numpy scalars modifying the repr.

this change conflicts with represent_float/int (which uses
repr) and with numpy 2.0 (which implements nep51) scalars
are being encoded as, for example, 'np.float64(3.14)'
instead of '3.14'. To work around this change, convert the
values to builtin python types before passing them to
pyyaml for encoding.
@braingram
Copy link
Contributor Author

3.9 devdeps failure is because the scipy nightly for 3.9 failed to build due to an anaconda server 500:
https://github.com/scipy/scipy/actions/runs/5682181723/job/15400012240

@braingram braingram merged commit b43b2c5 into asdf-format:main Jul 27, 2023
@braingram braingram deleted the nep51 branch July 27, 2023 17:26
braingram added a commit to braingram/asdf that referenced this pull request Aug 1, 2023
convert numpy scalars to python types before yaml encoding

(cherry picked from commit b43b2c5)
@braingram braingram mentioned this pull request Aug 1, 2023
braingram added a commit to braingram/asdf that referenced this pull request Aug 1, 2023
convert numpy scalars to python types before yaml encoding

(cherry picked from commit b43b2c5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development No backport required Downstream CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants