-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert numpy scalars to python types before yaml encoding #1605
Conversation
jwst devdeps failure is because of an unreleased fix in stdatamodels: |
for scalar_type in util.iter_subclasses(np.floating): | ||
AsdfDumper.add_representer(scalar_type, AsdfDumper.represent_float) | ||
AsdfDumper.add_representer(scalar_type, lambda dumper, data: dumper.represent_float(float(data))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this fully preserve floating point precision?
Alternately you can just set the numpy
representation version you want using np.set_printoptions
Adding something like this snippet in the correct place (such as conftest.py
to fix unit tests) should also fix the issue
from asdf.util import minversion
import numpy as np
if not minversion(np, "2.0.dev"):
np.set_printoptions(legacy="1.25")
For example adding this to romancal
's top level conftest.py
fixes the errors resulting from the proposed numpy
2.0 changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that astropy/astropy#15065, in particular the astropy.io.yaml suggest a much clearer way to do this than my above suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look and for sharing the links to how astropy is handling this change. astropy/astropy#15065 looks interesting (as it appears numpy calls str
prior to repr
where it adds the dtype portion of repr). However due to the way pyyaml deserializes floats I'm not sure we want to continue using str
or repr
on the numpy scalars (more on that below).
The short answer the precision question is that this PR should have no negative impact (and in some cases have a positive impact) on roundtripping floating point scalars.
The long answer is more complicated.
There has been some recent discussion on scalar handling and ASDF (see #1519). As discussed, the YAML standard is not overly prescriptive on float precision https://yaml.org/type/float.html. ASDF does not further define float handling and we should document that users should use arrays stored in ASDF blocks for accurate roundtripping and control of precision. I'm pinging @perrygreenfield @eslavich and @nden as they can hopefully fill in details from the discussion that I've forgotten and/or failed to note.
The simplest case is float128 (and anything else more than 64 bits). asdf currently deserializes floating point scalars in the tree as python native float
. This means that there is already and continues to be loss of precision for (systems that support) floats with more than 64 bits when these values are written to the tree and not the ASDF blocks.
For 64 bit floats both numpy and python reprs (by default) select the shortest string that will roundtrip (see https://docs.python.org/3/tutorial/floatingpoint.html). So converting a np.float64
to float
prior to serialization has no impact on precision.
For floats with less than 64 bits the situation is the messiest and is where the changes in this PR will have a small impact on what is written to and roundtripped through an ASDF file. The difference comes from the numpy repr choosing the shortest string that roundtrips with the precision of the datatype (whereas asdf will always convert the value to 64 bits on read). It's probably helpful to use an example.
Prior to this PR, np.float16(3.143)
did not roundtrip.
>>> v = np.float16(3.143)
>>> asdf.testing.helpers.roundtrip_object(v) == v
False
This can be boiled down to numpy repr choosing '3.143' to represent the value, which when loaded as a float produces a slightly different '3.143'.
>>> repr(v)
3.143
>>> fv = float(repr(v)) # repr also 3.143
>>> fv == v
False
>>> abs(fv - v)
0.00042187499999979394
However, as expected, casting the read value back to 16 bit allows the comparison to pass:
>>> np.float16(fv) == v
True
To summarize, prior to this PR, what was written for a <64 bit numpy floating point scalar was the shortest string that would round trip if the precision of the original float was used. However because pyyaml (and asdf) read the floats as 64 bits, the read values can fail a 64 bit comparison.
With this PR the values are converted to a float prior to conversion to a string for serialization. Using the above example, with this PR
>>> v = np.float16(3.143)
>>> asdf.testing.helpers.roundtrip_object(v) == v
np.True_ # different repr due to NEP51
Casting the value back to 16 bits also works without issue:
>>> np.float16(asdf.testing.helpers.roundtrip_object(v)) == v
np.True_
However it should be noted that to achieve this, the string written to the file is different due to the conversion to 64 bits prior to the call to repr (which will now select the shortest string that reconstructs the 64 instead of 16 bit float).
>>> v = np.float16(3.143)
>>> v
np.float16(3.143)
>>> float(v)
3.142578125
So for this example with this PR asdf will write 3.142578125
instead of 3.143
to the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. As I stated before, I wanted to make sure ASDF can "roundtrip" scalars correctly.
Now that light has been shined on the issue of numpy scalars, I think ASDF should work towards serializing numpy scalars differently than the built in python scalars. NEP 41 is working towards having more complicated dtypes
(which would apply to scalars too). One of the main motivations for this effort is to encode units into the dtype
itself (for performance). Looking towards this indicates that ASDF might want to start considering how to encode the dtype
for scalars.
nep51 changes numpy scalars modifying the repr. this change conflicts with represent_float/int (which uses repr) and with numpy 2.0 (which implements nep51) scalars are being encoded as, for example, 'np.float64(3.14)' instead of '3.14'. To work around this change, convert the values to builtin python types before passing them to pyyaml for encoding.
3.9 devdeps failure is because the scipy nightly for 3.9 failed to build due to an anaconda server 500: |
convert numpy scalars to python types before yaml encoding (cherry picked from commit b43b2c5)
convert numpy scalars to python types before yaml encoding (cherry picked from commit b43b2c5)
nep51 changes numpy scalars modifying the repr.
this change conflicts with represent_float/int (from pyyaml, which uses repr). With numpy 2.0 (which implements nep51) scalars are being encoded as, for example, 'np.float64(3.14)' instead of '3.14'. To work around this change, convert the values to builtin python types before passing them to pyyaml for encoding.
This PR also disables pyyaml devdeps testing as it is incompatible with cython 3.0 and it's failure to install stops devdeps testing.