Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert numpy scalars to python types before yaml encoding #1605

Merged
merged 1 commit into from
Jul 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ The ASDF Standard is at v1.6.0
- Fix issue opening files that don't support ``fileno`` [#1557]
- Allow Converters to defer conversion to other Converters
by returning ``None`` in ``Converter.select_tag`` [#1561]
- Convert numpy scalars to python types during yaml encoding
to handle NEP51 changes for numpy 2.0 [#1605]

2.15.0 (2023-03-28)
-------------------
Expand Down
8 changes: 7 additions & 1 deletion asdf/_tests/test_yaml.py
Original file line number Diff line number Diff line change
Expand Up @@ -287,4 +287,10 @@ def test_numpy_scalar(numpy_value, expected_value):
yamlutil.dump_tree(tree, buffer, ctx)
buffer.seek(0)

assert yamlutil.load_tree(buffer)["value"] == expected_value
loaded_value = yamlutil.load_tree(buffer)["value"]
if isinstance(numpy_value, np.floating):
abs_diff = abs(expected_value - loaded_value)
eps = np.finfo(numpy_value.dtype).eps
assert abs_diff < eps, abs_diff
else:
assert loaded_value == expected_value
5 changes: 3 additions & 2 deletions asdf/yamlutil.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,11 +117,12 @@ def represent_ordereddict(dumper, data):
# ----------------------------------------------------------------------
# Handle numpy scalars


for scalar_type in util.iter_subclasses(np.floating):
AsdfDumper.add_representer(scalar_type, AsdfDumper.represent_float)
AsdfDumper.add_representer(scalar_type, lambda dumper, data: dumper.represent_float(float(data)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this fully preserve floating point precision?

Alternately you can just set the numpy representation version you want using np.set_printoptions

Adding something like this snippet in the correct place (such as conftest.py to fix unit tests) should also fix the issue

from asdf.util import minversion
import numpy as np

if not minversion(np, "2.0.dev"):
    np.set_printoptions(legacy="1.25")

For example adding this to romancal's top level conftest.py fixes the errors resulting from the proposed numpy 2.0 changes.

See astropy/astropy#15096

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that astropy/astropy#15065, in particular the astropy.io.yaml suggest a much clearer way to do this than my above suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look and for sharing the links to how astropy is handling this change. astropy/astropy#15065 looks interesting (as it appears numpy calls str prior to repr where it adds the dtype portion of repr). However due to the way pyyaml deserializes floats I'm not sure we want to continue using str or repr on the numpy scalars (more on that below).

The short answer the precision question is that this PR should have no negative impact (and in some cases have a positive impact) on roundtripping floating point scalars.

The long answer is more complicated.

There has been some recent discussion on scalar handling and ASDF (see #1519). As discussed, the YAML standard is not overly prescriptive on float precision https://yaml.org/type/float.html. ASDF does not further define float handling and we should document that users should use arrays stored in ASDF blocks for accurate roundtripping and control of precision. I'm pinging @perrygreenfield @eslavich and @nden as they can hopefully fill in details from the discussion that I've forgotten and/or failed to note.

The simplest case is float128 (and anything else more than 64 bits). asdf currently deserializes floating point scalars in the tree as python native float. This means that there is already and continues to be loss of precision for (systems that support) floats with more than 64 bits when these values are written to the tree and not the ASDF blocks.

For 64 bit floats both numpy and python reprs (by default) select the shortest string that will roundtrip (see https://docs.python.org/3/tutorial/floatingpoint.html). So converting a np.float64 to float prior to serialization has no impact on precision.

For floats with less than 64 bits the situation is the messiest and is where the changes in this PR will have a small impact on what is written to and roundtripped through an ASDF file. The difference comes from the numpy repr choosing the shortest string that roundtrips with the precision of the datatype (whereas asdf will always convert the value to 64 bits on read). It's probably helpful to use an example.
Prior to this PR, np.float16(3.143) did not roundtrip.

>>> v = np.float16(3.143)
>>> asdf.testing.helpers.roundtrip_object(v) == v
False

This can be boiled down to numpy repr choosing '3.143' to represent the value, which when loaded as a float produces a slightly different '3.143'.

>>> repr(v)
3.143
>>> fv = float(repr(v))  # repr also 3.143
>>> fv == v
False
>>> abs(fv - v)
0.00042187499999979394

However, as expected, casting the read value back to 16 bit allows the comparison to pass:

>>> np.float16(fv) == v
True

To summarize, prior to this PR, what was written for a <64 bit numpy floating point scalar was the shortest string that would round trip if the precision of the original float was used. However because pyyaml (and asdf) read the floats as 64 bits, the read values can fail a 64 bit comparison.

With this PR the values are converted to a float prior to conversion to a string for serialization. Using the above example, with this PR

>>> v = np.float16(3.143)
>>> asdf.testing.helpers.roundtrip_object(v) == v
np.True_  # different repr due to NEP51

Casting the value back to 16 bits also works without issue:

>>> np.float16(asdf.testing.helpers.roundtrip_object(v)) == v
np.True_

However it should be noted that to achieve this, the string written to the file is different due to the conversion to 64 bits prior to the call to repr (which will now select the shortest string that reconstructs the 64 instead of 16 bit float).

>>> v = np.float16(3.143)
>>> v
np.float16(3.143)
>>> float(v)
3.142578125

So for this example with this PR asdf will write 3.142578125 instead of 3.143 to the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. As I stated before, I wanted to make sure ASDF can "roundtrip" scalars correctly.

Now that light has been shined on the issue of numpy scalars, I think ASDF should work towards serializing numpy scalars differently than the built in python scalars. NEP 41 is working towards having more complicated dtypes (which would apply to scalars too). One of the main motivations for this effort is to encode units into the dtype itself (for performance). Looking towards this indicates that ASDF might want to start considering how to encode the dtype for scalars.


for scalar_type in util.iter_subclasses(np.integer):
AsdfDumper.add_representer(scalar_type, AsdfDumper.represent_int)
AsdfDumper.add_representer(scalar_type, lambda dumper, data: dumper.represent_int(int(data)))


def represent_numpy_str(dumper, data):
Expand Down
3 changes: 2 additions & 1 deletion requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@ git+https://github.com/asdf-format/asdf-unit-schemas.git
git+https://github.com/asdf-format/asdf-wcs-schemas
git+https://github.com/astropy/astropy
git+https://github.com/spacetelescope/gwcs
git+https://github.com/yaml/pyyaml.git
#git+https://github.com/yaml/pyyaml.git
# jsonschema 4.18 contains incompatible changes: https://github.com/asdf-format/asdf/issues/1485
#git+https://github.com/python-jsonschema/jsonschema

numpy>=0.0.dev0
scipy>=0.0.dev0