Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safely convert fitsrec in tree before serializing with asdf #205

Merged
merged 3 commits into from
Sep 11, 2023

Conversation

braingram
Copy link
Collaborator

@braingram braingram commented Sep 5, 2023

Prior to asdf 3.0, FITS_rec (a subclass of ndarray) was serialized like an ndarray. In asdf 3.0, this will result in a warning:

asdf.exceptions.AsdfConversionWarning: A ndarray subclass (<class 'astropy.io.fits.fitsrec.FITS_rec'>) was converted as a ndarray. This behavior will be removed from a future version of ASDF. See https://asdf.readthedocs.io/en/latest/asdf/config.html#convert-unknown-ndarray-subclasses

This is done out of abundance of caution for oddities that can occur when treating FITS_rec like an ndarray. See below for an example.

There are many places where stdatamodels passes a FITS_recto asdf in a way that is unaware of these oddities (see this run log for a full list of warnings that will appear when asdf 3.0 is released or if testing stdatamodels against current asdf main: https://github.com/asdf-format/asdf/actions/runs/6088051343/job/16588221134)

This PR converts all FITS_rec instances to ndarray instances prior to asdf serialization/validation. The conversion is done using a new method _fits_rec_to_array which corrects columns containing unsigned integers.

Example of where treating FITS_rec like a ndarray can be problematic

from astropy.io import fits
import numpy as np

arr = np.zeros([6], dtype=[('a', 'uint32')])
arr['a'] = np.arange(6)

h = fits.HDUList()
h.append(fits.PrimaryHDU())
h.append(fits.BinTableHDU())
h[-1].data = arr
h.writeto('foo.fits', overwrite=True)

with fits.open('foo.fits') as ff:
    print(ff[-1].data)
    print(ff[-1].data.view(np.ndarray))

Produces invalid data in the view created from the FITS_rec.

[(          0,) (          1,) (          2,) (          3,)
 (          4,) (          5,)]
[(-2147483648,) (-2147483647,) (-2147483646,) (-2147483645,)
 (-2147483644,) (-2147483643,)]

Checklist

  • added entry in CHANGES.rst (either in Bug Fixes or Changes to API)
  • updated relevant tests
  • updated relevant documentation
  • updated relevant milestone(s)
  • added relevant label(s)

@braingram braingram changed the title strip fitsrec from tree before serializing with asdf TEST: strip fitsrec from tree before serializing with asdf Sep 5, 2023
@braingram
Copy link
Collaborator Author

braingram commented Sep 5, 2023

@braingram braingram changed the title TEST: strip fitsrec from tree before serializing with asdf TEST: safely convert fitsrec in tree before serializing with asdf Sep 5, 2023
@codecov
Copy link

codecov bot commented Sep 5, 2023

Codecov Report

Patch coverage is 100.00% of modified lines.

Files Changed Coverage
src/stdatamodels/fits_support.py 100.00%
src/stdatamodels/model_base.py 100.00%
src/stdatamodels/util.py 100.00%
src/stdatamodels/validate.py 100.00%

📢 Thoughts on this report? Let us know!.

@@ -775,7 +775,8 @@ def callback(node, json_id):
data = hdulist[pair].data
return data
return node
af.tree = treeutil.walk_and_modify(af.tree, callback)
# don't assign to af.tree to avoid an extra validation
af._tree = treeutil.walk_and_modify(af.tree, callback)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this assignment has 2 benefits.

  1. (required for this PR) the modified tree returned from walk_and_modify contains FITS_rec instances (this function, _map_hdulist_to_arrays is responsible for replacing the arrays referenced in the tree with the data from the hdus which might come back as FITS_rec instances). If af.tree was assigned, asdf would attempt to validate the assigned tree which would result in warnings (in asdf 3.0) if the tree contained FITS_rec instances
  2. The assignment to af.tree results in validation of the tree. This is unnecessary as this function (_map_hdulist_to_arrays) is called right after a call to asdf.open (see from_fits_asdf in the same file) which by default will validate the tree on read. Assigning to af._tree avoids this duplicate validation.

# don't open_asdf(tree) as this will cause a second validation of the tree
# instead open an empty tree, then assign to the hidden '_tree'
asdffile = self.open_asdf(None, **kwargs)
asdffile._tree = tree
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assignment to af._tree instead of af.tree is for the same reasons as described in the comment and in https://github.com/spacetelescope/stdatamodels/pull/205/files#r1316260255

@braingram braingram changed the title TEST: safely convert fitsrec in tree before serializing with asdf Safely convert fitsrec in tree before serializing with asdf Sep 5, 2023
@braingram braingram marked this pull request as ready for review September 5, 2023 18:38
@braingram braingram requested a review from a team as a code owner September 5, 2023 18:38
Copy link
Collaborator

@eslavich eslavich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just something I'm curious about, PR looks good



def _fits_rec_to_array(fits_rec):
bad_columns = [n for n in fits_rec.dtype.fields if np.issubdtype(fits_rec[n].dtype, np.unsignedinteger)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In rebuild_fits_rec_dtype there is a somewhat different definition of a bad column:

        table_dtype = dtype[field_name]
        field_dtype = fits_rec.field(field_name).dtype
        if np.issubdtype(table_dtype, np.signedinteger) and np.issubdtype(field_dtype, np.unsignedinteger):

Is that first check unnecessary? Since unsigned integers in FITS are always "pseudo unsigned"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. To put it another way, the lack of unsigned integer support in FITS means that all unsigned integers in the fitsrec (so field_dtype is unsigned) will be stored as a signed integer with an offset (so table_dtype will be signed).

I'm happy to remove the first part of the check in rebuild_fits_rec_dtype if it would be preferable to have that change in this PR. Alternatively I can open a follow-up PR (or just a tracking issue if the optimization is of lower priority).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened an issue to track optimizing the comparison.

#206

That way this PR can rely on the previously run regtests and a follow-up PR (after the next jwst release) can look into optimizing the comparison.

@hbushouse
Copy link
Contributor

CI failure is unrelated.

@hbushouse hbushouse merged commit 91c7d8a into spacetelescope:master Sep 11, 2023
@braingram braingram deleted the asdf_no_fitsrec branch September 11, 2023 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants