Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use deepdiff instead of asdf.commands.diff for output and truth file comparisons #868

Merged
merged 11 commits into from
Sep 20, 2023

Conversation

braingram
Copy link
Collaborator

@braingram braingram commented Sep 14, 2023

There are many uses of compare_asdf in the regression tests like the following:

assert compare_asdf(rtdata.output, rtdata.truth, **ignore_asdf_paths) is None

These asserts are always True as compare_asdf does not return a value (so it's output will always be None).

def compare_asdf(result, truth, **kwargs):
f = StringIO()
asdf_diff([result, truth], minimal=False, iostream=f, **kwargs)
if f.getvalue():
f.getvalue()

Furthermore, asdf.commands.diff produces unhelpful output when comparisons fail (see: #867 (comment)).

This PR replaces asdf.commands.diff with deepdiff (and adds it as a test dependency). deepdiff appears to be a well supported and flexible library for comparing nested python objects (although I have little experience with it).

deepdiff comparisons can be configured to support custom objects by implementing Operator subclasses. This PR includes Operator subclasses for:

  • ndarray and NDArrayType (which also covers Quantity)
  • astropy.time.Time
  • gwcs.WCS

Each of these implements a custom comparison which allows controlling "equality" and the output that is generated when objects are not equal.

Fixing the above issue with compare_asdf has exposed a number of regression test failures.

Regtest run: https://plwishmaster.stsci.edu:8081/blue/organizations/jenkins/RT%2FRoman-Developers-Pull-Requests/detail/Roman-Developers-Pull-Requests/362/pipeline/205

With this PR test_dark_current_outfile_step is failing with:

AssertionError: {'arrays_differ': {"root['roman']['data']": {'abs_diff': <Quantity 14526909. DN>,
                                               'n_diffs': 895316,
                                               'worst_abs_diff': {'index': (5,
                                                                            3304,
                                                                            3273),
                                                                  'value': <Quantity 8.09581 DN>},
                                               'worst_fractional_diff': {'index': (5,
                                                                                   4095,
                                                                                   2278),
                                                                         'value': <Quantity 29022.484>}},
                     "root['roman']['pixeldq']": {'abs_diff': 2078764369920,
                                                  'n_diffs': 16776188,
                                                  'worst_abs_diff': {'index': (5,
                                                                               1680),
                                                                     'value': 4294965248},
                                                  'worst_fractional_diff': {'index': (17,
                                                                                      669),
                                                                            'value': inf}}},
   'dictionary_item_removed': [root['roman']['meta']['observation']['ma_table_name']],
   'times_differ': {"root['roman']['meta']['file_date']": {'difference': <TimeDelta object: scale='tai' format='jd' value=0.00023103009259251017>}},
   'values_changed': {"root['roman']['meta']['ephemeris']['time']": {'new_value': 60170.820353270276,
                                                                     'old_value': 60170.82058109169},
                      "root['roman']['meta']['exposure']['effective_exposure_time']": {'new_value': 127.68,
                                                                                       'old_value': 169.26},
                      "root['roman']['meta']['exposure']['elapsed_exposure_time']": {'new_value': 152.04000000000002,
                                                                                     'old_value': 193.44},
                      "root['roman']['meta']['exposure']['end_time_mjd']": {'new_value': 59215.001759722225,
                                                                            'old_value': 59458.00344814815},
                      "root['roman']['meta']['exposure']['frame_time']": {'new_value': 3.04,
                                                                          'old_value': 4.03},
                      "root['roman']['meta']['exposure']['group_time']": {'new_value': 18.24,
                                                                          'old_value': 24.18},
                      "root['roman']['meta']['exposure']['integration_time']": {'new_value': 148.96,
                                                                                'old_value': 197.47},
                      "root['roman']['meta']['exposure']['mid_time_mjd']": {'new_value': 59215.00087986111,
                                                                            'old_value': 59458.00258611111},
                      "root['roman']['meta']['exposure']['start_time_mjd']": {'new_value': 59215.0,
                                                                              'old_value': 59458.00172407407},
                      "root['roman']['meta']['exposure']['type']": {'new_value': 'WFI_IMAGE',
                                                                    'old_value': 'WFI_GRISM'},
                      "root['roman']['meta']['filename']": {'new_value': 'r0000101001001001001_01101_0001_WFI01_uncal.asdf',
                                                            'old_value': 'r0000201001001001002_01101_0001_WFI01_uncal.asdf'},
                      "root['roman']['meta']['instrument']['optical_element']": {'new_value': 'F158',
                                                                                 'old_value': 'GRISM'},
                      "root['roman']['meta']['observation']['program']": {'new_value': '00001',
                                                                          'old_value': '00002'},
                      "root['roman']['meta']['observation']['visit']": {'new_value': 1,
                                                                        'old_value': 2},
                      "root['roman']['meta']['ref_file']['dark']": {'new_value': 'crds://roman_wfi_dark_0549.asdf',
                                                                    'old_value': 'crds://roman_wfi_dark_0546.asdf'}}}
assert False

To enable the output to be pretty_printed the usage of compare_asdf was changed to match similar code in jwst where a diff is computed then the result inspected and optionally a report generated. This also allows for an easy fix for #870

Fixes #870

Checklist

  • added entry in CHANGES.rst under the corresponding subsection
  • updated relevant tests
  • updated relevant documentation
  • updated relevant milestone(s)
  • added relevant label(s)

@github-actions github-actions bot added dependencies Pull requests that update a dependency file regression_testing labels Sep 14, 2023
@codecov
Copy link

codecov bot commented Sep 14, 2023

Codecov Report

Patch coverage is 75.32% of modified lines.

Files Changed Coverage
romancal/regtest/regtestdata.py 75.32%

📢 Thoughts on this report? Let us know!.

@braingram braingram mentioned this pull request Sep 14, 2023
5 tasks
@braingram braingram changed the title use deep diff instead of asdf.commands.diff use deepdiff instead of asdf.commands.diff for output and truth file comparisons Sep 14, 2023
@braingram braingram marked this pull request as ready for review September 14, 2023 21:48
@braingram braingram requested a review from a team as a code owner September 14, 2023 21:48
Copy link
Collaborator

@nden nden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@schlafly
Copy link
Collaborator

This looks good to me. I agree that there's enough complexity in these comparisons that it makes sense to separate this from asdf---the thorniest example being the WCS object.

Copy link
Collaborator

@schlafly schlafly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Some comments:

  • At some point we should add some tests, but I don't think we need those in this PR.
  • The scientifically relevant WCS consistency is about ~1/100 pix, which would correspond to 0.001 arcsec atol. One might hope that for this kind of regression testing we could do 10 or 100x better than that. That would be in 2D, so it would be something like np.all(coord1.separation(coord2) < 0.001 * u.arcsec).
  • It seems likely that at some point in the future we'll want to specify different levels of precision on different arrays. That will take some changes but I don't think we should do it until we need it.

@braingram
Copy link
Collaborator Author

This looks good. Some comments:

* At some point we should add some tests, but I don't think we need those in this PR.

* The scientifically relevant WCS consistency is about ~1/100 pix, which would correspond to 0.001 arcsec atol.  One might hope that for this kind of regression testing we could do 10 or 100x better than that.  That would be in 2D, so it would be something like np.all(coord1.separation(coord2) < 0.001 * u.arcsec).

* It seems likely that at some point in the future we'll want to specify different levels of precision on different arrays.  That will take some changes but I don't think we should do it until we need it.

Thanks for taking a look and for the response.

Adding some tests is a great idea. If the general approach and format look good I can add a few to get the bulk of the coverage back. Perhaps at least a failing and passing compare_asdf with an array, time, and wcs.

I'm also happy to update the wcs output based on your suggestion. I'd have to look a bit more at the wcs comparsion to figure out: does the output of the bounding box projection contain units? If not, what is the corresponding unit?

Adjusting the comparisons per-array seems doable (I believe the comparison is aware of the key/path of the object being compared). I'd have to think about this a bit to figure out if there is a low-complexity way of adding this configuration (possibly passing in comparison functions from the regtest that calls compare_asdf would be the most flexible way). Given the complexity, perhaps opening a follow-up issue would allow us to accumulate some experience with the comparison before attempting to implement a solution. If that sounds good I'm happy to open the issue.

For the above changes, would you prefer that I update this PR or open follow-up issues (so that this PR can be merged sooner to start using the included changes for testing other PRs)?

Copy link
Collaborator

@mairanteodoro mairanteodoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Collaborator

@WilliamJamieson WilliamJamieson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two requests so that we can insure that the fix works and remains working, can you add two tests:

  1. A test which runs a step twice and generates two identical outputs in different files and then runs compare_asdf on them and asserts they are identical. This way we know that in theory it passes on identical results.
  2. More importantly, a test which runs a step twice (or two different steps), in such a way that we expect the output files to be different, but the same in structure. Then run compare_asdf on them and asserts the files are not identical. This way we know its now correctly identifying when two files are different under the circumstances that most of the regtests are run.

@schlafly
Copy link
Collaborator

Thanks Brett. Yes, I think we want to proceed with this approach. I don't think we need the improved WCS comparison at the moment, and think we'd want to explore a bit how well we actually do. But I was trying to work out what to expect to need in the future. And yeah, let's just defer adding extra flexibility / complexity into the numpy comparisons until we find we actually need it.

Any objections to going ahead and merging? Thanks!

@WilliamJamieson
Copy link
Collaborator

Thanks Brett. Yes, I think we want to proceed with this approach. I don't think we need the improved WCS comparison at the moment, and think we'd want to explore a bit how well we actually do. But I was trying to work out what to expect to need in the future. And yeah, let's just defer adding extra flexibility / complexity into the numpy comparisons until we find we actually need it.

Any objections to going ahead and merging? Thanks!

I agree with this; however, the sanity tests I requested are to guard against root problem that this PR is trying to solve. Namely, compare_asdf currently does not work as expected meaning it passes comparisons which should have failed. The second requested test demonstrates that it correctly detects failures in a case we know should fail but was previously passing without problem. Moreover the first request protects us against the possibility that something breaks in deepdiff or the supporting code. In that event all or most of the regression tests would be failing, this test will help smoke out the issue to something in compare_asdf rather than a pipeline code change.

@braingram
Copy link
Collaborator Author

Two requests so that we can insure that the fix works and remains working, can you add two tests:

1. A test which runs a step twice and generates two identical outputs in different files and then runs `compare_asdf` on them and asserts they are identical. This way we know that in theory it passes on identical results.

2. More importantly, a test which runs a step twice (or two different steps), in such a way that we expect the output files to be different, but the same in structure. Then run `compare_asdf` on them and asserts the files are not identical. This way we know its now correctly identifying when two files are different under the circumstances that most of the regtests are run.

Thanks! I added 2 tests in: eb69b3f

They are generating and saving level 2 data models (rather than running steps). Let me know if these are sufficient for the current needs.

Copy link
Collaborator

@WilliamJamieson WilliamJamieson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding those two tests. Everything else looks good to me.

@braingram
Copy link
Collaborator Author

@braingram
Copy link
Collaborator Author

braingram commented Sep 20, 2023

The regression tests show 8 errors (and a slew of warnings, many of these occurred in run 376 which used #872 so they're now and probably were already in main).

[stable-deps] test_jump_detection_step – romancal.regtest.test_jump_det1m 57s
[stable-deps] test_linearity_step – romancal.regtest.test_linearity1m 38s
[stable-deps] test_linearity_outfile_step – romancal.regtest.test_linearity1m 36s
[stable-deps] test_ramp_fitting_step – romancal.regtest.test_ramp_fitting3m 28s
[stable-deps] test_tweakreg – romancal.regtest.test_tweakreg2m 4s
[stable-deps] test_flat_field_image_step – romancal.regtest.test_wfi_flat_field28s
[stable-deps] test_level2_image_processing_pipeline – romancal.regtest.test_wfi_pipeline8m 14s
[stable-deps] test_level2_grism_processing_pipeline – romancal.regtest.test_wfi_pipeline

All errors are using the new compare_asdf which may indicate issues with the truth files.

@schlafly
Copy link
Collaborator

Perfect. Ready to merge?

@braingram
Copy link
Collaborator Author

Works for me! Thanks.

@braingram braingram merged commit b7413cc into spacetelescope:main Sep 20, 2023
@braingram braingram deleted the chicago_deep_diff branch September 20, 2023 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tests contain multiple identical calls to compare_asdf
5 participants