use `deepdiff` instead of `asdf.commands.diff` for output and truth file comparisons #868

braingram · 2023-09-14T19:58:59Z

There are many uses of compare_asdf in the regression tests like the following:

romancal/romancal/regtest/test_wfi_pipeline.py

Line 43 in 68e6e0d

assert compare_asdf(rtdata.output, rtdata.truth, **ignore_asdf_paths) is None

These asserts are always True as compare_asdf does not return a value (so it's output will always be None).

romancal/romancal/regtest/regtestdata.py

Lines 528 to 532 in 68e6e0d

    
           def compare_asdf(result, truth, **kwargs): 
        
               f = StringIO() 
        
               asdf_diff([result, truth], minimal=False, iostream=f, **kwargs) 
        
               if f.getvalue(): 
        
                   f.getvalue()

Furthermore, asdf.commands.diff produces unhelpful output when comparisons fail (see: #867 (comment)).

This PR replaces asdf.commands.diff with deepdiff (and adds it as a test dependency). deepdiff appears to be a well supported and flexible library for comparing nested python objects (although I have little experience with it).

deepdiff comparisons can be configured to support custom objects by implementing Operator subclasses. This PR includes Operator subclasses for:

ndarray and NDArrayType (which also covers Quantity)
astropy.time.Time
gwcs.WCS

Each of these implements a custom comparison which allows controlling "equality" and the output that is generated when objects are not equal.

Fixing the above issue with compare_asdf has exposed a number of regression test failures.

Regtest run: https://plwishmaster.stsci.edu:8081/blue/organizations/jenkins/RT%2FRoman-Developers-Pull-Requests/detail/Roman-Developers-Pull-Requests/362/pipeline/205

With this PR test_dark_current_outfile_step is failing with:

AssertionError: {'arrays_differ': {"root['roman']['data']": {'abs_diff': <Quantity 14526909. DN>,
                                               'n_diffs': 895316,
                                               'worst_abs_diff': {'index': (5,
                                                                            3304,
                                                                            3273),
                                                                  'value': <Quantity 8.09581 DN>},
                                               'worst_fractional_diff': {'index': (5,
                                                                                   4095,
                                                                                   2278),
                                                                         'value': <Quantity 29022.484>}},
                     "root['roman']['pixeldq']": {'abs_diff': 2078764369920,
                                                  'n_diffs': 16776188,
                                                  'worst_abs_diff': {'index': (5,
                                                                               1680),
                                                                     'value': 4294965248},
                                                  'worst_fractional_diff': {'index': (17,
                                                                                      669),
                                                                            'value': inf}}},
   'dictionary_item_removed': [root['roman']['meta']['observation']['ma_table_name']],
   'times_differ': {"root['roman']['meta']['file_date']": {'difference': <TimeDelta object: scale='tai' format='jd' value=0.00023103009259251017>}},
   'values_changed': {"root['roman']['meta']['ephemeris']['time']": {'new_value': 60170.820353270276,
                                                                     'old_value': 60170.82058109169},
                      "root['roman']['meta']['exposure']['effective_exposure_time']": {'new_value': 127.68,
                                                                                       'old_value': 169.26},
                      "root['roman']['meta']['exposure']['elapsed_exposure_time']": {'new_value': 152.04000000000002,
                                                                                     'old_value': 193.44},
                      "root['roman']['meta']['exposure']['end_time_mjd']": {'new_value': 59215.001759722225,
                                                                            'old_value': 59458.00344814815},
                      "root['roman']['meta']['exposure']['frame_time']": {'new_value': 3.04,
                                                                          'old_value': 4.03},
                      "root['roman']['meta']['exposure']['group_time']": {'new_value': 18.24,
                                                                          'old_value': 24.18},
                      "root['roman']['meta']['exposure']['integration_time']": {'new_value': 148.96,
                                                                                'old_value': 197.47},
                      "root['roman']['meta']['exposure']['mid_time_mjd']": {'new_value': 59215.00087986111,
                                                                            'old_value': 59458.00258611111},
                      "root['roman']['meta']['exposure']['start_time_mjd']": {'new_value': 59215.0,
                                                                              'old_value': 59458.00172407407},
                      "root['roman']['meta']['exposure']['type']": {'new_value': 'WFI_IMAGE',
                                                                    'old_value': 'WFI_GRISM'},
                      "root['roman']['meta']['filename']": {'new_value': 'r0000101001001001001_01101_0001_WFI01_uncal.asdf',
                                                            'old_value': 'r0000201001001001002_01101_0001_WFI01_uncal.asdf'},
                      "root['roman']['meta']['instrument']['optical_element']": {'new_value': 'F158',
                                                                                 'old_value': 'GRISM'},
                      "root['roman']['meta']['observation']['program']": {'new_value': '00001',
                                                                          'old_value': '00002'},
                      "root['roman']['meta']['observation']['visit']": {'new_value': 1,
                                                                        'old_value': 2},
                      "root['roman']['meta']['ref_file']['dark']": {'new_value': 'crds://roman_wfi_dark_0549.asdf',
                                                                    'old_value': 'crds://roman_wfi_dark_0546.asdf'}}}
assert False

To enable the output to be pretty_printed the usage of compare_asdf was changed to match similar code in jwst where a diff is computed then the result inspected and optionally a report generated. This also allows for an easy fix for #870

Fixes #870

Checklist

added entry in CHANGES.rst under the corresponding subsection
updated relevant tests
updated relevant documentation
updated relevant milestone(s)
added relevant label(s)

codecov · 2023-09-14T20:08:31Z

Codecov Report

Patch coverage is 75.32% of modified lines.

Files Changed	Coverage
romancal/regtest/regtestdata.py	`75.32%`

📢 Thoughts on this report? Let us know!.

nden

🎉

schlafly · 2023-09-20T12:24:38Z

This looks good to me. I agree that there's enough complexity in these comparisons that it makes sense to separate this from asdf---the thorniest example being the WCS object.

schlafly

This looks good. Some comments:

At some point we should add some tests, but I don't think we need those in this PR.
The scientifically relevant WCS consistency is about ~1/100 pix, which would correspond to 0.001 arcsec atol. One might hope that for this kind of regression testing we could do 10 or 100x better than that. That would be in 2D, so it would be something like np.all(coord1.separation(coord2) < 0.001 * u.arcsec).
It seems likely that at some point in the future we'll want to specify different levels of precision on different arrays. That will take some changes but I don't think we should do it until we need it.

braingram · 2023-09-20T14:33:59Z

This looks good. Some comments:

* At some point we should add some tests, but I don't think we need those in this PR.

* The scientifically relevant WCS consistency is about ~1/100 pix, which would correspond to 0.001 arcsec atol.  One might hope that for this kind of regression testing we could do 10 or 100x better than that.  That would be in 2D, so it would be something like np.all(coord1.separation(coord2) < 0.001 * u.arcsec).

* It seems likely that at some point in the future we'll want to specify different levels of precision on different arrays.  That will take some changes but I don't think we should do it until we need it.

Thanks for taking a look and for the response.

Adding some tests is a great idea. If the general approach and format look good I can add a few to get the bulk of the coverage back. Perhaps at least a failing and passing compare_asdf with an array, time, and wcs.

I'm also happy to update the wcs output based on your suggestion. I'd have to look a bit more at the wcs comparsion to figure out: does the output of the bounding box projection contain units? If not, what is the corresponding unit?

Adjusting the comparisons per-array seems doable (I believe the comparison is aware of the key/path of the object being compared). I'd have to think about this a bit to figure out if there is a low-complexity way of adding this configuration (possibly passing in comparison functions from the regtest that calls compare_asdf would be the most flexible way). Given the complexity, perhaps opening a follow-up issue would allow us to accumulate some experience with the comparison before attempting to implement a solution. If that sounds good I'm happy to open the issue.

For the above changes, would you prefer that I update this PR or open follow-up issues (so that this PR can be merged sooner to start using the included changes for testing other PRs)?

mairanteodoro

Looks good to me.

CHANGES.rst

WilliamJamieson

Two requests so that we can insure that the fix works and remains working, can you add two tests:

A test which runs a step twice and generates two identical outputs in different files and then runs compare_asdf on them and asserts they are identical. This way we know that in theory it passes on identical results.
More importantly, a test which runs a step twice (or two different steps), in such a way that we expect the output files to be different, but the same in structure. Then run compare_asdf on them and asserts the files are not identical. This way we know its now correctly identifying when two files are different under the circumstances that most of the regtests are run.

schlafly · 2023-09-20T16:24:58Z

Thanks Brett. Yes, I think we want to proceed with this approach. I don't think we need the improved WCS comparison at the moment, and think we'd want to explore a bit how well we actually do. But I was trying to work out what to expect to need in the future. And yeah, let's just defer adding extra flexibility / complexity into the numpy comparisons until we find we actually need it.

Any objections to going ahead and merging? Thanks!

WilliamJamieson · 2023-09-20T16:36:34Z

Thanks Brett. Yes, I think we want to proceed with this approach. I don't think we need the improved WCS comparison at the moment, and think we'd want to explore a bit how well we actually do. But I was trying to work out what to expect to need in the future. And yeah, let's just defer adding extra flexibility / complexity into the numpy comparisons until we find we actually need it.

Any objections to going ahead and merging? Thanks!

I agree with this; however, the sanity tests I requested are to guard against root problem that this PR is trying to solve. Namely, compare_asdf currently does not work as expected meaning it passes comparisons which should have failed. The second requested test demonstrates that it correctly detects failures in a case we know should fail but was previously passing without problem. Moreover the first request protects us against the possibility that something breaks in deepdiff or the supporting code. In that event all or most of the regression tests would be failing, this test will help smoke out the issue to something in compare_asdf rather than a pipeline code change.

braingram · 2023-09-20T16:51:30Z

Two requests so that we can insure that the fix works and remains working, can you add two tests:

1. A test which runs a step twice and generates two identical outputs in different files and then runs `compare_asdf` on them and asserts they are identical. This way we know that in theory it passes on identical results.

2. More importantly, a test which runs a step twice (or two different steps), in such a way that we expect the output files to be different, but the same in structure. Then run `compare_asdf` on them and asserts the files are not identical. This way we know its now correctly identifying when two files are different under the circumstances that most of the regtests are run.

Thanks! I added 2 tests in: eb69b3f

They are generating and saving level 2 data models (rather than running steps). Let me know if these are sufficient for the current needs.

WilliamJamieson

Thanks for adding those two tests. Everything else looks good to me.

disable pytest verbose output in jenkins files as pretty print output should appear in test summaries

braingram · 2023-09-20T17:13:14Z

Rebased and running regression tests:
https://plwishmaster.stsci.edu:8081/job/RT/job/Roman-Developers-Pull-Requests/377/

braingram · 2023-09-20T17:35:11Z

The regression tests show 8 errors (and a slew of warnings, many of these occurred in run 376 which used #872 so they're now and probably were already in main).

[stable-deps] test_jump_detection_step – romancal.regtest.test_jump_det1m 57s
[stable-deps] test_linearity_step – romancal.regtest.test_linearity1m 38s
[stable-deps] test_linearity_outfile_step – romancal.regtest.test_linearity1m 36s
[stable-deps] test_ramp_fitting_step – romancal.regtest.test_ramp_fitting3m 28s
[stable-deps] test_tweakreg – romancal.regtest.test_tweakreg2m 4s
[stable-deps] test_flat_field_image_step – romancal.regtest.test_wfi_flat_field28s
[stable-deps] test_level2_image_processing_pipeline – romancal.regtest.test_wfi_pipeline8m 14s
[stable-deps] test_level2_grism_processing_pipeline – romancal.regtest.test_wfi_pipeline

All errors are using the new compare_asdf which may indicate issues with the truth files.

braingram · 2023-09-20T17:36:26Z

@schlafly I opened up some follow-up issues:

schlafly · 2023-09-20T17:38:45Z

Perfect. Ready to merge?

braingram · 2023-09-20T17:40:02Z

Works for me! Thanks.

github-actions bot added dependencies Pull requests that update a dependency file regression_testing labels Sep 14, 2023

braingram mentioned this pull request Sep 14, 2023

return a value for compare_asdf #867

Closed

5 tasks

github-actions bot added the automation label Sep 14, 2023

braingram changed the title ~~use deep diff instead of asdf.commands.diff~~ use deepdiff instead of asdf.commands.diff for output and truth file comparisons Sep 14, 2023

braingram marked this pull request as ready for review September 14, 2023 21:48

braingram requested a review from a team as a code owner September 14, 2023 21:48

github-actions bot added automation Dark Current Photom ramp_fitting jump Saturation linearity pipeline dq_init Wide Field Instrument (WFI) and removed automation labels Sep 15, 2023

nden approved these changes Sep 17, 2023

View reviewed changes

braingram force-pushed the chicago_deep_diff branch from 541cd48 to 0ebaead Compare September 18, 2023 14:33

schlafly approved these changes Sep 20, 2023

View reviewed changes

mairanteodoro approved these changes Sep 20, 2023

View reviewed changes

CHANGES.rst Outdated Show resolved Hide resolved

braingram force-pushed the chicago_deep_diff branch from 537210d to a21a8b5 Compare September 20, 2023 15:58

WilliamJamieson requested changes Sep 20, 2023

View reviewed changes

braingram force-pushed the chicago_deep_diff branch from a21a8b5 to eb69b3f Compare September 20, 2023 16:45

braingram requested a review from WilliamJamieson September 20, 2023 16:51

WilliamJamieson approved these changes Sep 20, 2023

View reviewed changes

braingram added 11 commits September 20, 2023 13:10

use deep diff instead of asdf.commands.diff

28ab091

enable verbose pytest output in jenkins tests

9d512d0

update changelog

253e284

change wcs test to evaluate ra/dec

47b3a78

disable lazy loading and memmaping for asdf comparison

ecb43e7

document compare_asdf, pretty print output

7c0086c

disable pytest verbose output in jenkins files as pretty print output should appear in test summaries

ignore cal_logs not meta.cal_logs

a63ff45

re-enable pytest verbose output in jenkins files

1acf8f6

change compare_asdf to return DiffResult instance

7caecc1

swap inputs to DeepDiff

2cd83e9

add initial tests for compare_asdf

9f703eb

braingram force-pushed the chicago_deep_diff branch from e894ad1 to 9f703eb Compare September 20, 2023 17:11

This was referenced Sep 20, 2023

Improve wcs comparison performed by compare_asdf in regression tests. #879

Open

Expand compare_asdf tests #880

Open

braingram merged commit b7413cc into spacetelescope:main Sep 20, 2023

braingram deleted the chicago_deep_diff branch September 20, 2023 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use `deepdiff` instead of `asdf.commands.diff` for output and truth file comparisons #868

use `deepdiff` instead of `asdf.commands.diff` for output and truth file comparisons #868

braingram commented Sep 14, 2023 •

edited

Loading

codecov bot commented Sep 14, 2023 •

edited

Loading

nden left a comment

schlafly commented Sep 20, 2023

schlafly left a comment

braingram commented Sep 20, 2023

mairanteodoro left a comment

WilliamJamieson left a comment

schlafly commented Sep 20, 2023

WilliamJamieson commented Sep 20, 2023

braingram commented Sep 20, 2023

WilliamJamieson left a comment

braingram commented Sep 20, 2023

braingram commented Sep 20, 2023 •

edited

Loading

braingram commented Sep 20, 2023

schlafly commented Sep 20, 2023

braingram commented Sep 20, 2023

	def compare_asdf(result, truth, **kwargs):
	f = StringIO()
	asdf_diff([result, truth], minimal=False, iostream=f, **kwargs)
	if f.getvalue():
	f.getvalue()

use deepdiff instead of asdf.commands.diff for output and truth file comparisons #868

use deepdiff instead of asdf.commands.diff for output and truth file comparisons #868

Conversation

braingram commented Sep 14, 2023 • edited Loading

codecov bot commented Sep 14, 2023 • edited Loading

Codecov Report

nden left a comment

Choose a reason for hiding this comment

schlafly commented Sep 20, 2023

schlafly left a comment

Choose a reason for hiding this comment

braingram commented Sep 20, 2023

mairanteodoro left a comment

Choose a reason for hiding this comment

WilliamJamieson left a comment

Choose a reason for hiding this comment

schlafly commented Sep 20, 2023

WilliamJamieson commented Sep 20, 2023

braingram commented Sep 20, 2023

WilliamJamieson left a comment

Choose a reason for hiding this comment

braingram commented Sep 20, 2023

braingram commented Sep 20, 2023 • edited Loading

braingram commented Sep 20, 2023

schlafly commented Sep 20, 2023

braingram commented Sep 20, 2023

use `deepdiff` instead of `asdf.commands.diff` for output and truth file comparisons #868

use `deepdiff` instead of `asdf.commands.diff` for output and truth file comparisons #868

braingram commented Sep 14, 2023 •

edited

Loading

codecov bot commented Sep 14, 2023 •

edited

Loading

braingram commented Sep 20, 2023 •

edited

Loading