Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make corrXY work with multiple GEN_KWs #9426

Closed
wants to merge 1 commit into from

Conversation

dafeda
Copy link
Contributor

@dafeda dafeda commented Dec 3, 2024

Introduction

A while back we introduced a call-back to the iterative_ensemble_smoother to allow fetching the cross-correlation matrix between parameters and responses.
This matrix was saved into ert's internal storage as corr_XY.nc and the load_cross_correlations API end-point was created.
The idea was to provide users with the raw data and let them analyse and visualise it as they saw fit.
This feature was not clearly communicated to our users and was therefore not used at all.
Also, it turns out that the implementation is buggy in that it overwrites existing data such that only the GEN_KW that was specified last in ert's config is available.
Works for the poly-case, but not for configs with multiple GEN_KW's.

Approach

Save cross correlations to a parquet file with the following schema:

image

The obs_name is created by appending a suffix to obs_group. The suffix is a counter that depends on the number of observations within each group.
Data is stored in "tidy format" to make it easy for users to query the data however they want.

Testing using the poly-case (see Plot_correlations.ipynb in the poly_example test folder)

Note that the poly-case has just one GEN_KW.

image

Testing using the heat equation (see Plot_correlations.ipynb in the heat_equation test folder)

To test with multiple GEN_KWs, I've added two GEN_KWs to the heat-equation, and created a Jupyter Notebook that plots correlations for all iterations. See diff for all plots. Here's an example of one of the parameters at iteration 0:

image

Testing using Drogon

The FMU-research team has done testing using Drogon.

Here's an example I created using Drogon:

image

  • PR title captures the intent of the changes, and is fitting for release notes.
  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • Make sure unit tests pass locally after every commit (git rebase -i main --exec 'pytest tests/ert/unit_tests -n logical -m "not integration_test"')

When applicable

  • When there are user facing changes: Updated documentation
  • New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
  • Large PR: Prepare changes in small commits for more convenient review
  • Bug fix: Add regression test for the bug
  • Bug fix: Create Backport PR to latest release

@dafeda dafeda changed the title Commitin Make corrXY work with multiple GEN_KWs Dec 3, 2024
@dafeda dafeda self-assigned this Dec 3, 2024
@dafeda dafeda force-pushed the heat-equation-gen-kws branch 13 times, most recently from b04af00 to a147425 Compare December 5, 2024 08:10
@dafeda dafeda requested a review from oyvindeide December 6, 2024 09:27
@dafeda dafeda force-pushed the heat-equation-gen-kws branch 4 times, most recently from 1c392a1 to c44eac7 Compare December 11, 2024 07:52
@dafeda dafeda marked this pull request as ready for review December 11, 2024 07:52
@dafeda dafeda force-pushed the heat-equation-gen-kws branch from c44eac7 to 124777c Compare December 11, 2024 08:15
@xjules
Copy link
Contributor

xjules commented Dec 12, 2024

There are some failing tests though.

@dafeda dafeda force-pushed the heat-equation-gen-kws branch from 124777c to 1e656a4 Compare December 12, 2024 08:29
@dafeda dafeda force-pushed the heat-equation-gen-kws branch 2 times, most recently from d29f1c3 to 359eb5a Compare December 13, 2024 12:30
@dafeda dafeda force-pushed the heat-equation-gen-kws branch 3 times, most recently from 7eaa3e4 to 766c012 Compare January 3, 2025 10:23
@dafeda dafeda added release-notes:bug-fix Automatically categorise as bug fix in release notes and removed christmas-review labels Jan 3, 2025
Copy link

codspeed-hq bot commented Jan 3, 2025

CodSpeed Performance Report

Merging #9426 will degrade performances by 11.04%

Comparing dafeda:heat-equation-gen-kws (5ad84b8) with main (a6d2dba)

Summary

❌ 1 regressions
✅ 23 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main dafeda:heat-equation-gen-kws Change
test_direct_dark_performance_with_storage[gen_x: 20, sum_x: 20 reals: 10-gen_data_with_obs-get_record_observations] 2.3 ms 2.6 ms -11.04%

@dafeda dafeda force-pushed the heat-equation-gen-kws branch from 766c012 to 29ab039 Compare January 3, 2025 10:53
@dafeda dafeda force-pushed the heat-equation-gen-kws branch from 29ab039 to 5ad84b8 Compare January 7, 2025 08:27
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.73%. Comparing base (d82fc01) to head (5ad84b8).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9426      +/-   ##
==========================================
+ Coverage   82.69%   91.73%   +9.03%     
==========================================
  Files         426      426              
  Lines       26559    26566       +7     
==========================================
+ Hits        21964    24371    +2407     
+ Misses       4595     2195    -2400     
Flag Coverage Δ
cli-tests 39.72% <100.00%> (?)
everest-models-test 34.14% <21.42%> (-0.01%) ⬇️
gui-tests 71.86% <85.71%> (?)
integration-test 38.03% <21.42%> (-0.01%) ⬇️
performance-tests 51.63% <28.57%> (-0.01%) ⬇️
test 39.40% <21.42%> (-0.01%) ⬇️
unit-tests 74.18% <28.57%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

# Used as labels for observations in cross-correlation matrix.
# Say we have two observation groups "FOPR" and "WOPR" where "FOPR" has
# 2 responses and "WOPR" has 3.
# In this case we create a list [FOPR_0, FOPR_1, WOPR_0, WOPR_1, WOPR_2]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use the index for the observations here instead? I guess it is ok for summary observations, though perhaps a bit confusing as this number will not align with report_step, but for GEN_OBS which already have an int index it would probably not align with that, and could be confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Do I remember correctly that the index is a time-stamp or date for summary observations? We could perhaps still use it but perhaps a bit messy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, was thinking a bit more about this, and should we consider doing it like we do for the snapshot in the gui showing the scaling factors when doing auto scaling? Those events are automatically written to csv by the client, so will be available to the user.

Something like:

AnalysisDataEvent(
, which alread uses the indices:
indexes[obs_group_mask],

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, these are the two types of indexes we have:

Screenshot 2025-01-08 at 09 02 25 Screenshot 2025-01-08 at 09 03 47

Are you suggesting we use the second part of the string as an index?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting we use the second part of the string as an index?

Yes, or perhaps send the whole thing as an event instead of saving it in storage. Then everyone gets the benefit right away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow the event thing. Would we then still be able to analyse the data from a Jupyter Notebook?

file_path = os.path.join(self.mount_point, "corr_XY.parquet")
if os.path.exists(file_path):
existing_df = pl.read_parquet(file_path)
df = pl.concat([existing_df, new_df])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a bit confusing if we just merge with existing data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you find confusing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is something on disk, we merge with that? Couldnt that potentially be done with a different threshold, and so no longer valid? I might be misunderstanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. self.mount_point points to a specific ensemble. Can we run the same ensemble with different thresholds?

image

@dafeda
Copy link
Contributor Author

dafeda commented Jan 28, 2025

Closing this for now as it's getting somewhat stale and I think EnIF is a better approach.

@dafeda dafeda closed this Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:bug-fix Automatically categorise as bug fix in release notes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants