Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix incorrect results for group averaging with missing data #320

Merged
merged 5 commits into from
Aug 25, 2022

Conversation

pochedls
Copy link
Collaborator

Description

This PR fixes an issue that leads to incorrect results for group averaging with missing data; the PR ensures that these data points receive no weight when averaging.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@tomvothecoder tomvothecoder self-requested a review August 18, 2022 01:02
data=np.array([[[2.0]], [[0.0]], [[1.0]], [[1.0]], [[2.0]]]),
data=np.array([[[2.0]], [[np.nan]], [[1.0]], [[1.0]], [[2.0]]]),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value that was 0.0 is taking a monthly average for a month with one data point, which is np.nan. The average of one np.nan value should be np.nan.

@tomvothecoder tomvothecoder added type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. Priority: High labels Aug 18, 2022
Comment on lines 923 to 924
dv_attrs = dv.attrs
dv_name = dv.name
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I updated dv = self._group_data(dv).sum() to dv = self._group_data(dv).sum() / self._group_data(weights).sum() these attributes were dropped, so I held onto them to put them back on.

@codecov-commenter
Copy link

codecov-commenter commented Aug 18, 2022

Codecov Report

Merging #320 (61e1700) into main (6343139) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #320   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           14        14           
  Lines         1191      1196    +5     
=========================================
+ Hits          1191      1196    +5     
Impacted Files Coverage Δ
xcdat/temporal.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

dv *= self._weights
dv = self._group_data(dv).sum()
# cast the weights to match the dv.shape / dims
weights, x = xr.broadcast(self._weights, dv)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weights is one-dimensional (with a temporal dimension), but I want to make sure that grid cells that are np.nan in dv receive zero weight. Since dv is multidimensional, I need my weights array to have the same shape in order to reflect this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated this PR comment in the code comment.

# ensure missing data receives no weight
weights = xr.where(np.isnan(dv), 0.0, weights)
# perform weighted average
dv = self._group_data(dv).sum() / self._group_data(weights).sum()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New weighted average corresponding to WA = sum(data*weights) / sum(weights). The denominator wasn't included before because it was always one. This is not true when there is missing data, so we need to include the denominator.

Copy link
Collaborator

@tomvothecoder tomvothecoder Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated this PR comment in the code comment.

Comment on lines 957 to 958
dv[self._dim].attrs = lt_attrs
dv[self._dim].encoding = lt_encoding
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines used to be

dv[self._dim].attrs = self._labeled_time.attrs
dv[self._dim].encoding = self._labeled_time.encoding

Somehow this information was getting dropped in the new weighted average call (dv = self._group_data(dv).sum() / self._group_data(weights).sum()).

I'm not sure if this can be improved...I store lt_attrs and lt_encoding in two places: under if self._weighted: and else because these values are only populated after ._group_data() (in the else statement these values are populated after calling _get_weights() because _get_weights() calls _group_data().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this in my recent review

tomvothecoder and others added 2 commits August 18, 2022 10:58
* Updates the dependencies in the conda env yml files
* Updates the pre-commit hook versions
* Address `mypy` warnings
  * Remove `type: ignore` inline comments to silence mypy warnings related to xarray, can address in the future
  * Remove unused `type: ignore` based on latest version of mypy
@tomvothecoder tomvothecoder force-pushed the bugfix/319-incorrect-group-avg-with-missing-data branch from 428675f to 63e9537 Compare August 18, 2022 17:58
@tomvothecoder
Copy link
Collaborator

Thanks for opening this PR. FYI, I realized I won't have time to review this until next week.

- Preserve data variable attributes using `xr.set_options(keep_attrs=True)`
- Reuse `self._labeled_time` if it is already set in a previous call to `_group_data()`
- Update group average tests to check data variable test attr is preserved
Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Steve, I did some minor refactoring and updated the code comments based on your PR comments.

The tests pass, so you can merge if you are in agreement with the changes.

Thanks!

Comment on lines +1037 to +1042
# Reuse the labeled time coordinates if they are already generated.
# This specifically applies when generating weights, which calls
# this method again after it has already been called to group
# the data variable with the same time coordinates.
if not hasattr(self, "_labeled_time"):
self._labeled_time = self._label_time_coords(dv[self._dim])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one of your previous comments, you mentioned that the self._labeled_time attrs were being dropped.

This was because we call _group_data() twice, one for dv and one for weights.
weights doesn't preserve the time dimension attrs because it is unnecessary. However, this caused a problem where the second call to _group_data() with weights overwrites self._labeled_time (resulting in attrs being dropped).

This new conditional fixes resolves this problem.

Comment on lines 954 to 955
dv[self._dim].attrs = self._labeled_time.attrs
dv[self._dim].encoding = self._labeled_time.encoding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can restore these lines of code since self._labeled_time attrs are preserved with the new conditional in _group_data().

Comment on lines +935 to +936
with xr.set_options(keep_attrs=True):
dv = self._group_data(dv).sum() / self._group_data(weights).sum()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xr.set_options(keep_attrs=True) will preserve the attrs for the dv. It does not restore the attrs for the time dimension, so we manually restore those in lines of code below.


# After grouping and aggregating the data variable values, the
# original time dimension is replaced with the grouped time dimension.
# For example, grouping on "year_season" replaces the time dimension
# with "year_season". This dimension needs to be renamed back to
# the original time dimension name before the data variable is added
# back to the dataset so that the CF compliant name is maintained.
# back to the dataset so that the original name is preserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this comment since the name of the time dimension isn't always necessarily CF compliant.

@@ -418,6 +418,7 @@ def setup(self):
),
coords={"time": self.ds.time, "lat": self.ds.lat, "lon": self.ds.lon},
dims=["time", "lat", "lon"],
attrs={"test_attr": "test"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now test that the attrs for ts is preserved.

@@ -467,6 +469,7 @@ def test_weighted_annual_averages(self):

xr.testing.assert_allclose(result, expected)
assert result.ts.attrs == expected.ts.attrs
assert result.time.attrs == expected.time.attrs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check the time attrs are preserved.

# achieve this, first broadcast the one-dimensional (temporal
# dimension) shape of the `weights` DataArray to the
# multi-dimensional shape of its corresponding data variable.
weights, _ = xr.broadcast(self._weights, dv)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the unused x variable with _.

@pochedls pochedls merged commit 007fb8a into main Aug 25, 2022
@pochedls pochedls deleted the bugfix/319-incorrect-group-avg-with-missing-data branch August 25, 2022 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Bug]: incorrect group averaging with missing data
3 participants