fix incorrect results for group averaging with missing data #320

pochedls · 2022-08-18T01:00:07Z

Description

This PR fixes an issue that leads to incorrect results for group averaging with missing data; the PR ensures that these data points receive no weight when averaging.

Closes [Bug]: incorrect group averaging with missing data #319

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

If applicable:

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass with my changes (locally and CI/CD build)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

pochedls · 2022-08-18T01:03:05Z

tests/test_temporal.py

-            data=np.array([[[2.0]], [[0.0]], [[1.0]], [[1.0]], [[2.0]]]),
+            data=np.array([[[2.0]], [[np.nan]], [[1.0]], [[1.0]], [[2.0]]]),


The value that was 0.0 is taking a monthly average for a month with one data point, which is np.nan. The average of one np.nan value should be np.nan.

pochedls · 2022-08-18T01:04:19Z

xcdat/temporal.py

+            dv_attrs = dv.attrs
+            dv_name = dv.name


When I updated dv = self._group_data(dv).sum() to dv = self._group_data(dv).sum() / self._group_data(weights).sum() these attributes were dropped, so I held onto them to put them back on.

codecov-commenter · 2022-08-18T01:04:30Z

Codecov Report

Merging #320 (61e1700) into main (6343139) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #320   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           14        14           
  Lines         1191      1196    +5     
=========================================
+ Hits          1191      1196    +5

Impacted Files	Coverage Δ
xcdat/temporal.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

pochedls · 2022-08-18T01:06:54Z

xcdat/temporal.py

            dv *= self._weights
-            dv = self._group_data(dv).sum()
+            # cast the weights to match the dv.shape / dims
+            weights, x = xr.broadcast(self._weights, dv)


weights is one-dimensional (with a temporal dimension), but I want to make sure that grid cells that are np.nan in dv receive zero weight. Since dv is multidimensional, I need my weights array to have the same shape in order to reflect this.

Incorporated this PR comment in the code comment.

xcdat/temporal.py

pochedls · 2022-08-18T01:09:15Z

xcdat/temporal.py

+            # ensure missing data receives no weight
+            weights = xr.where(np.isnan(dv), 0.0, weights)
+            # perform weighted average
+            dv = self._group_data(dv).sum() / self._group_data(weights).sum()


New weighted average corresponding to WA = sum(data*weights) / sum(weights). The denominator wasn't included before because it was always one. This is not true when there is missing data, so we need to include the denominator.

Incorporated this PR comment in the code comment.

pochedls · 2022-08-18T01:15:04Z

xcdat/temporal.py

+        dv[self._dim].attrs = lt_attrs
+        dv[self._dim].encoding = lt_encoding


These lines used to be

dv[self._dim].attrs = self._labeled_time.attrs dv[self._dim].encoding = self._labeled_time.encoding

Somehow this information was getting dropped in the new weighted average call (dv = self._group_data(dv).sum() / self._group_data(weights).sum()).

I'm not sure if this can be improved...I store lt_attrs and lt_encoding in two places: under if self._weighted: and else because these values are only populated after ._group_data() (in the else statement these values are populated after calling _get_weights() because _get_weights() calls _group_data().

Refactored this in my recent review

* Updates the dependencies in the conda env yml files * Updates the pre-commit hook versions * Address `mypy` warnings * Remove `type: ignore` inline comments to silence mypy warnings related to xarray, can address in the future * Remove unused `type: ignore` based on latest version of mypy

tomvothecoder · 2022-08-18T23:35:31Z

Thanks for opening this PR. FYI, I realized I won't have time to review this until next week.

- Preserve data variable attributes using `xr.set_options(keep_attrs=True)` - Reuse `self._labeled_time` if it is already set in a previous call to `_group_data()` - Update group average tests to check data variable test attr is preserved

tomvothecoder

Hi Steve, I did some minor refactoring and updated the code comments based on your PR comments.

The tests pass, so you can merge if you are in agreement with the changes.

Thanks!

tomvothecoder · 2022-08-24T22:38:11Z

xcdat/temporal.py

+            # Reuse the labeled time coordinates if they are already generated.
+            # This specifically applies when generating weights, which calls
+            # this method again after it has already been called to group
+            # the data variable with the same time coordinates.
+            if not hasattr(self, "_labeled_time"):
+                self._labeled_time = self._label_time_coords(dv[self._dim])


In one of your previous comments, you mentioned that the self._labeled_time attrs were being dropped.

This was because we call _group_data() twice, one for dv and one for weights.
weights doesn't preserve the time dimension attrs because it is unnecessary. However, this caused a problem where the second call to _group_data() with weights overwrites self._labeled_time (resulting in attrs being dropped).

This new conditional fixes resolves this problem.

tomvothecoder · 2022-08-24T22:38:48Z

xcdat/temporal.py

+        dv[self._dim].attrs = self._labeled_time.attrs
+        dv[self._dim].encoding = self._labeled_time.encoding


We can restore these lines of code since self._labeled_time attrs are preserved with the new conditional in _group_data().

tomvothecoder · 2022-08-24T22:40:17Z

xcdat/temporal.py

+            with xr.set_options(keep_attrs=True):
+                dv = self._group_data(dv).sum() / self._group_data(weights).sum()


xr.set_options(keep_attrs=True) will preserve the attrs for the dv. It does not restore the attrs for the time dimension, so we manually restore those in lines of code below.

tomvothecoder · 2022-08-24T22:40:53Z

xcdat/temporal.py


        # After grouping and aggregating the data variable values, the
        # original time dimension is replaced with the grouped time dimension.
        # For example, grouping on "year_season" replaces the time dimension
        # with "year_season". This dimension needs to be renamed back to
        # the original time dimension name before the data variable is added
-        # back to the dataset so that the CF compliant name is maintained.
+        # back to the dataset so that the original name is preserved.


Updated this comment since the name of the time dimension isn't always necessarily CF compliant.

tomvothecoder · 2022-08-24T22:41:18Z

tests/test_temporal.py

@@ -418,6 +418,7 @@ def setup(self):
            ),
            coords={"time": self.ds.time, "lat": self.ds.lat, "lon": self.ds.lon},
            dims=["time", "lat", "lon"],
+            attrs={"test_attr": "test"},


Now test that the attrs for ts is preserved.

tomvothecoder · 2022-08-24T22:41:34Z

tests/test_temporal.py

@@ -467,6 +469,7 @@ def test_weighted_annual_averages(self):

        xr.testing.assert_allclose(result, expected)
        assert result.ts.attrs == expected.ts.attrs
+        assert result.time.attrs == expected.time.attrs


Also check the time attrs are preserved.

tomvothecoder · 2022-08-24T22:41:59Z

xcdat/temporal.py

+            # achieve this, first broadcast the one-dimensional (temporal
+            # dimension) shape of the `weights` DataArray to the
+            # multi-dimensional shape of its corresponding data variable.
+            weights, _ = xr.broadcast(self._weights, dv)


Replace the unused x variable with _.

…data

tomvothecoder self-requested a review August 18, 2022 01:02

pochedls commented Aug 18, 2022

View reviewed changes

tomvothecoder added type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. Priority: High labels Aug 18, 2022

tomvothecoder assigned pochedls Aug 18, 2022

pochedls commented Aug 18, 2022

View reviewed changes

xcdat/temporal.py Show resolved Hide resolved

pochedls commented Aug 18, 2022

View reviewed changes

tomvothecoder force-pushed the main branch from afd7a82 to d3511aa Compare August 18, 2022 17:07

tomvothecoder and others added 2 commits August 18, 2022 10:58

initial fix for #304

63e9537

tomvothecoder force-pushed the bugfix/319-incorrect-group-avg-with-missing-data branch from 428675f to 63e9537 Compare August 18, 2022 17:58

Revert accidental rebase edit

3f5b7cb

Refactor _group_average()

8f8494a

- Preserve data variable attributes using `xr.set_options(keep_attrs=True)` - Reuse `self._labeled_time` if it is already set in a previous call to `_group_data()` - Update group average tests to check data variable test attr is preserved

tomvothecoder approved these changes Aug 24, 2022

View reviewed changes

Merge branch 'main' into bugfix/319-incorrect-group-avg-with-missing-…

61e1700

…data

pochedls merged commit 007fb8a into main Aug 25, 2022

pochedls deleted the bugfix/319-incorrect-group-avg-with-missing-data branch August 25, 2022 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix incorrect results for group averaging with missing data #320

fix incorrect results for group averaging with missing data #320

pochedls commented Aug 18, 2022

pochedls Aug 18, 2022

pochedls Aug 18, 2022

codecov-commenter commented Aug 18, 2022 •

edited

Loading

pochedls Aug 18, 2022

tomvothecoder Aug 24, 2022

pochedls Aug 18, 2022

tomvothecoder Aug 24, 2022 •

edited

Loading

pochedls Aug 18, 2022

tomvothecoder Aug 24, 2022

tomvothecoder commented Aug 18, 2022

tomvothecoder left a comment

tomvothecoder Aug 24, 2022

tomvothecoder Aug 24, 2022

tomvothecoder Aug 24, 2022

tomvothecoder Aug 24, 2022

tomvothecoder Aug 24, 2022

tomvothecoder Aug 24, 2022

tomvothecoder Aug 24, 2022

		data=np.array([[[2.0]], [[0.0]], [[1.0]], [[1.0]], [[2.0]]]),
		data=np.array([[[2.0]], [[np.nan]], [[1.0]], [[1.0]], [[2.0]]]),

		dv[self._dim].attrs = lt_attrs
		dv[self._dim].encoding = lt_encoding

		dv[self._dim].attrs = self._labeled_time.attrs
		dv[self._dim].encoding = self._labeled_time.encoding

		with xr.set_options(keep_attrs=True):
		dv = self._group_data(dv).sum() / self._group_data(weights).sum()

fix incorrect results for group averaging with missing data #320

fix incorrect results for group averaging with missing data #320

Conversation

pochedls commented Aug 18, 2022

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 18, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder Aug 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomvothecoder commented Aug 18, 2022

tomvothecoder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 18, 2022 •

edited

Loading

tomvothecoder Aug 24, 2022 •

edited

Loading