-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix incorrect results for group averaging with missing data #320
fix incorrect results for group averaging with missing data #320
Conversation
data=np.array([[[2.0]], [[0.0]], [[1.0]], [[1.0]], [[2.0]]]), | ||
data=np.array([[[2.0]], [[np.nan]], [[1.0]], [[1.0]], [[2.0]]]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value that was 0.0
is taking a monthly average for a month with one data point, which is np.nan
. The average of one np.nan
value should be np.nan
.
xcdat/temporal.py
Outdated
dv_attrs = dv.attrs | ||
dv_name = dv.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I updated dv = self._group_data(dv).sum()
to dv = self._group_data(dv).sum() / self._group_data(weights).sum()
these attributes were dropped, so I held onto them to put them back on.
Codecov Report
@@ Coverage Diff @@
## main #320 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 14 14
Lines 1191 1196 +5
=========================================
+ Hits 1191 1196 +5
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
xcdat/temporal.py
Outdated
dv *= self._weights | ||
dv = self._group_data(dv).sum() | ||
# cast the weights to match the dv.shape / dims | ||
weights, x = xr.broadcast(self._weights, dv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weights
is one-dimensional (with a temporal dimension), but I want to make sure that grid cells that are np.nan
in dv
receive zero weight. Since dv
is multidimensional, I need my weights
array to have the same shape in order to reflect this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorporated this PR comment in the code comment.
xcdat/temporal.py
Outdated
# ensure missing data receives no weight | ||
weights = xr.where(np.isnan(dv), 0.0, weights) | ||
# perform weighted average | ||
dv = self._group_data(dv).sum() / self._group_data(weights).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New weighted average corresponding to WA = sum(data*weights) / sum(weights)
. The denominator wasn't included before because it was always one. This is not true when there is missing data, so we need to include the denominator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorporated this PR comment in the code comment.
xcdat/temporal.py
Outdated
dv[self._dim].attrs = lt_attrs | ||
dv[self._dim].encoding = lt_encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines used to be
dv[self._dim].attrs = self._labeled_time.attrs
dv[self._dim].encoding = self._labeled_time.encoding
Somehow this information was getting dropped in the new weighted average call (dv = self._group_data(dv).sum() / self._group_data(weights).sum()
).
I'm not sure if this can be improved...I store lt_attrs
and lt_encoding
in two places: under if self._weighted:
and else
because these values are only populated after ._group_data()
(in the else
statement these values are populated after calling _get_weights()
because _get_weights()
calls _group_data()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored this in my recent review
* Updates the dependencies in the conda env yml files * Updates the pre-commit hook versions * Address `mypy` warnings * Remove `type: ignore` inline comments to silence mypy warnings related to xarray, can address in the future * Remove unused `type: ignore` based on latest version of mypy
428675f
to
63e9537
Compare
Thanks for opening this PR. FYI, I realized I won't have time to review this until next week. |
- Preserve data variable attributes using `xr.set_options(keep_attrs=True)` - Reuse `self._labeled_time` if it is already set in a previous call to `_group_data()` - Update group average tests to check data variable test attr is preserved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Steve, I did some minor refactoring and updated the code comments based on your PR comments.
The tests pass, so you can merge if you are in agreement with the changes.
Thanks!
# Reuse the labeled time coordinates if they are already generated. | ||
# This specifically applies when generating weights, which calls | ||
# this method again after it has already been called to group | ||
# the data variable with the same time coordinates. | ||
if not hasattr(self, "_labeled_time"): | ||
self._labeled_time = self._label_time_coords(dv[self._dim]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one of your previous comments, you mentioned that the self._labeled_time
attrs were being dropped.
This was because we call _group_data()
twice, one for dv
and one for weights
.
weights
doesn't preserve the time dimension attrs because it is unnecessary. However, this caused a problem where the second call to _group_data()
with weights
overwrites self._labeled_time
(resulting in attrs being dropped).
This new conditional fixes resolves this problem.
xcdat/temporal.py
Outdated
dv[self._dim].attrs = self._labeled_time.attrs | ||
dv[self._dim].encoding = self._labeled_time.encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can restore these lines of code since self._labeled_time
attrs are preserved with the new conditional in _group_data()
.
with xr.set_options(keep_attrs=True): | ||
dv = self._group_data(dv).sum() / self._group_data(weights).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xr.set_options(keep_attrs=True)
will preserve the attrs for the dv
. It does not restore the attrs for the time dimension, so we manually restore those in lines of code below.
|
||
# After grouping and aggregating the data variable values, the | ||
# original time dimension is replaced with the grouped time dimension. | ||
# For example, grouping on "year_season" replaces the time dimension | ||
# with "year_season". This dimension needs to be renamed back to | ||
# the original time dimension name before the data variable is added | ||
# back to the dataset so that the CF compliant name is maintained. | ||
# back to the dataset so that the original name is preserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this comment since the name of the time dimension isn't always necessarily CF compliant.
@@ -418,6 +418,7 @@ def setup(self): | |||
), | |||
coords={"time": self.ds.time, "lat": self.ds.lat, "lon": self.ds.lon}, | |||
dims=["time", "lat", "lon"], | |||
attrs={"test_attr": "test"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now test that the attrs for ts
is preserved.
@@ -467,6 +469,7 @@ def test_weighted_annual_averages(self): | |||
|
|||
xr.testing.assert_allclose(result, expected) | |||
assert result.ts.attrs == expected.ts.attrs | |||
assert result.time.attrs == expected.time.attrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also check the time attrs are preserved.
# achieve this, first broadcast the one-dimensional (temporal | ||
# dimension) shape of the `weights` DataArray to the | ||
# multi-dimensional shape of its corresponding data variable. | ||
weights, _ = xr.broadcast(self._weights, dv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace the unused x
variable with _
.
Description
This PR fixes an issue that leads to incorrect results for group averaging with missing data; the PR ensures that these data points receive no weight when averaging.
Checklist
If applicable: