Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated hash key generation to handle missing encoding dtype. #167

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/releases/development.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
Next release (in development)
=============================

* ...
* Fix datasets hash_key generation when geometry encoding
is missing a dtype (:pr:`166`).
david-sh-csiro marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 1 addition & 1 deletion src/emsarray/conventions/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1973,7 +1973,7 @@ def hash_geometry(self, hash: "hashlib._Hash") -> None:
# Include the dtype of the data array.
# A float array and an int array mean very different things,
# but could have identical byte patterns.
hash_string(hash, data_array.encoding['dtype'].name)
hash_string(hash, data_array.encoding.get('dtype', data_array.values.dtype).name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dtype encoding value should be present most of the time. It being missing is an unexpected exception to the norm. Without the context of this bug checking data_array.values.dtype looks unnecessary.

Please add brief one or two line description of why we are doing this extra step, and include a link to the xarray bug report tracking this issue: pydata/xarray#2436

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment with issue link and a simple description.


# Include the size and shape of the data.
# 1D coordinate arrays are very different to 2D coordinate arrays,
Expand Down
22 changes: 22 additions & 0 deletions tests/operations/test_cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,3 +200,25 @@ def test_cache_key_cfgrid1d_sha1(datasets: pathlib.Path):
assert result_cache_key_cf is not None

assert result_cache_key_cf == cache_key_hash_cf1d_sha1


def test_cache_key_with_missing_data_array_encoding_type(datasets: pathlib.Path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is too specific. What we actually care about is whether datasets opened via xarray.open_mfdataset() produce a valid geometry hash and we still don't have a test for that. Please update this test to ensure that a dataset opened via xarray.open_mfdataset() produces a valid hash. This means the test will continue to pass even if xarray fix the issue with data_array.encoding being empty, and will fail if some other issue with xarray.open_mfdataset() pops up.

Copy link
Collaborator Author

@david-sh-csiro david-sh-csiro Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added multifile dataset tests. I've included both ugrid and cfgrid specific tests. The cfgrid fixture does lose the encoding as expected, but the test should handle both cases for if and when xarray fixes the issue.

dataset_ugrid = emsarray.open_dataset(datasets / 'ugrid_mesh2d.nc')

data_array = dataset_ugrid.ems.topology.face_node_connectivity
data_array_dtype_dropped = data_array.copy()
data_array_dtype_dropped.encoding.pop('dtype', None)

with_dtype_hash = hashlib.sha1()
without_dtype_hash = hashlib.sha1()

dataset_ugrid.ems.hash_geometry(with_dtype_hash)
dataset_ugrid['mesh_face_node'] = data_array_dtype_dropped
dataset_ugrid.ems.hash_geometry(without_dtype_hash)

with_dtype_digest = with_dtype_hash.hexdigest()
without_dtype_digest = without_dtype_hash.hexdigest()

assert with_dtype_digest != without_dtype_digest

assert data_array_dtype_dropped.equals(data_array)
Loading