Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated hash key generation to handle missing encoding dtype. #167

Merged

Conversation

david-sh-csiro
Copy link
Collaborator

Updated hash key generation and added tests to validate the fix.

…added tests to validate fix and updated the development change notes.
@david-sh-csiro david-sh-csiro linked an issue Dec 12, 2024 that may be closed by this pull request
Copy link
Contributor

@mx-moth mx-moth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue seems to be due to pydata/xarray#2436

docs/releases/development.rst Outdated Show resolved Hide resolved
@@ -1973,7 +1973,7 @@ def hash_geometry(self, hash: "hashlib._Hash") -> None:
# Include the dtype of the data array.
# A float array and an int array mean very different things,
# but could have identical byte patterns.
hash_string(hash, data_array.encoding['dtype'].name)
hash_string(hash, data_array.encoding.get('dtype', data_array.values.dtype).name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dtype encoding value should be present most of the time. It being missing is an unexpected exception to the norm. Without the context of this bug checking data_array.values.dtype looks unnecessary.

Please add brief one or two line description of why we are doing this extra step, and include a link to the xarray bug report tracking this issue: pydata/xarray#2436

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment with issue link and a simple description.

@@ -200,3 +200,25 @@ def test_cache_key_cfgrid1d_sha1(datasets: pathlib.Path):
assert result_cache_key_cf is not None

assert result_cache_key_cf == cache_key_hash_cf1d_sha1


def test_cache_key_with_missing_data_array_encoding_type(datasets: pathlib.Path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is too specific. What we actually care about is whether datasets opened via xarray.open_mfdataset() produce a valid geometry hash and we still don't have a test for that. Please update this test to ensure that a dataset opened via xarray.open_mfdataset() produces a valid hash. This means the test will continue to pass even if xarray fix the issue with data_array.encoding being empty, and will fail if some other issue with xarray.open_mfdataset() pops up.

Copy link
Collaborator Author

@david-sh-csiro david-sh-csiro Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added multifile dataset tests. I've included both ugrid and cfgrid specific tests. The cfgrid fixture does lose the encoding as expected, but the test should handle both cases for if and when xarray fixes the issue.

@david-sh-csiro david-sh-csiro changed the title Updated hash key generation to handling missing encoding dtype. Updated hash key generation to handle missing encoding dtype. Dec 12, 2024
Copy link
Contributor

@mx-moth mx-moth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@david-sh-csiro david-sh-csiro merged commit 0579a7c into main Jan 21, 2025
15 checks passed
@david-sh-csiro david-sh-csiro deleted the 166-multifile-datasets-dont-work-with-cache-key-generation branch January 21, 2025 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multifile Datasets don't work with cache key generation
2 participants