Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method that generates a hash of geometry variables for caching #153

Open
mx-moth opened this issue Aug 22, 2024 · 0 comments
Open

Add method that generates a hash of geometry variables for caching #153

mx-moth opened this issue Aug 22, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@mx-moth
Copy link
Contributor

mx-moth commented Aug 22, 2024

A proposed feature from a discussion with @frizwi. Some applications perform transformations on the geometry generated by emsarray such as triangulating the polygons, or otherwise derive application-specific data from the dataset geometry. These transformations can be computationally expensive. It would be beneficial if these derived data could be cached between different runs of the application, and if these data could be shared between different instances of the same dataset geometry for datasets partitioned in multiple files across time, for example. This proposal is for a new module emsarray.operations.cache which can generate a hash of the geometry of a dataset to assist applications in caching transformed geometry data. Specifically:

A new module emsarray.operations.cache with the method:

def make_cache_key(dataset: xarray.Dataset, hash_type: type[hashlib._Hash] = hashlib.sha1) -> str:
    """
    Generate a key suitable for caching data derived from the geometry of a dataset.

    Parameters
    ----------
    dataset : xarray.Dataset
        The dataset to generate a cache key from
    hash : hash class
        The kind of hash to use.
        Defaults to `hashlib.sha1`, which is secure enough and fast enough for most purposes.
        The hash algorithm does not need to be cryptographically secure,
        so faster algorithms such as `xxhash` can be swapped in if desired.

    Returns
    -------
    cache_key : str
        A string suitable for use as a cache key.
        The string will be safe for use as part filename if data are to be cached to disk.

    Notes
    -----
    The cache key will depend on the Convention class,
    the emsarray version, and a hash of the geometry of the dataset.
    The specific structure of the cache key may change between emsarray versions
    and should not be relied upon.
    """

A new method Convention.hash_geometry(hash) to hash the dataset geometry:

class Convention:
    def hash_geometry(self, hash: hashlib._Hash) -> None:
        """
        Update the provided hash with all of the relevant geometry data for this dataset.
        This method must be deterministic based only on the dataset,
        resulting in the same hash is called multiple times,
        even across different instances of the Python interpreter.
        Ideally the hash should be independent of any variable data
        such that datasets partitioned across time in to multiple files on disk
        will result in identical hashes.

        Parameters
        ----------
        hash : hashlib-style hash instance
            The hash instance to update with geometry data.
            This must follow the interface defined in :mod:`hashlib`.
        """

This method would generate a hash of the dataset geometry using whichever properties are relevant for that convention. A default implementation that hashes the name, data, and attributes of all variables in Convention.get_geometry_variables() might be appropriate. Specific Convention subclasses can either extend this to hash additional information such as any relevant global attributes, or provide an entirely separate implementation.

Applications can use this hash as a key when caching transformed data between different application instances or to reuse transformed data between different partitions of the same dataset:

class DatasetTriangulation:
    def __init__(self):
        self.triangulations = {}

    def triangulate(self, dataset):
        cache_key = cache.make_cache_key(dataset)
        if cache_key in self.triangulations:
            return self.triangulations[cache_key]
        triangulation = triangulate.triangulate_dataset(dataset)
        self.triangulations[cache_key] = triangulation
        return triangulation

This feature proposal does not include any caching inside emsarray itself, either in memory or on disk. Future extensions to emsarray may use these functions to cache and reuse the generated geometry for datasets.

This feature proposal does not include any methods that cache arbitrary derived geometry data. Actual cache implementations are left to applications to implement.

@mx-moth mx-moth added the enhancement New feature or request label Aug 22, 2024
@mx-moth mx-moth self-assigned this Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant