Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggest the correct name when no key matches in the dataset #9943

Merged
merged 9 commits into from
Jan 17, 2025

Conversation

Illviljan
Copy link
Contributor

@Illviljan Illviljan commented Jan 12, 2025

I found the error when I make a typo on the dataset keys not so helpful. The truncated list of variables hides all the ones that I wanted to see. Instead, add a fuzzy matching function that does the typical "Did you mean X".

  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
possibilities = ("blech", "gray_r", 1, None, (2, 56))
# possibilities = tuple(f"long_name_{i}" for i in range(0, 1000))
t = np.arange(80)
var = xr.Variable(dims=("time",), data=np.arange(t.size))
data_vars = {v: ("time", var) for v in possibilities}
ds = xr.Dataset(data_vars, coords={"T": t})
ds["bluch"]
# KeyError: "No variable named 'bluch'. Did you mean one of ('blech',)?"
# Performance is pretty much linear with number of variables:
%timeit xr.core.utils.did_you_mean("long_name9o5", tuple(f"long_name_{i}" for i in range(0, 1000)))
35.1 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit xr.core.utils.did_you_mean("long_name9o5", tuple(f"long_name_{i}" for i in range(0, 100)))
3.29 ms ± 104 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit xr.core.utils.did_you_mean("long_name9o5", tuple(f"long_name_{i}" for i in range(0, 10)))
320 μs ± 10.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# Number of suggestions has no significant effect:
%timeit xr.core.utils.did_you_mean("long_name9o5", tuple(f"long_name_{i}" for i in range(0, 1000)), n=10)
34.6 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Further reading:
python/cpython#16850
matplotlib/matplotlib#28115
https://en.wikipedia.org/wiki/Levenshtein_distance

@@ -1611,6 +1611,11 @@ def __getitem__(
return self._construct_dataarray(key)
except KeyError as e:
message = f"No variable named {key!r}. Variables on the dataset include {shorten_list_repr(list(self.variables.keys()), max_items=10)}"

best_guess = utils.did_you_mean(key, self.variables.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing idea. I would print the best guess first, and then any others so that's it's easy to see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should just remove the "Variables on the dataset include ..." ? They try to do the same thing I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you could sort the whole list by similarity and then print that (truncated as above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it prioritizes best_guess. If best_guess is empty you could be working in the wrong dataset, so it's still nice to get some kind of clue which dataset you're using.

xarray/core/utils.py Outdated Show resolved Hide resolved
@max-sixty
Copy link
Collaborator

Big +1 on this; I'd also enjoy this as a user.

Is there any concern that some processes might be running ds[foo] in a hot loop, and this would cause a performance regression?

@headtr1ck
Copy link
Collaborator

Big +1 on this; I'd also enjoy this as a user.

Is there any concern that some processes might be running ds[foo] in a hot loop, and this would cause a performance regression?

We could add an LRU cache over did_you_mean if this is an issue

@max-sixty
Copy link
Collaborator

We could add an LRU cache over did_you_mean if this is an issue

Though I'm thinking that someone could query whether different keys exist; i.e. for x in very_long_list:; try:; ds[x]

Overall I say let's go ahead and we can reassess if we hear reports of slowdowns. Folks can use in / __contains__ for a fast path...

@Illviljan
Copy link
Contributor Author

Though I'm thinking that someone could query whether different keys exist; i.e. for x in very_long_list:; try:; ds[x]

I thought about this case as well, my initial idea was to just use ds.get(x) then. But turns out .get is one of the few ones we inherit from collections.Mapping and it's pretty much try:; ds[x]. We can probably do that one better as well if it becomes a problem.

@Illviljan Illviljan added the plan to merge Final call for comments label Jan 14, 2025
@Illviljan Illviljan merged commit 70997ef into pydata:main Jan 17, 2025
23 of 29 checks passed
@max-sixty
Copy link
Collaborator

Extremely cool!


[ins] In [2]: ds['ar']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/workspace/xarray/xarray/core/dataset.py in ?(self, name)
   1512             variable = self._variables[name]
   1513         except KeyError:
-> 1514             _, name, variable = _get_virtual_variable(self._variables, name, self.sizes)
   1515

KeyError: 'ar'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/workspace/xarray/xarray/core/dataset.py in ?(self, key)
   1622                 if isinstance(key, tuple):
   1623                     message += f"\nHint: use a list to select multiple variables, for example `ds[{list(key)}]`"
-> 1624                 raise KeyError(message) from e
   1625

~/workspace/xarray/xarray/core/dataset.py in ?(self, name)
   1512             variable = self._variables[name]
   1513         except KeyError:
-> 1514             _, name, variable = _get_virtual_variable(self._variables, name, self.sizes)
   1515

~/workspace/xarray/xarray/core/dataset.py in ?(variables, key, dim_sizes)
    219     split_key = key.split(".", 1)
    220     if len(split_key) != 2:
--> 221         raise KeyError(key)
    222

KeyError: 'ar'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-2-5bf134430f5a> in ?()
----> 1 ds['ar']

~/workspace/xarray/xarray/core/dataset.py in ?(self, key)
   1620
   1621                 # If someone attempts `ds['foo' , 'bar']` instead of `ds[['foo', 'bar']]`
   1622                 if isinstance(key, tuple):
   1623                     message += f"\nHint: use a list to select multiple variables, for example `ds[{list(key)}]`"
-> 1624                 raise KeyError(message) from e
   1625
   1626         if utils.iterable_of_hashable(key):
   1627             return self._copy_listed(key)

KeyError: "No variable named 'ar'. Did you mean one of ('air',)?"

Would definitely be up for doing more like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants