Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support finding data with typos #153

Merged
merged 1 commit into from
Oct 16, 2024

Conversation

overengineered
Copy link
Contributor

What: Increases tolerance for typos in data and/or query text

Why: match sorter currently finds data with some typos, but others. E.g. "canceled" would find "cancelled", but not "cacneled".

How: getClosenessRanking calculator allows skipping one of the characters from the query. This allows finding data that contains the most common typos: swapped letters, missing letters and diverging spelling (localisation/localization). Matches that have missing letters have reduced ranking.

Checklist:

  • Documentation N/A
  • Tests
  • Ready to be merged

Since this change only affects searches where no good match (like full word, or exact start or word) gets found, I think there's no need to have add to options some flag to disable/enable this feature. If searching ua and finding United States of America is good enough right now, then my changes that improve tolerance for typos should be also acceptable.

But I can see that there could be some value in providing configuration for this new feature - and I am open to do it.

Copy link
Owner

@kentcdodds kentcdodds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like this, but worry about the impact on filtering. I worry that it may end up matching far too much and I think it may be worth adding another ranking level for this. What do you think?

@overengineered
Copy link
Contributor Author

I think ranking level is not really appropriate for controlling this functionality. As an example for query defer with proposed typo tolerance the following matches are ordered like this:

lawyer defends victims
United States of America
indestructible dog crate

All those results are not good matches for the query, but when ordering for similarity to defer in my mind this order seems correct. Only the second match would be findable without typo tolerance. Essentially, typo tolerance adds both "better quality" results and "lower quality" results. So it seems that it would be best for typo tolerance to be in MATCHES ranking and let scoring function do its work.

In my view the MATCHES ranking is for results where no good matches were found and we are scraping the bottom of the barrel to see if something might be similar. Even current approach for MATCHES ranking can find lots of weird matches - and proposed changes do not drastically change behaviour. If spurious matches are not appropriate, increasing threshold seems like the best way to turn it off.

But if this new matching should be controlled separately, I would do it like this:

  interface MatchSorterOptions<ItemType = unknown> {
    <..>
    sorter?: Sorter<ItemType>
+   fuzzy?: 'scattered' | 'partial'
  }

I think true / false do work here as current MATCHES behaviour is an already little bit fuzzy. I think these constants fittingly describe behaviour differences.

@overengineered
Copy link
Contributor Author

overengineered commented Oct 11, 2024

I just had an idea to allow typo tolerance with ranking like this:

- MATCHES
+ TIGHT_PARTIAL
+ SCATTERED

Basically typo tolerant matching would be higher than the current MATCHES matching. However my proposed code change would have to be modified so that it allowed skipping letters only if matching letters are packed tightly. I would imagine restricting finding all the letters except one within a span that is just a few letters longer than initial query would almost always be higher quality match than finding all letters, but widely scattered.

@kentcdodds
Copy link
Owner

I think you're right. This is fine. If people find it's problematic then we can ship a breaking change with new levels.

@kentcdodds kentcdodds merged commit 8fc0645 into kentcdodds:main Oct 16, 2024
4 checks passed
Copy link

🎉 This PR is included in version 6.4.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@diegohaz
Copy link

FYI, this change caused one of our automated tests to fail: https://github.com/ariakit/ariakit/pull/4218/files#diff-ffb64734a719f82219f5f82d3ee8c52e111c4a560940b6bb14b60ab6d3e74de0

This isn't a problem. I'll update the test. The UI works well, and I believe it's a good feature. I'm sharing this so you might consider making it optional to avoid any friction until the next major.

@kentcdodds
Copy link
Owner

@diegohaz, I think I'll just do a major version bump and deprecate 6.4.0

kentcdodds added a commit that referenced this pull request Oct 17, 2024
#153 (comment)

BREAKING CHANGE: The last release was arguably a breaking change so we're going to trigger this major version and deprecate the 6.4.0.
@kentcdodds
Copy link
Owner

I've deprecated 6.4.0 and released 7.0.0: https://github.com/kentcdodds/match-sorter/releases/tag/v7.0.0

renovate bot added a commit to ariakit/ariakit that referenced this pull request Oct 17, 2024
This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
| [match-sorter](https://redirect.github.com/kentcdodds/match-sorter) |
[`6.3.4` ->
`7.0.0`](https://renovatebot.com/diffs/npm/match-sorter/6.3.4/7.0.0) |
[![age](https://developer.mend.io/api/mc/badges/age/npm/match-sorter/7.0.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![adoption](https://developer.mend.io/api/mc/badges/adoption/npm/match-sorter/7.0.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![passing](https://developer.mend.io/api/mc/badges/compatibility/npm/match-sorter/6.3.4/7.0.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![confidence](https://developer.mend.io/api/mc/badges/confidence/npm/match-sorter/6.3.4/7.0.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|

---

### Release Notes

<details>
<summary>kentcdodds/match-sorter (match-sorter)</summary>

###
[`v7.0.0`](https://redirect.github.com/kentcdodds/match-sorter/releases/tag/v7.0.0)

[Compare
Source](https://redirect.github.com/kentcdodds/match-sorter/compare/v6.4.0...v7.0.0)

This has the same contents as v6.4.0:

##### Features

- support finding data with typos
([#&#8203;153](https://redirect.github.com/kentcdodds/match-sorter/issues/153))
([8fc0645](https://redirect.github.com/kentcdodds/match-sorter/commit/8fc0645af4b2dfdbb53dd9c1c088ab52cd997f5f))

##### Bug Fixes

- **release:** manually release a major version
([d9b7dab](https://redirect.github.com/kentcdodds/match-sorter/commit/d9b7dab7d10f65db0dbc7df5788ee6e81eb26377)),
closes
[/github.com/kentcdodds/match-sorter/pull/153#issuecomment-2417996730](https://redirect.github.com//github.com/kentcdodds/match-sorter/pull/153/issues/issuecomment-2417996730)

##### BREAKING CHANGES

- **release:** The last release was arguably a breaking change so we're
going to trigger this major version and deprecate the 6.4.0.

###
[`v6.4.0`](https://redirect.github.com/kentcdodds/match-sorter/releases/tag/v6.4.0)

[Compare
Source](https://redirect.github.com/kentcdodds/match-sorter/compare/v6.3.4...v6.4.0)

##### Features

- support finding data with typos
([#&#8203;153](https://redirect.github.com/kentcdodds/match-sorter/issues/153))
([8fc0645](https://redirect.github.com/kentcdodds/match-sorter/commit/8fc0645af4b2dfdbb53dd9c1c088ab52cd997f5f))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR was generated by [Mend Renovate](https://mend.io/renovate/).
View the [repository job
log](https://developer.mend.io/github/ariakit/ariakit).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOC4xMjAuMSIsInVwZGF0ZWRJblZlciI6IjM4LjEyMC4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6W119-->

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Haz <[email protected]>
kodiakhq bot pushed a commit to timelessco/recollect that referenced this pull request Oct 20, 2024
This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
| [match-sorter](https://redirect.github.com/kentcdodds/match-sorter) | [`^6.3.1` -> `^7.0.0`](https://renovatebot.com/diffs/npm/match-sorter/6.3.4/7.0.0) | [![age](https://developer.mend.io/api/mc/badges/age/npm/match-sorter/7.0.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/npm/match-sorter/7.0.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/npm/match-sorter/6.3.4/7.0.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/npm/match-sorter/6.3.4/7.0.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) |

---

### Release Notes

<details>
<summary>kentcdodds/match-sorter (match-sorter)</summary>

### [`v7.0.0`](https://redirect.github.com/kentcdodds/match-sorter/releases/tag/v7.0.0)

[Compare Source](https://redirect.github.com/kentcdodds/match-sorter/compare/v6.4.0...v7.0.0)

This has the same contents as v6.4.0:

##### Features

-   support finding data with typos ([#&#8203;153](https://redirect.github.com/kentcdodds/match-sorter/issues/153)) ([8fc0645](https://redirect.github.com/kentcdodds/match-sorter/commit/8fc0645af4b2dfdbb53dd9c1c088ab52cd997f5f))

##### Bug Fixes

-   **release:** manually release a major version ([d9b7dab](https://redirect.github.com/kentcdodds/match-sorter/commit/d9b7dab7d10f65db0dbc7df5788ee6e81eb26377)), closes [/github.com/kentcdodds/match-sorter/pull/153#issuecomment-2417996730](https://redirect.github.com//github.com/kentcdodds/match-sorter/pull/153/issues/issuecomment-2417996730)

##### BREAKING CHANGES

-   **release:** The last release was arguably a breaking change so we're going to trigger this major version and deprecate the 6.4.0.

### [`v6.4.0`](https://redirect.github.com/kentcdodds/match-sorter/releases/tag/v6.4.0)

[Compare Source](https://redirect.github.com/kentcdodds/match-sorter/compare/v6.3.4...v6.4.0)

**DEPRECATED**: This was arguably a breaking change so we've deprecated this version and released a major version.

##### Features

-   support finding data with typos ([#&#8203;153](https://redirect.github.com/kentcdodds/match-sorter/issues/153)) ([8fc0645](https://redirect.github.com/kentcdodds/match-sorter/commit/8fc0645af4b2dfdbb53dd9c1c088ab52cd997f5f))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - "before 4am on Monday" in timezone Asia/Kolkata, Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] If you want to rebase/retry this PR, check this box

---

This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/timelessco/recollect).
@shshaw
Copy link
Contributor

shshaw commented Nov 4, 2024

This new typo matching definitely seems too broad. Before my search setup was returning relevant results, but the typo inclusion now brings up many irrelevant results.

Search:
'index'

Results Before Typo Inclusion:

  • index.html

Results After Typo Inclusion:

  • index.html
  • Show Pinned Items
  • Change Indentation Depth
  • Change Indentation Type
  • Edit Pen Description

Most of those extra results are not really logical. While it is great that a search for idnex will still match index.html, I'd like more control to ensure the results feel accurate to the search.

Are there any ways to control the typo matching in the new version?

@kentcdodds
Copy link
Owner

Yeah, that's really really bad. I'm open to anyone who wants to make a PR to simply revert that change and we'll release another major version bump to remove this functionality. I never should have merged that.

@overengineered
Copy link
Contributor Author

I think I have an idea how to improve typo tolerance: skipping letters can be allowed only if all other letters from query are found and those found letters are clustered in a short substring. I can make such PR. This would make getClosenessRanking function code a fair bit more complicated though. Are you willing to salvage this feature by having to maintain more complicated code, or do you want to look at PR before deciding, or do you think it's not worth the effort?

@shshaw
Copy link
Contributor

shshaw commented Nov 4, 2024

@overengineered If typo tolerance is going to be added, it's crucial that it's optional/configurable. Should the detection be complicated and potentially weighty, then it should be written in a way that lets it get tree-shaken out if typo tolerance isn't enabled.

Probably best to spec things out further before getting deep into any implementation.

shshaw added a commit to shshaw/match-sorter that referenced this pull request Nov 4, 2024
@overengineered
Copy link
Contributor Author

I agree that typo tolerance should be optional. Complexity would be ~30 additional lines in 560 line file - not worth the effort to support tree shaking IMO.

@shshaw
Copy link
Contributor

shshaw commented Nov 4, 2024

I'm open to anyone who wants to make a PR to simply revert that change and we'll release another major version bump to remove this functionality. I never should have merged that.

PR for revert. Should meet all the contribution guidelines. #155

kentcdodds pushed a commit that referenced this pull request Nov 4, 2024
This reverts commit 8fc0645.

BREAKING CHANGE: We've reverted the last release which added support for typos because it was too generous
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants