-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
difflib.SequenceMatcher(...).ratio gives bad/wrong/unexpected low value with repetitious strings #69578
Comments
Consider the following snippet: import difflib
first = u'location,location,location'
for second in (
u'location.location.location', # two periods (no commas)
u'location.location,location', # period after first
u'location,location.location', # period after second
u'location,location,location', # perfect match
):
edit_dist = difflib.SequenceMatcher(None, first, second).ratio()
print("comparing %r vs. %r gives edit dist: %g" % (first, second, edit_dist)) I would expect the second and third tests to give the same result, but in reality: comparing u'location,location,location' vs. u'location.location.location' gives edit dist: 0.923077 The same results are received from Python 3.4. From experimenting, it seems that when the period comes after the first "location", the longest match found is the final two "locations" from the first string against the first two "locations" from the second string. In [31]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').ratio() In [32]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').get_matching_blocks() In [33]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').ratio()Out[33]: 0.9615384615384616 In [34]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').get_matching_blocks() Using |
Do note that this is not an "edit distance" (like Levenshtein) algorithm. It works as documented instead ;-) , searching (in effect recursively) for the leftmost longest contiguous matching blocks. Both "leftmost" and "contiguous" are crucial to understanding what it does. I expect you're most surprised by the 2nd example, comparing: location,location,location The longest contiguous matching block is "location,location", in the first string at slice 0:17 (leftmost!) and in the second string at slice 9:26. That leaves a wholly unmatched ",location" at the end of the first string and a wholly unmatched "location." at the start of the second string. That's why the ratio is so low. We have a total of 17*2 = 34 matching characters (in the single matching block) out of 2*26 = 52 characters total, so the ratio is 34/52 ~= 0.65. Had it searched for the _rightmost_ longest matching blocks instead, then the trailing "location,location" pieces of both strings would have matched first, and then the leading "location" pieces of both strings, giving a ratio of about 0.96 instead. Indeed, that's essentially what happens in your 3rd example. .quick_ratio() and .real_quick_ratio() use entirely different algorithms, and - again as documented - their only purpose is to give an upper bound on what .ratio() returns (but do so faster). Anyway, a couple things to take from this:
Since it's functioning as designed & documented, I'm closing this as "not a bug". It may make sense to open a different "enhancement" issue instead asking for a different algorithm(s). |
BTW, the "leftmost longest contiguous" bit is messy to explain, so the main part of the docs don't explain it all (it's of no interest to 99.9% of users). Instead it's formally defined in the .find_longest_match() docs: """ |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: