Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

Difflib and python-Levenshtein give different ratios in some cases #128

Closed
theodickson opened this issue Aug 12, 2016 · 2 comments
Closed

Comments

@theodickson
Copy link

This is probably known and accepted, but since it's not mentioned in the docs I'll raise this anyway. In certain edge cases, difflib and python-Levenshtein give different ratios.
E.g.:

>>> from fuzzywuzzy import fuzz, StringMatcher
>>> import difflib

#As long as python-Levenshtein is available, that will be used for the following:
>>> fuzz.ratio("ababab", "aaaaab")
67

#Switch to difflib:
>>> fuzz.SequenceMatcher = difflib.SequenceMatcher
>>> fuzz.ratio("ababab", "aaaaab")
33

I don't know how python-Levenshtein works, but difflib first chooses the left-most longest block in the first sequence that matches any block in the second sequence. Then among the longest matches in the second sequence to this chosen block, it will choose the left-most.

What this means is that it is forced to choose the first "ab" in the first string, and then matches it to the only "ab" in the second, at the end. This means it cannot recurse either to the left or the right to look for more matching blocks, since this match is at the very beginning of the first string and at the very end of the second. So the ratio calculated is 33, since it has matched one third (4/12) of the total characters.

I'm assuming that in python-Levenshtein, it makes a decision about which matches to choose based not just on maximality and left-most-ness, but also on whether the pair of matches chosen allows subsequent recursion. This is because it scores 67, so must have matched eight of 12 characters, i.e. it must have matched "ab" and then two subsequent "a"s in each sequence, meaning it must have chosen the last "ab" in the first string. Either this or it can recurse to opposite sides in the two sequences but that surely isn't happening.

To show this, if we change the second sequence to "abaaaa", difflib will also score 67 (since it matches the first two characters of each sequence then recurses to the right). See as follows:

>>> fuzz.ratio("ababab", "abaaaa")
67

#And switching pack to python-Levenshtein, no change:
>>> fuzz.SequenceMatcher = fuzzywuzzy.StringMatcher.StringMatcher
>>> fuzz.ratio("ababab", "abaaaa")
67
@josegonzalez
Copy link
Contributor

Yeah, I guess they are slightly different sometimes. For our use-case, python-Levenshtein works well, and it's what we suggest people use in production.

@siulkilulki
Copy link

The difference is because difflib uses the Ratcliff-Obershelp algorithm, not the Levenshtein distance. Probably the readme should be updated, because not always Levenshtein distance is used.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants