Difflib and python-Levenshtein give different ratios in some cases #128

theodickson · 2016-08-12T10:18:51Z

This is probably known and accepted, but since it's not mentioned in the docs I'll raise this anyway. In certain edge cases, difflib and python-Levenshtein give different ratios.
E.g.:

>>> from fuzzywuzzy import fuzz, StringMatcher
>>> import difflib

#As long as python-Levenshtein is available, that will be used for the following:
>>> fuzz.ratio("ababab", "aaaaab")
67

#Switch to difflib:
>>> fuzz.SequenceMatcher = difflib.SequenceMatcher
>>> fuzz.ratio("ababab", "aaaaab")
33

I don't know how python-Levenshtein works, but difflib first chooses the left-most longest block in the first sequence that matches any block in the second sequence. Then among the longest matches in the second sequence to this chosen block, it will choose the left-most.

What this means is that it is forced to choose the first "ab" in the first string, and then matches it to the only "ab" in the second, at the end. This means it cannot recurse either to the left or the right to look for more matching blocks, since this match is at the very beginning of the first string and at the very end of the second. So the ratio calculated is 33, since it has matched one third (4/12) of the total characters.

I'm assuming that in python-Levenshtein, it makes a decision about which matches to choose based not just on maximality and left-most-ness, but also on whether the pair of matches chosen allows subsequent recursion. This is because it scores 67, so must have matched eight of 12 characters, i.e. it must have matched "ab" and then two subsequent "a"s in each sequence, meaning it must have chosen the last "ab" in the first string. Either this or it can recurse to opposite sides in the two sequences but that surely isn't happening.

To show this, if we change the second sequence to "abaaaa", difflib will also score 67 (since it matches the first two characters of each sequence then recurses to the right). See as follows:

>>> fuzz.ratio("ababab", "abaaaa")
67

#And switching pack to python-Levenshtein, no change:
>>> fuzz.SequenceMatcher = fuzzywuzzy.StringMatcher.StringMatcher
>>> fuzz.ratio("ababab", "abaaaa")
67

The text was updated successfully, but these errors were encountered:

josegonzalez · 2016-09-14T15:53:25Z

Yeah, I guess they are slightly different sometimes. For our use-case, python-Levenshtein works well, and it's what we suggest people use in production.

Closes #128

siulkilulki · 2021-01-23T23:40:21Z

The difference is because difflib uses the Ratcliff-Obershelp algorithm, not the Levenshtein distance. Probably the readme should be updated, because not always Levenshtein distance is used.

josegonzalez closed this as completed Sep 14, 2016

josegonzalez added a commit that referenced this issue Sep 14, 2016

Mention that difflib and levenshtein results may differ

6353e25

Closes #128

tmplt mentioned this issue Apr 6, 2017

Progress and Todo tmplt/fuzzywuzzy#1

Closed

27 tasks

This was referenced Nov 6, 2019

Index error in 1.1 ONSBigData/labelpropagation_clothing#4

Closed

python levenshtien ONSBigData/labelpropagation_clothing#2

Closed

molkoback mentioned this issue Jan 4, 2020

Wrong detection for local anime z411/trackma#469

Closed

maxbachmann mentioned this issue Mar 28, 2020

Inconsistent score from fuzz.ratio() between linux and windows #271

Closed

rnegron mentioned this issue Nov 30, 2021

Wrong (non symmetric) results for some strings if Levenshtein module is not present seatgeek/thefuzz#5

Open

abubelinha mentioned this issue May 10, 2022

wrong link in main page seatgeek/thefuzz#26

Open

R-N mentioned this issue Jun 5, 2023

The library doesn't use Levenshtein at all, or at least it's inconsistent seatgeek/thefuzz#53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difflib and python-Levenshtein give different ratios in some cases #128

Difflib and python-Levenshtein give different ratios in some cases #128

theodickson commented Aug 12, 2016

josegonzalez commented Sep 14, 2016

siulkilulki commented Jan 23, 2021

Difflib and python-Levenshtein give different ratios in some cases #128

Difflib and python-Levenshtein give different ratios in some cases #128

Comments

theodickson commented Aug 12, 2016

josegonzalez commented Sep 14, 2016

siulkilulki commented Jan 23, 2021