This repository has been archived by the owner on Aug 26, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 874
Difflib and python-Levenshtein give different ratios in some cases #128
Comments
Yeah, I guess they are slightly different sometimes. For our use-case, python-Levenshtein works well, and it's what we suggest people use in production. |
The difference is because |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
This is probably known and accepted, but since it's not mentioned in the docs I'll raise this anyway. In certain edge cases, difflib and python-Levenshtein give different ratios.
E.g.:
I don't know how python-Levenshtein works, but difflib first chooses the left-most longest block in the first sequence that matches any block in the second sequence. Then among the longest matches in the second sequence to this chosen block, it will choose the left-most.
What this means is that it is forced to choose the first "ab" in the first string, and then matches it to the only "ab" in the second, at the end. This means it cannot recurse either to the left or the right to look for more matching blocks, since this match is at the very beginning of the first string and at the very end of the second. So the ratio calculated is 33, since it has matched one third (4/12) of the total characters.
I'm assuming that in python-Levenshtein, it makes a decision about which matches to choose based not just on maximality and left-most-ness, but also on whether the pair of matches chosen allows subsequent recursion. This is because it scores 67, so must have matched eight of 12 characters, i.e. it must have matched "ab" and then two subsequent "a"s in each sequence, meaning it must have chosen the last "ab" in the first string. Either this or it can recurse to opposite sides in the two sequences but that surely isn't happening.
To show this, if we change the second sequence to "abaaaa", difflib will also score 67 (since it matches the first two characters of each sequence then recurses to the right). See as follows:
The text was updated successfully, but these errors were encountered: