difflib.SequenceMatcher(...).ratio gives bad/wrong/unexpected low value with repetitious strings #69578

LewisHaley · 2015-10-13T11:07:41Z

BPO	25391
Nosy	@tim-one
Files	test.py: Script which demonstates the inconsistency with repetitious strings

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2015-10-13.18:01:47.719>
created_at = <Date 2015-10-13.11:07:40.504>
labels = ['invalid', 'library']
title = 'difflib.SequenceMatcher(...).ratio gives bad/wrong/unexpected low value with repetitious strings'
updated_at = <Date 2015-10-13.18:19:45.290>
user = 'https://bugs.python.org/LewisHaley'

bugs.python.org fields:

activity = <Date 2015-10-13.18:19:45.290>
actor = 'tim.peters'
assignee = 'none'
closed = True
closed_date = <Date 2015-10-13.18:01:47.719>
closer = 'tim.peters'
components = ['Library (Lib)']
creation = <Date 2015-10-13.11:07:40.504>
creator = 'Lewis Haley'
dependencies = []
files = ['40767']
hgrepos = []
issue_num = 25391
keywords = []
message_count = 3.0
messages = ['252925', '252944', '252948']
nosy_count = 3.0
nosy_names = ['tim.peters', 'Lewis Haley', 'Allan Lewis']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue25391'
versions = ['Python 2.7', 'Python 3.4']

LewisHaley · 2015-10-13T11:07:40Z

Consider the following snippet:

import difflib

first = u'location,location,location'
for second in (
    u'location.location.location',  # two periods (no commas)
    u'location.location,location',  # period after first
    u'location,location.location',  # period after second
    u'location,location,location',  # perfect match
):
    edit_dist = difflib.SequenceMatcher(None, first, second).ratio()
    print("comparing %r vs. %r gives edit dist: %g" % (first, second, edit_dist))

I would expect the second and third tests to give the same result, but in reality:

comparing u'location,location,location' vs. u'location.location.location' gives edit dist: 0.923077
comparing u'location,location,location' vs. u'location.location,location' gives edit dist: 0.653846
comparing u'location,location,location' vs. u'location,location.location' gives edit dist: 0.961538
comparing u'location,location,location' vs. u'location,location,location' gives edit dist: 1

The same results are received from Python 3.4.

From experimenting, it seems that when the period comes after the first "location", the longest match found is the final two "locations" from the first string against the first two "locations" from the second string.

In [31]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').ratio()
Out[31]: 0.6538461538461539

In [32]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').get_matching_blocks()
Out[32]: [Match(a=0, b=9, size=17), Match(a=26, b=26, size=0)]

In [33]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').ratio()Out[33]: 0.9615384615384616

In [34]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').get_matching_blocks()
Out[34]:
[Match(a=0, b=0, size=17),
Match(a=18, b=18, size=8),
Match(a=26, b=26, size=0)]

Using quick_ratio instead of ratio gives (what I consider to be) the correct result.

tim-one · 2015-10-13T18:01:48Z

Do note that this is not an "edit distance" (like Levenshtein) algorithm. It works as documented instead ;-) , searching (in effect recursively) for the leftmost longest contiguous matching blocks. Both "leftmost" and "contiguous" are crucial to understanding what it does.

I expect you're most surprised by the 2nd example, comparing:

location,location,location
location.location,location

The longest contiguous matching block is "location,location", in the first string at slice 0:17 (leftmost!) and in the second string at slice 9:26. That leaves a wholly unmatched ",location" at the end of the first string and a wholly unmatched "location." at the start of the second string. That's why the ratio is so low. We have a total of 17*2 = 34 matching characters (in the single matching block) out of 2*26 = 52 characters total, so the ratio is 34/52 ~= 0.65.

Had it searched for the _rightmost_ longest matching blocks instead, then the trailing "location,location" pieces of both strings would have matched first, and then the leading "location" pieces of both strings, giving a ratio of about 0.96 instead. Indeed, that's essentially what happens in your 3rd example.

.quick_ratio() and .real_quick_ratio() use entirely different algorithms, and - again as documented - their only purpose is to give an upper bound on what .ratio() returns (but do so faster).

Anyway, a couple things to take from this:

Some apps really want an "edit distance" kind of algorithm instead. I would welcome one, but nobody so far has volunteered to implement one. "A problem" is that there are many such algorithms (e.g., computational biology has driven many developments in this area).
It's far too late to change what current difflib functions implement. The primary use case for the "leftmost longest contiguous matches" design was to improve the quality (as perceived by human eyes) of diffs generated for text files containing code (like C or Python source code), and it works very well for that purpose most times. It doesn't work well for all purposes at all times. Note that "works well (as perceived by human eyes)" is largely subjective, so arguing about that won't get far ;-)

Since it's functioning as designed & documented, I'm closing this as "not a bug". It may make sense to open a different "enhancement" issue instead asking for a different algorithm(s).

tim-one · 2015-10-13T18:19:45Z

BTW, the "leftmost longest contiguous" bit is messy to explain, so the main part of the docs don't explain it all (it's of no interest to 99.9% of users). Instead it's formally defined in the .find_longest_match() docs:

"""
If isjunk was omitted or None, find_longest_match() returns (i, j, k) such that a[i:i+k] is equal to b[j:j+k], where alo <= i <= i+k <= ahi and blo <= j <= j+k <= bhi. For all (i', j', k') meeting those conditions, the additional conditions k >= k', i <= i', and if i == i', j <= j' are also met. In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b.
"""

LewisHaley mannequin added the stdlib Python modules in the Lib dir label Oct 13, 2015

LewisHaley mannequin changed the title ~~difflib.SequenceMatcher(...).ratio gives bad/wrong/unexpected low value with repetitous strings~~ difflib.SequenceMatcher(...).ratio gives bad/wrong/unexpected low value with repetitious strings Oct 13, 2015

tim-one closed this as completed Oct 13, 2015

tim-one added the invalid label Oct 13, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

cornzz mentioned this issue Jan 9, 2025

[LongBench] How was Edit Sim for code tasks calculated? THUDM/LongBench#96

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

difflib.SequenceMatcher(...).ratio gives bad/wrong/unexpected low value with repetitious strings #69578

difflib.SequenceMatcher(...).ratio gives bad/wrong/unexpected low value with repetitious strings #69578

LewisHaley mannequin commented Oct 13, 2015

LewisHaley mannequin commented Oct 13, 2015

tim-one commented Oct 13, 2015

tim-one commented Oct 13, 2015

difflib.SequenceMatcher(...).ratio gives bad/wrong/unexpected low value with repetitious strings #69578

difflib.SequenceMatcher(...).ratio gives bad/wrong/unexpected low value with repetitious strings #69578

Comments

LewisHaley mannequin commented Oct 13, 2015

LewisHaley mannequin commented Oct 13, 2015

tim-one commented Oct 13, 2015

tim-one commented Oct 13, 2015