How to normalize by length? #10

ericphanson · 2020-10-13T13:21:43Z

In RegistryCI, we currently do 5 + sqrt(max(len1, len2)) but that was pretty arbitrarily chosen. I'm not sure if sqrt is the right scaling.

The optimal transport method should be basically additive; if there's double the mass that needs to be moved around, it will say it's twice as far. But say b vs d is more obvious than doubled vs doudled, i.e. when the swap is hidden in a bigger word, you see it less. But visual_distance without normalization doesn't really care:

julia> using VisualStringDistances

julia> visual_distance("b", "d")
10.183114073458526

julia> visual_distance("doubled", "doudled")
10.183870016950062

That's because it's the same amount of mass to move around either way. So IMO it does make sense to account somehow for the length of the string.

However, I don't think it necessarily makes sense to just divide by the (say max) length of the strings. Because say bb vs dd does seem visually more different than b vs d, but

julia> visual_distance("b", "d")
10.183114073458526

julia> visual_distance("bb", "dd")/2
9.953943822450068

(Why is it not exactly the same? well, the regularization might mess things up a bit, and maybe you can use the fact that you have two to do some more clever rearrangement of mass than just doing the b -> d one twice).

So then what is the "right" scaling? I think it depends a lot on how long the strings of interest are. We definitely aren't considering the "asymptotic" regime of infinite length strings for the case of package names (besides the fact that optimal transport isn't tractable if it's too big), so I think actually it might not matter if it's sqrt vs linear with a small constant vs something else, but rather just getting something that seems useful in the regime of length 5 to 20.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to normalize by length? #10

How to normalize by length? #10

ericphanson commented Oct 13, 2020

How to normalize by length? #10

How to normalize by length? #10

Comments

ericphanson commented Oct 13, 2020