Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to normalize by length? #10

Open
ericphanson opened this issue Oct 13, 2020 · 0 comments
Open

How to normalize by length? #10

ericphanson opened this issue Oct 13, 2020 · 0 comments

Comments

@ericphanson
Copy link
Owner

In RegistryCI, we currently do 5 + sqrt(max(len1, len2)) but that was pretty arbitrarily chosen. I'm not sure if sqrt is the right scaling.

The optimal transport method should be basically additive; if there's double the mass that needs to be moved around, it will say it's twice as far. But say b vs d is more obvious than doubled vs doudled, i.e. when the swap is hidden in a bigger word, you see it less. But visual_distance without normalization doesn't really care:

julia> using VisualStringDistances

julia> visual_distance("b", "d")
10.183114073458526

julia> visual_distance("doubled", "doudled")
10.183870016950062

That's because it's the same amount of mass to move around either way. So IMO it does make sense to account somehow for the length of the string.

However, I don't think it necessarily makes sense to just divide by the (say max) length of the strings. Because say bb vs dd does seem visually more different than b vs d, but

julia> visual_distance("b", "d")
10.183114073458526

julia> visual_distance("bb", "dd")/2
9.953943822450068

(Why is it not exactly the same? well, the regularization might mess things up a bit, and maybe you can use the fact that you have two to do some more clever rearrangement of mass than just doing the b -> d one twice).

So then what is the "right" scaling? I think it depends a lot on how long the strings of interest are. We definitely aren't considering the "asymptotic" regime of infinite length strings for the case of package names (besides the fact that optimal transport isn't tractable if it's too big), so I think actually it might not matter if it's sqrt vs linear with a small constant vs something else, but rather just getting something that seems useful in the regime of length 5 to 20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant