You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In RegistryCI, we currently do 5 + sqrt(max(len1, len2)) but that was pretty arbitrarily chosen. I'm not sure if sqrt is the right scaling.
The optimal transport method should be basically additive; if there's double the mass that needs to be moved around, it will say it's twice as far. But say b vs d is more obvious than doubled vs doudled, i.e. when the swap is hidden in a bigger word, you see it less. But visual_distance without normalization doesn't really care:
That's because it's the same amount of mass to move around either way. So IMO it does make sense to account somehow for the length of the string.
However, I don't think it necessarily makes sense to just divide by the (say max) length of the strings. Because say bb vs dd does seem visually more different than b vs d, but
(Why is it not exactly the same? well, the regularization might mess things up a bit, and maybe you can use the fact that you have two to do some more clever rearrangement of mass than just doing the b -> d one twice).
So then what is the "right" scaling? I think it depends a lot on how long the strings of interest are. We definitely aren't considering the "asymptotic" regime of infinite length strings for the case of package names (besides the fact that optimal transport isn't tractable if it's too big), so I think actually it might not matter if it's sqrt vs linear with a small constant vs something else, but rather just getting something that seems useful in the regime of length 5 to 20.
The text was updated successfully, but these errors were encountered:
In RegistryCI, we currently do
5 + sqrt(max(len1, len2))
but that was pretty arbitrarily chosen. I'm not sure ifsqrt
is the right scaling.The optimal transport method should be basically additive; if there's double the mass that needs to be moved around, it will say it's twice as far. But say
b
vsd
is more obvious thandoubled
vsdoudled
, i.e. when the swap is hidden in a bigger word, you see it less. Butvisual_distance
without normalization doesn't really care:That's because it's the same amount of mass to move around either way. So IMO it does make sense to account somehow for the length of the string.
However, I don't think it necessarily makes sense to just divide by the (say max) length of the strings. Because say
bb
vsdd
does seem visually more different thanb
vsd
, but(Why is it not exactly the same? well, the regularization might mess things up a bit, and maybe you can use the fact that you have two to do some more clever rearrangement of mass than just doing the b -> d one twice).
So then what is the "right" scaling? I think it depends a lot on how long the strings of interest are. We definitely aren't considering the "asymptotic" regime of infinite length strings for the case of package names (besides the fact that optimal transport isn't tractable if it's too big), so I think actually it might not matter if it's sqrt vs linear with a small constant vs something else, but rather just getting something that seems useful in the regime of length 5 to 20.
The text was updated successfully, but these errors were encountered: