You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How can I take the distance and compute a percentage of similarity? For instance, if given this example
a := []byte("this is a test for results")
aHash := simhash.Simhash(simhash.NewWordFeatureSet(a))
b := []byte("this is a test for cats")
bHash := simhash.Simhash(simhash.NewWordFeatureSet(b))
c := simhash.Compare(aHash, bHash)
fmt.Println(c)
I get an output of 7. But I would like to see that these are 90% similar (or whatever the exact amount is). Thank you.
The text was updated successfully, but these errors were encountered:
SimHash is used to compute a distance between two texts. When the distance equals to zero the two texts are similar. The output of Compare function is this distance. You have a complete example in the README:
Comparison of `this is a test phrase` and `this is a test phrass`: 2
Comparison of `this is a test phrase` and `foo bar`: 29
If you want to calculate a percentage of similarity, you have to find out a MAXIMUM distance value and use it in a formula such as:
100 - ((distance / MAXIMUM) * 100)
Let's take 1000 for MAXIMUM's value and apply the formula to the two examples of the README:
100 - ((2 / 1000) * 100) = 99.8
100 - ((29 / 1000) * 100) = 97.1
So you can say that:
the percentage of similarity between this is a test phrase and this is a test phrass is 99.8%
the percentage of similarity between this is a test phrase and foo bar is 97.1%
Now you may ask why 1000 and not the maximum value that could be given by Compare function? Compare returns an uint64 which ranges from 0 up to 2^64 (or 18446744073709551615). That's was a rough normalization (adjust the scale to get values that make sense when taking into account the set of values to which they belong), because:
And depending what you want to do with this percentage (e.g. display it into a report or give it to a machine) you have to take into account the variable type that will store or display this value.
How can I take the distance and compute a percentage of similarity? For instance, if given this example
I get an output of 7. But I would like to see that these are 90% similar (or whatever the exact amount is). Thank you.
The text was updated successfully, but these errors were encountered: