Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate percentage similar #7

Open
vbisbest opened this issue May 26, 2020 · 1 comment
Open

Calculate percentage similar #7

vbisbest opened this issue May 26, 2020 · 1 comment

Comments

@vbisbest
Copy link

How can I take the distance and compute a percentage of similarity? For instance, if given this example

	a := []byte("this is a test for results")
	aHash := simhash.Simhash(simhash.NewWordFeatureSet(a))

	b := []byte("this is a test for cats")
	bHash := simhash.Simhash(simhash.NewWordFeatureSet(b))

	c := simhash.Compare(aHash, bHash)
	fmt.Println(c)

I get an output of 7. But I would like to see that these are 90% similar (or whatever the exact amount is). Thank you.

@bbalet
Copy link

bbalet commented May 26, 2020

SimHash is used to compute a distance between two texts. When the distance equals to zero the two texts are similar. The output of Compare function is this distance. You have a complete example in the README:

Comparison of `this is a test phrase` and `this is a test phrass`: 2
Comparison of `this is a test phrase` and `foo bar`: 29

If you want to calculate a percentage of similarity, you have to find out a MAXIMUM distance value and use it in a formula such as:

100 - ((distance / MAXIMUM) * 100)

Let's take 1000 for MAXIMUM's value and apply the formula to the two examples of the README:

  • 100 - ((2 / 1000) * 100) = 99.8
  • 100 - ((29 / 1000) * 100) = 97.1

So you can say that:

  • the percentage of similarity between this is a test phrase and this is a test phrass is 99.8%
  • the percentage of similarity between this is a test phrase and foo bar is 97.1%

Now you may ask why 1000 and not the maximum value that could be given by Compare function? Compare returns an uint64 which ranges from 0 up to 2^64 (or 18446744073709551615). That's was a rough normalization (adjust the scale to get values that make sense when taking into account the set of values to which they belong), because:

  • 100 - ((2 / 2^64) * 100) aprox. 99,9999999999998 %
  • 100 - ((29 / 2^64) * 100) aprox. 99,9999999999971 %

And depending what you want to do with this percentage (e.g. display it into a report or give it to a machine) you have to take into account the variable type that will store or display this value.

Now you have to find out the MAXIMUM's value that suits your use case. See https://en.wikipedia.org/wiki/Normalization_(statistics)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants