Skip to content

Commit

Permalink
Replace DiceCoefficient implementation
Browse files Browse the repository at this point in the history
I've been using it for almost a year now, and in my benchmarks it performs much better than original implementation (at least for shorter strings, I'm limited to <2048 characters). And as pointed out in tylerjensen#8, current implementation may produce incorrect score for repeating bigrams.

The downside is that this is O(n*m) implementation and will perform much worse on long strings.

Fixes tylerjensen#8
  • Loading branch information
13xforever authored Mar 2, 2021
1 parent 6399781 commit 5ddc33e
Showing 1 changed file with 22 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,28 @@ public static double DiceCoefficient(this string input, string comparedTo)
/// <returns></returns>
public static double DiceCoefficient(this string[] nGrams, string[] compareToNGrams)
{
int matches = nGrams.Intersect(compareToNGrams).Count();
if (matches == 0) return 0.0d;
double totalBigrams = nGrams.Length + compareToNGrams.Length;
return (2 * matches) / totalBigrams;
var bgCount1 = input.Length - 1;
var bgCount2 = comparedTo.Length - 1;
if (comparedTo.Length < input.Length)
{
var tmp = input;
input = comparedTo;
comparedTo = tmp;
}
var matches = 0;
for (var i = 0; i < input.Length - 1; i++)
for (var j = 0; j < comparedTo.Length - 1; j++)
{
if (input[i] == comparedTo[j] && input[i + 1] == comparedTo[j + 1])
{
matches++;
break;
}
}
if (matches == 0)
return 0.0d;

return 2.0 * matches / (bgCount1 + bgCount2);
}

public static string[] ToBiGrams(this string input)
Expand Down

0 comments on commit 5ddc33e

Please sign in to comment.