Merge pull request #12 from lewinfox/feature/weights

Feature/weights
lewinfox · Sep 30, 2023 · 8266550 · 8266550
2 parents c733fce + aeb2c8f
commit 8266550
Show file tree

Hide file tree

Showing 17 changed files with 722 additions and 23 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,17 +1,14 @@
 Type: Package
 Package: levitate
 Title: Fuzzy String Comparison
-Version: 0.1.0.9000
+Version: 0.2.0
 Authors@R: 
-    person(given = "Lewin",
-           family = "Appleton-Fox",
-           role = c("aut", "cre", "cph"),
-           email = "[email protected]")
+    person("Lewin", "Appleton-Fox", , "[email protected]", role = c("aut", "cre", "cph"))
 Description: Provides string similarity calculations inspired by the
-    Python 'thefuzz' package. Compare strings by edit distance,
-    similarity ratio, best matching substring, ordered token matching and
-    set-based token matching. A range of edit distance measures are
-    available thanks to the 'stringdist' package.
+    Python 'thefuzz' package. Compare strings by edit distance, similarity
+    ratio, best matching substring, ordered token matching and set-based
+    token matching. A range of edit distance measures are available thanks
+    to the 'stringdist' package.
 License: GPL-3
 URL: https://lewinfox.github.io/levitate/,
     https://github.com/lewinfox/levitate/,
@@ -20,15 +17,14 @@ BugReports: https://github.com/lewinfox/levitate/issues
 Depends: 
     R (>= 2.10)
 Imports: 
-    cli,
-    glue,
     rlang,
-    stringdist,
-    stringr
+    stringdist
 Suggests: 
+    glue,
     knitr,
     pkgdown,
     rmarkdown,
+    styler,
     testthat
 VignetteBuilder: 
     knitr

diff --git a/NAMESPACE b/NAMESPACE
@@ -1,7 +1,12 @@
 # Generated by roxygen2: do not edit by hand
 
+export(lev_best_match)
 export(lev_distance)
 export(lev_partial_ratio)
 export(lev_ratio)
+export(lev_score_multiple)
 export(lev_token_set_ratio)
 export(lev_token_sort_ratio)
+export(lev_weighted_token_ratio)
+export(lev_weighted_token_set_ratio)
+export(lev_weighted_token_sort_ratio)
diff --git a/NEWS.md b/NEWS.md
@@ -1,6 +1,9 @@
-# Development version
+# levitate 0.2.0
 
+* New ranking functions `lev_score_multiple()` and `lev_best_match()` (#5)
 * New package logo added by [`@zatch3301`](https://github.com/zatch3301) (#6)
+* New `lev_weighted_*()` functions (#10)
+
 
 # levitate 0.1.0
 

diff --git a/R/lev-distance.R b/R/lev-distance.R
@@ -291,7 +291,6 @@ NULL
 
 #' @describeIn internal-functions See [lev_token_set_ratio()].
 internal_lev_token_set_ratio <- function(a, b, pairwise = TRUE, useNames = !pairwise, ...) {
-
   token_a <- unlist(str_tokenise(a))
   token_b <- unlist(str_tokenise(b))
   common_tokens <- sort(intersect(token_a, token_b))

diff --git a/R/ranking.R b/R/ranking.R
@@ -0,0 +1,42 @@
+#' Score multiple candidate strings against a single input
+#'
+#' Given a single `input` string and multiple `candidates`, compute scores for each candidate.
+#'
+#' @param input A single string
+#' @param candidates One or more candidate strings to score
+#' @param .fn The scoring function to use, as a string or function object. Defaults to
+#'   [lev_ratio()].
+#' @param ... Additional arguments to pass to `.fn`.
+#' @param decreasing If `TRUE` (the default), the candidate with the highest score is ranked first.
+#'   If using a comparison `.fn` that computes _distance_ rather than similarity, or if you want the
+#'   worst match to be returned first, set this to `FALSE`.
+#'
+#' @return A list where the keys are `candidates` and the values are the scores. The list is sorted
+#'   according to the `decreasing` parameter, so by default higher scores are first.
+#'
+#' @examples
+#' lev_score_multiple("bilbo", c("frodo", "gandalf", "legolas"))
+#' @export
+#' @seealso [lev_best_match()]
+lev_score_multiple <- function(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE) {
+  if (length(input) > 1) rlang::abort(glue::glue("`input` must be length 1, not {length(input)}"))
+  .fn <- match.fun(.fn)
+  scores <- sort(sapply(candidates, .fn, input, ...), decreasing = decreasing)
+  as.list(scores)
+}
+
+#' Get the best matched string from a list of candidates
+#'
+#' Given an `input` string and multiple `candidates`, return the candidate with the best score as
+#' calculated by `.fn`.
+#'
+#' @inheritParams lev_score_multiple
+#' @return A string
+#' @seealso [lev_score_multiple()]
+#' @examples
+#' lev_best_match("bilbo", c("frodo", "gandalf", "legolas"))
+#' @export
+lev_best_match <- function(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE) {
+  scores <- lev_score_multiple(input = input, candidates = candidates, .fn = .fn, ..., decreasing = decreasing)
+  names(scores)[[1]]
+}
diff --git a/R/weighted.R b/R/weighted.R
@@ -0,0 +1,129 @@
+#' Weighted token similarity measure
+#'
+#' Computes similarity but allows you to assign weights to specific tokens. This is useful, for
+#' example, when you have a frequently-occurring string that doesn't contain useful information. See
+#' examples.
+#'
+#' # Details
+#'
+#' The algorithm used here is as follows:
+#'
+#' * Tokenise the input strings
+#' * Compute the edit distance between each pair of tokens
+#' * Compute the maximum edit distance between each pair of tokens
+#' * Apply any weights from the `weights` argument
+#' * Return `1 - (sum(weighted_edit_distances) / sum(weighted_max_edit_distance))`
+#'
+#' @inheritParams default-params
+#' @param weights List of token weights. For example, `weights = list(foo = 0.9, bar = 0.1)`. Any
+#'   tokens omitted from `weights` will be given a weight of 1.
+#'
+#' @return A float
+#' @export
+#'
+#' @family weighted token functions
+#'
+#' @examples
+#' lev_weighted_token_ratio("jim ltd", "tim ltd")
+#'
+#' lev_weighted_token_ratio("tim ltd", "jim ltd", weights = list(ltd = 0.1))
+lev_weighted_token_ratio <- function(a, b, weights = list(), ...) {
+  if (length(a) != 1 || length(b) != 1) {
+    rlang::abort("`a` and `b` must be length 1")
+  }
+  a_tokens <- unlist(str_tokenise(a))
+  b_tokens <- unlist(str_tokenise(b))
+
+  # If the token lists aren't the same length we will pad the shorter list with empty strings
+  if (length(a_tokens) > length(b_tokens)) {
+    b_tokens <- c(b_tokens, rep("", length(a_tokens) - length(b_tokens)))
+  } else if (length(a_tokens) < length(b_tokens)) {
+    a_tokens <- c(a_tokens, rep("", length(b_tokens) - length(a_tokens)))
+  }
+
+  token_lev_distances <- mapply(lev_distance, a_tokens, b_tokens, MoreArgs = ...)
+
+  #  Weights are applied where
+  #
+  #  * a token is in the `weights` list
+  #  * AND the token appears in the same position in a and b.
+  #  * OR the token appears in a OR b and the corresponding token is missing (which has the effect
+  #    of reducing the impact of tokens that appear in one string but not the other).
+  weights_to_apply <- mapply(
+    function(token_a, token_b) {
+      if (token_a == token_b && token_a %in% names(weights)) {
+        weights[[token_a]]
+      } else if (token_a == "" && token_b %in% names(weights)) {
+        weights[[token_b]]
+      } else if (token_b == "" && token_a %in% names(weights)) {
+        weights[[token_a]]
+      } else {
+        1
+      }
+    },
+    a_tokens,
+    b_tokens
+  )
+
+  # The similarity score is (1 - (edit_distance / max_edit_distance)), after weighting.
+  weighted_edit_distances <- token_lev_distances * weights_to_apply
+  weighted_max_edit_distances <- mapply(function(a, b) max(nchar(a), nchar(b)), a_tokens, b_tokens) * weights_to_apply
+
+  1 - (sum(weighted_edit_distances) / sum(weighted_max_edit_distances))
+}
+
+#' Weighted version of lev_token_sort_ratio()
+#'
+#' This function tokenises inputs, sorts tokens and computes similarities for each pair of tokens.
+#' Similarity scores are weighted based on the `weights` argument, and a total similarity score is
+#' returned in the same manner as [lev_weighted_token_ratio()].
+#'
+#' @inheritParams default-params
+#' @inheritParams lev_weighted_token_ratio
+#'
+#' @return Float
+#' @export
+#' @family weighted token functions
+#' @seealso [lev_token_sort_ratio()]
+lev_weighted_token_sort_ratio <- function(a, b, weights = list(), ...) {
+  if (length(a) != 1 || length(b) != 1) {
+    rlang::abort("`a` and `b` must be length 1")
+  }
+  lev_weighted_token_ratio(str_token_sort(a), str_token_sort(b), weights = weights, ...)
+}
+
+#' Weighted version of `lev_token_set_ratio()`
+#'
+#' @inheritParams default-params
+#' @inheritParams lev_weighted_token_ratio
+#' @return Float
+#' @family weighted token functions
+#' @seealso [lev_token_set_ratio()]
+#' @export
+lev_weighted_token_set_ratio <- function(a, b, weights = list(), ...) {
+  if (length(a) != 1 || length(b) != 1) {
+    rlang::abort("`a` and `b` must be length 1")
+  }
+
+  token_a <- unlist(str_tokenise(a))
+  token_b <- unlist(str_tokenise(b))
+  common_tokens <- sort(intersect(token_a, token_b))
+  unique_token_a <- sort(setdiff(token_a, token_b))
+  unique_token_b <- sort(setdiff(token_b, token_a))
+
+  # Construct two new strings of the form {sorted_common_tokens}{sorted_remainder_a/b} and return
+  # a lev_weighted_token_ratio() on those
+  new_a <- paste(c(common_tokens, unique_token_a), collapse = " ")
+  new_b <- paste(c(common_tokens, unique_token_b), collapse = " ")
+
+  # We want the max of the three pairwise comparisons between `common_tokens`, `new_a` and `new_b`.
+  # For this to work properly we need to stick `common_tokens` back together into a single string.
+  common_tokens <- paste(common_tokens, collapse = " ")
+  res <- max(
+    lev_weighted_token_ratio(common_tokens, new_a, weights = weights, ...),
+    lev_weighted_token_ratio(common_tokens, new_b, weights = weights, ...),
+    lev_weighted_token_ratio(new_a, new_b, weights = weights, ...)
+  )
+
+  res
+}
diff --git a/README.Rmd b/README.Rmd
@@ -143,6 +143,126 @@ lev_token_sort_ratio(x, y)
 lev_token_set_ratio(x, y)
 ```
 
+### `lev_weighted_token_ratio()`
+
+The `lev_weighted_*()` family of functions work slightly differently from the others. They always
+tokenise their input, and they allow you to assign different weights to specific tokens. This allows
+you to exert some influence over parts of the input strings that are more or less interesting to
+you.
+
+For example, maybe you're comparing company names from different sources, trying to match them up.
+
+``` {r weighted-tokens-1}
+lev_ratio("united widgets, ltd", "utd widgets, ltd") # Note the typos
+```
+
+These strings score quite highly already, but the `"ltd"` in each name isn't very helpful. We can
+use `lev_weighted_token_ratio()` to reduce the impact of `"ltd"`.
+
+**NOTE** Because the tokenisation affects the score, we can't compare the output of the 
+`lev_weighted_*()` functions with the non-weighted versions. To get a baseline, call the weighted
+function without supplying a `weights` argument.
+
+``` {r weighted-tokens-2}
+lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd")
+
+lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0.1))
+```
+
+De-weighting `"ltd"` has reduced the similarity score of the strings, which gives a more accurate
+impression of their similarity.
+
+We can remove the effect of `"ltd"` altogether by setting its weight to zero.
+
+``` {r weighted-tokens-3}
+lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0))
+
+lev_weighted_token_ratio("united widgets", "utd widgets")
+```
+
+De-weighting also works the other way - if the token to be weighted appears in one string but not
+the other, then de-weighting it _increases_ the similarity score:
+
+``` {r weighted-token-4}
+lev_weighted_token_ratio("utd widgets", "united widgets, ltd")
+
+lev_weighted_token_ratio("utd widgets", "united widgets, ltd", weights = list(ltd = 0.1))
+```
+
+#### Limitations of token weighting
+
+`lev_weighted_token_ratio()` has a key limitation: tokens will only be weighted if:
+
+* The token appears in the same position in both strings (i.e. it's the first/second/third, etc. 
+  token in both)
+* OR the strings contain different numbers of tokens, and the corresponding token position in the
+  other string is empty.
+
+This is probably easiest to see by example.
+
+``` {r weighted-token-5}
+lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd")
+lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd", weights = list(ltd = 0.1, limited = 0.1))
+```
+
+In this case the weighting has had no effect. Why not? Internally, the function has tokenised the
+strings as follows:
+
+| token_1   | token_2   | token_3   |
+|-----------|-----------|-----------|
+| "utd"     | "widgets" | "limited" |
+| "united"  | "widgets" | "ltd"     |
+
+Because the token `"ltd"` doesn't appear in the same position in both strings, the function doesn't
+apply any weights.
+
+This is a deliberate decision; while in the example above it's easy to say "well, clearly ltd and 
+limited are the same thing so we ought to weight them", how should we handle a less clear example?
+
+``` {r weighted-token-6}
+lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam")
+lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam", weights = list(spam = 0.1, eggs = 0.5))
+```
+
+In this case it's hard to say what the "correct" approach would be. There isn't a meaningful way of
+applying weights to dissimilar tokens. In situations like "ltd"/"limited", a pre-cleaning or
+standardisation process might be helpful, but that is outside the scope of what `levitate` offers.
+
+I recommend exploring `lev_weighted_token_sort_ratio()` and `lev_weighted_token_set_ratio()` as
+they may give more useful results for some problems. Remember, **weighting is going to be most 
+useful when compared to the unweighted output of the same function**.
+
+
+## Ranking functions
+
+A common problem in this area is "given a string x and a set of strings y, which string in y is most
+/ least similar to x?". `levitate` provides two functions to help with this: `lev_score_multiple()` 
+and `lev_best_match()`.
+
+`lev_score_multiple()` returns a ranked list of candidates. By default the highest-scoring is first.
+
+``` {r score-multiple}
+lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"))
+```
+
+`lev_best_match()` returns the best matched string without any score information.
+
+``` {r best-match}
+lev_best_match("bilbo", c("gandalf", "frodo", "legolas"))
+```
+
+Both functions take a `.fn` argument which allows you to select a different ranking function. The
+default is `lev_ratio()` but you can pick another or write your own. See `?lev_score_multiple` for
+details.
+
+You can also reverse the direction of sorting by using `decreasing = FALSE`. This reverses the sort
+direction so _lower_ scoring items are preferred. This may be helpful if you're using a distance 
+rather than a similarity measure, or if you want to return least similar strings.
+
+``` {r best-match-reverse}
+lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"), decreasing = FALSE)
+```
+
 ## Porting code from `thefuzz` or `fuzzywuzzyR`
 
 Results differ between `levitate` and `thefuzz`, not least because