Skip to content

Commit

Permalink
Merge pull request #12 from lewinfox/feature/weights
Browse files Browse the repository at this point in the history
Feature/weights
  • Loading branch information
lewinfox authored Sep 30, 2023
2 parents c733fce + aeb2c8f commit 8266550
Show file tree
Hide file tree
Showing 17 changed files with 722 additions and 23 deletions.
22 changes: 9 additions & 13 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,17 +1,14 @@
Type: Package
Package: levitate
Title: Fuzzy String Comparison
Version: 0.1.0.9000
Version: 0.2.0
Authors@R:
person(given = "Lewin",
family = "Appleton-Fox",
role = c("aut", "cre", "cph"),
email = "[email protected]")
person("Lewin", "Appleton-Fox", , "[email protected]", role = c("aut", "cre", "cph"))
Description: Provides string similarity calculations inspired by the
Python 'thefuzz' package. Compare strings by edit distance,
similarity ratio, best matching substring, ordered token matching and
set-based token matching. A range of edit distance measures are
available thanks to the 'stringdist' package.
Python 'thefuzz' package. Compare strings by edit distance, similarity
ratio, best matching substring, ordered token matching and set-based
token matching. A range of edit distance measures are available thanks
to the 'stringdist' package.
License: GPL-3
URL: https://lewinfox.github.io/levitate/,
https://github.com/lewinfox/levitate/,
Expand All @@ -20,15 +17,14 @@ BugReports: https://github.com/lewinfox/levitate/issues
Depends:
R (>= 2.10)
Imports:
cli,
glue,
rlang,
stringdist,
stringr
stringdist
Suggests:
glue,
knitr,
pkgdown,
rmarkdown,
styler,
testthat
VignetteBuilder:
knitr
Expand Down
5 changes: 5 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
# Generated by roxygen2: do not edit by hand

export(lev_best_match)
export(lev_distance)
export(lev_partial_ratio)
export(lev_ratio)
export(lev_score_multiple)
export(lev_token_set_ratio)
export(lev_token_sort_ratio)
export(lev_weighted_token_ratio)
export(lev_weighted_token_set_ratio)
export(lev_weighted_token_sort_ratio)
5 changes: 4 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# Development version
# levitate 0.2.0

* New ranking functions `lev_score_multiple()` and `lev_best_match()` (#5)
* New package logo added by [`@zatch3301`](https://github.com/zatch3301) (#6)
* New `lev_weighted_*()` functions (#10)


# levitate 0.1.0

Expand Down
1 change: 0 additions & 1 deletion R/lev-distance.R
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,6 @@ NULL

#' @describeIn internal-functions See [lev_token_set_ratio()].
internal_lev_token_set_ratio <- function(a, b, pairwise = TRUE, useNames = !pairwise, ...) {

token_a <- unlist(str_tokenise(a))
token_b <- unlist(str_tokenise(b))
common_tokens <- sort(intersect(token_a, token_b))
Expand Down
42 changes: 42 additions & 0 deletions R/ranking.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#' Score multiple candidate strings against a single input
#'
#' Given a single `input` string and multiple `candidates`, compute scores for each candidate.
#'
#' @param input A single string
#' @param candidates One or more candidate strings to score
#' @param .fn The scoring function to use, as a string or function object. Defaults to
#' [lev_ratio()].
#' @param ... Additional arguments to pass to `.fn`.
#' @param decreasing If `TRUE` (the default), the candidate with the highest score is ranked first.
#' If using a comparison `.fn` that computes _distance_ rather than similarity, or if you want the
#' worst match to be returned first, set this to `FALSE`.
#'
#' @return A list where the keys are `candidates` and the values are the scores. The list is sorted
#' according to the `decreasing` parameter, so by default higher scores are first.
#'
#' @examples
#' lev_score_multiple("bilbo", c("frodo", "gandalf", "legolas"))
#' @export
#' @seealso [lev_best_match()]
lev_score_multiple <- function(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE) {
if (length(input) > 1) rlang::abort(glue::glue("`input` must be length 1, not {length(input)}"))
.fn <- match.fun(.fn)
scores <- sort(sapply(candidates, .fn, input, ...), decreasing = decreasing)
as.list(scores)
}

#' Get the best matched string from a list of candidates
#'
#' Given an `input` string and multiple `candidates`, return the candidate with the best score as
#' calculated by `.fn`.
#'
#' @inheritParams lev_score_multiple
#' @return A string
#' @seealso [lev_score_multiple()]
#' @examples
#' lev_best_match("bilbo", c("frodo", "gandalf", "legolas"))
#' @export
lev_best_match <- function(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE) {
scores <- lev_score_multiple(input = input, candidates = candidates, .fn = .fn, ..., decreasing = decreasing)
names(scores)[[1]]
}
129 changes: 129 additions & 0 deletions R/weighted.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
#' Weighted token similarity measure
#'
#' Computes similarity but allows you to assign weights to specific tokens. This is useful, for
#' example, when you have a frequently-occurring string that doesn't contain useful information. See
#' examples.
#'
#' # Details
#'
#' The algorithm used here is as follows:
#'
#' * Tokenise the input strings
#' * Compute the edit distance between each pair of tokens
#' * Compute the maximum edit distance between each pair of tokens
#' * Apply any weights from the `weights` argument
#' * Return `1 - (sum(weighted_edit_distances) / sum(weighted_max_edit_distance))`
#'
#' @inheritParams default-params
#' @param weights List of token weights. For example, `weights = list(foo = 0.9, bar = 0.1)`. Any
#' tokens omitted from `weights` will be given a weight of 1.
#'
#' @return A float
#' @export
#'
#' @family weighted token functions
#'
#' @examples
#' lev_weighted_token_ratio("jim ltd", "tim ltd")
#'
#' lev_weighted_token_ratio("tim ltd", "jim ltd", weights = list(ltd = 0.1))
lev_weighted_token_ratio <- function(a, b, weights = list(), ...) {
if (length(a) != 1 || length(b) != 1) {
rlang::abort("`a` and `b` must be length 1")
}
a_tokens <- unlist(str_tokenise(a))
b_tokens <- unlist(str_tokenise(b))

# If the token lists aren't the same length we will pad the shorter list with empty strings
if (length(a_tokens) > length(b_tokens)) {
b_tokens <- c(b_tokens, rep("", length(a_tokens) - length(b_tokens)))
} else if (length(a_tokens) < length(b_tokens)) {
a_tokens <- c(a_tokens, rep("", length(b_tokens) - length(a_tokens)))
}

token_lev_distances <- mapply(lev_distance, a_tokens, b_tokens, MoreArgs = ...)

# Weights are applied where
#
# * a token is in the `weights` list
# * AND the token appears in the same position in a and b.
# * OR the token appears in a OR b and the corresponding token is missing (which has the effect
# of reducing the impact of tokens that appear in one string but not the other).
weights_to_apply <- mapply(
function(token_a, token_b) {
if (token_a == token_b && token_a %in% names(weights)) {
weights[[token_a]]
} else if (token_a == "" && token_b %in% names(weights)) {
weights[[token_b]]
} else if (token_b == "" && token_a %in% names(weights)) {
weights[[token_a]]
} else {
1
}
},
a_tokens,
b_tokens
)

# The similarity score is (1 - (edit_distance / max_edit_distance)), after weighting.
weighted_edit_distances <- token_lev_distances * weights_to_apply
weighted_max_edit_distances <- mapply(function(a, b) max(nchar(a), nchar(b)), a_tokens, b_tokens) * weights_to_apply

1 - (sum(weighted_edit_distances) / sum(weighted_max_edit_distances))
}

#' Weighted version of lev_token_sort_ratio()
#'
#' This function tokenises inputs, sorts tokens and computes similarities for each pair of tokens.
#' Similarity scores are weighted based on the `weights` argument, and a total similarity score is
#' returned in the same manner as [lev_weighted_token_ratio()].
#'
#' @inheritParams default-params
#' @inheritParams lev_weighted_token_ratio
#'
#' @return Float
#' @export
#' @family weighted token functions
#' @seealso [lev_token_sort_ratio()]
lev_weighted_token_sort_ratio <- function(a, b, weights = list(), ...) {
if (length(a) != 1 || length(b) != 1) {
rlang::abort("`a` and `b` must be length 1")
}
lev_weighted_token_ratio(str_token_sort(a), str_token_sort(b), weights = weights, ...)
}

#' Weighted version of `lev_token_set_ratio()`
#'
#' @inheritParams default-params
#' @inheritParams lev_weighted_token_ratio
#' @return Float
#' @family weighted token functions
#' @seealso [lev_token_set_ratio()]
#' @export
lev_weighted_token_set_ratio <- function(a, b, weights = list(), ...) {
if (length(a) != 1 || length(b) != 1) {
rlang::abort("`a` and `b` must be length 1")
}

token_a <- unlist(str_tokenise(a))
token_b <- unlist(str_tokenise(b))
common_tokens <- sort(intersect(token_a, token_b))
unique_token_a <- sort(setdiff(token_a, token_b))
unique_token_b <- sort(setdiff(token_b, token_a))

# Construct two new strings of the form {sorted_common_tokens}{sorted_remainder_a/b} and return
# a lev_weighted_token_ratio() on those
new_a <- paste(c(common_tokens, unique_token_a), collapse = " ")
new_b <- paste(c(common_tokens, unique_token_b), collapse = " ")

# We want the max of the three pairwise comparisons between `common_tokens`, `new_a` and `new_b`.
# For this to work properly we need to stick `common_tokens` back together into a single string.
common_tokens <- paste(common_tokens, collapse = " ")
res <- max(
lev_weighted_token_ratio(common_tokens, new_a, weights = weights, ...),
lev_weighted_token_ratio(common_tokens, new_b, weights = weights, ...),
lev_weighted_token_ratio(new_a, new_b, weights = weights, ...)
)

res
}
120 changes: 120 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,126 @@ lev_token_sort_ratio(x, y)
lev_token_set_ratio(x, y)
```

### `lev_weighted_token_ratio()`

The `lev_weighted_*()` family of functions work slightly differently from the others. They always
tokenise their input, and they allow you to assign different weights to specific tokens. This allows
you to exert some influence over parts of the input strings that are more or less interesting to
you.

For example, maybe you're comparing company names from different sources, trying to match them up.

``` {r weighted-tokens-1}
lev_ratio("united widgets, ltd", "utd widgets, ltd") # Note the typos
```

These strings score quite highly already, but the `"ltd"` in each name isn't very helpful. We can
use `lev_weighted_token_ratio()` to reduce the impact of `"ltd"`.

**NOTE** Because the tokenisation affects the score, we can't compare the output of the
`lev_weighted_*()` functions with the non-weighted versions. To get a baseline, call the weighted
function without supplying a `weights` argument.

``` {r weighted-tokens-2}
lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd")
lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0.1))
```

De-weighting `"ltd"` has reduced the similarity score of the strings, which gives a more accurate
impression of their similarity.

We can remove the effect of `"ltd"` altogether by setting its weight to zero.

``` {r weighted-tokens-3}
lev_weighted_token_ratio("united widgets, ltd", "utd widgets, ltd", weights = list(ltd = 0))
lev_weighted_token_ratio("united widgets", "utd widgets")
```

De-weighting also works the other way - if the token to be weighted appears in one string but not
the other, then de-weighting it _increases_ the similarity score:

``` {r weighted-token-4}
lev_weighted_token_ratio("utd widgets", "united widgets, ltd")
lev_weighted_token_ratio("utd widgets", "united widgets, ltd", weights = list(ltd = 0.1))
```

#### Limitations of token weighting

`lev_weighted_token_ratio()` has a key limitation: tokens will only be weighted if:

* The token appears in the same position in both strings (i.e. it's the first/second/third, etc.
token in both)
* OR the strings contain different numbers of tokens, and the corresponding token position in the
other string is empty.

This is probably easiest to see by example.

``` {r weighted-token-5}
lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd")
lev_weighted_token_ratio("utd widgets limited", "united widgets, ltd", weights = list(ltd = 0.1, limited = 0.1))
```

In this case the weighting has had no effect. Why not? Internally, the function has tokenised the
strings as follows:

| token_1 | token_2 | token_3 |
|-----------|-----------|-----------|
| "utd" | "widgets" | "limited" |
| "united" | "widgets" | "ltd" |

Because the token `"ltd"` doesn't appear in the same position in both strings, the function doesn't
apply any weights.

This is a deliberate decision; while in the example above it's easy to say "well, clearly ltd and
limited are the same thing so we ought to weight them", how should we handle a less clear example?

``` {r weighted-token-6}
lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam")
lev_weighted_token_ratio("green eggs and ham", "spam spam spam spam", weights = list(spam = 0.1, eggs = 0.5))
```

In this case it's hard to say what the "correct" approach would be. There isn't a meaningful way of
applying weights to dissimilar tokens. In situations like "ltd"/"limited", a pre-cleaning or
standardisation process might be helpful, but that is outside the scope of what `levitate` offers.

I recommend exploring `lev_weighted_token_sort_ratio()` and `lev_weighted_token_set_ratio()` as
they may give more useful results for some problems. Remember, **weighting is going to be most
useful when compared to the unweighted output of the same function**.


## Ranking functions

A common problem in this area is "given a string x and a set of strings y, which string in y is most
/ least similar to x?". `levitate` provides two functions to help with this: `lev_score_multiple()`
and `lev_best_match()`.

`lev_score_multiple()` returns a ranked list of candidates. By default the highest-scoring is first.

``` {r score-multiple}
lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"))
```

`lev_best_match()` returns the best matched string without any score information.

``` {r best-match}
lev_best_match("bilbo", c("gandalf", "frodo", "legolas"))
```

Both functions take a `.fn` argument which allows you to select a different ranking function. The
default is `lev_ratio()` but you can pick another or write your own. See `?lev_score_multiple` for
details.

You can also reverse the direction of sorting by using `decreasing = FALSE`. This reverses the sort
direction so _lower_ scoring items are preferred. This may be helpful if you're using a distance
rather than a similarity measure, or if you want to return least similar strings.

``` {r best-match-reverse}
lev_score_multiple("bilbo", c("gandalf", "frodo", "legolas"), decreasing = FALSE)
```

## Porting code from `thefuzz` or `fuzzywuzzyR`

Results differ between `levitate` and `thefuzz`, not least because
Expand Down
Loading

0 comments on commit 8266550

Please sign in to comment.