Fuzzy left-join using minihashing
Usage
lsh_left_join(
a,
b,
by = NULL,
block_by = NULL,
n_gram_width = 2,
n_bands = 45,
band_width = 8,
threshold = 0.7,
clean = FALSE
)
Arguments
- a
the first dataframe you wish to join.
- b
the second dataframe you wish to join.
- by
a named vector indicating which columns to join on. Format should be the same as dplyr:
by = c("column_name_in_df_a" = "column_name_in_df_b")
- block_by
a named vector indicating which column to block on, such that rows that disagree on this field cannot be considered a match. Format should be the same as dplyr:
by = c("column_name_in_df_a" = "column_name_in_df_b")
- n_gram_width
the length of the n_grams used in calculating the jaccard similarity. For best performance, I set this large enough that the chance any string has a specific n_gram is low (i.e.
n_gram_width
= 2 or 3 when matching on first names, 5 or 6 when matching on entire sentences).- n_bands
the number of bands used in the minihash algorithm (default is 40). Use this in conjunction with the
band_width
to determine the performance of the hashing. The default settings are for a (.2,.8,.001,.999)-sensitive hash i.e. that pairs with a similarity of less than .2 have a >.1% chance of being compared, while pairs with a similarity of greater than .8 have a >99.9% chance of being compared.- band_width
the length of each band used in the minihashing algorithm (default is 8) Use this in conjunction with the
n_bands
to determine the performance of the hashing. The default settings are for a (.2,.8,.001,.999)-sensitive hash i.e. that pairs with a similarity of less than .2 have a >.1% chance of being compared, while pairs with a similarity of greater than .8 have a >99.9% chance of being compared.- threshold
the jaccard similarity threshold above which two strings should be considered a match (default is .95). The similarity is euqal to 1
the jaccard distance between the two strings, so 1 implies the strings are identical, while a similarity of zero implies the strings are completely dissimilar. '
- clean
should the strings that you fuzzy join on be cleaned (coerced to lower-case, stripped of punctuation and spaces)? Default is FALSE
Value
a tibble fuzzily-joined on the basis of the variables in by.
Tries
to adhere to the same standards as the dplyr-joins, and uses the same
logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).
Examples
# load baby names data
library(babynames)
baby_names <- data.frame(name = tolower(unique(babynames$name)))
baby_names_sans_vowels <- data.frame(
name_wo_vowels =gsub("[aeiouy]","", baby_names$name)
)
# Check the probability two pairs of strings with
# similarity .8 will be matched with a band width of 30
# and 30 bands using the `lsh_probability` function:
lsh_probability(.8,30,8)
#> [1] 0.9959518
# Run the join:
joined_names <- lsh_left_join(
baby_names,
baby_names_sans_vowels,
by = c("name"= "name_wo_vowels"),
threshold = .8,
n_bands = 20,
band_width = 6,
n_gram_width = 1,
clean = FALSE # default
)
joined_names
#> # A tibble: 297,133 × 2
#> name name_wo_vowels
#> <chr> <chr>
#> 1 vernell vrln
#> 2 kristn krstn
#> 3 shalan shln
#> 4 arshan shrn
#> 5 rolondo drln
#> 6 eschelle chls
#> 7 cherrel rchl
#> 8 alastar sttlr
#> 9 ronold drln
#> 10 shatana tshn
#> # ℹ 297,123 more rows