Fuzzy joins for Hamming distance using Locality Sensitive Hashing

Find similar rows between two tables using the hamming distance. The hamming distance is equal to the number characters two strings differ by, or is equal to infinity if two strings are of different lengths

Usage

hamming_inner_join(
  a,
  b,
  by = NULL,
  n_bands = 100,
  band_width = 8,
  threshold = 2,
  progress = FALSE,
  clean = FALSE,
  similarity_column = NULL
)

hamming_anti_join(
  a,
  b,
  by = NULL,
  n_bands = 100,
  band_width = 100,
  threshold = 2,
  progress = FALSE,
  clean = FALSE,
  similarity_column = NULL
)

hamming_left_join(
  a,
  b,
  by = NULL,
  n_bands = 100,
  band_width = 100,
  threshold = 2,
  progress = FALSE,
  clean = FALSE,
  similarity_column = NULL
)

hamming_right_join(
  a,
  b,
  by = NULL,
  n_bands = 100,
  band_width = 100,
  threshold = 2,
  progress = FALSE,
  clean = FALSE,
  similarity_column = NULL
)

hamming_full_join(
  a,
  b,
  by = NULL,
  n_bands = 100,
  band_width = 100,
  threshold = 2,
  progress = FALSE,
  clean = FALSE,
  similarity_column = NULL
)

Arguments

a, b: The two dataframes to join.
by: A named vector indicating which columns to join on. Format should be the same as dplyr: by = c("column_name_in_df_a" = "column_name_in_df_b"), but two columns must be specified in each dataset (x column and y column). Specification made with dplyr::join_by() are also accepted.
n_bands: The number of bands used in the locality sensitive hashing algorithm (default is 100). Use this in conjunction with the band_width to determine the performance of the hashing. Generally speaking, a higher number of bands leads to greater recall at the cost of higher runtime.
band_width: The length of each band used in the minihashing algorithm (default is 8). Use this in conjunction with the n_bands to determine the performance of the hashing. Generally speaking a wider number of bands decreases the number of false positives, decreasing runtime at the cost of lower sensitivity (true matches are less likely to be found).
threshold: The Hamming distance threshold below which two strings should be considered a match. A distance of zero corresponds to complete equality between strings, while a distance of 'x' between two strings means that 'x' substitutions must be made to transform one string into the other.
progress: Set to TRUE to print progress.
clean: Should the strings that you fuzzy join on be cleaned (coerced to lower-case, stripped of punctuation and spaces)? Default is FALSE.
similarity_column: An optional character vector. If provided, the data frame will contain a column with this name giving the Hamming distance between the two fields. Extra column will not be present if anti-joining.

Value

A tibble fuzzily-joined on the basis of the variables in by. Tries to adhere to the same standards as the dplyr-joins, and uses the same logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).

Examples

# load baby names data
# install.packages("babynames")
library(babynames)

baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500])
baby_names_mispelled <- data.frame(
  name_mispelled = gsub("[aeiouy]", "x", baby_names$name)
)

# Run the join and only keep rows that have a match:
hamming_inner_join(
  baby_names,
  baby_names_mispelled,
  by = c("name" = "name_mispelled"),
  threshold = 3,
  n_bands = 150,
  band_width = 10,
  clean = FALSE # default
)
#> # A tibble: 2,664 × 2
#>    name    name_mispelled
#>    <chr>   <chr>         
#>  1 vina    rxnx          
#>  2 nina    xnnx          
#>  3 lola    lxxs          
#>  4 adda    xddx          
#>  5 nan     zxx           
#>  6 malinda mxlxndx       
#>  7 zada    mxdx          
#>  8 della   rxllx         
#>  9 idella  xdxllx        
#> 10 lily    lxdx          
#> # ℹ 2,654 more rows

# Run the join and keep all rows from the first dataset, regardless of whether
# they have a match:
hamming_left_join(
  baby_names,
  baby_names_mispelled,
  by = c("name" = "name_mispelled"),
  threshold = 3,
  n_bands = 150,
  band_width = 10,
)
#> # A tibble: 2,746 × 2
#>    name    name_mispelled
#>    <chr>   <chr>         
#>  1 ora     xrx           
#>  2 mina    nxnx          
#>  3 melissa mxlxssx       
#>  4 tina    nxnx          
#>  5 lura    lxnx          
#>  6 dolly   bxllx         
#>  7 ona     xvx           
#>  8 mae     xmx           
#>  9 oma     mxx           
#> 10 daisy   dxssx         
#> # ℹ 2,736 more rows