Fuzzy joins for Hamming distance using Locality Sensitive Hashing
Source:R/hamming_logical_joins.R
hamming-joins.Rd
Find similar rows between two tables using the hamming distance. The hamming distance is equal to the number characters two strings differ by, or is equal to infinity if two strings are of different lengths
Usage
hamming_inner_join(
a,
b,
by = NULL,
n_bands = 100,
band_width = 8,
threshold = 2,
progress = FALSE,
clean = FALSE,
similarity_column = NULL
)
hamming_anti_join(
a,
b,
by = NULL,
n_bands = 100,
band_width = 100,
threshold = 2,
progress = FALSE,
clean = FALSE,
similarity_column = NULL
)
hamming_left_join(
a,
b,
by = NULL,
n_bands = 100,
band_width = 100,
threshold = 2,
progress = FALSE,
clean = FALSE,
similarity_column = NULL
)
hamming_right_join(
a,
b,
by = NULL,
n_bands = 100,
band_width = 100,
threshold = 2,
progress = FALSE,
clean = FALSE,
similarity_column = NULL
)
hamming_full_join(
a,
b,
by = NULL,
n_bands = 100,
band_width = 100,
threshold = 2,
progress = FALSE,
clean = FALSE,
similarity_column = NULL
)
Arguments
- a, b
The two dataframes to join.
- by
A named vector indicating which columns to join on. Format should be the same as dplyr:
by = c("column_name_in_df_a" = "column_name_in_df_b")
, but two columns must be specified in each dataset (x column and y column). Specification made withdplyr::join_by()
are also accepted.- n_bands
The number of bands used in the locality sensitive hashing algorithm (default is 100). Use this in conjunction with the
band_width
to determine the performance of the hashing. Generally speaking, a higher number of bands leads to greater recall at the cost of higher runtime.- band_width
The length of each band used in the minihashing algorithm (default is 8). Use this in conjunction with the
n_bands
to determine the performance of the hashing. Generally speaking a wider number of bands decreases the number of false positives, decreasing runtime at the cost of lower sensitivity (true matches are less likely to be found).- threshold
The Hamming distance threshold below which two strings should be considered a match. A distance of zero corresponds to complete equality between strings, while a distance of 'x' between two strings means that 'x' substitutions must be made to transform one string into the other.
- progress
Set to
TRUE
to print progress.- clean
Should the strings that you fuzzy join on be cleaned (coerced to lower-case, stripped of punctuation and spaces)? Default is
FALSE
.- similarity_column
An optional character vector. If provided, the data frame will contain a column with this name giving the Hamming distance between the two fields. Extra column will not be present if anti-joining.
Value
A tibble fuzzily-joined on the basis of the variables in by.
Tries
to adhere to the same standards as the dplyr-joins, and uses the same
logical joining patterns (i.e. inner-join joins and keeps only observations
in both datasets).
Examples
# load baby names data
# install.packages("babynames")
library(babynames)
baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500])
baby_names_mispelled <- data.frame(
name_mispelled = gsub("[aeiouy]", "x", baby_names$name)
)
# Run the join and only keep rows that have a match:
hamming_inner_join(
baby_names,
baby_names_mispelled,
by = c("name" = "name_mispelled"),
threshold = 3,
n_bands = 150,
band_width = 10,
clean = FALSE # default
)
#> # A tibble: 2,664 × 2
#> name name_mispelled
#> <chr> <chr>
#> 1 ada xmx
#> 2 ivy xdx
#> 3 leila lxxlx
#> 4 anna txnx
#> 5 lina lxxh
#> 6 lily lxxs
#> 7 helen hxldx
#> 8 ida xrx
#> 9 nell xllx
#> 10 mina lxnx
#> # ℹ 2,654 more rows
# Run the join and keep all rows from the first dataset, regardless of whether
# they have a match:
hamming_left_join(
baby_names,
baby_names_mispelled,
by = c("name" = "name_mispelled"),
threshold = 3,
n_bands = 150,
band_width = 10,
)
#> # A tibble: 2,746 × 2
#> name name_mispelled
#> <chr> <chr>
#> 1 ora sxx
#> 2 emma xrmx
#> 3 mabelle mxbxllx
#> 4 nan mxx
#> 5 martha mxrtxx
#> 6 prudence prxdxncx
#> 7 ona xvx
#> 8 nelle rxllx
#> 9 luna txnx
#> 10 lena lxnx
#> # ℹ 2,736 more rows