zoomerjoin is an R package that empowers you to fuzzy-join massive datasets rapidly, and with little memory consumption. It is powered by high-performance implementations of Locality Sensitive Hashing, an algorithm that finds the matches records between two datasets without having to compare all possible pairs of observations. In practice, this means zoomerjoin can fuzzily-join datasets days, or even years faster than other matching packages. zoomerjoin has been used in-production to join datasets of hundreds of millions of names or vectors in a matter of hours.
Installation
Installing from CRAN:
You can install from the CRAN as you would with any other package. Please be aware that you will have to have Cargo (the rust toolchain and compiler) installed to build the package from source.
install.packages(zoomerjoin)
Installing from R-Universe:
This package is distributed using r-universe, which provides pre-compiled binaries for common operating systems and recent versions of R. To install with r-universe, you can use the following command in R:
install.packages(
'zoomerjoin',
repos = c('https://beniaminogreen.r-universe.dev', getOption("repos"))
)
Installing Rust
If your operating system or version of R is not installed, you must have the Rust compiler installed to compile this package from sources. After the package is compiled, Rust is no longer required, and can be safely uninstalled.
Installing Rust on Linux or Mac:
To install Rust on Linux or Mac, you can simply run the following snippet in your terminal.
Installing Rust on Windows:
To install Rust on windows, you can use the Rust installation wizard, rustup-init.exe
, found at this site. Depending on your version of Windows, you may see an error that looks something like this:
In this case, you should run rustup install stable-x86_64-pc_windows-gnu
to install the missing toolchain. If you’re missing another toolchain, simply type this in the place of stable-x86_64-pc_windows-gnu
in the command above.
Installing Package from Github:
Once you have rust installed Rust, you should be able to install the package with either the install.packages function as above, or using the install_github
function from the devtools
package or with the pkg_install
function from the pak
package.
## Install with devtools
# install.packages("devtools")
devtools::install_github("beniaminogreen/zoomerjoin")
## Install with pak
# install.packages("pak")
pak::pkg_install("beniaminogreen/zoomerjoin")
Usage:
The flagship feature of zoomerjoins are the jaccard_join and euclidean family of functions, which are designed to be near drop-ins for the corresponding dplyr/fuzzyjoin commands:
jaccard_left_join()
jaccard_right_join()
jaccard_inner_join()
jaccard_full_join()
euclidean_left_join()
euclidean_right_join()
euclidean_inner_join()
euclidean_full_join()
The jaccard_join
family of functions provide fast fuzzy-joins for strings using the Jaccard distance while the euclidean_join
family provides fuzzy-joins for points or vectors using the Euclidean distance.
Example: Joining rows of the Database on Ideology, Money in Politics, and Elections
(DIME)
Here’s a snippet showing off how to use the jaccard_inner_join()
merge two lists of political donors in the Database on Ideology, Money in Politics, and Elections (DIME). You can see a more detailed example of this vignette in the introductory vignette.
I start with two corpuses I would like to combine, corpus_1
:
## # A tibble: 500 × 2
## a field
## <dbl> <chr>
## 1 1 ufwa cope committee
## 2 2 committee to re elect charles e. bennett
## 3 3 montana democratic party non federal account
## 4 4 mississippi power & light company management political action and educ…
## 5 5 napus pac for postmasters
## 6 6 aminoil good government fund
## 7 7 national women's political caucus of california
## 8 8 minnesota gun owners' political victory fund
## 9 9 metropolitan detroit afl cio cope committee
## 10 10 carpenters legislative improvement committee united brotherhood of car…
## # ℹ 490 more rows
And corpus_2
:
## # A tibble: 500 × 2
## b field
## <dbl> <chr>
## 1 501 citizens for derwinski
## 2 502 progressive victory fund greater washington americans for democratic a…
## 3 503 ingham county democratic party federal campaign fund
## 4 504 committee for a stronger future
## 5 505 atoka country supper committee
## 6 506 friends of democracy pac inc
## 7 507 baypac
## 8 508 international brotherhood of electrical workers local union 278 cope/p…
## 9 509 louisville & jefferson county republican executive committee
## 10 510 democratic party of virginia
## # ℹ 490 more rows
Both corpuses have an observation ID column, and a donor name column. We would like to join the two datasets on the donor names column, but the two can’t be directly joined because of misspellings. Because of this, we will use the jaccard_inner_join function to fuzzily join the two on the donor name column.
Importantly, Locality Sensitive Hashing is a probabilistic algorithm, so it may fail to identify some matches by random chance. I adjust the hyperparameters n_bands
and band_width
until the chance of true matches being dropped is negligible. By default, the package will issue a warning if the chance of a true match being discovered is less than 95%. You can use the jaccard_probability
and jaccard_hyper_grid_search
to help understand the probability any true matches will be discarded by the algorithm.
More details and a more thorough description of how to tune the hyperparameters can be can be found in the guided tour vignette.
set.seed(1)
start_time <- Sys.time()
join_out <- jaccard_inner_join(corpus_1, corpus_2, n_gram_width = 6, n_bands = 20, band_width = 6)
## Warning in jaccard_join(a, b, mode = "inner", by = by, salt_by = block_by, : A pair of records at the threshold (0.7) have only a 92% chance of being compared.
## Please consider changing `n_bands` and `band_width`.
## Joining by 'field'
print(join_out)
## # A tibble: 19 × 4
## a field.x b field.y
## <dbl> <chr> <dbl> <chr>
## 1 88 scheuer for congress 1980 667 scheuer…
## 2 35 solarz for congress 82 671 solarz …
## 3 378 guarini for congress 1982 883 guarini…
## 4 163 davies county republican executive committee 852 warren …
## 5 87 kentucky state democratic central executive committee 639 arizona…
## 6 302 americans for good government inc 910 america…
## 7 216 kent county republican finance committee 719 harford…
## 8 319 7th congressional district democratic party of wisconsin 792 8th con…
## 9 122 tarrant county republican victory fund 761 lake co…
## 10 238 4th congressional district democratic party 792 8th con…
## 11 387 committee to re elect congressman staton 805 committ…
## 12 478 united democrats for better government 642 democra…
## 13 45 dole for senate committee 623 riegle …
## 14 216 kent county republican finance committee 607 lake co…
## 15 230 pipefitters local union 524 998 pipefit…
## 16 232 republican county committee of chester county 710 republi…
## 17 292 bill bradley for u s senate '84 913 bill br…
## 18 378 guarini for congress 1982 606 guarini…
## 19 238 4th congressional district democratic party 518 16th co…
Zoomerjoin is able to quickly find the matching columns without comparing all pairs of records. This saves more and more time as the size of each list increases, so it can scale to join datasets with millions or hundreds of millions of rows.
Contributing
Thanks for your interest in contributing to Zoomerjoin!
I am using a gitub-centric workflow to manage the package; You can file a bug report, request a new feature, or ask a question about the package by filing an issue on the issues page, where you will also find a range of templates to help you out. If you’d like to make changes to the code, you can write and file a pull request on this page. I’ll try to respond to all of these in a timely manner (within a week) although occasionally I may take longer to respond to a complicated question or issue.
Please also be aware of the contributor code of conduct for contributing to the repository.
Acknowledgments:
The Zoomerjoin was made using this SQL join illustration by Germanx and this speed limit sign from the Federal Highway Administration - MUTCD.
References:
Bonica, Adam. 2016. Database on Ideology, Money in Politics, and Elections: Public version 2.0 [Computer file]. Stanford, CA: Stanford University Libraries.
Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets (2nd. ed.). Cambridge University Press, USA.
Broder, Andrei Z. (1997), “On the resemblance and containment of documents”, Compression and Complexity of Sequences: Proceedings. Positano, Salerno, Italy