Matching Vectors Based on Euclidean Distance
Source:vignettes/matching_vectors.Rmd
matching_vectors.Rmd
Introduction
The flagship feature of zoomerjoin is are the tidy joins for strings using the Jaccard distance, but zoomerjoin also allows you to join vectors using Euclidean distance. This can be useful for joining addresses or coordinates in space.
Unlike other nearest-neighbor methods such as KD-trees, the joins do not slow down as the dimension of the coordinates increases, so zoomerjoin can be used can be used to find close points in a high-dimensional space (such as word embeddings).
Demonstration
For this demonstration, I create a simulated dataset of 10^5 points distributed uniformly within a 100-dimensional hypercube. I join this to another dataset which is a copy of the first with each point shifted an tiny random amount.
n <- 10^5 # number of data points
d <- 10^2 # dimension
# Create a matrix of 10^6 observations in R^100
X <- matrix(runif(n * d), n, d)
# Second Dataset is a copy of the first with points shifted an infinitesimal
# amount
X_2 <- as.data.frame(X + matrix(rnorm(n * d, 0, .0001), n, d))
X <- as.data.frame(X)
I now want to join these two datasets together. The Euclidean joins
take 3 hyperparameters: n_bands
, band_width
,
and r
. Which all have to be chosen for the problem domain
(although the defaults are generally sensible).
I use the euclidean_probability
function in the package
to understand the probability that two observations at distance of .01
from each other are indentified as a match at a variety of
hyperparameter configurations.
euclidean_probability(.01, n_bands = 5, band_width = 8, r = .25)
#> [1] 0.9993764
euclidean_probability(.1, n_bands = 5, band_width = 8, r = .25)
#> [1] 0.2141322
euclidean_probability(.01, n_bands = 10, band_width = 4, r = .15)
#> [1] 0.9999999
euclidean_probability(.1, n_bands = 10, band_width = 4, r = .15)
#> [1] 0.4956251
euclidean_probability(.01, n_bands = 40, band_width = 8, r = .15)
#> [1] 1
euclidean_probability(.1, n_bands = 40, band_width = 8, r = .15)
#> [1] 0.16091
Using n_bands=40
, band_width=8
, and
r=.15
seems to provide a good balance between identifying
all true matches (as pairs less than .01 apart are guaranteed to be
found) with reducing the number of un-promising comparisons (as pairs
greater than .1 apart are unlikely to be compared). I then use the
euclidean_inner_join
to find all matching pairs across the
two datasets:
set.seed(1)
start <- Sys.time()
joined_out <- euclidean_inner_join(
X,
X_2,
threshold = .01,
n_bands = 40,
band_width = 8,
r = .15
)
n_matches <- nrow(joined_out)
time_taken <- Sys.time() - start
print(paste("found", n_matches, "matches in", round(time_taken), "seconds"))
#> [1] "found 100000 matches in 16 seconds"
Zoomerjoin is able to easily find all pairs in just under 30s (perhaps longer on the runner that renders the website), even though the points lie in high-dimensional (d=100) space. This makes zoomerjoin a useful tool when trying to join or find matches between datasets of word or document embeddings.