Skip to contents

This is a port of the lsh_probability function from the textreuse package, with arguments changed to reflect the hyperparameters in this package. It gives the probability that two strings of jaccard similarity similarity will be matched, given the chosen bandwidth and number of bands.

Usage

jaccard_probability(similarity, n_bands, band_width)

Arguments

similarity

the similarity of the two strings you want to compare

n_bands

The number of LSH bands used in hashing.

band_width

The number of hashes in each band.

Value

a decimal number giving the probability that the two items will be returned as a candidate pair from the minhash algorithm.

Examples

# Find the probability two pairs will be matched given they have a
# jaccard_similarity of .8, band width of 5, and 50 bands:
jaccard_probability(.8, n_bands = 50, band_width = 5)
#> [1] 1