cerebras.modelzoo.data_preparation.nlp.data_dedup.generate_duplicate_pairs#

This script is used for duplicate pairs generation.

It includes some functions from the datasketch library for calculation of range and bands - namely, _false_positive_probability, _false_negative_probability and optimal_param. The original source code can be found at: https://github.com/ekzhu/datasketch/blob/master/datasketch/lsh.py#L24

Functions

custom_progress_bar

generate_pairs

get_hashes

lsh

optimal_param

Compute the optimal MinHashLSH parameter that minimizes the weighted sum of probabilities of false positive and false negative.

split_files