Skip to content

Alex-Weatherhead/minhash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

First run "python3 main.py" to get the "corpus.txt" file. Then move this corpus onto hadoop.

To run the MinHash algorithm:

spark-submit --class ca.uwaterloo.cs651.MinHash
--num-executors 4 --executor-cores 4 --executor-memory 24G
target/project-1.0.jar --input_path corpus.txt
--output_path sentenceIdPairs
--target_jaccard_similarity_of_pairs 0.60

Then, to combine each sentence id pair with its sentence pair:

spark-submit --class ca.uwaterloo.cs651.InspectPairs
--num-executors 4 --executor-cores 4 --executor-memory 24G
target/project-1.0.jar --corpus_input_path corpus.txt
--sentence_id_pairs_input_path sentenceIdPairs
--sentence_id_and_sentence_pairs_output_path sentenceIdAndSentencePairs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published