GitHub - Alex-Weatherhead/minhash

First run "python3 main.py" to get the "corpus.txt" file. Then move this corpus onto hadoop.

To run the MinHash algorithm:

spark-submit --class ca.uwaterloo.cs651.MinHash
--num-executors 4 --executor-cores 4 --executor-memory 24G
target/project-1.0.jar --input_path corpus.txt
--output_path sentenceIdPairs
--target_jaccard_similarity_of_pairs 0.60

Then, to combine each sentence id pair with its sentence pair:

spark-submit --class ca.uwaterloo.cs651.InspectPairs
--num-executors 4 --executor-cores 4 --executor-memory 24G
target/project-1.0.jar --corpus_input_path corpus.txt
--sentence_id_pairs_input_path sentenceIdPairs
--sentence_id_and_sentence_pairs_output_path sentenceIdAndSentencePairs

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
src/main/scala/ca/uwaterloo/cs651		src/main/scala/ca/uwaterloo/cs651
CS651.pdf		CS651.pdf
README.md		README.md
clean.py		clean.py
fetch.py		fetch.py
main.py		main.py
pom.xml		pom.xml
scrape.py		scrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Alex-Weatherhead/minhash

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages