Skip to content

This repo contains my projects developed during the "Natural Language Processing" course at university

License

Notifications You must be signed in to change notification settings

EmilioGarzia/NLP-Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP-Search-Engine

This repo contain my project developed during the "Natural Language Processing" course at university

The aim of the project is develope a NLP Search Engine, using NLTK (Natural Language toolkit) that given a query string the engine retrive the first $k$ documents in the corpus that have best similarity respect to the query, in this project I have explored the main tools useful in the NLP context, such as:

  • Corpus loading
  • Preprocessing on text data
    • Stopwords removal
    • Lemmatization
    • Tokenization
    • Punctuation removal
    • Part of Speech
    • Data cleaning in general
  • Document representation
    • Continous Bags of Word (CBOW)
    • Word embeddings (Word2Vec)
  • Document represtation
    • Embedding average for documents representation
    • Doc2Vec model
    • TF-IDF
  • Cosine similarity
  • K-means alghoritm
  • t-SNE dimensionality reduction
  • Evaluation of the model (Precision, Recall, F1)
  • Spelling correction (using Levenshtein edit distance)

Dependencies

You can install theese dependencies from requirements.txt using pip manager in your environment as shown below:

pip install -r requirements.txt

Author

Emilio Garzia, 2024

About

This repo contains my projects developed during the "Natural Language Processing" course at university

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published