This repo contain my project developed during the "Natural Language Processing" course at university
The aim of the project is develope a NLP Search Engine, using NLTK (Natural Language toolkit) that given a query string the engine retrive the first
- Corpus loading
- Preprocessing on text data
- Stopwords removal
- Lemmatization
- Tokenization
- Punctuation removal
- Part of Speech
- Data cleaning in general
- Document representation
- Continous Bags of Word (CBOW)
- Word embeddings (
Word2Vec
)
- Document represtation
- Embedding average for documents representation
Doc2Vec
model- TF-IDF
- Cosine similarity
- K-means alghoritm
- t-SNE dimensionality reduction
- Evaluation of the model (
Precision
,Recall
,F1
) - Spelling correction (using Levenshtein edit distance)
You can install theese dependencies from requirements.txt
using pip
manager in your environment as shown below:
pip install -r requirements.txt
Emilio Garzia, 2024