This project aims to implement and train a sentiment analysis model specifically for the Hungarian language, based on the architecture proposed in this research paper. Initially presented at the CINTI 2023 conference, the model has been trained on a dataset of over 300,000 reviews. This README provides a comprehensive guide to the project's structure, setup, and usage.
The model architecture is adapted from the referenced paper with specific adjustments for the Hungarian language and to the dataset.
The mock dataset used for demonstration purposes is a scaled-down version of the original dataset. It includes essential preprocessing steps such as duplicate removal, NaN handling, and character count restrictions. While not as extensive as the full dataset, it serves as a representative sample for testing and development.
First clone the repository
git clone
Enter the directory
cd hu_sentiment_analyser
Install dependencies
poetry install
Run the training
poetry run python .\src\main.py
Run mlflow to see the metrics, or go to your dagshub account.
mlflow run
- Data Ingestion: It downloads the dataset.
- Data Preprocess: It separates the dataset into 3 parts: training, validation and test data.
- Prepare Base Model: It contains the base model architecture.
- Training: It trains the model on training dataset, check the config.yaml file for its output.
- Evaluation: It evaluates the model and uses mlflow for storing the expreminents and models.
Follow these steps to run each stage of the pipeline:
- Data Ingestion: Run
data_ingest.py
to download and preprocess the dataset. - Data Preprocess: Execute
data_preprocess.py
to prepare the data. - Prepare Base Model: Use
model_init.py
to initialize the model architecture. - Training: Run
train_model.py
to train the model on the dataset. - Evaluation: Execute
evaluate_model.py
to generate the evaluation report.
Refer to .env.example
for environment variable setups. Here you can setup your dagshub account, but also you can leave it empty, then the evaluation ouput will be stored in the local mlflow directory.
You can change the model's parameters and the training's settings in the params.yaml
file
The model is evaluated based on accuracy, precision, and recall. A summary of these metrics, along with a confusion matrix is stored in the environment file specified or in the local mlflow directory.
- Original research paper: Link to Paper
- Additional materials and resources used in this project.