Welcome to the Algolia_test project! This project showcases the implementation of Algolia search functionality in a web application.
The Algolia_test project provides a demonstration of integrating Algolia search into a web application. It leverages the Algolia API to provide fast and efficient search capabilities on a specific dataset.
The main objectives of this project are:
- To demonstrate how to set up and configure Algolia for search functionality.
- To answer the test for data engineer.
Follow these steps to install and set up the Algolia_test project:
-
Clone the repository:
git clone [email protected]:Anfor/Algolia_test.git
-
Change to the project directory:
cd Algolia_test
-
Start the project using Docker Compose:
docker-compose up
-
Access the Airflow web UI in your browser at
http://localhost:8080
.
- Use the following credentials to log in:
- User: airflow
- Password: airflow
- Unpause the DAG named
algolia_data_pipeline
in the Airflow web UI. -Hints:- In order to trigger manually in Airflow we need to comment end_date in the file
.../algolia_test/dags/algolia_data_pipeline.py
- If we run in test mode, confirm the variable
IS_TEST = True
in the file.../algolia_test/dags/python_files/algolia_configuration.py
- In order to trigger manually in Airflow we need to comment end_date in the file
Once the Algolia_test project is installed and running, you can use it as follows:
- Configure the necessary variables in the
python_files/algolia_configuration.py
file.
- Update the variables according to your specific setup and requirements.
- Run the data pipeline:
- If you want to trigger the pipeline manually in Airflow, comment out the
end_date
parameter in thealgolia_data_pipeline.py
file. - If running in test mode, ensure that the variable
IS_TEST
in thepython_files/algolia_configuration.py
file is set toTrue
. - If running in production mode, set
IS_TEST
toFalse
and execute the pipeline.
- Monitor the execution of the data pipeline in the Airflow web UI.
To run tests in the Algolia_test project, follow these steps:
-
Open a terminal.
-
Run the following command to list the running Docker containers and identify the
container_id
for theairflow_webserver
container:docker ps
-
Run the following command to enter the Docker container's terminal:
docker exec -it <container_id> bash
-
Change to the
dags/python_files/test/
directory:cd dags/python_files/test/
-
Run the following command to test the extraction from S3:
pytest extract_test.py
-
Run the following command to test the transformation step in PySpark:
pytest test_transform.py
-
Run the following command to test the PostgreSQL load step using data from the
.../test_data.csv
file:pytest test_load.py
In .../dags/final/test_[YYYY-MM-DD].csv/*.csv
you can see the result of running transform step,
the source before pyspark transformation comes from the folder
.../dags/staging/test_[YYYY-MM-DD].csv
that is produced in the create_test_csv
function in file .../algolia_test/dags/algolia_data_pipeline.py
The prefix test_
is added just in case that runs in test mode.