GitHub - Primus-Learning/pipeline

Project Credit: João Pedro

Tools to be used for the demo

S3 to upload data and create different folders for different reasons
Lambda for extraction of data from pdf to raw json format
Glue for processing of data to get the questions from the data
Airflow: This is a workflow orchestrator. It’s a tool to develop, organize, order, schedule, and monitor tasks using a structure called DAG (Direct Acyclic Graph), The DAGS are all Python code.

The data: The data is from the Brazillian ENEM (National Exam of High School, on literal translation). This exam occurs yearly and is the main entrance door to most public and private Brazilian universities. We will use this data to do some data extraction and get questions from the exam.

Steps:

Create the airflow environment by running: docker compose up (make sure you are in the path where the docker compose file is found. Access Airflow through: localhost:8080)
Create an S3 bucket called primuslearning-enem-bucket (give a suitable name for your use case)
Create an IAM User called primuslearning-enem and grant it admin permissions and save the access keys.
In the airflow UI (localhost:8080), under the admin->connections tab, create a new AWS connection, named AWSConnection, using the previously created access key pair.
Uploading files to AWS Using Airflow: Create a Python file inside the /dags folder, I named mine primuslearning_process_enem_pdf.py
Create a ‘year’ variable in the Airflow UI (admin->variables). variable simulates the ‘year’ when the scraping script should execute, starting in 2010 and being automatically incremented (+1) by the end of the task execution.
Create a new Lambda function from scratch, name it process-enem-pdf, choose Python 3.9 runtime. lambda will automatically create an IAM Role. Make sure this role has the read and write permissions in the primuslearning-enem-bucket S3 bucket. Increase the execution time to about 4 mins to the lambda.
Create a Python virtual env with venv: python3 -m venv pdfextractor
Activate the environment and install the dependencies : source pdfextractor/bin/activate pip3 install pypdf2 typing_extensions
Create a lambda layer and upload to lambda by running: (This has already been done, to ease your work. Just upload the archive.zip file as a layer to aws. bash prepare_lambda_package.sh
Add an S3 Trigger to the lambda function, make sure the suffix is .pdf and the events types: All object create events
Create a glue Crawler to create a catalog of the dataset. Name it: primuslearning-enem-crawler and make sure to select the bucket up to the content folder. Make sure an IAM role is created and also create a database with the name: enem_pdf_project
Create a glue job named: Spark_EnemExtractQuestionsJSON and paste the code on process_pdf_glue_job.py and execute from airflow for the complete pipeline to be in action.
Make sure to delete all your processes afterwards to avoid the bills

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data_pipeline_airflow_aws		data_pipeline_airflow_aws
README.md		README.md
data pipeline.png		data pipeline.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Primus-Learning/pipeline

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages