Pinterest Data Pipeline

Real time & historical data pipeline simulation

Project aim

This project is an educational exercise for hands on big data gathering and processing (ETL) using common big data tools: Kafka, Spark, Airflow (both on prem & in AWS WMAA), AWS Kinesis and DataBricks.

What it does

The Pinterest Data Pipeline is takes example Pinterest user generated data through an ETL process. This is handled in two ways. First through the processes of handling "legacy" data from an external database to an S3 data store for cleansing ad-hock reporting. Secondly, data is handled with data streams.

Concepts learned/demonstrated

Using an EC2 instance via SSH, SSHFS and SCP.
Local configuration of Kafka topics via command line.
Basic MSK configuration in the AWS console.
Kinesis stream connectivity with API Gateway
Airflow to schedule jobs both on prem and in AWS (WMAA)
Use of Databricks to:
- Connect to AWS S3 bucket
- Load a series of files into a Spark dataframe
- Connect to an AWS Kinesis data stream
- Using PySpark to clean the data/streams and provide basic query results for static data
- Store data transformed in DataBricks data lake

Report Request Output

Since this project is connected directly to Github via the DataBricks Repo feature, DataBricks removes all notebook output before committing to Github, the reporting output can be seen in [this snapshot of the notebook] (documentation_resources/AiCore%20Pinterest%20project.ipynb) taken locally from DataBricks

Architecture

System languages

While a good portion of the system could be handled with SQL, the focus was to maximise the use of Python on top of Spark to demonstrate the concepts learned. Only Python and some Bash scripts are used for development of this project.

File structure

The project has the following folder structure with the naming rule is any script with a 0 as the first character in the file name is for initial configutaton and only to be run once. More details will be explained in the "Set-up & usage instructions" section of this file. Note: The 0 prefixed set-up file naming convention excludes files starting with 0x115306525 which are naming requirements for this project AWS access.

DataBricks

PinterestTransformations.py
AiCore Pinterest project.ipynb
AiCore Read Kinesis.ipynb

MWAA_S3_bucket
- 0x115306525_dag.py

documentation_resources
Kineses
- 0create_kinesis_streams.py
- test_stream_api.py
- user_posting_emulation_streaming.py
legacy_simulation

See the "Set-up & usage instructions" for more information on these files

run_this_to_send_messages_after_ec2_kafka_started.sh
- run_on_ec2

Set-up & usage instructions

Legacy data project steps

On prem components

The on premise software for processing of legacy data plus initial exploratory testing was done on Ubuntu Linux using Conda to manage dependencies.
The main components are:

Java JRE 8
Apache AirFlow
Apache Spark
Apache Kafka via Confluent

For a detailed technical view of the supporting software used, see [this Conda environment export](documentation_resources/conda-env.yaml).

Create topics

With the software security credentials and other minor configurations set as per the AiCore Data Engineering notebook instructions, the topics for on premise legacy data topics can be created by the following script in the legacy_simulaiton/run_on_ec2 folder:

0setup_topics_and_connector.sh

Legacy Data Population

Shell scripts are provided to prevent errors and to aid in returning to this process in the future. To load the data in the legacy database into an S3 bucket using Kafka, execute the following on the EC2 instance within the VPC:

1run_this_on_kafka_ec2.sh

This will execute the following commands:

cd ~/confluent-7.2.0/bin/
./kafka-rest-start /home/ec2-user/confluent-7.2.0/etc/kafka-rest/kafka-rest.properties

The next script can be run on the same computer, though it is recommended to run it on another box:

2run_this_to_send_messages.sh

This will execute the following:

python3 user_posting_emulation.py

Data will randomly be sent to the data streams for processing.

Conclusion

This was the final project in the AI Core Data Engineering path which was certainly more of what we would have called "tech ops" in my previous organisation. Big data tools were entirely new to me. Now with some basic hands on experience, I know some of the capabilities and feel confident with our notebooks that I could have these tools up and running.

Some research would still be required before making a professional recommendation on the use of these technologies largely in terms of execution performance. Say that, even without further studies I can say that much of what has been taught and used here would allow a fairly rapid prototype for data handling. I'm grateful for the experience and look forward to applying these skills and learning more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pinterest Data Pipeline

Real time & historical data pipeline simulation

Project aim

What it does

Concepts learned/demonstrated

Report Request Output

Architecture

System languages

File structure

Set-up & usage instructions

Legacy data project steps

On prem components

Create topics

Legacy Data Population

Conclusion

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pinterest Data Pipeline

Real time & historical data pipeline simulation

Project aim

What it does

Concepts learned/demonstrated

Report Request Output

Architecture

System languages

File structure

Set-up & usage instructions

Legacy data project steps

On prem components

Create topics

Legacy Data Population

Conclusion