GitHub - Asrst/reddit-api-data-pipeline

Reddit-API-data-pipeline

Objective:

Create a batch data pipeline to ingest Reddit data from its API (for any chosen sub-reddit). The end goal is build a dashboard/report, providing insight into the user engagement metrics of the selected subreddits.

Architecture/Flow:

Setup/Instructions:

1. Pre-requisites

GCP Account & Terraform (Cloud Storage & Biq Query will be used).
Make sure Docker is installed & Logged in to the Docker CLI using docker login.
Reddit API requires Auth, login & create an app to get credentials (skip this step, as already migrated the pipeline to use pushshift API).
Signup for a free dbt cloud developer account & connect with github to read/write to this git repo.

2. Clone the repo to get started

Create a Virtual environemnt & Activate it.

# create env
python -m venv <env-name>

# activate env
source <env-name>/bin/activate

Install requirements using requirements.txt file.
```
pip3 install -r requirements.txt
```

Clone this repo & Change the directory.

git clone https://github.com/Asrst/reddit-api-data-pipeline.git

cd reddit-api-data-pipeline

3. Setup gcp infra setup using terraform

Setup gcp project, enable api's and download the credentials json - refer this gcp_setup.md.
Navigate to terraform directory & refer the folder terraform for commands to run.
```
cd terraform
```

4. Extract and load data into gcs & big query external tables

Ideal Option: Using Airflow as Orchestrator

Navigate to airflow directory & Create two new directories for airflow logs & plugins.
```
cd airflow

mkdir logs plugins
```
Setup Airflow using docker & docker-compose. Refer airflow for detailed steps.Refer the dags in airflow/dags which contains code for quering api & loading data into gcs/big-query.
In the UI, There will be two dags

reddit_etl - This DAG exrtacts data for the last month and is scheduled to run on monthly basis. But you can also manually trigger it from UI to test run it.
reddit_etl_historical - This DAG exrtacts historical data and not scheduled to run. Trigger it manually to extract, save & load 2022 data into gcs, bigquery.

Alternative (only to populate historical data): As the data is small, we can use python scripts to extract & load the data into gcp

To extract and store the historical data (2022) to GCS & create big query external tables, run the scripts in scripts/ folder.
Run the commands for different subreddit for which data needs to be collected (gcs bucket & bq dataset will remain same).

This step can be improved, by moving it into a seperate airflow DAG.

# next 2 commands assumes you are in airflow folder, if not run cd command.
cd airflow
# extracts data from api & store to gcs
python3 ../scripts/test_pushshift.py --gcs_bucket="dl-reddit-api-404" --sub_reddit="ipl" --year=2022
# creates a big query external table
python3 ../scripts/test_bq_load.py --bq_dataset="reddit_api" --table_name="ext_ipl" --gcs_uri="gs://dl-reddit-api-404/ipl/posts-2022-*.csv"

5. Transform data in big query using dbt

Fork this repo into your git account & Login into your dbt cloud account.
Setup new project by providing link your newly forked git repo & gcp creditional json (if not already done).
Transform data using dbt & Create a data mart. Refer dbt for detailed steps.
Schedule the runs on monthly basis to update the datamarts regularly (as airflow pipeline also runs monthly)

Alternatively you can run the dbt commands in the local, but wont be able to scheduled runs without an Orchestrator

6. Visualise the data & create the reports

Create a account at preset.io & add big query as new data source by providing gcp creditional json.
Create datasets from newly created data mart to visualise/report or create dashboards.

Important Note:

preset.io (or superset) doesnt allow public dashboard sharing outside of workspace (users needs to be in same workspace for sharing to work). hence this dashboard is not publicly availble . Alternatively, I created an identical Looker Dashboard with public access.

Result/Output:

Next Steps/Improvements:

Move dbt from cloud to airflow.
Try using google cloud composer.
Unit testing.
Full CI/CD & Github Actions.

References to tools & technologies used::

Programming Language - Python
Infrastructure as Code software - Terraform
Containerization - Docker, Docker Compose
Orchestration - Airflow
Transformation - dbt
Data Lake - Google Cloud Storage
Data Warehouse - BigQuery
Data Visualization - Preset.io

Learned a lot! All thanks to this amazing course at https://github.com/DataTalksClub/data-engineering-zoomcamp

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
airflow		airflow
dbt		dbt
scripts		scripts
static		static
terraform		terraform
.gitignore		.gitignore
Readme.md		Readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit-API-data-pipeline

Objective:

Architecture/Flow:

Setup/Instructions:

1. Pre-requisites

2. Clone the repo to get started

3. Setup gcp infra setup using terraform

4. Extract and load data into gcs & big query external tables

5. Transform data in big query using dbt

6. Visualise the data & create the reports

Result/Output:

Next Steps/Improvements:

About

Releases

Packages

Languages

Asrst/reddit-api-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Reddit-API-data-pipeline

Objective:

Architecture/Flow:

Setup/Instructions:

1. Pre-requisites

2. Clone the repo to get started

3. Setup gcp infra setup using terraform

4. Extract and load data into gcs & big query external tables

5. Transform data in big query using dbt

6. Visualise the data & create the reports

Result/Output:

Next Steps/Improvements:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages