Create a batch data pipeline to ingest Reddit data from its API (for any chosen sub-reddit). The end goal is build a dashboard/report, providing insight into the user engagement metrics of the selected subreddits.
- GCP Account & Terraform (Cloud Storage & Biq Query will be used).
- Make sure Docker is installed & Logged in to the Docker CLI using
docker login
. - Reddit API requires Auth, login & create an app to get credentials (skip this step, as already migrated the pipeline to use pushshift API).
- Signup for a free dbt cloud developer account & connect with github to read/write to this git repo.
-
Create a Virtual environemnt & Activate it.
# create env python -m venv <env-name> # activate env source <env-name>/bin/activate
-
Install requirements using
requirements.txt
file.pip3 install -r requirements.txt
-
Clone this repo & Change the directory.
git clone https://github.com/Asrst/reddit-api-data-pipeline.git cd reddit-api-data-pipeline
- Setup gcp project, enable api's and download the credentials json - refer this
gcp_setup.md
. - Navigate to terraform directory & refer the folder
terraform
for commands to run.cd terraform
Ideal Option: Using Airflow as Orchestrator
- Navigate to
airflow
directory & Create two new directories for airflow logs & plugins.cd airflow mkdir logs plugins
- Setup Airflow using docker & docker-compose. Refer
airflow
for detailed steps.Refer the dags inairflow/dags
which contains code for quering api & loading data into gcs/big-query. - In the UI, There will be two dags
reddit_etl
- This DAG exrtacts data for the last month and is scheduled to run on monthly basis. But you can also manually trigger it from UI to test run it.reddit_etl_historical
- This DAG exrtacts historical data and not scheduled to run. Trigger it manually to extract, save & load 2022 data into gcs, bigquery.
Alternative (only to populate historical data): As the data is small, we can use python scripts to extract & load the data into gcp
-
To extract and store the historical data (2022) to GCS & create big query external tables, run the scripts in
scripts/
folder. -
Run the commands for different subreddit for which data needs to be collected (gcs bucket & bq dataset will remain same).
-
This step can be improved, by moving it into a seperate airflow DAG.
# next 2 commands assumes you are in airflow folder, if not run cd command. cd airflow # extracts data from api & store to gcs python3 ../scripts/test_pushshift.py --gcs_bucket="dl-reddit-api-404" --sub_reddit="ipl" --year=2022 # creates a big query external table python3 ../scripts/test_bq_load.py --bq_dataset="reddit_api" --table_name="ext_ipl" --gcs_uri="gs://dl-reddit-api-404/ipl/posts-2022-*.csv"
- Fork this repo into your git account & Login into your dbt cloud account.
- Setup new project by providing link your newly forked git repo & gcp creditional json (if not already done).
- Transform data using dbt & Create a data mart. Refer
dbt
for detailed steps. - Schedule the runs on monthly basis to update the datamarts regularly (as airflow pipeline also runs monthly)
Alternatively you can run the dbt commands in the local, but wont be able to scheduled runs without an Orchestrator
- Create a account at preset.io & add big query as new data source by providing gcp creditional json.
- Create datasets from newly created data mart to visualise/report or create dashboards.
Important Note:
- preset.io (or superset) doesnt allow public dashboard sharing outside of workspace (users needs to be in same workspace for sharing to work). hence this dashboard is not publicly availble . Alternatively, I created an identical Looker Dashboard with public access.
- Move dbt from cloud to airflow.
- Try using google cloud composer.
- Unit testing.
- Full CI/CD & Github Actions.
References to tools & technologies used::
- Programming Language - Python
- Infrastructure as Code software - Terraform
- Containerization - Docker, Docker Compose
- Orchestration - Airflow
- Transformation - dbt
- Data Lake - Google Cloud Storage
- Data Warehouse - BigQuery
- Data Visualization - Preset.io
Learned a lot! All thanks to this amazing course at https://github.com/DataTalksClub/data-engineering-zoomcamp