Skip to content

Asrst/reddit-api-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit-API-data-pipeline

Objective:

Create a batch data pipeline to ingest Reddit data from its API (for any chosen sub-reddit). The end goal is build a dashboard/report, providing insight into the user engagement metrics of the selected subreddits.

Architecture/Flow:

Setup/Instructions:

1. Pre-requisites

  • GCP Account & Terraform (Cloud Storage & Biq Query will be used).
  • Make sure Docker is installed & Logged in to the Docker CLI using docker login.
  • Reddit API requires Auth, login & create an app to get credentials (skip this step, as already migrated the pipeline to use pushshift API).
  • Signup for a free dbt cloud developer account & connect with github to read/write to this git repo.

2. Clone the repo to get started

  • Create a Virtual environemnt & Activate it.

    # create env
    python -m venv <env-name>
    
    # activate env
    source <env-name>/bin/activate
  • Install requirements using requirements.txt file.

    pip3 install -r requirements.txt
  • Clone this repo & Change the directory.

    git clone https://github.com/Asrst/reddit-api-data-pipeline.git
    
    cd reddit-api-data-pipeline

3. Setup gcp infra setup using terraform

  • Setup gcp project, enable api's and download the credentials json - refer this gcp_setup.md.
  • Navigate to terraform directory & refer the folder terraform for commands to run.
    cd terraform

4. Extract and load data into gcs & big query external tables

Ideal Option: Using Airflow as Orchestrator

  • Navigate to airflow directory & Create two new directories for airflow logs & plugins.
    cd airflow
    
    mkdir logs plugins
  • Setup Airflow using docker & docker-compose. Refer airflow for detailed steps.Refer the dags in airflow/dags which contains code for quering api & loading data into gcs/big-query.
  • In the UI, There will be two dags
  1. reddit_etl - This DAG exrtacts data for the last month and is scheduled to run on monthly basis. But you can also manually trigger it from UI to test run it.
  2. reddit_etl_historical - This DAG exrtacts historical data and not scheduled to run. Trigger it manually to extract, save & load 2022 data into gcs, bigquery.

Alternative (only to populate historical data): As the data is small, we can use python scripts to extract & load the data into gcp

  • To extract and store the historical data (2022) to GCS & create big query external tables, run the scripts in scripts/ folder.

  • Run the commands for different subreddit for which data needs to be collected (gcs bucket & bq dataset will remain same).

  • This step can be improved, by moving it into a seperate airflow DAG.

    # next 2 commands assumes you are in airflow folder, if not run cd command.
    cd airflow
    # extracts data from api & store to gcs
    python3 ../scripts/test_pushshift.py --gcs_bucket="dl-reddit-api-404" --sub_reddit="ipl" --year=2022
    # creates a big query external table
    python3 ../scripts/test_bq_load.py --bq_dataset="reddit_api" --table_name="ext_ipl" --gcs_uri="gs://dl-reddit-api-404/ipl/posts-2022-*.csv"

5. Transform data in big query using dbt

  • Fork this repo into your git account & Login into your dbt cloud account.
  • Setup new project by providing link your newly forked git repo & gcp creditional json (if not already done).
  • Transform data using dbt & Create a data mart. Refer dbt for detailed steps.
  • Schedule the runs on monthly basis to update the datamarts regularly (as airflow pipeline also runs monthly)

Alternatively you can run the dbt commands in the local, but wont be able to scheduled runs without an Orchestrator

6. Visualise the data & create the reports

  • Create a account at preset.io & add big query as new data source by providing gcp creditional json.
  • Create datasets from newly created data mart to visualise/report or create dashboards.

Important Note:

  • preset.io (or superset) doesnt allow public dashboard sharing outside of workspace (users needs to be in same workspace for sharing to work). hence this dashboard is not publicly availble . Alternatively, I created an identical Looker Dashboard with public access.

Result/Output:


Next Steps/Improvements:

  • Move dbt from cloud to airflow.
  • Try using google cloud composer.
  • Unit testing.
  • Full CI/CD & Github Actions.

References to tools & technologies used::

Learned a lot! All thanks to this amazing course at https://github.com/DataTalksClub/data-engineering-zoomcamp

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published