ETL PIPELINE REDDIT TO MONGODB

This project aims to create an ETL (Extract, Transform, Load) pipeline for extracting data from Reddit through its API and storing it in a MongoDB database. Reddit is a popular platform with a vast amount of user-generated content, making it an excellent source of data for various purposes, including research, analysis, and data-driven decision-making.

This repository provides the necessary tools and scripts to perform the following key tasks:

Data Extraction: The ETL pipeline extracts data from specific subreddits on Reddit, allowing you to focus on topics of interest or relevance to your project.

Data Transformation: The extracted data is processed and transformed to ensure consistency and relevance. This includes some cleaning, filtering, text processing and sentiment analysis.

Data Loading: The transformed data is then loaded into a MongoDB database for storage and future analysis. MongoDB is a NoSQL database that offers flexibility and scalability for handling diverse data types.

By setting up this ETL pipeline, I have automated the process of collecting and managing Reddit data, making it easier to access and analyze some of the information needed.

Whether conducting research, monitoring trends, or building applications that require Reddit data, this project provides a solid foundation for your data processing needs.

RUN PROJECT

You need to have a Reddit account get Reddit API keys and also set up a MongoDB cluster on MongoDB Atlas. Then you can get to work.

Create virtualenv: I used Python 3.10 for this project.

virtualenv venv --python=python3.10

source venv/usr/local/bin/activate
(Optional if you already have kafka setup) Setup Kafka on Docker

bash kafka_setup.sh
Install packages

pip install -r requirements.txt
Create and update environmental variables in .env

    REDDIT_CLIENT_SECRET - Reddit API Client Secret
    REDDIT_USERNAME - Your Reddit Username
    REDDIT_PASSWORD - Your Reddit Password
    MONGODB_USER - Your MongoDB User
    MONGODB_PASSWORD - Your MongoDB password
    MONGODB_CLUSTER - MongoDB cluster that will host the data
    DATABASE - Database created on MongoDB to load the data into
    SUBREDDIT_NAME - the name of the subreddit you want to pull data from, which will also become the collection name
    KAFKA_TOPIC - Kafka topic

Open two separate terminals to run the kafka producer and consumer separately. Run the producer first
- producer: python3 reddit_producer.py
- consumer: python3 consumer.py
Enter your desired date and wait for the data to be pulled and loaded into the database.

TOOLS AND TECHNOLOGIES

Praw
Kafka
Pymongo
MongoDB
Docker

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
consumer.py		consumer.py
docker-compose.yml		docker-compose.yml
kafka_setup.sh		kafka_setup.sh
mongodb_connection.py		mongodb_connection.py
reddit_producer.py		reddit_producer.py
requirements.txt		requirements.txt
sentiment_analysis.py		sentiment_analysis.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL PIPELINE REDDIT TO MONGODB

RUN PROJECT

TOOLS AND TECHNOLOGIES

About

Releases

Packages

Contributors 2

Languages

BBimie/ETL-Pipeline-Reddit-MongoDb

Folders and files

Latest commit

History

Repository files navigation

ETL PIPELINE REDDIT TO MONGODB

RUN PROJECT

TOOLS AND TECHNOLOGIES

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages