Redditor's Club is a data pipeline for analysis of Reddit trends. Reddit is an interest-based social media network popular among young males in North America.
This project currently makes use of the following technologies:
- Amazon DynamoDB
- Apache Spark 1.6.1 with Hadoop 2.7
- MySQL
- AWS S3
- AWS Redshift
- Django 1.9.6 with the following frameworks: HighCharts, jQuery, Bootstrap
The figure below shows the flow of data through the pipeline:
The data used in this project is the set of Reddit comments published and compiled by reddit user /u/Stuck_In_the_Matrix. The total size of all comments from October 2007 to December 2015 is greater than 1TB.