Sparking Joy with NLP

Problem Statements:

Using data pulled from Reddit I have built Natural Language Processing (NLP) models to answer the following questions:

Based on text data from posts, can the model correctly classify posts from r/konmari vs. posts from r/hoarding?
Can the model also differentiate r/konmari from a larger set of posts including r/declutter vs. posts from r/organization?
What words are most important in identifying r/konmari posts?
Are there certain posts with which the model has trouble? Why?
Can the model be generalized outside Reddit?

Executive Summary:

The project primarily utilizes two classification models:

Naïve Bayes (Multinomial)
Logistic Regression

Each model has been performed alongside use of a TF-IDF vectorizer.

Additionally, a voting classifier model was run to see if, though more conceptually and computationally complicated, produced better results.

The NLP models herein have proven successful at accurately predicting whether a post or not a post from the r/konmari subreddit, though some overfitting is present.

Data Gathering Process

I pulled Reddit posts using either the Reddit API (via the Python Reddit API Wrapper (PRAW)) or the Pushshift API.

The initial dataset included 4,936 posts from r/konmari and r/hoarding with approximately 53% being from r/konmari.

Conclusions

The NLP models that sought to classify r/hoarding and r/konmari, all showed strong results with testing scores (r-squared) of 0.95 or higher.

Once other similarly 'tidy' subreddits were introduced, accuracy scores dropped and overfitting increased, as the new posts were much more similar to the positive target than the existing observations of the negative target class.

Stemming and lemmatization was performed and although not dramatically different than the base models, a stemmed corpus might be performed due to a slight reduction in variance.

A further analysis was performed by taking transcripts of the Marie Kondo's Netflix show (scraped from https://www.springfieldspringfield.co.uk) and feeding each episode into an existing base NLP model. Interestingly, each episode of the show had differing 'levels of Konmari-ness' and each were correctly identified as being a part of the 'konmari' class.

Further Improvements

In order to cut down on the variance, I would like to gather additional data to feed into the model. Additionally, given the interesting results with the Netflix transcripts, I'd like to test the improved model on additional non-reddit data for a more generalized model of 'Konmari-ness'.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
datasets		datasets
images		images
pickled-models		pickled-models
.gitignore		.gitignore
README.md		README.md
reddit-NLP.ipynb		reddit-NLP.ipynb
scrape-kondo-netflix.ipynb		scrape-kondo-netflix.ipynb
sparking-joy-withNLP.pdf		sparking-joy-withNLP.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparking Joy with NLP

Problem Statements:

Executive Summary:

Data Gathering Process

Conclusions

Further Improvements

About

Releases

Packages

Languages

wkarney/sparking-joy-NLP

Folders and files

Latest commit

History

Repository files navigation

Sparking Joy with NLP

Problem Statements:

Executive Summary:

Data Gathering Process

Conclusions

Further Improvements

About

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages