Analysed syntax and Semantics of Corpus of Text Documents Retrived from Web Scraping of News articles from Inshorts and followed the Standard NLP Workflow of the CRISP-DM model.
- Index
- About
- Usage
- Commands
- File Structure
- Brief Description
- Info Gallery
- Guidelines
- Resources
- Present Contributors
- License
A NLP based Project which scraps the news articles of mainly 3 categories:
- Technology
- Sports
- World
from InShorts using website urls. Finally after numerous preprocessing steps like Text Wrangling, Removing accented characters, Removing html tags, Lemmatization, Stemming, build a text normalizer to create dataset for applying sentiment analysis.
Sentiment analysis is perhaps one of the most popular applications of NLP.
The key aspect of sentiment analysis is to analyze a body of text for understanding the opinion expressed by it. Typically, quantifying this sentiment with a positive or negative value, called polarity.
This project can be used to create following key features:
- Building Text summarizer using RNNs and LSTM
- Gain only particular sentiment be it positive or negative.
- Emojifier: Building appropriate reaction emojis from the extracted sentiments.
- Building a tone detector as Grammarly (Beta) provides us.
Build this project to learn the nuances of NLP of handling Text Data.
- Pandas
- Numpy
- Seaborn
- nltk
- Afinn
- TextBlob
- Beautiful Soup
- requests
- Spacy Language Models
Note: Spacy may give lot of errors, one should make sure to proper install it. Further more refer to the requirements.txt
Just want to run the project on your local machine: Make sure you install all the packages mentioned in requirements.txt.
- Clone the repository
$ git clone https://github.com/codekhal/Inshorts-NLP
- Install dependencies.
$ cd Inshorts-NLP
- Now in your terminal, using appropriate conda env
$ run jupyter or any other preferable editor
- File structure with the basic details about files and directories.
.__Inshorts-NLP__
├── contractions.py
├── img
│ ├── scraping.png
│ ├── Sentiment_Score_News_Category.png
│ ├── sentiments.png
│ ├── stemming.png
│ ├── Visualizing_Sentiments_Box_Plot.png
│ └── workflow.png
├── LICENSE
├── news.csv
├── NLP_main.ipynb
├── __pycache__
│ └── contractions.cpython-35.pyc
├── README.md
└── requirements.txt
2 directories, 13 files
Built a web scraper which had scraped news articles from Inshorts website urls. Then using numerous text-preprocessing techniques, cleaned the data for further processing. After this, turn came for sentiment analysis on the data. Various popular lexicons are used for sentiment analysis, including the following.
- AFINN lexicon
- Bing Liu’s lexicon
- MPQA subjectivity lexicon
- SentiWordNet
- VADER lexicon
- TextBlob lexicon
Used NLTK, AFINN and TextBlob library. Using both data visualization tools and pandas dataframe techniques to show results of the dataset.
The sentiment score of different genres of news category is shown with the help of the following plots.
Lastly, the count of three sentiments in different genres of news articles is depicted with the help of factor or bar plot.
- Contribution Guidelines
Future Work that could be done:
-
Flask/Flask App Deployment - Deploy the app so that couldbe efficiently used.
-
Use of Deep Learning - One may try and use deep learning for building a text summurizer and tone detector.
Kindly follow the Contributions Guildlines before you create any pull requests or issues. Though feel free to contribute in any form.
Open Source <3
Feel free to reach out to me