Skip to content

Developing a web crawler to create a multi-modal Covid data set

Notifications You must be signed in to change notification settings

sarahogden2017/Covid-vCues

Repository files navigation

COVID vCues Dataset

This is a dataset research project developed to assist Professor Ankur Chattopadhyay's COVID vCues research by creating a multi-modal dataset containing images sourced from reliable and unreliable sources on COVID-19. This dataset will be used to train multiple AI models: reliable vs unreliable images, and identify memes, ads, claims, fact-checks, or logos.

To-Do

  • Scrape CoAID (Sarah)
  • Scrape ReCovery (Shreetika)
  • Scrape MM-Covid (Shreetika)
  • Scrape tweets with Twikit
  • Consolidate CoAID, ReCovery, & MM-Covid (Sarah)
  • Remove duplicate images
  • Deep learning reliable vs unreliable model
    - [X] Small test model w/ Keras and Tensorflow (200 images each) (Sarah)
    - [X] Keras neural network using all images - Model is overfit? (Sarah)
    - [X] Small test model w/ SVM (also tried with random images on Google) (Shreetika)
  • Clean dataset
    1. Remove duplicates w/ existing script (Sarah)
    2. Figure out how to remove pixilated images (Shreetika)
    3. Use OpenCV to identify people and manually del profile pictures (Shreetika)
    4. Remove favicon and icon type images by sorting images by size (Shreetika)
    5. Randomly select same amount of remaining images from each category
  • Redo model training with cleaned dataset-SVM Model (Shreetika)
  • Develop category identifying models: Method 1
    - [ ] Memes
    - [ ] Ads
    - [ ] Claims
    - [ ] Fact-checks
    - [ ] Logos
    Method 2
    - [ ] Infographics/Diagrams
    - [ ] Photographs
    - [ ] Illustrations
    - [ ] Memes
    - [ ] Advertisements
    - [ ] Misc/Logos
    - Image naming convention idea: (un)reliable.subcategory.####.jpg/png
  • Analysis of dataset breakdown

Sources

The dataset based on CoAID: COVID-19 Healthcare Misinformation Dataset, ReCovery, and MM-Covid.

Citations:
@misc {
cui2020coaid,
title={CoAID: COVID-19 Healthcare Misinformation Dataset},
author={Limeng Cui and Dongwon Lee},
year={2020},
eprint={2006.00885},
archivePrefix={arXiv},
primaryClass={cs.SI}
}
https://github.com/apurvamulay/ReCOVery/tree/master
https://github.com/bigheiniu/MM-COVID/blob/main/README.md

Usage

This dataset is still underdevelopment and not yet ready for use.

Authors

Sarah Ogden
Shreetika Poudel

Helpful Tutorials

About

Developing a web crawler to create a multi-modal Covid data set

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages