Skip to content
/ PHist Public

A web scraper that writes the results into JSON format

License

Notifications You must be signed in to change notification settings

aristila/PHist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PHist

A web scraper in Python3 that utilizes BeautifulSoup and writes the results into JSON format.

This program was created for Pseudohistoria project at the University of Turku (https://sites.utu.fi/pseudohistoria/en/). The program was coded by Anna Ristilä and contributed to by Mila Oiva, who helped with the processing of Russian materials.

Required modules

The program uses the following extra modules:

Usage

The program was designed to be run from command line with some arguments:
  • -s name_of_seedfile
  • -p prefix_tag
  • -l language_tag
  • -L level_tag
The seed file should be in plain text and contain one link on one line. Currently, the code only understands Windows newlines "\r\n".

The other arguments are for distinguishing files and are used in filenames. The prefix and level tags are also used as names of subdirectories.

The program only scrapes text inside p-tags.

Running the scraper_bs4_MAIN.py with valid arguments and a list of seed links will produce as many JSON files as the scraper is able to scrape from the seed file list without problems. The program will keep a log file which is constantly updated, so if something goes wrong and the program crashes or hangs, you can go see which seed link caused the problem. Links ending with .mp3 or other media format are currently a problem.

The program also keeps track of all the processed links, but writes them into files only at the end of the cycle. The program does not scrape social media links but records them for possible future use.

The files created py the program:

  • JSON files
  • outgoing links for next scrape cycle
  • facebook links
  • twitter links
  • vkontakte links
  • linkedin links
  • all scraped links
  • links not found
  • links disallowed by robots.txt
  • links whose robots.txt file could not be found

About

A web scraper that writes the results into JSON format

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages