This program was created for Pseudohistoria project at the University of Turku (https://sites.utu.fi/pseudohistoria/en/). The program was coded by Anna Ristilä and contributed to by Mila Oiva, who helped with the processing of Russian materials.
The program uses the following extra modules:- beatufulsoup4 (https://pypi.org/project/beautifulsoup4/)
- robotexclusionrulesparser (https://pypi.org/project/robotexclusionrulesparser/)
- -s name_of_seedfile
- -p prefix_tag
- -l language_tag
- -L level_tag
The other arguments are for distinguishing files and are used in filenames. The prefix and level tags are also used as names of subdirectories.
The program only scrapes text inside p-tags.
Running the scraper_bs4_MAIN.py with valid arguments and a list of seed links will produce as many JSON files as the scraper is able to scrape from the seed file list without problems. The program will keep a log file which is constantly updated, so if something goes wrong and the program crashes or hangs, you can go see which seed link caused the problem. Links ending with .mp3 or other media format are currently a problem.
The program also keeps track of all the processed links, but writes them into files only at the end of the cycle. The program does not scrape social media links but records them for possible future use.
The files created py the program:
- JSON files
- outgoing links for next scrape cycle
- facebook links
- twitter links
- vkontakte links
- linkedin links
- all scraped links
- links not found
- links disallowed by robots.txt
- links whose robots.txt file could not be found