PHist

A web scraper in Python3 that utilizes BeautifulSoup and writes the results into JSON format.

This program was created for Pseudohistoria project at the University of Turku (https://sites.utu.fi/pseudohistoria/en/). The program was coded by Anna Ristilä and contributed to by Mila Oiva, who helped with the processing of Russian materials.

Required modules

The program uses the following extra modules:

beatufulsoup4 (https://pypi.org/project/beautifulsoup4/)
robotexclusionrulesparser (https://pypi.org/project/robotexclusionrulesparser/)

Usage

The program was designed to be run from command line with some arguments:

-s name_of_seedfile
-p prefix_tag
-l language_tag
-L level_tag

The seed file should be in plain text and contain one link on one line. Currently, the code only understands Windows newlines "\r\n".

The other arguments are for distinguishing files and are used in filenames. The prefix and level tags are also used as names of subdirectories.

The program only scrapes text inside p-tags.

Running the scraper_bs4_MAIN.py with valid arguments and a list of seed links will produce as many JSON files as the scraper is able to scrape from the seed file list without problems. The program will keep a log file which is constantly updated, so if something goes wrong and the program crashes or hangs, you can go see which seed link caused the problem. Links ending with .mp3 or other media format are currently a problem.

The program also keeps track of all the processed links, but writes them into files only at the end of the cycle. The program does not scrape social media links but records them for possible future use.

The files created py the program:

JSON files
outgoing links for next scrape cycle
facebook links
twitter links
vkontakte links
linkedin links
all scraped links
links not found
links disallowed by robots.txt
links whose robots.txt file could not be found

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
scraper_bs4_MAIN.py		scraper_bs4_MAIN.py
scraper_globals.py		scraper_globals.py
scraper_module.py		scraper_module.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PHist

Required modules

Usage

About

Releases

Packages

Languages

License

aristila/PHist

Folders and files

Latest commit

History

Repository files navigation

PHist

Required modules

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages