11 Jun 09:31

liserman

7602ca8

archiveRetriever 0.4.0 Latest

Latest

archiveRetriever 0.4.0

Replace deprecated functions of dependencies
Fix bugs in archive_overview() and retrieve_urls()
New option nonArchive added to retrieve_links() and scrape_urls(). This option allows users to scrape internet pages not stemming from the Internet Archive.
New feature added to the collapse option of scrape_urls(). collapse can now also take a Xpath as input, to collapse results based on a structuring Xpath. Unfortunately, this works only with Xpaths and not with CSS selectors. If used, Paths refers only to children of the structuring Xpath given in collapse.

Assets 2

27 Dec 10:34

liserman

v0.3.1

b7a7558

archiveRetriever 0.3.1

Changes to the testing environment.
Disable progress bar in non-interactive use.

Assets 2

20 Dec 21:03

liserman

v0.3.0

5cbd375

archiveRetriever 0.3.0

Fixes to filtering of links in retrieve_links() to enable link scraping from domains with more than one domain ending.
New option filter added to retrieve_links(). This options allows to disable the filtering of links to be sub-domains of the top-level domain.
New option pattern added to retrieve_links(). This option allows for custom patterns by which links are filtered before output.

Assets 2

21 Jun 15:22

liserman

v0.2.0

4c6c4ee

archiveRetriever 0.2.0

New option collapseDate added to retrieve_urls(). This option allows users to choose whether retrieve_urls outputs all or just one memento per requested day.

Assets 2

08 Jun 08:31

liserman

v0.1.2

ac9e6ad

archiveRetriever 0.1.2

Fixes to ignoreErrors option for html reading-errors in scrape_urls()
Fixes to retrieve_links() for Errors occurring in last Url
Improve compatibility between retrieve_links() and scrape_urls()

Assets 2

12 Jan 18:37

liserman

v0.1.1

3890dc3

archiveRetriever 0.1.1

Fixes to ignoreErrors option for encoding errors in retrieve_links()

Assets 2

22 Sep 12:54

liserman

v0.1.0

e75e2f0

archiveRetriever 0.1.0

Added new function to scrape_urs: collapse
Improved functionality
Fixes to test environment

Assets 2

19 Mar 12:01

liserman

v0.0.2

5ddbeef

archiveRetriever 0.0.2

Scraping content from archived web pages stored in the 'Internet Archive' (https://archive.org) using a systematic workflow. Get an overview of the mementos available from the respective homepage, retrieve the Urls and links of the page and finally scrape the content. The final output is stored in tibbles, which can be then easily used for further analysis.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archiveRetriever 0.4.0

Releases: liserman/archiveRetriever