Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 1.59 KB

2017-11-15-86.md

File metadata and controls

20 lines (15 loc) · 1.59 KB
layout title byline arxiv tags summary
post
Web Robot Detection in Academic Publishing
Lagopoulos et al
1711.05098
internet
crawlers
It is possible to detect which web viewers are real humans and which are automated bot site crawlers, which improves interest-tracking and viewership statistics in academic publication.

This work focuses on the detection of web-crawling robots on academic publishing sites. Robots, the paper explains, make up around half of internet traffic to these websites, and so it is important to exclude crawlers when determining the "organic" popularity of a website or web asset.

This paper applies several machine-learning techniques to "learn" which users are crawlers and which are real humans, using data from the web logs of both real and experimentally-defined servers.

The authors employ a few main techniques for making these determinations: First, they inspect traffic patterns, using information about the user's flow around the website and to and from neighboring websites to establish to determine robotness.

They also use page content to calibrate this (which is one novel contribution from this work). The authors use the content of the crawled pages to determine if the user has an interest topic — as a human might — or if it's indiscrimiately clicking links and exploring pages (as a robot would do).

From this work, it is possible to better establish which users are robots and which are humans, better calibrating viewership statistics in academic publishing, and likely preventing artificial "view" or "interest" inflation with automated readers.