Auto-detect broken links #152

MrFlick · 2021-11-11T08:20:58Z

I wrote up a rough R script to locate broken links in the course content. In total it found 952 links and of those 43 where broken. That's a p-value of 0.0451 so it's kind of significant 😄. Here's the code

library(rvest)
library(purrr)
library(dplyr)

base_url <- Vectorize(function(url) {
  r <- httr::parse_url(url)
  r$path <- dirname(httr::parse_url(url)$path)
  httr::build_url(r)
}, USE.NAMES = FALSE)

polite_GET_status <- function(urls, delay=1) {
  statuses <- rep(0, length(urls))
  safe_status <- possibly(~httr::status_code(httr::GET(.x)), otherwise=999)
  for(i in seq_along(urls)) {
    statuses[i] <- safe_status(urls[i])
    Sys.sleep(delay)
  }
  statuses
}

check_page_links <- function(page, base="", seen=data.frame(url=character(0)), delay=1) {
  stopifnot(inherits(page, "xml_node"))
  stopifnot("url" %in% names(seen))
  page_ids <- page %>% html_nodes("*[id]") %>% html_attr("id") %>% paste0("#", .)
  page_hrefs <- page %>% html_nodes("a") %>% html_attr("href") %>% unique()
  grep_filter <- function(x, pattern, negate=FALSE, ...) x[grepl(pattern, x,)!=negate]
  links <- page_hrefs %>% 
    grep_filter("^(#|mailto)", negate=TRUE) %>% 
    tibble(href=.) %>% 
    mutate(url = url_absolute(href, base=base))
  new_links <- links %>% 
    anti_join(seen %>% select(-any_of("href")) %>% filter(!is.na(url)), by="url") %>% 
    mutate(status=polite_GET_status(url, delay=delay),
     result = case_when(status==200~"OK", status==404~"Not Found", TRUE~"Error"))
  seen_links <- links %>% 
    inner_join(seen %>% select(-any_of("href")) %>% filter(!is.na(url)), by="url")
  dups <- function(x) unique(x[duplicated(x)])
  dup_paged_ids <- dups(page_ids)
  anchors <- page_hrefs %>% 
    grep_filter("^#") %>% 
    tibble(href=.) %>% 
    mutate(result = case_when(
      href %in% dup_paged_ids ~ "Duplicate ID",
      href %in% page_ids ~ "OK",
      TRUE ~ "Not Found"))
  bind_rows(new_links, seen_links, anchors)
}

crawl_pages <- function(seed_url, root_url=base_url(seed_url), delay=.5) {
  page <- read_html(seed_url)
  results <- check_page_links(page, base=seed_url, delay=delay) %>% 
    mutate(page=seed_url)

  filter_children <- function(x, parent) { unique(x[which(startsWith(x, parent))]) }  
  scanned <- seed_url
  toscan <- setdiff(filter_children(results$url, root_url), scanned)
  seen <- results %>% select(url, status, result) %>% unique()
  while(length(toscan) > 0) {
    url <- toscan[1]
    page <- possibly(read_html,otherwise=NULL)(url)
    if (!is.null(page)) {
      pageresults <- check_page_links(page, base=url, seen=seen, delay=delay) %>% 
        mutate(page=.env$url)
      
      candidates <- filter_children(pageresults %>% filter(status==200) %>% pull(url), root_url)
      toscan <- append(toscan, setdiff(setdiff(candidates, scanned), toscan))
      
      results <- results %>% bind_rows(pageresults)
      seen <- results %>% select(url, status, result) %>% unique()
    }
    toscan <- toscan[-1]
    scanned <- c(scanned, url)
    Sys.sleep(delay)
  }
  results
}

url <- "https://umcarpentries.org/intro-curriculum-r/index.html"
all_links <- crawl_pages(url)
all_links %>% filter(result != "OK")

This will return a table that till show the page that was searched, the href used in the <a> tag on that page, the "full" url that evaluates to (turning relative URLs into absolute URLs), the http status code returned from a GET request to that URL, and the result which is "OK" for good links, and non-OK for any potential problems.

Would it be possible to incorporate something like this into the build workflow so it can check for broken links automatically?

It does take some time to run the code but that's mostly because I've added in a delay between http requests in order to be a "polite" web scraper and not bombard any one server with too many requests in a short period of time. The current delay between requests is .5 seconds. The code is written to not query the same URL twice but there are still about ~160 unique URLs that are checked.

It checks both URL links and anchor-style links. So if the URL uses "#" it will search the IDs on the page to make sure it's here. Note that since it uses the element ID and those IDs need to be unique on a page, it also will report if one of those IDs has been duplicate which would interfere with the links.

The text was updated successfully, but these errors were encountered:

kelly-sovacool · 2021-11-22T16:18:16Z

Awesome, we should definitely incorporate this into the build workflow.

kelly-sovacool · 2021-11-22T19:06:23Z

Since this checks the rendered site at https://umcarpentries.org/intro-curriculum-r/, it won't catch if someone breaks a link in a PR until after the PR is merged. Links outside our control could also break (e.g. links to the cheatsheets) at any time. Rather than putting this in the build-website workflow, maybe it should be it's own GitHub Actions workflow that runs on a CRON schedule?

MrFlick · 2021-11-22T20:03:02Z

Those are good points. I don't actually have much experience with GitHub Actions myself so I wasn't sure what was possible. But If the code runs once a month or something to check links, that would be helpful.

kelly-sovacool self-assigned this Nov 22, 2021

kelly-sovacool mentioned this issue Nov 22, 2021

Auto-detect broken links with GitHub Actions #154

Merged

3 tasks

kelly-sovacool closed this as completed in #154 Nov 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-detect broken links #152

Auto-detect broken links #152

MrFlick commented Nov 11, 2021 •

edited

Loading

kelly-sovacool commented Nov 22, 2021

kelly-sovacool commented Nov 22, 2021 •

edited

Loading

MrFlick commented Nov 22, 2021

Auto-detect broken links #152

Auto-detect broken links #152

Comments

MrFlick commented Nov 11, 2021 • edited Loading

kelly-sovacool commented Nov 22, 2021

kelly-sovacool commented Nov 22, 2021 • edited Loading

MrFlick commented Nov 22, 2021

MrFlick commented Nov 11, 2021 •

edited

Loading

kelly-sovacool commented Nov 22, 2021 •

edited

Loading