Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-detect broken links #152

Closed
MrFlick opened this issue Nov 11, 2021 · 3 comments · Fixed by #154
Closed

Auto-detect broken links #152

MrFlick opened this issue Nov 11, 2021 · 3 comments · Fixed by #154
Assignees

Comments

@MrFlick
Copy link
Contributor

MrFlick commented Nov 11, 2021

I wrote up a rough R script to locate broken links in the course content. In total it found 952 links and of those 43 where broken. That's a p-value of 0.0451 so it's kind of significant 😄. Here's the code

library(rvest)
library(purrr)
library(dplyr)

base_url <- Vectorize(function(url) {
  r <- httr::parse_url(url)
  r$path <- dirname(httr::parse_url(url)$path)
  httr::build_url(r)
}, USE.NAMES = FALSE)

polite_GET_status <- function(urls, delay=1) {
  statuses <- rep(0, length(urls))
  safe_status <- possibly(~httr::status_code(httr::GET(.x)), otherwise=999)
  for(i in seq_along(urls)) {
    statuses[i] <- safe_status(urls[i])
    Sys.sleep(delay)
  }
  statuses
}

check_page_links <- function(page, base="", seen=data.frame(url=character(0)), delay=1) {
  stopifnot(inherits(page, "xml_node"))
  stopifnot("url" %in% names(seen))
  page_ids <- page %>% html_nodes("*[id]") %>% html_attr("id") %>% paste0("#", .)
  page_hrefs <- page %>% html_nodes("a") %>% html_attr("href") %>% unique()
  grep_filter <- function(x, pattern, negate=FALSE, ...) x[grepl(pattern, x,)!=negate]
  links <- page_hrefs %>% 
    grep_filter("^(#|mailto)", negate=TRUE) %>% 
    tibble(href=.) %>% 
    mutate(url = url_absolute(href, base=base))
  new_links <- links %>% 
    anti_join(seen %>% select(-any_of("href")) %>% filter(!is.na(url)), by="url") %>% 
    mutate(status=polite_GET_status(url, delay=delay),
     result = case_when(status==200~"OK", status==404~"Not Found", TRUE~"Error"))
  seen_links <- links %>% 
    inner_join(seen %>% select(-any_of("href")) %>% filter(!is.na(url)), by="url")
  dups <- function(x) unique(x[duplicated(x)])
  dup_paged_ids <- dups(page_ids)
  anchors <- page_hrefs %>% 
    grep_filter("^#") %>% 
    tibble(href=.) %>% 
    mutate(result = case_when(
      href %in% dup_paged_ids ~ "Duplicate ID",
      href %in% page_ids ~ "OK",
      TRUE ~ "Not Found"))
  bind_rows(new_links, seen_links, anchors)
}

crawl_pages <- function(seed_url, root_url=base_url(seed_url), delay=.5) {
  page <- read_html(seed_url)
  results <- check_page_links(page, base=seed_url, delay=delay) %>% 
    mutate(page=seed_url)

  filter_children <- function(x, parent) { unique(x[which(startsWith(x, parent))]) }  
  scanned <- seed_url
  toscan <- setdiff(filter_children(results$url, root_url), scanned)
  seen <- results %>% select(url, status, result) %>% unique()
  while(length(toscan) > 0) {
    url <- toscan[1]
    page <- possibly(read_html,otherwise=NULL)(url)
    if (!is.null(page)) {
      pageresults <- check_page_links(page, base=url, seen=seen, delay=delay) %>% 
        mutate(page=.env$url)
      
      candidates <- filter_children(pageresults %>% filter(status==200) %>% pull(url), root_url)
      toscan <- append(toscan, setdiff(setdiff(candidates, scanned), toscan))
      
      results <- results %>% bind_rows(pageresults)
      seen <- results %>% select(url, status, result) %>% unique()
    }
    toscan <- toscan[-1]
    scanned <- c(scanned, url)
    Sys.sleep(delay)
  }
  results
}

url <- "https://umcarpentries.org/intro-curriculum-r/index.html"
all_links <- crawl_pages(url)
all_links %>% filter(result != "OK")

This will return a table that till show the page that was searched, the href used in the <a> tag on that page, the "full" url that evaluates to (turning relative URLs into absolute URLs), the http status code returned from a GET request to that URL, and the result which is "OK" for good links, and non-OK for any potential problems.

Would it be possible to incorporate something like this into the build workflow so it can check for broken links automatically?

It does take some time to run the code but that's mostly because I've added in a delay between http requests in order to be a "polite" web scraper and not bombard any one server with too many requests in a short period of time. The current delay between requests is .5 seconds. The code is written to not query the same URL twice but there are still about ~160 unique URLs that are checked.

It checks both URL links and anchor-style links. So if the URL uses "#" it will search the IDs on the page to make sure it's here. Note that since it uses the element ID and those IDs need to be unique on a page, it also will report if one of those IDs has been duplicate which would interfere with the links.

@kelly-sovacool
Copy link
Member

Awesome, we should definitely incorporate this into the build workflow.

@kelly-sovacool
Copy link
Member

kelly-sovacool commented Nov 22, 2021

Since this checks the rendered site at https://umcarpentries.org/intro-curriculum-r/, it won't catch if someone breaks a link in a PR until after the PR is merged. Links outside our control could also break (e.g. links to the cheatsheets) at any time. Rather than putting this in the build-website workflow, maybe it should be it's own GitHub Actions workflow that runs on a CRON schedule?

@MrFlick
Copy link
Contributor Author

MrFlick commented Nov 22, 2021

Those are good points. I don't actually have much experience with GitHub Actions myself so I wasn't sure what was possible. But If the code runs once a month or something to check links, that would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants