You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wrote up a rough R script to locate broken links in the course content. In total it found 952 links and of those 43 where broken. That's a p-value of 0.0451 so it's kind of significant 😄. Here's the code
This will return a table that till show the page that was searched, the href used in the <a> tag on that page, the "full" url that evaluates to (turning relative URLs into absolute URLs), the http status code returned from a GET request to that URL, and the result which is "OK" for good links, and non-OK for any potential problems.
Would it be possible to incorporate something like this into the build workflow so it can check for broken links automatically?
It does take some time to run the code but that's mostly because I've added in a delay between http requests in order to be a "polite" web scraper and not bombard any one server with too many requests in a short period of time. The current delay between requests is .5 seconds. The code is written to not query the same URL twice but there are still about ~160 unique URLs that are checked.
It checks both URL links and anchor-style links. So if the URL uses "#" it will search the IDs on the page to make sure it's here. Note that since it uses the element ID and those IDs need to be unique on a page, it also will report if one of those IDs has been duplicate which would interfere with the links.
The text was updated successfully, but these errors were encountered:
Since this checks the rendered site at https://umcarpentries.org/intro-curriculum-r/, it won't catch if someone breaks a link in a PR until after the PR is merged. Links outside our control could also break (e.g. links to the cheatsheets) at any time. Rather than putting this in the build-website workflow, maybe it should be it's own GitHub Actions workflow that runs on a CRON schedule?
Those are good points. I don't actually have much experience with GitHub Actions myself so I wasn't sure what was possible. But If the code runs once a month or something to check links, that would be helpful.
I wrote up a rough R script to locate broken links in the course content. In total it found 952 links and of those 43 where broken. That's a p-value of 0.0451 so it's kind of significant 😄. Here's the code
This will return a table that till show the
page
that was searched, thehref
used in the<a>
tag on that page, the "full"url
that evaluates to (turning relative URLs into absolute URLs), the httpstatus
code returned from a GET request to that URL, and theresult
which is "OK" for good links, and non-OK for any potential problems.Would it be possible to incorporate something like this into the build workflow so it can check for broken links automatically?
It does take some time to run the code but that's mostly because I've added in a delay between http requests in order to be a "polite" web scraper and not bombard any one server with too many requests in a short period of time. The current delay between requests is .5 seconds. The code is written to not query the same URL twice but there are still about ~160 unique URLs that are checked.
It checks both URL links and anchor-style links. So if the URL uses "#" it will search the IDs on the page to make sure it's here. Note that since it uses the element ID and those IDs need to be unique on a page, it also will report if one of those IDs has been duplicate which would interfere with the links.
The text was updated successfully, but these errors were encountered: