Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robotstxt #25

Closed
10 tasks done
petermeissner opened this issue Jan 9, 2016 · 24 comments
Closed
10 tasks done

robotstxt #25

petermeissner opened this issue Jan 9, 2016 · 24 comments

Comments

@petermeissner
Copy link

    1. What does this package do? (explain in 50 words or less)

Web scraping allows to gather information of scientific value - mainly social science related in my experience. While scraping web pages one should respect permissions declared in robots.txt files.
The package provides functions to retrieve and parse robots.txt files. The core functionality is to check a bots/users permission for one or more resources (paths) for a given domain. To ease checking all functions have been bundled with relevant data into an R6 robotstxt class but everything works functional or object oriented depending on the users preferences.

    1. Paste the full DESCRIPTION file inside a code block (bounded by ``` on either end).
Package: robotstxt
Type: Package
Title: A 'robots.txt' Parser and Webbot/Webspider/Webcrawler Permissions Checker
Version: 0.1.0
Author: Peter Meissner
Maintainer: Peter Meissner <[email protected]>
Description: Class ('R6') and accompanying methods to
    parse and check 'robots.txt' files. Data fields are provided as
    data frames and vectors. Permissions can be checked by providing
    path character vectors and optional bot names.
License: MIT + file LICENSE
LazyData: TRUE
BugReports: https://github.com/petermeissner/robotstxt/issues
URL: https://github.com/petermeissner/robotstxt
Imports:
    R6 (>= 2.1.1),
    stringr (>= 1.0.0),
    httr (>= 1.0.0)
Suggests:
    knitr,
    rmarkdown,
    dplyr,
    testthat
Depends:
    R (>= 3.0.0)
VignetteBuilder: knitr
RoxygenNote: 5.0.1
    1. URL for the package (the development repository, not a stylized html page)

https://github.com/petermeissner/robotstxt

    1. What data source(s) does it work with (if applicable)?

robots.txt files like:

Package developers and users that want an easy way to be nice while gathering data from the web.

    1. Are there other R packages that accomplish the same thing? If so, what is different about yours?

None that I know of.

    1. Check the box next to each policy below, confirming that you agree. These are mandatory.
  • This package does not violate the Terms of Service of any service it interacts with
  • The repository has continuous integration with Travis and/or another service: https://travis-ci.org/petermeissner/robotstxt
  • The package contains a vignette:
  • The package contains a reasonably complete readme with devtools install instructions
  • The package contains unit tests
  • The package only exports functions to the NAMESPACE that are intended for end users
    1. Do you agree to follow the [rOpenSci packaging guidelines]

Yes, good guidelines!

  • Are there any package dependencies not on CRAN?

No.

  • Do you intend for this package to go on CRAN?

With or without ropensci.

  • Does the package have a CRAN accepted license?

yes, MIT

  • Did devtools::check() produce any errors or warnings? If so paste them below.

no:

* DONE
Status: OK

R CMD check succeeded
    1. Please add explanations below for any exceptions to the above:

Does not apply.

    1. If this is a resubmission following rejection, please explain the change in cirucmstances.

No, no resubmit.

@sckott
Copy link
Contributor

sckott commented Jan 10, 2016

Thanks for the submission 🚀 - seeking reviewers

@sckott
Copy link
Contributor

sckott commented Jan 10, 2016

Reviewers: @richfitz @Ironholds

@Ironholds
Copy link
Member

  1. Unless you own gmai.com there's a typo in your DESCRIPTION file ;p.
  2. Do you need the alternate df implementations? They're fun but also a lot of extra code to replace stringsAsFactors = FALSE ;P.
  3. The assignment here seems extraneous.
  4. "did apply" to "applied" here

As you may notice, most of these quibbles are well, quibbles. On the whole the package is excellent :). I'll try to put time into a more detailed review later today or early tomorrow.

@petermeissner
Copy link
Author

Thanks for the ropensci infrastructure and thanks for the feedback - much-much-much appreciated.

in response to @Ironholds

  1. Unless you own gmai.com there's a typo in your DESCRIPTION file ;p.
  • ... that is why devtools::build_win() never called back ... (Windows build only complains about qpdf missing link1 link2)
  • fixed
  1. Do you need the alternate df implementations? They're fun but also a lot of extra code to replace stringsAsFactors = FALSE ;P.
  • fixed
  1. The assignment here seems extraneous.
  • after checking extranous here and here ...
  • yes, developement relict
  • fixed
  1. "did apply" to "applied" here
  • fixed

question of my own

path_allowed() (link) will response to unforseen permission constallations with a warning and return all data that led to the problem.

  • Do you think that design decision is wise/ok ??
  • Is it better to return a warning and FALSE for each path?
  • Should it stop with an error? I think not - when used by headless programs this might break them also the problem is not fatal/severe

@Ironholds
Copy link
Member

I think that's totally fine behaviour, but I'd caution you, in that scenario, to use more formal/explanatory description text, noting that problematic data will be returned.

(On the subject of formality, working on a more formal review now)

@Ironholds
Copy link
Member

On the whole: solid work! Some thoughts:

  1. There are a lot of functions, such as rt_get_fields, which are documented but not exported. What this means in practice is that the code can't be found or used but the documentation does appear within the package index, which is pretty confusing. I'd suggest adding the "internal" roxygen2 keyword to the docs for those functions so that you don't index them.
  2. The vignette is still using "Vignette Title" for indexing, so that's what CRAN will list it as (do not feel bad, I have made this mistake 150m times and never notice until I've already submitted it to CRAN and they've accepted >.>)
  3. The vignette opens "The package provides a class (‘R6’)". This makes it sound like the class is called R6, at least to me.
  4. The vignette could really benefit from a brief explanation of what robots.txt files are and why they matter at the top, with some examples, rather than just links at the bottom.

@petermeissner
Copy link
Author

in response to @Ironholds II

  1. drop unexported from manual
  • done
  1. vignette title / index entry
  • done
  1. The vignette opens "The package ...
  • done
  1. The vignette could really benefit from a brief explanation of what robots.txt files are and why they matter ...
  • done

@petermeissner
Copy link
Author

in regard to how path_allowed() handles problems with permission deduction

  • path_allowed() will now warn more informative describing what happened:

    robotstxt::path_allowed() says:
    Encountered problems while trying to determine bot permissions, returning NA instead of TRUE/FALSE

  • furthermore, path_allowed() will return NA in cases where the permission cannot be determined

@petermeissner
Copy link
Author

Dear Oliver (@Ironholds ),

thank you very much for taking the time to review my package and comment on it. I think the package gained a lot from your suggestions and I myself learned a lot as well. I really appreciate the effort. I would like to add you as contributor - what is the ropensci policy on that matter - within DESCRIPTION and README, if that's ok with you.

@Ironholds
Copy link
Member

I feel like I should say "the package looks really good now, sorry for not mentioning it earlier, this week has been nuts" before accepting or rejecting ;p.

@richfitz
Copy link
Member

I'm sorry, I have left this very late so I will just jot down the thoughts I have rather than a full review, in the hope that it is useful.

My biggest major comment is that I don't see what R6 is adding here; it seems that the return value could be a list (which contains one function).

Don't get me wrong -- I love R6. But in cases where reference semantics aren't needed or being used it seems an unnecessary weirdness for most users (and in my experience users do find them weird). See the code below for a pure-list implementation of your function (purposefully left very similar) to show that there is no real difference in implementation. The only difference is that initialisation looks less weird (x <- robotstxt() not x <- robotstxt$new()).

robotstxt <- function(domain, text) {
  ## check input
  self <- list()
  if (missing(domain)) {
    self$domain <- NA
  }
  if (!missing(text)){
    self$text <- text
    if(!missing(domain)){
      self$domain <- domain
    }
  }else{
    if(!missing(domain)){
      self$domain <- domain
      self$text   <- get_robotstxt(domain)
    }else{
      stop("robotstxt: I need text to initialize.")
    }
  }
  ## fill fields with default data

  tmp <- parse_robotstxt(self$text)
  self$bots        <- tmp$useragents
  self$comments    <- tmp$comments
  self$permissions <- tmp$permissions
  self$crawl_delay <- tmp$crawl_delay
  self$host        <- tmp$host
  self$sitemap     <- tmp$sitemap
  self$other       <- tmp$other

  self$check <- function(paths="/", bot="*", permission=self$permissions) {
    paths_allowed(permissions=permission, paths=paths, bot=bot)
  }

  class(self) <- "robotstxt"
  self
}

(this passes checks in the package by deleting only the $new in the test file).

I really like the idea of downloading the file once and testing repeatedly. I wonder if that could be extended with the assumption that the file does not change during an R session. Then one could memoise the calls and avoid having to have any sort of object for the users to worry about.

cache <- new.env(parent=emptyenv())
path_allowed <- function(url) {
  x <- httr::parse_url(url)
  obj <- cache[[x$hostname]]
  if (is.null(obj)) {
    cache[[x$hostname]] <- obj <- robotstxt(hostname)
  }
  obj$check(x$path)
}

Minor comments

Avoid httr::content(rtxt) in package code (see ?httr::content) in favour of checking that the return type was correct. In development versions of httr I find I also have to declare encoding="UTF-8" to avoid a message on every request, too.

In get_robotstxt, is warning the appropriate behaviour? I would have thought that for 403 and 5xx errors it should be an error. For 404 errors could we not assume that there is_ no robots.txt and therefore it's all fair game?

I find that missing() lends itself to hard-to-diagnose errors when functions are several layers deep; consider explicit NULL arguments and is.null() tests for missing-ness (purely style).

You have a partial match of $useragent to $useragents in R/permissions.R and throughout tests/testthat/test_parser.R

@petermeissner
Copy link
Author

Pfuh, some thoughts -aye? Sounds all very interesting and helpful in the sense that I really have to think about it. Now, this will take me a while. Thank you very much - most appreciated!

@petermeissner
Copy link
Author

@richfitz

I am working on implementing all of your suggestion. But I really would like to understand better what happens in ropensci/robotstxt#6 .

Do you have maybe a reference or a catch phrase for me that I can look up? What I specifically have a hard time wrapping my head around is that the function within the list knows where to find the data.

example

# function definition

robotstxt_dev <- function(domain, text) {
  ## check input
  self <- list()
  if (missing(domain)) {
    self$domain <- NA
  }
  if (!missing(text)){
    self$text <- text
    if(!missing(domain)){
      self$domain <- domain
    }
  }else{
    if(!missing(domain)){
      self$domain <- domain
      self$text   <- get_robotstxt(domain)
    }else{
      stop("robotstxt: I need text to initialize.")
    }
  }
  ## fill fields with default data

  tmp <- parse_robotstxt(self$text)
  self$bots        <- tmp$useragents
  self$comments    <- tmp$comments
  self$permissions <- tmp$permissions
  self$crawl_delay <- tmp$crawl_delay
  self$host        <- tmp$host
  self$sitemap     <- tmp$sitemap
  self$other       <- tmp$other

  self$check <- function(paths="/", bot="*", permission=self$permissions) {
    paths_allowed(permissions=permission, paths=paths, bot=bot)
  }

  class(self) <- "robotstxt"
  self
}

# example
library(robotstxt)
blah <- robotstxt_dev("pmeissner.com")$check
blah("_layouts")

At the moment, that feels like pure black magic.

@petermeissner
Copy link
Author

@richfitz

I have now implemented all suggestions.

  • In regard to the caching / memoisation suggestion I took the freedom to implement the caching in another function (get_robotstxt()) instead - I think it is more appropriate to do the caching at this step.
  • Also, I implemented the related suggestion to allow for one single function call to check permissions. Nonetheless, I kept the possibility to create robots.txt objects. (see updated examples in README and Vignette)

Thank you very very much for taking the time to go through my code. The package gained a lot from that, I think and I learned lot for myself. I really appreciate the effort. Thanks.

@petermeissner
Copy link
Author

Just some further thoughts ...

In regard to discarding R6 because of not using "reference semantics" I have done some research - ah, that's what it is.

  • First, I will still leave R6 out because it can be done without - as you have suggested - and its one dependency less.
  • But, using reference semantics is not the only reason to use Oo-design.
  • In this case I think the tight connection of data and functions would be reason enough to use it.
  • In regard to users who might be confused by using robotstxt$new() instead of robotstxt(): I do not think that the first is more confusing than the latter. In the first case the user is explicitly told that he will use something that might be unusual. In the second case it is more like giving the user something that looks like something he understands but than letting it do black magic (the black magic is called clojure and is very well explained in Wickham 2015: "Advanced R", p. 175ff, in particular p. 186)

@petermeissner
Copy link
Author

Hey everyone (@sckott ), so what is the next step? All suggestions have been gratefully aceppted and incorporated. Is there another thumbs up needed from @richfitz or something else I should/can do or?

@sckott
Copy link
Contributor

sckott commented Apr 26, 2016

seems ready to merge in. I do see

checking for missing documentation entries
W  checking for code/documentation mismatches
   Codoc mismatches from documentation object 'robotstxt':
   robotstxt
     Code: function(domain = NULL, text = NULL)
     Docs: function(domain = "mydomain.com")
     Argument names in code not in docs:
       text
     Mismatches in argument default values:
       Name: 'domain' Code: NULL Docs: "mydomain.com"
   robotstxt
     Code: function(domain = NULL, text = NULL)
     Docs: function(text = "User-agent: *\nDisallow: /")
     Argument names in code not in docs:
       domain
     Mismatches in argument names:
       Position: 1 Code: domain Docs: text
     Mismatches in argument default values:
       Name: 'text' Code: NULL Docs: "User-agent: *\nDisallow: /"

W  checking Rd \usage sections
   Undocumented arguments in documentation object 'get_robotstxt'warnUndocumented arguments in documentation object 'robotstxt'domain’ ‘text

can these be fixed first?

@petermeissner
Copy link
Author

Sorry, I missed that. @sckott thanks for checking - its all fixed.

All checks now pass locally (Ubuntu) without errors, warnings or notes, the same is true for winbuilder checks.

@sckott
Copy link
Contributor

sckott commented Apr 27, 2016

Great! Thanks. Approved.

I added you to a team with admin access, and now you should be able to transfer the repo to ropenscilabs - we move pkgs to ropensci github org once more mature - your pkg is still a fully ropensci pkg even though in ropenscilabs

@petermeissner
Copy link
Author

Coooool.

Transfer repo? : It is done via Settings > Danger Zone > Transfer ownership : "ropenscilabs" - right?
How is pushing to CRAN handled thereafter? I would like to push the current thing to CRAN.

@sckott
Copy link
Contributor

sckott commented Apr 27, 2016

It is done via Settings > Danger Zone > Transfer ownership : "ropenscilabs"

yes

How is pushing to CRAN handled thereafter? I would like to push the current thing to CRAN.

same as before - you are still the maintainer of your pkg, and can submit to CRAN as you like. You can push new version to CRAN before transferring if you like

@petermeissner
Copy link
Author

Transfered to ropenscilabs - everything works fine.

@sckott
Copy link
Contributor

sckott commented Apr 27, 2016

awesome, thanks for your work and contribution

@petermeissner
Copy link
Author

Nah..., I am the one indebted for getting all that great feadback - thanks for letting me join the club.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants