Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust checker of whether a URI is live or not #50

Open
opoudjis opened this issue Mar 6, 2024 · 6 comments
Open

Robust checker of whether a URI is live or not #50

opoudjis opened this issue Mar 6, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@opoudjis
Copy link
Contributor

opoudjis commented Mar 6, 2024

As a result of metanorma/metanorma-iso#1114, I have enabled code that was previously deactivated, to check whether a URI in a bibliographic entry is active or not. This is done in case the bibliography requires a date last accessed to be supplied, and it hasn't been already.

https://github.com/relaton/relaton-render/blob/main/lib/relaton/render/general/uri.rb

The problem is, it isn't working well, and it needs someone who understands fetching better than me to fix it.

For example:

      def url_exist?(url_string)
        url = URI.parse(url_string)
        url.host or return true # allow file URLs
        res = access_url(url) or return false
        res.is_a?(Net::HTTPRedirection) and return url_exist?(res["location"])
        res.code[0] != "4"
      rescue Errno::ENOENT, SocketError
        false # false if can't find the server
      end

seems to be in an infinite loop of redirections triggered by https://dl.acm.org/doi/10.1145/3425898.3426958

It is returning HTTP 302 Found, but it is a redirection. The problem is, it's a redirection to a cookie query, https://dl.acm.org/doi/10.1145/3425898.3426958?cookieSet=1, and that ends up in an infinite loop. Clearly res.is_a?(Net::HTTPRedirection) is naive, but TBH I don't have the headspace to make this robust.

PDFs are routinely returning false on res.code[0] != "4"; so

http://www.tandfonline.com/doi/abs/10.1111/j.1467-8306.2004.09402005.x is returning HTTP 301 Moved Permanently, which really is a redirect, and its res["location"] is still https://www.tandfonline.com/doi/abs/10.1111/j.1467-8306.2004.09402005.x. When I access that, I get HTTP 403 Forbidden. But I expect to get HTTP 403 for a paywalled resource! The gem should not be reporting a failure there.

So this needs a smarter treatment of possible HTTP codes. Really, the only case where a URI is invalid is (I think) 404 or 50x. But I don't want to do this, I want someone else to do this, that is familiar with HTTP codes and paywalled content and redirects.

I do not agree with Ronald that a new gem is required, but I'm asking that someone else handles this. For now, I'm doing a hotfix that passes all URIs it sees.

@opoudjis
Copy link
Contributor Author

opoudjis commented Mar 6, 2024

if you cannot do this, @andrew2net, we can give this to @alexeymorozov . As long as it's not me :)

@andrew2net
Copy link

@opoudjis it would be helpful if you give this to someone else.

@opoudjis
Copy link
Contributor Author

opoudjis commented Mar 7, 2024

Fair. @alexeymorozov ?

@ronaldtse
Copy link

There are many things to fix here:

  • We should only care about the head status response instead of actually retrieving the whole URL response. The URL response can be 100MB.
  • A URI is not meant to be resolved. ONLY a URL is meant to be accessible.
  • There are MANY sites that require browser access (cookies, etc) or JS.
  • It is very difficult to check whether a document is still available, it probably requires some intelligent mechanism to determine. This is a confidence issue. Maybe the page says "Not found" but the status code is 200...

@opoudjis
Copy link
Contributor Author

opoudjis commented Mar 7, 2024

There are many things to fix here:

  • We should only care about the head status response instead of actually retrieving the whole URL response. The URL response can be 100MB.

Already being done

  • A URI is not meant to be resolved. ONLY a URL is meant to be accessible.

Fine.

  • There are MANY sites that require browser access (cookies, etc) or JS.

And like I said, HTTP 403 Forbidden has to be assumed to be a valid URL.

  • It is very difficult to check whether a document is still available, it probably requires some intelligent mechanism to determine. This is a confidence issue. Maybe the page says "Not found" but the status code is 200...

It's a best effort check. The really authoritative way of doing this is for the author to insert a manual accessed date, to indicate that they have physically sighted it.

@opoudjis
Copy link
Contributor Author

No developer is currently working on this. @ronaldtse This needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants