Robust checker of whether a URI is live or not #50

opoudjis · 2024-03-06T11:14:39Z

As a result of metanorma/metanorma-iso#1114, I have enabled code that was previously deactivated, to check whether a URI in a bibliographic entry is active or not. This is done in case the bibliography requires a date last accessed to be supplied, and it hasn't been already.

https://github.com/relaton/relaton-render/blob/main/lib/relaton/render/general/uri.rb

The problem is, it isn't working well, and it needs someone who understands fetching better than me to fix it.

For example:

      def url_exist?(url_string)
        url = URI.parse(url_string)
        url.host or return true # allow file URLs
        res = access_url(url) or return false
        res.is_a?(Net::HTTPRedirection) and return url_exist?(res["location"])
        res.code[0] != "4"
      rescue Errno::ENOENT, SocketError
        false # false if can't find the server
      end

seems to be in an infinite loop of redirections triggered by https://dl.acm.org/doi/10.1145/3425898.3426958

It is returning HTTP 302 Found, but it is a redirection. The problem is, it's a redirection to a cookie query, https://dl.acm.org/doi/10.1145/3425898.3426958?cookieSet=1, and that ends up in an infinite loop. Clearly res.is_a?(Net::HTTPRedirection) is naive, but TBH I don't have the headspace to make this robust.

PDFs are routinely returning false on res.code[0] != "4"; so

http://www.tandfonline.com/doi/abs/10.1111/j.1467-8306.2004.09402005.x is returning HTTP 301 Moved Permanently, which really is a redirect, and its res["location"] is still https://www.tandfonline.com/doi/abs/10.1111/j.1467-8306.2004.09402005.x. When I access that, I get HTTP 403 Forbidden. But I expect to get HTTP 403 for a paywalled resource! The gem should not be reporting a failure there.

So this needs a smarter treatment of possible HTTP codes. Really, the only case where a URI is invalid is (I think) 404 or 50x. But I don't want to do this, I want someone else to do this, that is familiar with HTTP codes and paywalled content and redirects.

I do not agree with Ronald that a new gem is required, but I'm asking that someone else handles this. For now, I'm doing a hotfix that passes all URIs it sees.

The text was updated successfully, but these errors were encountered:

opoudjis · 2024-03-06T11:26:58Z

if you cannot do this, @andrew2net, we can give this to @alexeymorozov . As long as it's not me :)

andrew2net · 2024-03-06T20:11:25Z

@opoudjis it would be helpful if you give this to someone else.

opoudjis · 2024-03-07T02:09:34Z

Fair. @alexeymorozov ?

ronaldtse · 2024-03-07T04:37:23Z

There are many things to fix here:

We should only care about the head status response instead of actually retrieving the whole URL response. The URL response can be 100MB.
A URI is not meant to be resolved. ONLY a URL is meant to be accessible.
There are MANY sites that require browser access (cookies, etc) or JS.
It is very difficult to check whether a document is still available, it probably requires some intelligent mechanism to determine. This is a confidence issue. Maybe the page says "Not found" but the status code is 200...

opoudjis · 2024-03-07T13:45:16Z

There are many things to fix here:

We should only care about the head status response instead of actually retrieving the whole URL response. The URL response can be 100MB.

Already being done

A URI is not meant to be resolved. ONLY a URL is meant to be accessible.

Fine.

There are MANY sites that require browser access (cookies, etc) or JS.

And like I said, HTTP 403 Forbidden has to be assumed to be a valid URL.

It is very difficult to check whether a document is still available, it probably requires some intelligent mechanism to determine. This is a confidence issue. Maybe the page says "Not found" but the status code is 200...

It's a best effort check. The really authoritative way of doing this is for the author to insert a manual accessed date, to indicate that they have physically sighted it.

opoudjis · 2024-09-30T09:50:34Z

No developer is currently working on this. @ronaldtse This needs to be addressed.

opoudjis assigned ronaldtse and andrew2net Mar 6, 2024

opoudjis added the enhancement New feature or request label Mar 6, 2024

opoudjis added a commit that referenced this issue Mar 6, 2024

temporarily disable URI validation until it is improved: #50

77c17b0

opoudjis unassigned andrew2net Mar 7, 2024

alexeymorozov self-assigned this Mar 27, 2024

alexeymorozov mentioned this issue Mar 29, 2024

WIP: Robust URL checker #51

Draft

opoudjis unassigned alexeymorozov Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robust checker of whether a URI is live or not #50

Robust checker of whether a URI is live or not #50

opoudjis commented Mar 6, 2024

opoudjis commented Mar 6, 2024

andrew2net commented Mar 6, 2024

opoudjis commented Mar 7, 2024

ronaldtse commented Mar 7, 2024

opoudjis commented Mar 7, 2024

opoudjis commented Sep 30, 2024

Robust checker of whether a URI is live or not #50

Robust checker of whether a URI is live or not #50

Comments

opoudjis commented Mar 6, 2024

opoudjis commented Mar 6, 2024

andrew2net commented Mar 6, 2024

opoudjis commented Mar 7, 2024

ronaldtse commented Mar 7, 2024

opoudjis commented Mar 7, 2024

opoudjis commented Sep 30, 2024