Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linkcheck builder: optionally allow HTTP 401 status code hyperlinks to be reported broken #11684

Conversation

jayaddison
Copy link
Contributor

@jayaddison jayaddison commented Sep 14, 2023

Feature or Bugfix

  • Bugfix

Purpose

Detail

  • The HTTP 401 status code indicates that access to a web resource requires authentication, and that the client has not provided that authentication.
  • Webservers often configure authentication on a per-directory or per-path basis, so client receipt of an authentication-required status code does not necessarily confirm that a URL is valid.
  • 🗒️ With this changeset, the ability to treat HTTP 401 responses from hyperlinks as broken is opt-in - deployments must set the linkcheck_allow_unauthorized config option to False to consider unauthorized links as unavailable during linkchecking. The default reporting status for unauthorized links will change to broken as part of the major-version upgrade to Sphinx 8.0 in future.

Relates

if status_code == 401:
return 'working', 'unauthorized', 0
return 'broken', 'unauthorized', 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think that the 401 still makes sense if people do not want to expose credentials in their configuration and only assume that something exists (and could return a 401).

    As such, I think we could keep the 'working' but change a bit the way that linkcheck_auth is handled:

    • If we specify "fake" credentials by None, we keep 'working' and do not emit warnings.
    • If no fake credential for the corresponding URL is specified, we keep the 'working', but we emit a warning saying that the content couldn't be accessed and that users should specify credentials (fake or not).

That way, this also ensures that credentials are not exposed unintentionally and that the contents with 401 errors are detected correctly.

  • It's more of a follow-up, but what about using an enumeration (or at least mark the possible string statuses we returm using Literal) for the status instead of using 'working', 'broken', etc?

    In process_result, we use if-case but using match instead + enumerations may be more elegant and clearer (also we wouldn't have an arbitrary status code being emitted).

Copy link
Contributor Author

@jayaddison jayaddison Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the 401 still makes sense if people do not want to expose credentials in their configuration and only assume that something exists (and could return a 401).

Hmm.. point understood, although I think that even when an URL has no linkcheck credentials configured (the "no fake credential" case), an HTTP 401 response should be considered a broken link, because it means that the link is unchecked. This is probably the core of the disruptive/controversial angle I have on this.

If for some reason a user is unable to provide credentials for some hyperlinks included in a documentation set, but wants a success report from linkchecking despite that, I'd argue they should use linkcheck_anchors_ignore linkcheck_ignore to skip the relevant websites.

I'm still thinking about the first point -- using None as a special marker for intentionally-empty credentials. It seems a bit fragile and too-clever; I like simple and hard-to-misunderstand things, because I misunderstand a lot.

It's more of a follow-up, but what about using an enumeration (or at least mark the possible string statuses we returm using Literal) for the status instead of using 'working', 'broken', etc?

👍

In process_result, we use if-case but using match instead + enumerations may be more elegant and clearer (also we wouldn't have an arbitrary status code being emitted).

Mostly agree, although we still support Py3.9 at the moment - so this could be a future enhancement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me using linkcheck_anchors_ignore although it's a bit counterintuitive in this case (like I don't really want to ignore).

using None as a special marker for intentionally-empty credentials

It was an example. We could have another configuration value like linkcheck_auth_bypass and put what links we expect to have 401.

Concerning the match, I forgot it was introduced in 3.10.


Actually, when writing my comment, I thought about a much more elegant solution, namely, you could specify the HTTP response code to expect for specific links and treat them as "working".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me using linkcheck_anchors_ignore although it's a bit counterintuitive in this case (like I don't really want to ignore).

Let me try to make a convicing argument that it makes sense :) (I could be wrong! in which case that'll help too)

The HTTP 401 response is zero-information in terms of hyperlink validity. It does confirm that the client should use auth if it wants to gain more-than-zero information, but that's the only feedback it provides -- and we hide that from users at the moment.

So we're making network requests that fail, and we're not informing the user about it. I think we should start by informing the user -- and then they either choose to find auth to gain greater-than-zero info, or they ignore the links and reduce the outbound traffic (and linkchecking time).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 I think we should start by informing the user 

Agreed. Actually, by counterintuitive, I meant that using the linkcheck_anchors_ignore was weird to me since I would have used it to suppress errors related to anchors rather than to HTTP response codes.

So, we could keep what you suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agh, sorry. I should have written linkcheck_ignore instead. I completely failed to notice/parse the word anchor in there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's ok for me !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) thank you!

Copy link
Contributor Author

@jayaddison jayaddison Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more of a follow-up, but what about using an enumeration (or at least mark the possible string statuses we returm using Literal) for the status instead of using 'working', 'broken', etc?
In process_result, we use if-case but using match instead + enumerations may be more elegant and clearer (also we wouldn't have an arbitrary status code being emitted).

I'm working on some refactoring for this at the moment; the CheckResult class relies upon being JSON-serializable at the moment, and I don't feel like introducing custom serialization code, so I've opted to use the StrEnum type -- Enum with string values isn't serializable by default.

Taking that approach adds a dependency on Py3.11, so I'm experimenting with adding match statements at the same time.

I did read up a little bit about using Literal - it looks like it's mostly a type-checking aid? I think I'd prefer the enum route, even if it may take a while before it could land in the codebase. Maybe I'll change my mind as I learn more, though..

@AA-Turner
Copy link
Member

For this to go into 7.3, it should be opt-in; we can flip to opt-out in 8.0.

A

@jayaddison
Copy link
Contributor Author

Ok, thanks @AA-Turner.

I could wait until 8.0 to merge this, if that's simpler -- or I'll add a config option called linkcheck_permit_unauthorized_responses, default True (existing behaviour).

More concise and/or better config option name ideas welcome.. I'd like to communicate the fact that it's server-communicated auth status that is the condition, and for the way those cases are handled to be fairly clear from the option name.

@picnixz
Copy link
Member

picnixz commented Sep 21, 2023

An alternative: linkcheck_ignore_error_codes containing 401 by default.

Or an alternative name: linkcheck_ignore_unauthorized set to true.

Or: linkcheck_strict_authorized set to false.

@jayaddison
Copy link
Contributor Author

Hmm.. the error-code list approach could be a neat solution, but gives me a sense of too much configurability. Roughly speaking, I think HTTP status codes are well-enough defined to determine success and failure in the case of linkchecking.

The ignore keyword would be nice because it'd visually align with other configuration options and re-uses existing terminology.. but the use case is different here, I think. We do check these unauthorized URLs, so they're not ignored.

strict could also be OK; my concern there is that it infers that there could be a category of checks -- but since this is intended to be a temporary flag until 8.0, I'd prefer to pick something that can't accidentally become repurposed for other things prior to removal.

To shoot down one of my own other alternative ideas as well: linkcheck_allow_unauthorized - concise, re-uses the word allow -- but could be confusing because there is client auth, and it's perfectly fine for the linkchecker to make unauthorized requests in most cases. What we care about is whether the server is saying the request was not authorized.

So.. I suppose linkcheck_allow_unauthorized_responses could be OK. linkcheck_tolerate_unauthorized_responses?

@picnixz
Copy link
Member

picnixz commented Sep 21, 2023

I'm not fond of long configuration value names. And I would prefer using standard terminology (i.e., not "tolerate"). I think we should pick something from: allow, ignore, skip or strict.

We could also have something like linkcheck_ignore_errors = ['unauthorized'] (meaning, you do check but ignore specific errors, and not the link itself). Since we also have a bunch of other errors that turn into warnings, we could also suppress them like that.

@jayaddison
Copy link
Contributor Author

I'm not fond of long configuration value names. And I would prefer using standard terminology (i.e., not "tolerate"). I think we should pick something from: allow, ignore, skip or strict.

Ok, agreed on both of those.

We could also have something like linkcheck_ignore_errors = ['unauthorized'] (meaning, you do check but ignore specific errors, and not the link itself). Since we also have a bunch of other errors that turn into warnings, we could also suppress them like that.

I'd like to keep the scope of this change (and the config option) small, and the management of log-level for some of those errors/warnings seems a bit unrelated.

However: it does make me wonder whether it'd be OK to add a warning message even when HTTP 401 is treated as success?

Regarding the option name: probably enough bikeshedding, so linkcheck_allow_unauthorized is it, I think. I resolved my own concern about the ambiguity with client auth because I suppose technically that refers to authentication and not authorization.

@picnixz
Copy link
Member

picnixz commented Sep 21, 2023

whether it'd be OK to add a warning message even when HTTP 401 is treated as success?

Not good if people are turning warnings into errors.

…setting so that users can opt-in to use the stricter treatment of HTTP 401 unauthorized responses
…false to opt-in to treating HTTP 401 hyperlinks as broken
CHANGES.rst Outdated Show resolved Hide resolved
@jayaddison jayaddison changed the title linkcheck builder: consider HTTP 401 status code to indicate failure linkcheck builder: allow HTTP 401 status code hyperlinks to be reported as broken Sep 22, 2023
@jayaddison jayaddison changed the title linkcheck builder: allow HTTP 401 status code hyperlinks to be reported as broken linkcheck builder: optionally allow HTTP 401 status code hyperlinks to be reported broken Sep 22, 2023
@jayaddison
Copy link
Contributor Author

I don't have any further changes/commits planned here at the moment and would be glad for more review when possible.

doc/usage/configuration.rst Show resolved Hide resolved
sphinx/builders/linkcheck.py Outdated Show resolved Hide resolved
sphinx/builders/linkcheck.py Outdated Show resolved Hide resolved
tests/test_build_linkcheck.py Show resolved Hide resolved
sphinx/builders/linkcheck.py Outdated Show resolved Hide resolved
sphinx/builders/linkcheck.py Outdated Show resolved Hide resolved
tests/test_build_linkcheck.py Outdated Show resolved Hide resolved
@AA-Turner
Copy link
Member

I could wait until 8.0 to merge this

For X.0 releases, I prefer to have a 'clean' release with only removals/changes to defaults -- a new feature I think should be merged in a point release, with an opt-in.

jayaddison and others added 2 commits September 26, 2023 16:09
…-specific Sphinx 8.0 deprecation subclass

Co-authored-by: picnixz <[email protected]>
…riable private, to prevent other code relying on it, since it is intended for removal soon

Co-authored-by: picnixz <[email protected]>
@jayaddison
Copy link
Contributor Author

Ok, I've reached another review-pause now (no further commits planned).

The main question I have open / to resolve is how and when we should emit deprecation warnings -- if at all -- and whether my assumptions in this subthread are correct.

@jayaddison
Copy link
Contributor Author

Ok, reading back through the commentary, there's one detail I missed:

For this to go into 7.3, it should be opt-in; we can flip to opt-out in 8.0.

So we should switch the default, not remove the configuration value entirely. I'll update the deprecation notice about this correspondingly -- and as I understand from the other comments, deprecation warnings don't become errors, so that should be innocuous, and we can remove the warning from 8.0 onwards.

… the default value for it will change; update the deprecation message and config documentation accordingly
…d in-the-wild and legacy behaviour is enabled, emit a notice to inform the deployment about the upcoming behaviour-change
…ld not be logged as part of the warning message; partly for reasons of log-message deduplication, and partly because the information is already available elsewhere in the linkchecker output
CHANGES.rst Outdated Show resolved Hide resolved
sphinx/builders/linkcheck.py Show resolved Hide resolved
@jayaddison
Copy link
Contributor Author

I think this is ready - no further commits/changes planned.

@jayaddison
Copy link
Contributor Author

@AA-Turner @picnixz any further feedback for these changes? I don't know if it'll have much effect in practice, but if the number of people who have broken hyperlinks and don't realize it yet is greater than zero, I'd like to help with that.

@picnixz
Copy link
Member

picnixz commented Oct 21, 2023

Personally I don't have anything else in mind. However note that I willl be extremely busy for the next months as I am leaning towards the end of my thesis.

@AA-Turner AA-Turner merged commit 5211c01 into sphinx-doc:master Jan 9, 2024
21 of 22 checks passed
@jayaddison
Copy link
Contributor Author

Thank you @AA-Turner! (and again @picnixz for your review of this)

I'll offer the GitHub issue reference suggestion in a small follow-up PR.

@jayaddison jayaddison deleted the issue-11433/adjust-linkcheck-http-401-handling branch January 9, 2024 12:48
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 9, 2024
@AA-Turner AA-Turner added this to the 7.3.0 milestone Jul 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

linkchecker: update handling of HTTP 401 unauthorized responses
3 participants