-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linkcheck builder: optionally allow HTTP 401 status code hyperlinks to be reported broken #11684
linkcheck builder: optionally allow HTTP 401 status code hyperlinks to be reported broken #11684
Conversation
…r it a broken-link condition
sphinx/builders/linkcheck.py
Outdated
if status_code == 401: | ||
return 'working', 'unauthorized', 0 | ||
return 'broken', 'unauthorized', 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I think that the 401 still makes sense if people do not want to expose credentials in their configuration and only assume that something exists (and could return a 401).
As such, I think we could keep the
'working'
but change a bit the way thatlinkcheck_auth
is handled:- If we specify "fake" credentials by
None
, we keep'working'
and do not emit warnings. - If no fake credential for the corresponding URL is specified, we keep the
'working'
, but we emit a warning saying that the content couldn't be accessed and that users should specify credentials (fake or not).
- If we specify "fake" credentials by
That way, this also ensures that credentials are not exposed unintentionally and that the contents with 401 errors are detected correctly.
-
It's more of a follow-up, but what about using an enumeration (or at least mark the possible string statuses we returm using
Literal
) for thestatus
instead of using'working'
,'broken'
, etc?In
process_result
, we use if-case but usingmatch
instead + enumerations may be more elegant and clearer (also we wouldn't have an arbitrary status code being emitted).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the 401 still makes sense if people do not want to expose credentials in their configuration and only assume that something exists (and could return a 401).
Hmm.. point understood, although I think that even when an URL has no linkcheck credentials configured (the "no fake credential" case), an HTTP 401 response should be considered a broken link, because it means that the link is unchecked. This is probably the core of the disruptive/controversial angle I have on this.
If for some reason a user is unable to provide credentials for some hyperlinks included in a documentation set, but wants a success report from linkchecking despite that, I'd argue they should use linkcheck_anchors_ignore
linkcheck_ignore
to skip the relevant websites.
I'm still thinking about the first point -- using None
as a special marker for intentionally-empty credentials. It seems a bit fragile and too-clever; I like simple and hard-to-misunderstand things, because I misunderstand a lot.
It's more of a follow-up, but what about using an enumeration (or at least mark the possible string statuses we returm using
Literal
) for thestatus
instead of using'working'
,'broken'
, etc?
👍
In
process_result
, we use if-case but usingmatch
instead + enumerations may be more elegant and clearer (also we wouldn't have an arbitrary status code being emitted).
Mostly agree, although we still support Py3.9 at the moment - so this could be a future enhancement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me using linkcheck_anchors_ignore
although it's a bit counterintuitive in this case (like I don't really want to ignore).
using None as a special marker for intentionally-empty credentials
It was an example. We could have another configuration value like linkcheck_auth_bypass
and put what links we expect to have 401.
Concerning the match, I forgot it was introduced in 3.10.
Actually, when writing my comment, I thought about a much more elegant solution, namely, you could specify the HTTP response code to expect for specific links and treat them as "working".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me using
linkcheck_anchors_ignore
although it's a bit counterintuitive in this case (like I don't really want to ignore).
Let me try to make a convicing argument that it makes sense :) (I could be wrong! in which case that'll help too)
The HTTP 401 response is zero-information in terms of hyperlink validity. It does confirm that the client should use auth if it wants to gain more-than-zero information, but that's the only feedback it provides -- and we hide that from users at the moment.
So we're making network requests that fail, and we're not informing the user about it. I think we should start by informing the user -- and then they either choose to find auth to gain greater-than-zero info, or they ignore the links and reduce the outbound traffic (and linkchecking time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should start by informing the user
Agreed. Actually, by counterintuitive, I meant that using the linkcheck_anchors_ignore
was weird to me since I would have used it to suppress errors related to anchors rather than to HTTP response codes.
So, we could keep what you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agh, sorry. I should have written linkcheck_ignore
instead. I completely failed to notice/parse the word anchor
in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it's ok for me !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more of a follow-up, but what about using an enumeration (or at least mark the possible string statuses we returm using
Literal
) for thestatus
instead of using'working'
,'broken'
, etc?
Inprocess_result
, we use if-case but usingmatch
instead + enumerations may be more elegant and clearer (also we wouldn't have an arbitrary status code being emitted).
I'm working on some refactoring for this at the moment; the CheckResult
class relies upon being JSON-serializable at the moment, and I don't feel like introducing custom serialization code, so I've opted to use the StrEnum
type -- Enum
with string values isn't serializable by default.
Taking that approach adds a dependency on Py3.11, so I'm experimenting with adding match
statements at the same time.
I did read up a little bit about using Literal
- it looks like it's mostly a type-checking aid? I think I'd prefer the enum route, even if it may take a while before it could land in the codebase. Maybe I'll change my mind as I learn more, though..
For this to go into 7.3, it should be opt-in; we can flip to opt-out in 8.0. A |
Ok, thanks @AA-Turner. I could wait until 8.0 to merge this, if that's simpler -- or I'll add a config option called More concise and/or better config option name ideas welcome.. I'd like to communicate the fact that it's server-communicated auth status that is the condition, and for the way those cases are handled to be fairly clear from the option name. |
An alternative: Or an alternative name: Or: |
Hmm.. the error-code list approach could be a neat solution, but gives me a sense of too much configurability. Roughly speaking, I think HTTP status codes are well-enough defined to determine success and failure in the case of linkchecking. The
To shoot down one of my own other alternative ideas as well: So.. I suppose |
I'm not fond of long configuration value names. And I would prefer using standard terminology (i.e., not "tolerate"). I think we should pick something from: allow, ignore, skip or strict. We could also have something like |
Ok, agreed on both of those.
I'd like to keep the scope of this change (and the config option) small, and the management of log-level for some of those errors/warnings seems a bit unrelated. However: it does make me wonder whether it'd be OK to add a warning message even when HTTP 401 is treated as success? Regarding the option name: probably enough bikeshedding, so |
Not good if people are turning warnings into errors. |
…setting so that users can opt-in to use the stricter treatment of HTTP 401 unauthorized responses
…false to opt-in to treating HTTP 401 hyperlinks as broken
…kcheck_allow_unauthorized' option will be removed in Sphinx 8.0
I don't have any further changes/commits planned here at the moment and would be glad for more review when possible. |
For X.0 releases, I prefer to have a 'clean' release with only removals/changes to defaults -- a new feature I think should be merged in a point release, with an opt-in. |
…-specific Sphinx 8.0 deprecation subclass Co-authored-by: picnixz <[email protected]>
…riable private, to prevent other code relying on it, since it is intended for removal soon Co-authored-by: picnixz <[email protected]>
Ok, I've reached another review-pause now (no further commits planned). The main question I have open / to resolve is how and when we should emit deprecation warnings -- if at all -- and whether my assumptions in this subthread are correct. |
…ed' to True instead of None Co-authored-by: picnixz <[email protected]>
Ok, reading back through the commentary, there's one detail I missed:
So we should switch the default, not remove the configuration value entirely. I'll update the deprecation notice about this correspondingly -- and as I understand from the other comments, deprecation warnings don't become errors, so that should be innocuous, and we can remove the warning from 8.0 onwards. |
… the default value for it will change; update the deprecation message and config documentation accordingly
…d in-the-wild and legacy behaviour is enabled, emit a notice to inform the deployment about the upcoming behaviour-change
…ld not be logged as part of the warning message; partly for reasons of log-message deduplication, and partly because the information is already available elsewhere in the linkchecker output
Co-authored-by: Bénédikt Tran <[email protected]>
I think this is ready - no further commits/changes planned. |
@AA-Turner @picnixz any further feedback for these changes? I don't know if it'll have much effect in practice, but if the number of people who have broken hyperlinks and don't realize it yet is greater than zero, I'd like to help with that. |
Personally I don't have anything else in mind. However note that I willl be extremely busy for the next months as I am leaning towards the end of my thesis. |
Thank you @AA-Turner! (and again @picnixz for your review of this) I'll offer the GitHub issue reference suggestion in a small follow-up PR. |
Feature or Bugfix
Purpose
Detail
broken
is opt-in - deployments must set thelinkcheck_allow_unauthorized
config option toFalse
to consider unauthorized links as unavailable during linkchecking. The default reporting status for unauthorized links will change tobroken
as part of the major-version upgrade to Sphinx 8.0 in future.Relates