-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retries to link checking #40
Conversation
Signed-off-by: Saswata Mukherjee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, some comments.
response.Ctx.Put(retryKey, strconv.Itoa(retries+1)) | ||
retryAfter, convErr := strconv.Atoi(response.Headers.Get("Retry-After")) | ||
if convErr == nil { | ||
time.Sleep(time.Duration(retryAfter) * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmmmm I think we need to have global retry. You see that will not work if others for same host will try in the same time 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I didn't think of this before! Will try to implement global retry!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this and tested it a bit. I think we never make requests to the same host again or at the same time due to our present caching layer and retries can only occur after request! So this should work. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine for now, let's iterate (global retry will be hard to implement), but I think it's needed, paths are cached not hosts
time.Sleep(time.Duration(retryAfter) * time.Second) | ||
} | ||
|
||
retryErr := response.Request.Retry() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might think we need default back of maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well default if no retry-after is set? maybe something like 1s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay! Will try to implement this!
Signed-off-by: Saswata Mukherjee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some review during our 1:1
switch response.StatusCode { | ||
case http.StatusTooManyRequests: | ||
if retries <= 0 { | ||
var retryAfter int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var retryAfter int | |
var retryAfterSeconds int |
If it's int not time.Duration
we need to mention unit in var name
if convErr != nil { | ||
retryAfter = 1 | ||
} | ||
time.Sleep(time.Duration(retryAfter) * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.Sleep(time.Duration(retryAfter) * time.Second) | |
select { | |
case <- time.After(time.Duration(retryAfter) * time.Second): | |
case <- response.Context.Done(): | |
return | |
} |
Let's also mock time.After
so we can inject different times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Colly allows us to pass in context and use that, but it isn't released yet. So pinning to needed commit for now! 🙂https://pkg.go.dev/github.com/gocolly/colly/[email protected]#Collector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect
v.remoteLinks[response.Ctx.Get(originalURLKey)] = errors.Wrapf(err, "%q rate limited even after retry; status code %v", response.Request.URL.String(), response.StatusCode) | ||
} | ||
case http.StatusMovedPermanently, http.StatusTemporaryRedirect, http.StatusServiceUnavailable: | ||
if retries <= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make more important block being left indendented
retries, _ := strconv.Atoi(retriesStr) | ||
switch response.StatusCode { | ||
case http.StatusTooManyRequests: | ||
if retries <= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto (indent)
Signed-off-by: Saswata Mukherjee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think retrying will be annoying if we cannot interrupt it. Do you mind hooking the context properly. It will also show if we put it in right place... (:
}) | ||
return v, nil | ||
} | ||
|
||
// MustNewValidator returns mdformatter.LinkTransformer that crawls all links. | ||
func MustNewValidator(logger log.Logger, linksValidateConfig []byte, anchorDir string) mdformatter.LinkTransformer { | ||
v, err := NewValidator(logger, linksValidateConfig, anchorDir) | ||
v, err := NewValidator(context.TODO(), logger, linksValidateConfig, anchorDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not sound useful ;p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can currently interrupt retry. For instance, testing a file with envoyproxy link and setting retryAfterSeconds to 10, gives the following results. First is when I let it run and retry, second is when I interrupt,
time mdox fmt -l test.md
error: test.md: "https://envoyproxy.io" not accessible even after retry; status code 301: Moved Permanently
mdox fmt -l test.md 0.14s user 0.06s system 1% cpu 14.995 total
ime mdox fmt -l test.md
^Cinfo: caught signal. Exiting.: interrupt
mdox fmt -l test.md 0.12s user 0.05s system 2% cpu 6.762 total
So we can cancel retry without having to wait. But I'm not sure what would be the correct way to hook context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it works but we don't know why, right? That's not the best case too ;p Anyway, let's merge and think later (: I think we pass context to colly somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we pass it here! That's why we can cancel without waiting. Also passing it is sort of weird as colly also has a context
package.
This PR adds retries to link checking based on status code. For status code 429, it uses the
Retry-After
header to wait before retrying and retries normally for status code 301, 307 and 503. Can try a maximum of three times before returning error. This can increase the time of processing docs if rate limit is encountered but not too excessively.Fixes #11.
Some questions:
--retry
flag so that user can en/disable and set number of retries?Retry-After
header works well for sites like GitHub which returns 429 withRetry-After
set to 60s. But other sites might make value excessive, so should a limit be placed?