Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries to link checking #40

Merged
merged 3 commits into from
Jun 17, 2021
Merged

Add retries to link checking #40

merged 3 commits into from
Jun 17, 2021

Conversation

saswatamcode
Copy link
Collaborator

This PR adds retries to link checking based on status code. For status code 429, it uses the Retry-After header to wait before retrying and retries normally for status code 301, 307 and 503. Can try a maximum of three times before returning error. This can increase the time of processing docs if rate limit is encountered but not too excessively.
Fixes #11.

Some questions:

  • Should this logic be put behind a --retry flag so that user can en/disable and set number of retries?
  • The Retry-After header works well for sites like GitHub which returns 429 with Retry-After set to 60s. But other sites might make value excessive, so should a limit be placed?

Signed-off-by: Saswata Mukherjee <[email protected]>
@saswatamcode saswatamcode requested a review from bwplotka June 11, 2021 08:45
@saswatamcode saswatamcode self-assigned this Jun 11, 2021
Copy link
Owner

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, some comments.

response.Ctx.Put(retryKey, strconv.Itoa(retries+1))
retryAfter, convErr := strconv.Atoi(response.Headers.Get("Retry-After"))
if convErr == nil {
time.Sleep(time.Duration(retryAfter) * time.Second)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmmm I think we need to have global retry. You see that will not work if others for same host will try in the same time 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I didn't think of this before! Will try to implement global retry!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this and tested it a bit. I think we never make requests to the same host again or at the same time due to our present caching layer and retries can only occur after request! So this should work. 🙂

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I challenge this statement (:

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine for now, let's iterate (global retry will be hard to implement), but I think it's needed, paths are cached not hosts

time.Sleep(time.Duration(retryAfter) * time.Second)
}

retryErr := response.Request.Retry()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might think we need default back of maybe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in default?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well default if no retry-after is set? maybe something like 1s

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay! Will try to implement this!

Signed-off-by: Saswata Mukherjee <[email protected]>
@saswatamcode saswatamcode requested a review from bwplotka June 13, 2021 15:34
Copy link
Owner

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some review during our 1:1

switch response.StatusCode {
case http.StatusTooManyRequests:
if retries <= 0 {
var retryAfter int
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var retryAfter int
var retryAfterSeconds int

If it's int not time.Duration we need to mention unit in var name

if convErr != nil {
retryAfter = 1
}
time.Sleep(time.Duration(retryAfter) * time.Second)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time.Sleep(time.Duration(retryAfter) * time.Second)
select {
case <- time.After(time.Duration(retryAfter) * time.Second):
case <- response.Context.Done():
return
}

Let's also mock time.After so we can inject different times

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Colly allows us to pass in context and use that, but it isn't released yet. So pinning to needed commit for now! 🙂https://pkg.go.dev/github.com/gocolly/colly/[email protected]#Collector

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect

v.remoteLinks[response.Ctx.Get(originalURLKey)] = errors.Wrapf(err, "%q rate limited even after retry; status code %v", response.Request.URL.String(), response.StatusCode)
}
case http.StatusMovedPermanently, http.StatusTemporaryRedirect, http.StatusServiceUnavailable:
if retries <= 0 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make more important block being left indendented

retries, _ := strconv.Atoi(retriesStr)
switch response.StatusCode {
case http.StatusTooManyRequests:
if retries <= 0 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto (indent)

Signed-off-by: Saswata Mukherjee <[email protected]>
@saswatamcode saswatamcode requested a review from bwplotka June 16, 2021 11:13
Copy link
Owner

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think retrying will be annoying if we cannot interrupt it. Do you mind hooking the context properly. It will also show if we put it in right place... (:

})
return v, nil
}

// MustNewValidator returns mdformatter.LinkTransformer that crawls all links.
func MustNewValidator(logger log.Logger, linksValidateConfig []byte, anchorDir string) mdformatter.LinkTransformer {
v, err := NewValidator(logger, linksValidateConfig, anchorDir)
v, err := NewValidator(context.TODO(), logger, linksValidateConfig, anchorDir)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not sound useful ;p

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can currently interrupt retry. For instance, testing a file with envoyproxy link and setting retryAfterSeconds to 10, gives the following results. First is when I let it run and retry, second is when I interrupt,

time mdox fmt -l test.md
error:          test.md: "https://envoyproxy.io" not accessible even after retry; status code 301: Moved Permanently
mdox fmt -l test.md  0.14s user 0.06s system 1% cpu 14.995 total

ime mdox fmt -l test.md
^Cinfo: caught signal. Exiting.: interrupt
mdox fmt -l test.md  0.12s user 0.05s system 2% cpu 6.762 total

So we can cancel retry without having to wait. But I'm not sure what would be the correct way to hook context?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it works but we don't know why, right? That's not the best case too ;p Anyway, let's merge and think later (: I think we pass context to colly somewhere else?

Copy link
Collaborator Author

@saswatamcode saswatamcode Jun 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we pass it here! That's why we can cancel without waiting. Also passing it is sort of weird as colly also has a context package.

@bwplotka bwplotka merged commit ffae630 into bwplotka:main Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add retries to link checking.
2 participants