Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature #18: Retry FileDownloads #23

Merged
merged 28 commits into from
Jan 5, 2024
Merged

Feature #18: Retry FileDownloads #23

merged 28 commits into from
Jan 5, 2024

Conversation

Dtphelan1
Copy link
Collaborator

@Dtphelan1 Dtphelan1 commented Dec 29, 2023

Having trouble debugging some weird test issues, so I figure I'll make a PR and two-birds-one-stone the problem along with the code review.

Addresses #18 by manually triggering re-downloads of files in the event that FileDownload's streaming requests trigger one of several HTTP status codes in response. Additionally, this PR adds two new configuration values (yay, more config variables) for customizing this retry behavior: fileDownloadMaxRetries and fileDownloadRetryAfterMSec.

To test, modify the bulk-data-server repo to track the number of attempts made to a file download before succeeding. I'd push a branch but I don't have push permissions now so for the moment forgive the awkward git diff formatting:

@@ -292,7 +299,7 @@ class ExportManager
 const ABORT_CONTROLLERS = new Map();
+///////////////////////////////////
+// TEMP
+const TRIES_PER_FILE = new Map();
+// TEMP
+///////////////////////////////////
+
 
@@ -910,6 +917,25 @@ class ExportManager
         const shouldGzip     = (/\bgzip\b/.test(acceptEncoding));
 
+        ///////////////////////////////////
+        // TEMP
+        // Fail on first try
+        const numTries = TRIES_PER_FILE.get(req.params.file.split(".")[1]) ?? 0
+        console.log("NUM TRIES: ", numTries)
+        console.log("req.params.file.split('.')[1]: ", req.params.file.split(".")[1])
+        console.log(TRIES_PER_FILE)
+        TRIES_PER_FILE.set(req.params.file.split(".")[1], numTries + 1)
+        console.log(numTries <= 1)
+        console.log(numTries <= 1)
+        if (numTries <= 1) { 
+            console.log('IN FAIL STATE')
+            return res.set("x-debugging-header", "someValue")
+                .status(500)            
+                .end('')
+        } 
+        // TEMP
+        ///////////////////////////////////
+        
         // set the response headers
         res.set({
             "Content-Type": exportTypes[this.outputFormat].contentType,

Things to consider:

  • Why are tests failing? Specifically, the attachment-download related test seems to be failing with the addition of this new retry logic. I've been looking at the file for a bit too long and can't seem to puzzle out why that is, so I'm hoping to get another pair of eyes on that.
  • Do we want to combine fileDownloadRetryAfterMSec with retryAfterMSec?
    • Arguments against: the latter is supposed to only be used when bulk server's dont respond with a retry-after header, we don't want to make the responsibility of this variable ambiguous. Plus, the rate at which we retry for status updates might resonably differ from the rate at which we retry failed requests.
    • Arguments for: aren't we adding too many config variables? The meaning of this variable is pretty similar across both instances, and we probably don't have too many use cases where retrying wait time for status updates is too far removed
  • We could get rid of these specific variables altogether and instead expect users to configure the request option directly.
    • Argument for: Minimizes the # of config variables. Improves consistency between retry's handled manually by us and automatically by Got.
    • Argument against: It's a little awkward to expect users of the client to understand the config logic of Got. Additionally, Got doesn't have a concept of a fixed retryAfterMSec – it supports a calculateDelay function. Our config's are .js files, so users could define them inline, but it does feel cumbersome to have users specify a delay fn instead of a number.

@vlad-ignatov
Copy link
Contributor

Now that we know (hopefully) why the test was failing, we know that this is doable. Unfortunately, this is messy and would require a long discussion and some strategic decisions, so lets begin that by splitting the problem in multiple topics and try to agree on them. Here is the first one:

The client makes several types of requests:

  • Authorization requests to get (or renew) a token - Do we need to (does it make sense to) retry?
  • Bulk Data Kick-off - Do we need to (does it make sense to) retry?
  • Status polling - built-in retry logic following the bulk data spec. Not to be confused with what we are trying to do here!
  • Bulk Data download (as stream) - Use custom retry because got's retry does not work on streams
  • Attachments download - Use got's built-in retry capabilities

My intuition here is to:

  • Not retry the kick-off because it is the first request in the sequence and can be reviewed and retried manually if needed.
  • Same for the token request. Perhaps any error should be fatal in this case.

What do you think?

@Dtphelan1
Copy link
Collaborator Author

I agree with all your intuitions! To repeat:

  • (No Action) Authorization requests to get (or renew) a token - if there is an error it should probably be fatal. Any errors that would be related to token-expiration should be handled by the BulkDataClient.ts itself, since any this.request call invokes this.getAccessToken, which requests a new token if the current one is expired.
  • (No Action) Bulk Data Kick-off - I agree, this can be done manually by the user if there is an error.
  • (No Action) Status polling - the built-in retry logic following the bulk data spec is all we need here.
  • (Already Done 🤞) Bulk Data download (as stream) - Agreed and relieved that you too see the issues with got's stream-retry logic. Sad though – it would be nice to take advantage of if it worked.
  • (To-Fix) Attachments download - Use got's built-in retry capabilities, which aren't working as expected right now.

Now that we agree on which requests should actually retry, did you have next steps in mind?

@vlad-ignatov
Copy link
Contributor

OK, Yes, multiple next steps :) Next question:

We have 2 separate retries to handle. Let's call them got and custom for simplicity. The user should not have to know that these are different things. Ideally they should be configured in a single place and behave the same.

  • Option 1 - Use requests.retry for config that is also used by the custom retry method
  • Option 2 - Use custom config variables which are also used to override requests.retry variables

If we chose option 1, it means that we have to:

  • TODO: Override retry.limit to 1 for kick-off and auth requests
  • TODO: Explain in comments and docs that retry options are ignored for kick-off and auth requests, even though the requests config variable is supposed to affect every request.
  • Make sure the custom retry uses settings from requests.retry (which should be easily doable)

If we chose option 2, it means that we have to:

  • Duplicate settings that are otherwise provided in requests.retry
  • Make sure our custom settings override requests.retry wherever it makes sense

What do you think?

P.S.:

  • We probably don't need to provide custom calculateDelay (and fileDownloadRetryAfterMSec) and can use the one from got. Since we will only retry the file downloads, the single most useful variable that we need is retry.limit, and the built-in backoff from got should be good enough.
  • Given the previous comment, regardless of the chosen option we may have to add some code to override/ignore retry settings passed to kick-off and auth.

@Dtphelan1
Copy link
Collaborator Author

Personally, I lean towards Option 2 because it's more direct – if people want to retry file downloads, they can read through the options and README and find the fileDownloadMaxRetries option (or fileDownloadRetries, I'm open to renaming it). If we were enabling wider retry logic across other types of requests, I'd lean towards Option 1 since we'd be affecting all requests, but we're really honing in on retrying in two limited scenarios. Pointing people to a generic retry option, only to explain that it only affects fileDownload requests, seems confusing… What do you think?

If we do go with Option 2, I'll make sure that the order of arguments are such that fileDownloadRetries overrides other retry options.

In either case, as far as not having access to calculateDelay, I'm sad but I agree it's not worth the effort right now. If users want to use a long delay, we can encourage them to set retries high; the built-in exponential backoff should help with that.

@vlad-ignatov
Copy link
Contributor

OK, I am not a big fan of either option and it is quite hard to decide, so let's skip it for now and hope that it will become clearer while doing the rest. The next one should be simple.

Even though is not configured this way by default, one might need to be able to retry successful requests. For example retry on 304 (not modified) until you get 200. I know you don't exactly support such config now, but that is a question for another step. That said, can we agree that this makes sense and can you move https://github.com/smart-on-fhir/bulk-data-client/blob/dtp/18-manual-retry/src/lib/FileDownload.ts#L152-L159 outside (before) its containing if statement if (res.statusCode >= 400) {?

Also, do we need to wait to call downloadRequest.destroy() or can't we do that immediately?

@Dtphelan1
Copy link
Collaborator Author

I agree, we can move the manual shouldRetry check outside of the wrapping if-statement. Good catch, I'll do that now.

I am not a big fan of either option and it is quite hard to decide

I could not agree more ^ 😆

@vlad-ignatov
Copy link
Contributor

Thanks Dylan! I tested it and it seems to work fine!

@Dtphelan1 Dtphelan1 merged commit b2a14a7 into main Jan 5, 2024
@vlad-ignatov vlad-ignatov deleted the dtp/18-manual-retry branch February 7, 2025 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants