Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reprocessing workflow cause many DLQ messages #708

Closed
iliapolo opened this issue Jan 10, 2022 · 2 comments · Fixed by #711
Closed

Reprocessing workflow cause many DLQ messages #708

iliapolo opened this issue Jan 10, 2022 · 2 comments · Fixed by #711
Assignees
Labels
bug Something isn't working p1

Comments

@iliapolo
Copy link
Contributor

Whenever our reprocess workflow is triggered, we see a bunch of DLQ message that say:

Rate exceeded (Service: AmazonECS; Status Code: 400; Error Code: ThrottlingException; Request ID: 1db00da3-2470-4c18-b97a-ffca0c6aa602; Proxy: null)

This happens because reprocessing means spawning around 40,000 ECS tasks, while our current service limit stands on 1000.

I think we need a two vector solution here:

  1. Increase service limits to make reprocessing faster.
  2. Tweak the retry policy or some other mechanism to prevent this from happening.
@iliapolo iliapolo self-assigned this Jan 10, 2022
@iliapolo iliapolo added bug Something isn't working p1 labels Jan 10, 2022
@Chriscbr
Copy link
Contributor

I wonder if it would make sense to fix the throttling using a queue?

We could put all of these requests into an SQS queue first, and then either:

  • create a lambda to process requests into ECS tasks (perhaps in a way that checks that there aren't too many ECS tasks running)
  • or, configure our ECS service to auto-scale based on the size of the SQS queue, and automatically poll the SQS queue for messages

@iliapolo
Copy link
Contributor Author

@Chriscbr

create a lambda to process requests into ECS tasks (perhaps in a way that checks that there aren't too many ECS tasks running)

We were trying to avoid implementing our own back-pressure mechanism, I think that tweaking the SFN retry policies pretty much gives us the same benefit.

or, configure our ECS service to auto-scale based on the size of the SQS queue, and automatically poll the SQS queue for messages

I don't think auto-scaling is relevant here (ignoring the fact we don't have an ECS service, only tasks, and there is no auto-scaling mechanism for tasks). At any point in time our queue will have more messages than ECS can consume in parallel according to our service limits, so what we need here is back-pressure. If we can configure ECS to poll from a queue and perform that back-pressure on its own, that would be fantastic, but I'm not sure this is possible.

Having said that, I do agree we need to rethink this architecture and come up with something more robust, this will be scheduled later on as a bigger operational excellence project.

@mergify mergify bot closed this as completed in #711 Jan 12, 2022
mergify bot pushed a commit that referenced this issue Jan 12, 2022
#711)

During the reprocessing workflow, step functions tries to start a burst of 60,000 (current number of package versions) ECS tasks. Since our account limit is only 1000 parallel tasks, we need to apply a retry policy so the throttled tasks don't end up in the DLQ. 

Currently, our retry policy allows for a total wait time of roughly 2.5 hours. Lets do some math to see if this is enough. 

Since tasks also have boot time, we don't really run 1000 in parallel. In practice what we normally see is:

![Screen Shot 2022-01-12 at 4 12 24 PM](https://user-images.githubusercontent.com/1428812/149156438-9ba5e844-fa62-4294-9760-92887f6825f5.png)

So for simplicity sake lets assume 500 parallel tasks. If every task takes about 2 minutes (empirically and somewhat based on `jsii-docgen` test timeouts) we are able to process 1000 tasks in 4 minutes. 

This means that in order to process 60,000 tasks, we need 4 hours. The current retry policy of 2.5 hours allows us to process only about 35,000 tasks. And indeed, most recent execution of the workflow resulted in the remaining 25,000 tasks being sent to the DLQ. 

The retry policy implemented in this PR gives us 7 hours. 

## TODO

- [x] 5 hours might still a bit too close. Run the reprocess workflow again to see if the numbers have changed following cdklabs/jsii-docgen#553. Follow up: `jsii-docgen` improvements did make it better but not enough to put a significant dent. I've updated the PR to give us 7 hours. 

Fixes #708

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants