Reprocessing workflow cause many DLQ messages #708

iliapolo · 2022-01-10T16:45:59Z

Whenever our reprocess workflow is triggered, we see a bunch of DLQ message that say:

Rate exceeded (Service: AmazonECS; Status Code: 400; Error Code: ThrottlingException; Request ID: 1db00da3-2470-4c18-b97a-ffca0c6aa602; Proxy: null)

This happens because reprocessing means spawning around 40,000 ECS tasks, while our current service limit stands on 1000.

I think we need a two vector solution here:

Increase service limits to make reprocessing faster.
Tweak the retry policy or some other mechanism to prevent this from happening.

Chriscbr · 2022-01-10T16:58:08Z

I wonder if it would make sense to fix the throttling using a queue?

We could put all of these requests into an SQS queue first, and then either:

create a lambda to process requests into ECS tasks (perhaps in a way that checks that there aren't too many ECS tasks running)
or, configure our ECS service to auto-scale based on the size of the SQS queue, and automatically poll the SQS queue for messages

iliapolo · 2022-01-12T11:37:43Z

@Chriscbr

create a lambda to process requests into ECS tasks (perhaps in a way that checks that there aren't too many ECS tasks running)

We were trying to avoid implementing our own back-pressure mechanism, I think that tweaking the SFN retry policies pretty much gives us the same benefit.

or, configure our ECS service to auto-scale based on the size of the SQS queue, and automatically poll the SQS queue for messages

I don't think auto-scaling is relevant here (ignoring the fact we don't have an ECS service, only tasks, and there is no auto-scaling mechanism for tasks). At any point in time our queue will have more messages than ECS can consume in parallel according to our service limits, so what we need here is back-pressure. If we can configure ECS to poll from a queue and perform that back-pressure on its own, that would be fantastic, but I'm not sure this is possible.

Having said that, I do agree we need to rethink this architecture and come up with something more robust, this will be scheduled later on as a bigger operational excellence project.

#711) During the reprocessing workflow, step functions tries to start a burst of 60,000 (current number of package versions) ECS tasks. Since our account limit is only 1000 parallel tasks, we need to apply a retry policy so the throttled tasks don't end up in the DLQ. Currently, our retry policy allows for a total wait time of roughly 2.5 hours. Lets do some math to see if this is enough. Since tasks also have boot time, we don't really run 1000 in parallel. In practice what we normally see is: ![Screen Shot 2022-01-12 at 4 12 24 PM](https://user-images.githubusercontent.com/1428812/149156438-9ba5e844-fa62-4294-9760-92887f6825f5.png) So for simplicity sake lets assume 500 parallel tasks. If every task takes about 2 minutes (empirically and somewhat based on `jsii-docgen` test timeouts) we are able to process 1000 tasks in 4 minutes. This means that in order to process 60,000 tasks, we need 4 hours. The current retry policy of 2.5 hours allows us to process only about 35,000 tasks. And indeed, most recent execution of the workflow resulted in the remaining 25,000 tasks being sent to the DLQ. The retry policy implemented in this PR gives us 7 hours. ## TODO - [x] 5 hours might still a bit too close. Run the reprocess workflow again to see if the numbers have changed following cdklabs/jsii-docgen#553. Follow up: `jsii-docgen` improvements did make it better but not enough to put a significant dent. I've updated the PR to give us 7 hours. Fixes #708 ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

iliapolo self-assigned this Jan 10, 2022

iliapolo added bug Something isn't working p1 labels Jan 10, 2022

iliapolo mentioned this issue Jan 12, 2022

fix(transliterator): excessive throttling from ECS during reprocessing #711

Merged

1 task

mergify bot closed this as completed in #711 Jan 12, 2022

Chriscbr mentioned this issue Jan 14, 2022

Reprocessing can cause ingestion of new packages to be delayed by over an hour #715

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reprocessing workflow cause many DLQ messages #708

Reprocessing workflow cause many DLQ messages #708

iliapolo commented Jan 10, 2022

Chriscbr commented Jan 10, 2022

iliapolo commented Jan 12, 2022

Reprocessing workflow cause many DLQ messages #708

Reprocessing workflow cause many DLQ messages #708

Comments

iliapolo commented Jan 10, 2022

Chriscbr commented Jan 10, 2022

iliapolo commented Jan 12, 2022