Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQS timeout bug #1

Open
vm-wylbur opened this issue Jan 10, 2022 · 0 comments
Open

SQS timeout bug #1

vm-wylbur opened this issue Jan 10, 2022 · 0 comments

Comments

@vm-wylbur
Copy link
Contributor

When a stratum takes longer than the SQS timeout period, SQS moves the stratum message from "in flight" back to "queued." That means that the stratum is assigned to another thread while the first thread may still be running.

A worker thread cannot tell the difference between a stratum that has been abandoned (perhaps because it was dequeued but then its worker failed) and one that is currently being computed but taking a long time.

The problem is that these very long-lived strata end up being computed more than once. In fact, any stratum that takes RUNTIME > SQS_TIMEOUT will be reinserted in the queue and reassigned ceiling(RUNTIME/SQS_TIMEOUT) times. These are the hardest-to-compute strata, so this can very greatly extend the total runtime.

This needs attention. Maybe a central watcher script can harvest the currently-running strata ids, and somehow broadcast it to workers so they avoid dequeueing a currently-running stratum?

@vm-wylbur vm-wylbur changed the title fix SQS timeout bug SQS timeout bug Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant