-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs get double processed (if removed immediately upon completion) #356
Comments
I have not check the source code yet, but if the job is removed and moved out from the completed queue atomically, wouldn't it be enough to check if the job still exists? that would be a O(1) operation. |
Yes, that would work too. I see now that job data is removed upon completion. It does however leave open the opportunity that if the job is failed (and its job data is left around), it'll get automatically double processed. |
Though I suppose we could have it check that the job key exists and also that it's not in the failed set. Are there any other states it could be in if the job key exists? |
it could be delayed... but maybe it does not matter, I need to check the code. Quite busy ATM. |
…'t double-process jobs. See OptimalBits#356 for more on how this can happen. The change here is to use a LUA script to atomically iterate the active queue AND get the lock. This ensures that the job is actually in 'active' when it's being processed, which fixes OptimalBits#356. If a stalled job was found, it will continue to call itself recursively until all stalled jobs are processed. This strategy is also slightly more efficient than before because it iterates the jobs all within a script, reducing the back-and-forth with the Redis server. Additionally, this addresses long-standing issue where a job can be considered "stalled" even though it was just moved to active and before a lock could be obtained by the worker that moved it (OptimalBits#258), by waiting a grace period (with a default of LOCK_RENEW_TIME) before considering a job as possibly stalled. This gives the (real) worker time to acquire its lock after moving it to active.
Created #359 and confirmed that it fixes this. |
…'t double-process jobs. See OptimalBits#356 for more on how this can happen. Additionally, this addresses long-standing issue where a job can be considered "stalled" even though it was just moved to active and before a lock could be obtained by the worker that moved it (OptimalBits#258), by waiting a grace period (with a default of LOCK_RENEW_TIME) before considering a job as possibly stalled. This gives the (real) worker time to acquire its lock after moving it to active. Note that this includes a small API change: the 'stalled' event is now passed an array of events.
Closed per PR |
…'t double-process jobs. See OptimalBits/bull#356 for more on how this can happen. Additionally, this addresses long-standing issue where a job can be considered "stalled" even though it was just moved to active and before a lock could be obtained by the worker that moved it (OptimalBits/bull#258), by waiting a grace period (with a default of LOCK_RENEW_TIME) before considering a job as possibly stalled. This gives the (real) worker time to acquire its lock after moving it to active. Note that this includes a small API change: the 'stalled' event is now passed an array of events.
There exists a situation where a job can get processed a second time by the
processStalledJobs
periodic cleanup job. This only happens when jobs are removed upon completion by the worker that processed them (a common pattern - see #354).We only discovered this because we saw
Error: Missing Job xx when trying to move from active to completed
show up in our server logs that is running a high-volume (100/events/sec) Bull queue.This is how it happens:
wait
toactive
this.client.lrangeAsync(this.toKey('active')
), and beings to iterate over themJob.fromId
(the job data still exists at this point)job.remove()
which removes the job data and the lockscripts.getStalledJob
which only checks if the job is incompleted
(which it isn't anymore so it continues to grab the lock)Error: Missing Job 59694333 when trying to move from active to completed
is thrown because the job data doesn't exist anymorePerhaps a solution here is to have
scripts.getStalledJob
ensure the job is in theactive
state, not merely checking if it's not incompleted
- since now we know that it could have been removed prior to this check. So if a job is inactive
AND doesn't have an existing lock, then processStalledJobs knows that another worker isn't processing it. However, since theactive
queue is a LIST, checking the existence of the element in the list is expensive (requires list iteration in the lua script).Our temporary workaround is to delay calling
job.remove()
at the completion of the job to leave around a 'tombstone' soprocessStalledJobs
will see it in the completed queue and skip over it.The text was updated successfully, but these errors were encountered: