Jobs get double processed (if removed immediately upon completion) #356

bradvogel · 2016-10-12T18:28:51Z

There exists a situation where a job can get processed a second time by the processStalledJobs periodic cleanup job. This only happens when jobs are removed upon completion by the worker that processed them (a common pattern - see #354).

We only discovered this because we saw Error: Missing Job xx when trying to move from active to completed show up in our server logs that is running a high-volume (100/events/sec) Bull queue.

This is how it happens:

Time	Process A	Process B
1	In the regular Bull run loop of Process A, getNextJob moves a job from `wait` to `active`
2		In Process B, processStalledJob happens to run, pulls all active jobs (`this.client.lrangeAsync(this.toKey('active')`), and beings to iterate over them
3	processes the job normally	pulls job data via `Job.fromId` (the job data still exists at this point)
4	completes processing the job and the application code calls `job.remove()` which removes the job data and the lock
5		calls `scripts.getStalledJob` which only checks if the job is in `completed` (which it isn't anymore so it continues to grab the lock)
6		job is processed again
7		upon job completion, `Error: Missing Job 59694333 when trying to move from active to completed` is thrown because the job data doesn't exist anymore

Perhaps a solution here is to have scripts.getStalledJob ensure the job is in the active state, not merely checking if it's not in completed - since now we know that it could have been removed prior to this check. So if a job is in active AND doesn't have an existing lock, then processStalledJobs knows that another worker isn't processing it. However, since the active queue is a LIST, checking the existence of the element in the list is expensive (requires list iteration in the lua script).

Our temporary workaround is to delay calling job.remove() at the completion of the job to leave around a 'tombstone' so processStalledJobs will see it in the completed queue and skip over it.

The text was updated successfully, but these errors were encountered:

manast · 2016-10-12T18:59:52Z

I have not check the source code yet, but if the job is removed and moved out from the completed queue atomically, wouldn't it be enough to check if the job still exists? that would be a O(1) operation.

bradvogel · 2016-10-12T19:11:55Z

Yes, that would work too. I see now that job data is removed upon completion. It does however leave open the opportunity that if the job is failed (and its job data is left around), it'll get automatically double processed.

bradvogel · 2016-10-12T22:10:18Z

Though I suppose we could have it check that the job key exists and also that it's not in the failed set. Are there any other states it could be in if the job key exists?

manast · 2016-10-13T09:05:21Z

it could be delayed... but maybe it does not matter, I need to check the code. Quite busy ATM.

…'t double-process jobs. See OptimalBits#356 for more on how this can happen. The change here is to use a LUA script to atomically iterate the active queue AND get the lock. This ensures that the job is actually in 'active' when it's being processed, which fixes OptimalBits#356. If a stalled job was found, it will continue to call itself recursively until all stalled jobs are processed. This strategy is also slightly more efficient than before because it iterates the jobs all within a script, reducing the back-and-forth with the Redis server. Additionally, this addresses long-standing issue where a job can be considered "stalled" even though it was just moved to active and before a lock could be obtained by the worker that moved it (OptimalBits#258), by waiting a grace period (with a default of LOCK_RENEW_TIME) before considering a job as possibly stalled. This gives the (real) worker time to acquire its lock after moving it to active.

bradvogel · 2016-10-15T02:08:00Z

Created #359 and confirmed that it fixes this.

…'t double-process jobs. See OptimalBits#356 for more on how this can happen. Additionally, this addresses long-standing issue where a job can be considered "stalled" even though it was just moved to active and before a lock could be obtained by the worker that moved it (OptimalBits#258), by waiting a grace period (with a default of LOCK_RENEW_TIME) before considering a job as possibly stalled. This gives the (real) worker time to acquire its lock after moving it to active. Note that this includes a small API change: the 'stalled' event is now passed an array of events.

bradvogel · 2016-11-09T00:57:13Z

Closed per PR

…'t double-process jobs. See OptimalBits/bull#356 for more on how this can happen. Additionally, this addresses long-standing issue where a job can be considered "stalled" even though it was just moved to active and before a lock could be obtained by the worker that moved it (OptimalBits/bull#258), by waiting a grace period (with a default of LOCK_RENEW_TIME) before considering a job as possibly stalled. This gives the (real) worker time to acquire its lock after moving it to active. Note that this includes a small API change: the 'stalled' event is now passed an array of events.

bradvogel mentioned this issue Oct 12, 2016

Removing completed jobs automatically & atomically #354

Closed

bradvogel mentioned this issue Oct 15, 2016

Rewrite the handling of process stalled jobs to be atomic so it doesn… #358

Closed

bradvogel mentioned this issue Oct 15, 2016

Rewrite the handling of process stalled jobs to be atomic so it doesn… #359

Merged

bradvogel closed this as completed Nov 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs get double processed (if removed immediately upon completion) #356

Jobs get double processed (if removed immediately upon completion) #356

bradvogel commented Oct 12, 2016 •

edited

Loading

manast commented Oct 12, 2016

bradvogel commented Oct 12, 2016

bradvogel commented Oct 12, 2016

manast commented Oct 13, 2016 •

edited

Loading

bradvogel commented Oct 15, 2016 •

edited

Loading

bradvogel commented Nov 9, 2016

Jobs get double processed (if removed immediately upon completion) #356

Jobs get double processed (if removed immediately upon completion) #356

Comments

bradvogel commented Oct 12, 2016 • edited Loading

manast commented Oct 12, 2016

bradvogel commented Oct 12, 2016

bradvogel commented Oct 12, 2016

manast commented Oct 13, 2016 • edited Loading

bradvogel commented Oct 15, 2016 • edited Loading

bradvogel commented Nov 9, 2016

bradvogel commented Oct 12, 2016 •

edited

Loading

manast commented Oct 13, 2016 •

edited

Loading

bradvogel commented Oct 15, 2016 •

edited

Loading