Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs get double processed (if removed immediately upon completion) #356

Closed
bradvogel opened this issue Oct 12, 2016 · 6 comments
Closed

Comments

@bradvogel
Copy link
Contributor

bradvogel commented Oct 12, 2016

There exists a situation where a job can get processed a second time by the processStalledJobs periodic cleanup job. This only happens when jobs are removed upon completion by the worker that processed them (a common pattern - see #354).

We only discovered this because we saw Error: Missing Job xx when trying to move from active to completed show up in our server logs that is running a high-volume (100/events/sec) Bull queue.

This is how it happens:

Time Process A Process B
1 In the regular Bull run loop of Process A, getNextJob moves a job from wait to active
2 In Process B, processStalledJob happens to run, pulls all active jobs (this.client.lrangeAsync(this.toKey('active')), and beings to iterate over them
3 processes the job normally pulls job data via Job.fromId (the job data still exists at this point)
4 completes processing the job and the application code calls job.remove() which removes the job data and the lock
5 calls scripts.getStalledJob which only checks if the job is in completed (which it isn't anymore so it continues to grab the lock)
6 job is processed again
7 upon job completion, Error: Missing Job 59694333 when trying to move from active to completed is thrown because the job data doesn't exist anymore

Perhaps a solution here is to have scripts.getStalledJob ensure the job is in the active state, not merely checking if it's not in completed - since now we know that it could have been removed prior to this check. So if a job is in active AND doesn't have an existing lock, then processStalledJobs knows that another worker isn't processing it. However, since the active queue is a LIST, checking the existence of the element in the list is expensive (requires list iteration in the lua script).

Our temporary workaround is to delay calling job.remove() at the completion of the job to leave around a 'tombstone' so processStalledJobs will see it in the completed queue and skip over it.

@manast
Copy link
Member

manast commented Oct 12, 2016

I have not check the source code yet, but if the job is removed and moved out from the completed queue atomically, wouldn't it be enough to check if the job still exists? that would be a O(1) operation.

@bradvogel
Copy link
Contributor Author

Yes, that would work too. I see now that job data is removed upon completion. It does however leave open the opportunity that if the job is failed (and its job data is left around), it'll get automatically double processed.

@bradvogel
Copy link
Contributor Author

Though I suppose we could have it check that the job key exists and also that it's not in the failed set. Are there any other states it could be in if the job key exists?

@manast
Copy link
Member

manast commented Oct 13, 2016

it could be delayed... but maybe it does not matter, I need to check the code. Quite busy ATM.

bradvogel added a commit to mixmaxhq/bull that referenced this issue Oct 15, 2016
…'t double-process jobs. See OptimalBits#356 for more on how this can happen.

The change here is to use a LUA script to atomically iterate the active queue AND get the lock. This ensures that the job is actually in 'active' when it's being processed, which fixes OptimalBits#356. If a stalled job was found, it will continue to call itself recursively until all stalled jobs are processed. This strategy is also slightly more efficient than before because it iterates the jobs all within a script, reducing the back-and-forth with the Redis server.

Additionally, this addresses long-standing issue where a job can be considered "stalled" even though it was just moved to active and before a lock could be obtained by the worker that moved it (OptimalBits#258), by waiting a grace period (with a default of LOCK_RENEW_TIME) before considering a job as possibly stalled. This gives the (real) worker time to acquire its lock after moving it to active.
@bradvogel
Copy link
Contributor Author

bradvogel commented Oct 15, 2016

Created #359 and confirmed that it fixes this.

bradvogel added a commit to mixmaxhq/bull that referenced this issue Oct 15, 2016
…'t double-process jobs. See OptimalBits#356 for more on how this can happen.

Additionally, this addresses long-standing issue where a job can be considered "stalled" even though it was just moved to active and before a lock could be obtained by the worker that moved it (OptimalBits#258), by waiting a grace period (with a default of LOCK_RENEW_TIME) before considering a job as possibly stalled. This gives the (real) worker time to acquire its lock after moving it to active.

Note that this includes a small API change: the 'stalled' event is now passed an array of events.
@bradvogel
Copy link
Contributor Author

Closed per PR

duyenddd added a commit to duyenddd/bull that referenced this issue Jul 28, 2024
…'t double-process jobs. See OptimalBits/bull#356 for more on how this can happen.

Additionally, this addresses long-standing issue where a job can be considered "stalled" even though it was just moved to active and before a lock could be obtained by the worker that moved it (OptimalBits/bull#258), by waiting a grace period (with a default of LOCK_RENEW_TIME) before considering a job as possibly stalled. This gives the (real) worker time to acquire its lock after moving it to active.

Note that this includes a small API change: the 'stalled' event is now passed an array of events.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants