-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a single spider queue, for deterministic priority across all projects #187
Comments
The easiest way to fix this --- a/scrapyd/poller.py
+++ b/scrapyd/poller.py
@@ -1,3 +1,4 @@
+from random import sample
from zope.interface import implements
from twisted.internet.defer import DeferredQueue, inlineCallbacks, maybeDeferred, returnValue
@@ -17,11 +18,12 @@ class QueuePoller(object):
def poll(self):
if self.dq.pending:
return
- for p, q in self.queues.iteritems():
- c = yield maybeDeferred(q.count)
- if c:
- msg = yield maybeDeferred(q.pop)
- returnValue(self.dq.put(self._message(msg, p)))
+
+ for proj, queue in sample(self.queues.items(), len(self.queues)):
+ if (yield maybeDeferred(queue.count)):
+ msg = (yield maybeDeferred(queue.pop)).copy()
+ msg = dict(msg, _project=proj, _spider=msg.pop('name'))
+ returnValue(self.dq.put(msg))
def next(self):
return self.dq.get() This is not the kind of fix that we'll release This also makes me ask |
Hello Stopping to use multiple queues sounds good, what would be the requirements of an acceptable solution for this? thanks |
Any solution is more or less backwards incompatible. The requirements are: |
Similar: |
The tests in #344 #349 might be useful. Can also consider #343 Idea from #197 (comment)
Edit: Removing priority milestone as I don’t know how many users actually use priorities. Also, the ideal way to do this is a breaking change. |
Scheduled jobs are not run in FIFO+priority order.
Instead, there are multiple queues
that are also arranged in a queue-like fashion
but not round-robin or anything,
just an "arbitrary but constant" order (python dict).
A scheduled job with the current highest priority
will have to wait for all other jobs
that reside in all queues
that the dict-order puts in front of its queue
The poller demonstrates "preferences" for some projects over others
and lets them monopolize process slots.
The following code in
poller.py
polls projects in whatever order
iteritems()
useswhich is arbitrary but constant.
This becomes obvious when multiple projects
have several spiders queued and waiting for slots.
The first
(project, queue)
pair returned fromiteritems()
that has queued spiders
will prevent any further projects from being polled.
If its queue is constantly full
only this project will have its spiders ran.
Possible solutions:
One way to solve this is suggested by @luc4smoreira in #140.
However it can't be a correct solution (for this issue)
because it can make total slot utilization vary a lot.
E.g. with 5 slots per project and 10 projects,
when 1 project is active there will be only 5 slots utilized
(even if the single active project is in need for more)
but when all 10 are active, there will be 50 slots.
In a workaround for the above, one can set the total slot limit
just above the slots per project
multiplied by the number of projects expected to monopolize slots
so that there will always be a few slots for others.
Another incomplete solution is to use the spider priority earlier in the poller
before picking a project. When I noticed this problem,
I mistakenly tried lowering the priority
of spiders belonging to the project monopolizing the slots.
Of course it didn't work ― the priority is only within the project queue
and not the set of queues.
Update 2019-06-26:
A for-loop in the poller, querying all queues,
if more processes (threads too) write/read queues,
will cause race conditions to occur more often.
We can also stop using multiple queues
and keep everything in a single database.
Although it sounds like a big change,
by doing it we'll drop a lot of code from scrapyd
and we can even implement both of the 2 suggestions above
cleaner, faster and with less code.
The text was updated successfully, but these errors were encountered: