-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some CDX queries are pathelogically slow #38
Comments
Tried switching to lxml and now all queries fail quick! Need to enable debug logging. |
Hm, now seem to error on production rather than be slow... Hm. |
I've added the XML from OutbackCDX for the example from above that seems not to be working: |
- ensure lxml-enabled parsing in XmlQueryIndexSource works by passing the raw bytestring instead of unicode text to the parser - tests: add lxml and non-lxml parsing tests to test_xmlquery_indexsource.py, add lxml to test install - misc fixes: fix typo in banner.html, update gevent api to support latest gevent
Parsing with LXML wasn't working due to LXML's more stringent encoding requirements, now fixed in the PR above. However, not sure if that addresses the original issue of slowness, but should get the lxml parsing working. The above attached xml query seems valid and parsers quickly with no errors. |
having got debug enabled a different way, it's something to do with the access list stuff:
|
Ah,I think the blocks file is malformed:
But this refers to the JSON payload, so it's difficult to tell which line is causing the problem! |
EDIT: I take that back - having fixed the blocks.aclj file, it's now back to hanging instead of crashing! |
See also webrecorder/pywb#439 |
Ah, okay, the penny drops.... The queries that hang are ones that are neither in the ALLOWS or BLOCKS lists. If you go to https://www.webarchive.org.uk/wayback/archive/2019/www.jeremycorbyn.co.uk you get a 451, as we should 🎉. But if you perform a CDX query, this hangs. |
So, hacked in some debugging, and it's just REALLY SLOW. This is a fragment:
In practice, what this means is this line is really slooow |
Thanks for the additional debug info! Based on the above, it seems queries are slow even if in block list, not just if neither allowed or blocked, is that correct? |
- stop checking acl rules linearly if acl key < tld - use existing rule for same url (at least until date-range checking)
The issue is how the ACL system deals with large ACL rules. Currently, after the exact url binsearch, it performs a linear search to determine if there are any prefix matches, eg. if there is no rule The PR should be an improvement, but will also look at further improvements. |
I'd say this cures this issue, as the patch from ukwa/pywb#5 means the queries do complete in a few seconds rather than being extremely slow and timing out. |
- ensure lxml-enabled parsing in XmlQueryIndexSource works by passing the raw bytestring instead of unicode text to the parser - tests: add lxml and non-lxml parsing tests to test_xmlquery_indexsource.py, add lxml to test install - misc fixes: fix typo in banner.html, update gevent api to support latest gevent
- stop checking acl rules linearly if acl key < tld - use existing rule for same url (at least until date-range checking)
Some CDX queries timeout, some don't. e.g. this doesn't work but this one works fine
i.e. these queries are taking 10 mins!
The text was updated successfully, but these errors were encountered: