Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some CDX queries are pathelogically slow #38

Closed
anjackson opened this issue Feb 6, 2019 · 13 comments
Closed

Some CDX queries are pathelogically slow #38

anjackson opened this issue Feb 6, 2019 · 13 comments
Labels
bug Something isn't working

Comments

@anjackson
Copy link
Contributor

Some CDX queries timeout, some don't. e.g. this doesn't work but this one works fine

[pid: 18|app: 0|req: 295807/6259025] 193.61.220.3 () {50 vars in 1267 bytes} [Wed Feb  6 12:10:25 2019] GET /wayback/en/archive/cdx?url=http%3A%2F%2Fwww.bardsey.org%2F&output=json&allowFuzzy=false => generated 37973 bytes in 236 msecs (HTTP/1.1 200) 1 headers in 48 bytes (5 switches on core 359)
[pid: 16|app: 0|req: 436657/6259124] 194.66.232.92 () {48 vars in 1398 bytes} [Wed Feb  6 11:59:59 2019] GET /wayback/en/archive/cdx?url=http%3A%2F%2Fwww.jeremycorbyn.org.uk%2F&output=json&allowFuzzy=false => generated 81575 bytes in 634587 msecs (HTTP/1.1 200) 1 headers in 48 bytes (7 switches on core 366)

i.e. these queries are taking 10 mins!

@anjackson anjackson added the bug Something isn't working label Feb 6, 2019
@anjackson
Copy link
Contributor Author

Tried switching to lxml and now all queries fail quick! Need to enable debug logging.

@anjackson
Copy link
Contributor Author

Hm, now seem to error on production rather than be slow... Hm.

@anjackson
Copy link
Contributor Author

I've added the XML from OutbackCDX for the example from above that seems not to be working:

jc.xml.txt

ikreymer added a commit to ukwa/pywb that referenced this issue Feb 14, 2019
- ensure lxml-enabled parsing in XmlQueryIndexSource works by passing the raw bytestring instead of unicode text to the parser
- tests: add lxml and non-lxml parsing tests to test_xmlquery_indexsource.py, add lxml to test install
- misc fixes: fix typo in banner.html, update gevent api to support latest gevent
@ikreymer ikreymer mentioned this issue Feb 14, 2019
8 tasks
@ikreymer
Copy link
Contributor

Parsing with LXML wasn't working due to LXML's more stringent encoding requirements, now fixed in the PR above.

However, not sure if that addresses the original issue of slowness, but should get the lxml parsing working.

The above attached xml query seems valid and parsers quickly with no errors.

@anjackson
Copy link
Contributor Author

having got debug enabled a different way, it's something to do with the access list stuff:

access_pywb-beta.1.kz5it8ubv2bm@access    | Traceback (most recent call last):
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/site-packages/gevent/pywsgi.py", line 935, in handle_one_response
access_pywb-beta.1.kz5it8ubv2bm@access    |     self.run_application()
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/site-packages/gevent/pywsgi.py", line 909, in run_application
access_pywb-beta.1.kz5it8ubv2bm@access    |     self.process_result()
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/site-packages/gevent/pywsgi.py", line 893, in process_result
access_pywb-beta.1.kz5it8ubv2bm@access    |     for data in self.result:
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/site-packages/pywb-2.1.0.dev0-py3.5.egg/pywb/warcserver/handlers.py", line 100, in check_str
access_pywb-beta.1.kz5it8ubv2bm@access    |     for line in lines:
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/site-packages/pywb-2.1.0.dev0-py3.5.egg/pywb/warcserver/handlers.py", line 25, in <genexpr>
access_pywb-beta.1.kz5it8ubv2bm@access    |     return content_type, (cdx.to_json(fields) for cdx in cdx_iter)
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/site-packages/pywb-2.1.0.dev0-py3.5.egg/pywb/warcserver/access_checker.py", line 109, in wrap_iter
access_pywb-beta.1.kz5it8ubv2bm@access    |     rule = self.find_access_rule(url, cdx.get('timestamp'), cdx.get('urlkey'))
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/site-packages/pywb-2.1.0.dev0-py3.5.egg/pywb/warcserver/access_checker.py", line 85, in find_access_rule
access_pywb-beta.1.kz5it8ubv2bm@access    |     for acl in acl_iter:
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/site-packages/pywb-2.1.0.dev0-py3.5.egg/pywb/warcserver/index/cdxops.py", line 132, in <genexpr>
access_pywb-beta.1.kz5it8ubv2bm@access    |     return (cdx for cdx, _ in zip(cdx_iter, range(limit)))
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/heapq.py", line 359, in merge
access_pywb-beta.1.kz5it8ubv2bm@access    |     s[0] = next()           # raises StopIteration when exhausted
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/site-packages/pywb-2.1.0.dev0-py3.5.egg/pywb/warcserver/index/aggregator.py", line 76, in <genexpr>
access_pywb-beta.1.kz5it8ubv2bm@access    |     cdx_iter = (add_source(cdx, name) for cdx in cdx_iter)
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/site-packages/pywb-2.1.0.dev0-py3.5.egg/pywb/warcserver/index/indexsource.py", line 79, in do_load
access_pywb-beta.1.kz5it8ubv2bm@access    |     yield CDXObject(line)
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/site-packages/pywb-2.1.0.dev0-py3.5.egg/pywb/warcserver/index/cdxobject.py", line 124, in __init__
access_pywb-beta.1.kz5it8ubv2bm@access    |     json_fields = self.json_decode(to_native_str(fields[-1], 'utf-8'))
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/site-packages/pywb-2.1.0.dev0-py3.5.egg/pywb/warcserver/index/cdxobject.py", line 251, in json_decode
access_pywb-beta.1.kz5it8ubv2bm@access    |     return json_decode(string, object_pairs_hook=OrderedDict)
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/json/__init__.py", line 332, in loads
access_pywb-beta.1.kz5it8ubv2bm@access    |     return cls(**kw).decode(s)
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/json/decoder.py", line 339, in decode
access_pywb-beta.1.kz5it8ubv2bm@access    |     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
access_pywb-beta.1.kz5it8ubv2bm@access    |   File "/usr/local/lib/python3.5/json/decoder.py", line 355, in raw_decode
access_pywb-beta.1.kz5it8ubv2bm@access    |     obj, end = self.scan_once(s, idx)
access_pywb-beta.1.kz5it8ubv2bm@access    | json.decoder.JSONDecodeError: Expecting ':' delimiter: line 1 column 12 (char 11)
access_pywb-beta.1.kz5it8ubv2bm@access    | Thu Feb 14 10:21:32 2019 {'REMOTE_ADDR': '127.0.0.1', 'REMOTE_PORT': '44964', 'HTTP_HOST': 'localhost:33505', (hidden keys: 20)} failed with JSONDecodeError

@anjackson
Copy link
Contributor Author

anjackson commented Feb 14, 2019

Ah,I think the blocks file is malformed:

# docker exec -ti access_pywb-beta.1.kz5it8ubv2bmgyoy88twf7ci0 wb-manager acl validate acl/blocks.aclj
Error Occured: Expecting ':' delimiter: line 1 column 12 (char 11)

But this refers to the JSON payload, so it's difficult to tell which line is causing the problem!

@anjackson
Copy link
Contributor Author

anjackson commented Feb 14, 2019

Sorry, this looks like entirely our problem. That said, having lxml working is a good thing. :-)

EDIT: I take that back - having fixed the blocks.aclj file, it's now back to hanging instead of crashing!

@anjackson
Copy link
Contributor Author

See also webrecorder/pywb#439

@anjackson
Copy link
Contributor Author

Ah, okay, the penny drops....

The queries that hang are ones that are neither in the ALLOWS or BLOCKS lists.

If you go to https://www.webarchive.org.uk/wayback/archive/2019/www.jeremycorbyn.co.uk you get a 451, as we should 🎉. But if you perform a CDX query, this hangs.

@anjackson
Copy link
Contributor Author

So, hacked in some debugging, and it's just REALLY SLOW. This is a fragment:

access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20170501044027 http://www.jeremycorbyn.org.uk/ text/html 200 HASUR43FBXA5C2QAI2567XS7P37OHUR7 156514689 /heritrix/output/warcs/monthly/20170501040804/WREN-monthly-20170501040804-20170501-20170501043944778-00005-boi1gu08.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20170522073441 http://www.jeremycorbyn.org.uk/ text/html 200 JJ4HMBBNQWFK2EVOABJTKIDRQFRCJ3WX 4307035 /heritrix/output/warcs/weekly/20170522070124/WREN-weekly-20170522070124-20170522-20170522073509548-00005-ifu93bpo.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20170529075008 http://www.jeremycorbyn.org.uk/ text/html 200 JJ4HMBBNQWFK2EVOABJTKIDRQFRCJ3WX 83847315 /heritrix/output/warcs/weekly/20170529070122/WREN-weekly-20170529070122-20170529-20170529074957193-00008-pe8m1n9s.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/

access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20170606085842 http://www.jeremycorbyn.org.uk/ text/html 200 PGSL2VJ3ODF44JZMTLQKPDP5JWNAE6GG 583554375 /heritrix/output/warcs/weekly/20170606080322/WREN-weekly-20170606080322-20170606-20170606085541990-00008-4pivrxow.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20170612075001 http://www.jeremycorbyn.org.uk/ text/html 200 PGSL2VJ3ODF44JZMTLQKPDP5JWNAE6GG 426775868 /heritrix/output/warcs/weekly/20170612070128/WREN-weekly-20170612070128-20170612-20170612074747779-00007-s7jw0pyd.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20170619133123 http://www.jeremycorbyn.org.uk/ text/html 200 PGSL2VJ3ODF44JZMTLQKPDP5JWNAE6GG 884287404 /heritrix/output/warcs/weekly/20170619124150/WREN-weekly-20170619124150-20170619-20170619132654348-00007-dk4vhc6q.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20170626075325 http://www.jeremycorbyn.org.uk/ text/html 200 PGSL2VJ3ODF44JZMTLQKPDP5JWNAE6GG 88761038 /heritrix/output/warcs/weekly/20170626070120/WREN-weekly-20170626070120-20170626-20170626075320392-00010-1rljwb6x.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20170626075359 http://www.jeremycorbyn.org.uk/ text/html 200 PGSL2VJ3ODF44JZMTLQKPDP5JWNAE6GG 213069934 /heritrix/output/warcs/weekly/20170626070120/WREN-weekly-20170626070120-20170626-20170626075320392-00010-1rljwb6x.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20170703082851 http://www.jeremycorbyn.org.uk/ text/html 200 PGSL2VJ3ODF44JZMTLQKPDP5JWNAE6GG 835900406 /heritrix/output/warcs/weekly/20170703070144/WREN-weekly-20170703070144-20170703-20170703081613544-00006-d37hcy1m.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20170706140644 http://www.jeremycorbyn.org.uk/ text/html 200 PGSL2VJ3ODF44JZMTLQKPDP5JWNAE6GG 675389065 /heritrix/output/warcs/weekly/20170706133051/WREN-weekly-20170706133051-20170706-20170706140332158-00005-rytik30x.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20170710075853 http://www.jeremycorbyn.org.uk/ text/html 200 PGSL2VJ3ODF44JZMTLQKPDP5JWNAE6GG 314385573 /heritrix/output/warcs/weekly/20170710070126/WREN-weekly-20170710070126-20170710-20170710075720057-00009-92e4bgxf.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20171002131640 http://www.jeremycorbyn.org.uk/ text/html 200 VSHIIFZMLSFQSFKHT27RTPQUUKSFSLRO 657636100 /heritrix/output/warcs/weekly/20171002080120/WREN-weekly-20171002080120-20171002-20171002125902458-00015-erimtj6v.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20171013113338 http://www.jeremycorbyn.org.uk/ text/html 200 VSHIIFZMLSFQSFKHT27RTPQUUKSFSLRO 953977910 /heritrix/output/warcs/weekly/20171013104253/WREN-weekly-20171013104253-20171013-20171013112759245-00006-b9x0d1gu.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20171016084339 http://www.jeremycorbyn.org.uk/ text/html 200 VSHIIFZMLSFQSFKHT27RTPQUUKSFSLRO 295265366 /heritrix/output/warcs/weekly/20171016080122/WREN-weekly-20171016080122-20171016-20171016084220975-00007-uico3jv5.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20171023083459 http://www.jeremycorbyn.org.uk/ text/html 200 VSHIIFZMLSFQSFKHT27RTPQUUKSFSLRO 587830732 /heritrix/output/warcs/weekly/20171023080113/WREN-weekly-20171023080113-20171023-20171023083252735-00006-t25hrnk4.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20171030093502 http://www.jeremycorbyn.org.uk/ text/html 200 VSHIIFZMLSFQSFKHT27RTPQUUKSFSLRO 668485727 /heritrix/output/warcs/weekly/20171030090118/WREN-weekly-20171030090118-20171030-20171030093218600-00006-af2b8clw.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/





access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20171106094119 http://www.jeremycorbyn.org.uk/ text/html 200 VSHIIFZMLSFQSFKHT27RTPQUUKSFSLRO 861233894 /heritrix/output/warcs/weekly/20171106090108/WREN-weekly-20171106090108-20171106-20171106093729514-00006-6rkqa4gm.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20171113094253 http://www.jeremycorbyn.org.uk/ text/html 200 VSHIIFZMLSFQSFKHT27RTPQUUKSFSLRO 264384834 /heritrix/output/warcs/weekly/20171113090120/WREN-weekly-20171113090120-20171113-20171113094135716-00007-80t3bwo1.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20171120093423 http://www.jeremycorbyn.org.uk/ text/html 200 VSHIIFZMLSFQSFKHT27RTPQUUKSFSLRO 95400008 /heritrix/output/warcs/weekly/20171120090126/WREN-weekly-20171120090126-20171120-20171120093428243-00006-4tnagk7o.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20171127094415 http://www.jeremycorbyn.org.uk/ text/html 200 VSHIIFZMLSFQSFKHT27RTPQUUKSFSLRO 110644550 /heritrix/output/warcs/weekly/20171127090121/WREN-weekly-20171127090121-20171127-20171127094413015-00007-evg4lyah.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/
access_pywb-beta.1.vc6kxpl7logj@access    |  - block true
access_pywb-beta.1.vc6kxpl7logj@access    | block
access_pywb-beta.1.vc6kxpl7logj@access    | Yielding...
access_pywb-beta.1.vc6kxpl7logj@access    | Looking at uk,org,jeremycorbyn)/ 20171204094124 http://www.jeremycorbyn.org.uk/ text/html 200 VSHIIFZMLSFQSFKHT27RTPQUUKSFSLRO 568584557 /heritrix/output/warcs/weekly/20171204090126/WREN-weekly-20171204090126-20171204-20171204093902857-00006-jgfu5mdr.warc.gz archive archive
access_pywb-beta.1.vc6kxpl7logj@access    | http://www.jeremycorbyn.org.uk/

In practice, what this means is this line is really slooow

https://github.com/ukwa/pywb/blob/de90e5c767a13819d6ff7da463c448e521fc5f18/pywb/warcserver/access_checker.py#L111

@ikreymer
Copy link
Contributor

ikreymer commented Feb 14, 2019

Thanks for the additional debug info! Based on the above, it seems queries are slow even if in block list, not just if neither allowed or blocked, is that correct?

ikreymer added a commit to ukwa/pywb that referenced this issue Feb 14, 2019
- stop checking acl rules linearly if acl key < tld
- use existing rule for same url (at least until date-range checking)
@ikreymer
Copy link
Contributor

The issue is how the ACL system deals with large ACL rules. Currently, after the exact url binsearch, it performs a linear search to determine if there are any prefix matches, eg. if there is no rule
for uk,org,jeremycorbyn)/, it will continue searching in case there is a rule for uk,org, or uk.

The PR should be an improvement, but will also look at further improvements.

@ikreymer ikreymer mentioned this issue Feb 14, 2019
8 tasks
@anjackson
Copy link
Contributor Author

I'd say this cures this issue, as the patch from ukwa/pywb#5 means the queries do complete in a few seconds rather than being extremely slow and timing out.

This was referenced Feb 14, 2019
N0taN3rd pushed a commit to webrecorder/pywb that referenced this issue Sep 3, 2019
- ensure lxml-enabled parsing in XmlQueryIndexSource works by passing the raw bytestring instead of unicode text to the parser
- tests: add lxml and non-lxml parsing tests to test_xmlquery_indexsource.py, add lxml to test install
- misc fixes: fix typo in banner.html, update gevent api to support latest gevent
N0taN3rd pushed a commit to webrecorder/pywb that referenced this issue Sep 3, 2019
- stop checking acl rules linearly if acl key < tld
- use existing rule for same url (at least until date-range checking)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants