Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout errors #156

Closed
edsu opened this issue Aug 18, 2022 · 5 comments
Closed

Timeout errors #156

edsu opened this issue Aug 18, 2022 · 5 comments

Comments

@edsu
Copy link
Contributor

edsu commented Aug 18, 2022

This appears to be a site specific issue when attempting to crawl http://stephenratcliffe.blogspot.com with two workers. After a short period of time a series of net:ERR_ABORTED errors appear, which are eventually followed by a series of long running timeout errors:
Screen Shot 2022-08-18 at 10 16 25 AM
Screen Shot 2022-08-18 at 10 56 41 AM
Once the timeout errors start the screencast window appears blank.

In /crawls/collections/stephenratcliffe/logs/pywb.log I noticed a series of posts being routed through pywb which seemed to generate the aborted connections, and resulted sometimes in a timeout. For example:

127.0.0.1 - - [2022-08-18 15:27:49] "POST /live/resource/postreq?param.recorder.coll=stephenratcliffe&url=https%3A%2F%2Fstephenratcliffe.blogspot.com%2F2015%2F&closest=now&matchType=exact HTTP/1.1" 200 15420 0.178694
127.0.0.1 - - [2022-08-18 15:27:49] "POST /live/resource/postreq?param.recorder.coll=stephenratcliffe&url=https%3A%2F%2Fstephenratcliffe.blogspot.com%2F2015%2F&closest=now&matchType=exact HTTP/1.1" 200 15406 0.182478
Thu Aug 18 15:27:49 2022 - uwsgi_response_write_headers_do(): Broken pipe [core/writer.c line 248] during CONNECT 2.bp.blogspot.com:443 (127.0.0.1)
[pid: 163|app: 0|req: 354/1704] 127.0.0.1 () {24 vars in 416 bytes} [Thu Aug 18 15:27:40 2022] CONNECT 2.bp.blogspot.com:443 => generated 0 bytes in 8853 msecs (HTTP/1.1 200) 0 headers in 0 bytes (2 switches on core 993)

I was using main at 827c153 with the following command:

docker compose run -p 9037:9037 -it crawler crawl --config /crawls/config.yaml

and the following config file placed in the crawls directory:

collection: stephenratcliffe
workers: 2
generateWACZ: true
text: true
screencastPort: 9037
logging: stats,pywb,behaviors,behaviors-debug
seeds:
  - url: https://stephenratcliffe.blogspot.com
    scopeType: host
    exclude:
      - https?://stephenratcliffe.blogpost.com/search.*
      - https?://stephenratcliffe.blogpost.com//search.*
      - https?://stephenratcliffe.blogspot.com/navbar.g.*
@ikreymer
Copy link
Member

Thanks for the detailed repro! Hm, I actually have seen this happen on other sites as well, I think the default window context and screencasting is causing this issue in the latest 0.7.0 beta, which uses Chrome/Chromium 101. This issue is not happening with Chrome 91 (0.6.0 release). Of course want to upgrade to latest browser, so will investigate if there's a reasonable fix..

@edsu
Copy link
Contributor Author

edsu commented Aug 18, 2022

@ikreymer you're right, it is running a lot smoother without screencasting!

I can't quite understand why it is picking up URLs like https://stephenratcliffe.blogspot.com/search?updated-max=2009-05-12 when the config should be excluding it. But that's a separate issue 😵

@ikreymer
Copy link
Member

Can you try this branch: https://github.com/webrecorder/browsertrix-crawler/tree/window-context-tweaks, hopefully this fixes it, with current browser and screencasting enabled..

@edsu edsu changed the title Site specific timeout errors Timeout errors Aug 19, 2022
@edsu
Copy link
Contributor Author

edsu commented Aug 19, 2022

@ikreymer This branch is working a lot better!

ikreymer added a commit that referenced this issue Aug 19, 2022
…ements (#157)

* new window: use cdp instead of window.open

* new window tweaks: add reuseCount, use browser.target() instead of opening a new blank page

* rename NewWindowPage -> ReuseWindowConcurrency, move to windowconcur.js
potential fix for #156

* browser repair:
- when using window-concurrency, attempt to repair / relaunch browser if cdp errors occur
- mark pages as failed and don't reuse if page error or cdp errors occur
- screencaster: clear previous targets if screencasting when repairing browser

* bump version to 0.7.0-beta.3
@ikreymer
Copy link
Member

Should be fixed as of 0.7.0-beta.3 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants