Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accept4: too many open files; retrying in 1s #2621

Closed
Titan-Node opened this issue Oct 14, 2022 · 8 comments · Fixed by #2628
Closed

accept4: too many open files; retrying in 1s #2621

Titan-Node opened this issue Oct 14, 2022 · 8 comments · Fixed by #2628
Assignees
Labels
type: bug Something isn't working

Comments

@Titan-Node
Copy link

Describe the bug
A couple reports of this error happening putting the node into a frozen state.

E1014 16:14:04.380446   76519 block_watcher.go:123] blockwatch.Watcher error encountered - trying again on next polling interval err="unable to open database file"
2022/10/14 16:14:04 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:04 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:05 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:05 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:06 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:06 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:07 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:07 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:08 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:08 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
E1014 16:14:09.316321   76519 block_watcher.go:123] blockwatch.Watcher error encountered - trying again on next polling interval err="unable to open database file"
2022/10/14 16:14:09 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:09 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:10 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:10 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:11 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:11 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:12 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:12 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:13 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:13 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s

Desktop (please complete the following information):

  • Ubuntu 20.04
  • Standalone O
  • Livepeer Version 0.5.34

Don't know how to reproduce but 2 of my nodes have had this error in 12 hours and other reports also. No recent updates or upgrades.
May community node RPC endpoint issue?

@github-actions github-actions bot added the status: triage this issue has not been evaluated yet label Oct 14, 2022
@Franck-UltimaRatio
Copy link

Franck-UltimaRatio commented Oct 15, 2022

got this error this morning, even after set ulimit -n 65000. our RPC are Arb1 and alchemy. node running on Ubuntu

    2022-10-15 12:52:43 2022/10/15 12:52:43 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
    2022-10-15 12:52:42 2022/10/15 12:52:42 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
    2022-10-15 12:52:41 2022/10/15 12:52:41 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
    2022-10-15 12:52:40 2022/10/15 12:52:40 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
    2022-10-15 12:52:39 2022/10/15 12:52:39 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
    2022-10-15 12:52:38 2022/10/15 12:52:38 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
    2022-10-15 12:52:37 2022/10/15 12:52:37 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
    2022-10-15 12:52:36 2022/10/15 12:52:36 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
    2022-10-15 12:52:35 2022/10/15 12:52:35 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
    2022-10-15 12:52:34 2022/10/15 12:52:34 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
    2022-10-15 12:52:34 E1015 12:52:34.084901 4168297 block_watcher.go:123] blockwatch.Watcher error encountered - trying again on next polling interval err="unable to open database file"
    2022-10-15 12:52:33 2022/10/15 12:52:33 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s

@kyriediculous
Copy link
Contributor

kyriediculous commented Oct 17, 2022

I'm not sure which connection it is exactly, but the server seems to not be reusing some of these connections properly.

Increasing the limit might work but feels like a temporary bandaid until there's enough single shot requests to also reach that limit.

You can check the amount of open files next time the error occurs by running

ps aux | grep -i livepeer

Get the livepeer pid and run

lsof -p <livepeer_pid> | wc -l

@Franck-UltimaRatio
Copy link

don't know if that could be related, but just before this error appear, i got this error on 2 of our nodes (combined O/T) so i decided to split them to keep the O running if the error was from the T, after that i started to be hit by the "too many open files error" on the standalone O.
Oct 14 04:40:59 newyorkcity livepeer[3182884]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7fae6c5a2685]
Oct 14 04:40:59 newyorkcity livepeer[3182884]: runtime stack:
Oct 14 04:40:59 newyorkcity livepeer[3182884]: runtime.throw({0x571796, 0x0})
Oct 14 04:40:59 newyorkcity livepeer[3182884]: /opt/hostedtoolcache/go/1.17.12/x64/src/runtime/panic.go:1198 +0x71
Oct 14 04:40:59 newyorkcity livepeer[3182884]: runtime.sigpanic()
Oct 14 04:40:59 newyorkcity livepeer[3182884]: /opt/hostedtoolcache/go/1.17.12/x64/src/runtime/signal_unix.go:719 +0x396
Oct 14 04:40:59 newyorkcity livepeer[3182884]: goroutine 17006128 [syscall]:
Oct 14 04:40:59 newyorkcity livepeer[3182884]: runtime.cgocall(0x1e74a90, 0xc002ac17e8)

@hobrob
Copy link

hobrob commented Oct 17, 2022

This error first occured for me on Friday. Running opensuse 15.3, my system file limit is ample but my user-level file limit was only 1024, however after upping this to 70k+ and restarting the service in a new shell I still get the same error crop up within 24 hours. Interestingly, each time I've run a count of open files mentioning livepeer with lsof -u root | grep livepeer | wc -l and it always seems to be around the 1000+ mark. Could there be a per-process file limit as well? Anyhow still doesn't explain why it's suddenly started hitting any limits, and to so many other O's at around the same time.

@leszko
Copy link
Contributor

leszko commented Oct 18, 2022

Here's what I believe happens:

  1. There is no timeout in the Orchestrator's endpoint /segment, so when B sends a segment and does not close the connection, it's always open.
  2. Each connection opens a file descriptor, so if there are 1024 requests from B to O, then it creates 1024 file descriptors
  3. There is a limit of the number of files that can be opened by the process and it's 1024 by default.
  4. In that case, O cannot accept new transcoding requests, nor it can catch up with on-chain blocks (because they are stored in DB and accessing DB requires creating a file descriptor).

The simple solution is to add the timeout and I added it in #2628.

Still, technically, you will be able to cause a DoS attack if you send 1024 number of requests to O during the HTTP timeout. But for now I think this solution will be good enough.

@leszko leszko removed the status: triage this issue has not been evaluated yet label Oct 18, 2022
@leszko leszko self-assigned this Oct 18, 2022
@leszko leszko added the type: bug Something isn't working label Oct 18, 2022
@stronk-dev
Copy link
Contributor

I think I might have accidentaly caused this issue: I am using a nodejs grpc library to collect O performance and capabilitites (by simply doing getOrchestrator grpc calls and storing the results). The occurence of the too many open files error seems to coincide with the period where I was experimenting with creating a new client for each call. Closing the client explicitly seems to fix this, so probably the library kept old clients open, even if they go out of scope and are no longer used.

Hopefully the timeout is enough, but it is concerning that such a simple script would cause O's to stop functioning. I can imagine that in the future as more B's start experimenting with the network this issue will resurface. Is there any way we can pace the amount of sessions a B can create and keep open?

@thomshutt
Copy link
Contributor

Thanks for the surprise stress test @stronk-dev 😄 I was actually discussing with @leszko and @yondonfu whether this was likely caused by someone doing something malicious or accidental!

Agreed that something like this shouldn't make Os stop functioning

@stronk-dev
Copy link
Contributor

Yeah my goal was to give O's a reliable way to gauge their response times in each region and compare this with other O's. There are some weird results with response times varying a lot for some O's, so I tried to fix this by forcing a new client to be created as I suspected they were being reused

Apologies to al the O's which were affected by this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants