accept4: too many open files; retrying in 1s #2621

Titan-Node · 2022-10-14T17:13:13Z

Describe the bug
A couple reports of this error happening putting the node into a frozen state.

E1014 16:14:04.380446   76519 block_watcher.go:123] blockwatch.Watcher error encountered - trying again on next polling interval err="unable to open database file"
2022/10/14 16:14:04 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:04 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:05 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:05 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:06 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:06 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:07 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:07 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:08 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:08 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
E1014 16:14:09.316321   76519 block_watcher.go:123] blockwatch.Watcher error encountered - trying again on next polling interval err="unable to open database file"
2022/10/14 16:14:09 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:09 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:10 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:10 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:11 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:11 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:12 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:12 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:13 http: Accept error: accept tcp 161.35.223.50:7935: accept4: too many open files; retrying in 1s
2022/10/14 16:14:13 http: Accept error: accept tcp [::]:8935: accept4: too many open files; retrying in 1s

Desktop (please complete the following information):

Ubuntu 20.04
Standalone O
Livepeer Version 0.5.34

Don't know how to reproduce but 2 of my nodes have had this error in 12 hours and other reports also. No recent updates or upgrades.
May community node RPC endpoint issue?

The text was updated successfully, but these errors were encountered:

Franck-UltimaRatio · 2022-10-15T11:39:10Z

got this error this morning, even after set ulimit -n 65000. our RPC are Arb1 and alchemy. node running on Ubuntu

		2022-10-15 12:52:43	2022/10/15 12:52:43 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
		2022-10-15 12:52:42	2022/10/15 12:52:42 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
		2022-10-15 12:52:41	2022/10/15 12:52:41 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
		2022-10-15 12:52:40	2022/10/15 12:52:40 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
		2022-10-15 12:52:39	2022/10/15 12:52:39 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
		2022-10-15 12:52:38	2022/10/15 12:52:38 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
		2022-10-15 12:52:37	2022/10/15 12:52:37 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
		2022-10-15 12:52:36	2022/10/15 12:52:36 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
		2022-10-15 12:52:35	2022/10/15 12:52:35 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
		2022-10-15 12:52:34	2022/10/15 12:52:34 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s
		2022-10-15 12:52:34	E1015 12:52:34.084901 4168297 block_watcher.go:123] blockwatch.Watcher error encountered - trying again on next polling interval err="unable to open database file"
		2022-10-15 12:52:33	2022/10/15 12:52:33 http: Accept error: accept tcp [::]:8936: accept4: too many open files; retrying in 1s

kyriediculous · 2022-10-17T00:38:42Z

I'm not sure which connection it is exactly, but the server seems to not be reusing some of these connections properly.

Increasing the limit might work but feels like a temporary bandaid until there's enough single shot requests to also reach that limit.

You can check the amount of open files next time the error occurs by running

ps aux | grep -i livepeer

Get the livepeer pid and run

lsof -p <livepeer_pid> | wc -l

Franck-UltimaRatio · 2022-10-17T06:30:34Z

don't know if that could be related, but just before this error appear, i got this error on 2 of our nodes (combined O/T) so i decided to split them to keep the O running if the error was from the T, after that i started to be hit by the "too many open files error" on the standalone O.
Oct 14 04:40:59 newyorkcity livepeer[3182884]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7fae6c5a2685]
Oct 14 04:40:59 newyorkcity livepeer[3182884]: runtime stack:
Oct 14 04:40:59 newyorkcity livepeer[3182884]: runtime.throw({0x571796, 0x0})
Oct 14 04:40:59 newyorkcity livepeer[3182884]: /opt/hostedtoolcache/go/1.17.12/x64/src/runtime/panic.go:1198 +0x71
Oct 14 04:40:59 newyorkcity livepeer[3182884]: runtime.sigpanic()
Oct 14 04:40:59 newyorkcity livepeer[3182884]: /opt/hostedtoolcache/go/1.17.12/x64/src/runtime/signal_unix.go:719 +0x396
Oct 14 04:40:59 newyorkcity livepeer[3182884]: goroutine 17006128 [syscall]:
Oct 14 04:40:59 newyorkcity livepeer[3182884]: runtime.cgocall(0x1e74a90, 0xc002ac17e8)

hobrob · 2022-10-17T20:20:11Z

This error first occured for me on Friday. Running opensuse 15.3, my system file limit is ample but my user-level file limit was only 1024, however after upping this to 70k+ and restarting the service in a new shell I still get the same error crop up within 24 hours. Interestingly, each time I've run a count of open files mentioning livepeer with lsof -u root | grep livepeer | wc -l and it always seems to be around the 1000+ mark. Could there be a per-process file limit as well? Anyhow still doesn't explain why it's suddenly started hitting any limits, and to so many other O's at around the same time.

leszko · 2022-10-18T12:28:08Z

Here's what I believe happens:

There is no timeout in the Orchestrator's endpoint /segment, so when B sends a segment and does not close the connection, it's always open.
Each connection opens a file descriptor, so if there are 1024 requests from B to O, then it creates 1024 file descriptors
There is a limit of the number of files that can be opened by the process and it's 1024 by default.
In that case, O cannot accept new transcoding requests, nor it can catch up with on-chain blocks (because they are stored in DB and accessing DB requires creating a file descriptor).

The simple solution is to add the timeout and I added it in #2628.

Still, technically, you will be able to cause a DoS attack if you send 1024 number of requests to O during the HTTP timeout. But for now I think this solution will be good enough.

stronk-dev · 2022-10-19T09:25:13Z

I think I might have accidentaly caused this issue: I am using a nodejs grpc library to collect O performance and capabilitites (by simply doing getOrchestrator grpc calls and storing the results). The occurence of the too many open files error seems to coincide with the period where I was experimenting with creating a new client for each call. Closing the client explicitly seems to fix this, so probably the library kept old clients open, even if they go out of scope and are no longer used.

Hopefully the timeout is enough, but it is concerning that such a simple script would cause O's to stop functioning. I can imagine that in the future as more B's start experimenting with the network this issue will resurface. Is there any way we can pace the amount of sessions a B can create and keep open?

thomshutt · 2022-10-19T09:36:08Z

Thanks for the surprise stress test @stronk-dev 😄 I was actually discussing with @leszko and @yondonfu whether this was likely caused by someone doing something malicious or accidental!

Agreed that something like this shouldn't make Os stop functioning

stronk-dev · 2022-10-19T09:49:32Z

Yeah my goal was to give O's a reliable way to gauge their response times in each region and compare this with other O's. There are some weird results with response times varying a lot for some O's, so I tried to fix this by forcing a new client to be created as I suspected they were being reused

Apologies to al the O's which were affected by this

github-actions bot added the status: triage this issue has not been evaluated yet label Oct 14, 2022

leszko mentioned this issue Oct 18, 2022

Add timeout for HTTP connections #2628

Merged

5 tasks

leszko removed the status: triage this issue has not been evaluated yet label Oct 18, 2022

leszko self-assigned this Oct 18, 2022

leszko added the type: bug Something isn't working label Oct 18, 2022

leszko closed this as completed in #2628 Oct 20, 2022

leszko mentioned this issue Oct 27, 2022

Stream connections dropping on 0.5.35 release #2638

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accept4: too many open files; retrying in 1s #2621

accept4: too many open files; retrying in 1s #2621

Titan-Node commented Oct 14, 2022

Franck-UltimaRatio commented Oct 15, 2022 •

edited

Loading

kyriediculous commented Oct 17, 2022 •

edited

Loading

Franck-UltimaRatio commented Oct 17, 2022

hobrob commented Oct 17, 2022

leszko commented Oct 18, 2022

stronk-dev commented Oct 19, 2022

thomshutt commented Oct 19, 2022

stronk-dev commented Oct 19, 2022

accept4: too many open files; retrying in 1s #2621

accept4: too many open files; retrying in 1s #2621

Comments

Titan-Node commented Oct 14, 2022

Franck-UltimaRatio commented Oct 15, 2022 • edited Loading

kyriediculous commented Oct 17, 2022 • edited Loading

Franck-UltimaRatio commented Oct 17, 2022

hobrob commented Oct 17, 2022

leszko commented Oct 18, 2022

stronk-dev commented Oct 19, 2022

thomshutt commented Oct 19, 2022

stronk-dev commented Oct 19, 2022

Franck-UltimaRatio commented Oct 15, 2022 •

edited

Loading

kyriediculous commented Oct 17, 2022 •

edited

Loading