Limit number of simultaneously opened fds #578

dra27 · 2018-03-03T18:01:33Z

Both @jpdeplaix and I have seen some Travis jobs failing with beta18 running out of fds (see, for example, this Travis log

I bisected the problem and found that it was #486 which introduced the issue. However, it's not a bug in the implementation, it's just that it reveals a problem. The redirect function in Action opens the files which will be written to immediately, but the jobs which are going to write to them (in this case, lots of calls to ocamldep.opt) obviously won't be scheduled at once so the fds are "hogged" by the scheduler. This creates a problem for things like mccs in opam's vendored build which does a lot of copy_files which obviously can proceed while ocamldep calls are going on, but eventually fail because of the number of open fds (I think @avsm has seen the same thing with RWO2?). Unusually, this is therefore a failure which becomes more apparently with -j 1 than less!

The proper fix is quite invasive, as it involves threading the state of redirection through Action and Process - I couldn't see a really neat way of guaranteeing that the file is both only opened as late as possible and also that it's always written to. However, what I've done here captures the common case of with-*-to (run foo)

Non-optimal solution: only handles the common case.

rgrinberg · 2018-03-04T18:05:28Z

Would wrapping up stdout_to and and stderr_to in a Lazy.t help out with this? The issue seems to be that resource acquisition is done too early. Why not delay it until it's actually needed?

I know lazy IO has problems, but we aren't actually letting our handles escape. So we aren't pretending to be composable. Also, only the fd acquisition is lazy. Reading/Writing is still done as eagerly as ever.

dra27 · 2018-03-04T18:56:24Z

I had an earlier (flawed) implementation which did the deferring (there's no need here for lazy values, as Process.std_output_to already has an encoding for laziness via the File or Opened_file constructors). It's not that invasive to make Action.exec use Process.std_output_to instead of (string * out_channel) option, the problem is that Process.run is a "sink" - i.e. if Process.run opens the channel, it's difficult to communicate that back to Action.exec (if you don't communicate it back, then this wouldn't work: (with-stdout-to foo (progn (run echo bar) (run echo baz))). I could see two ways to do this:

Return stdout_to and stderr_to from Process.run (that's a fairly big change, I think)
1a. Pass stdout_to and stderr_to as refs (that's a gross change, I think!)
Do something with Scheduler.wait_for_available_job in Action.exec (I think that's possible to do, what I wasn't convinced was whether it was sensible).

The problem with both those ideas is a slight flaw in the underlying design - once you start a progn pipeline, you want to keep executing those jobs, you don't want the scheduler to run something else before continuing with the next thing in your sequence, because you want the fd to be released as soon as feasible.

ghost · 2018-03-05T12:05:44Z

The problem with both those ideas is a slight flaw in the underlying design - once you start a progn pipeline, you want to keep executing those jobs, you don't want the scheduler to run something else before continuing with the next thing in your sequence, because you want the fd to be released as soon as feasible.

This should be fixed by #586

ghost · 2018-03-05T12:10:12Z

I had an earlier (flawed) implementation which did the deferring (there's no need here for lazy values, as Process.std_output_to already has an encoding for laziness via the File or Opened_file constructors). It's not that invasive to make Action.exec use Process.std_output_to instead of (string * out_channel) option, the problem is that Process.run is a "sink" - i.e. if Process.run opens the channel, it's difficult to communicate that back to Action.exec (if you don't communicate it back, then this wouldn't work: (with-stdout-to foo (progn (run echo bar) (run echo baz))).

It should be fine to use (string * out_channel Lazy.t) option though, right? and pass the lazy value to the Process module

rgrinberg · 2018-03-06T12:00:08Z

The implementation here is good, but the change still seems a little ad hoc. Do we still need this after the scheduling fix? I'd say no if it seems to address the problem for everyone. But if this PR makes the fd usage measurably more efficient, I suppose it should be included.

dra27 · 2018-03-06T16:21:27Z

@diml - using a lazy value is sort of my solution "1a", but some reason using a lazy value seems less unclean than passing a ref! I'd be happy to switch it to that - it would also solve the problem without the special case.

@rgrinberg - does the scheduling change solve this? The issue here isn't the scheduler, it's the delay between opening the file for redirection and actually scheduling anything to it, but I'd only glanced at the scheduling change PR.

ghost · 2018-03-06T17:11:36Z

Looking again at process.ml, using a lazy value is not going to be enough as in the File _ case, we also close the fd right after starting the command.

So the only way to generalize this PR is to correctly set the tail field.

ghost · 2018-03-06T17:13:34Z

I think we merge this as it and leave the generalization for later as it's a more invasive change

Non-optimal solution: only handles the common case.

Limit number of simultaneously opened fds

e30fa83

Non-optimal solution: only handles the common case.

dra27 mentioned this pull request Mar 4, 2018

1.0+beta18 "Too many open files" error #581

Closed

dra27 added a commit to dra27/opam that referenced this pull request Mar 4, 2018

Work around ocaml/dune#578 on Travis macOS

ff8467d

dra27 added a commit to dra27/opam that referenced this pull request Mar 4, 2018

Work around ocaml/dune#578 on Travis macOS

c451d25

dra27 added a commit to dra27/opam that referenced this pull request Mar 4, 2018

Work around ocaml/dune#578 on Travis macOS

da5aedd

ghost approved these changes Mar 6, 2018

View reviewed changes

Merge branch 'master' into limit-fds

67bd214

ghost merged commit b604871 into ocaml:master Mar 8, 2018

hhugo mentioned this pull request Mar 11, 2018

Build fails on opam when ulimit -n is too low janestreet/core_kernel#76

Closed

ghost pushed a commit that referenced this pull request Mar 14, 2018

Limit number of simultaneously opened fds (#578)

a439f3e

Non-optimal solution: only handles the common case.

ghost pushed a commit that referenced this pull request Mar 14, 2018

Limit number of simultaneously opened fds (#578)

143e5a0

Non-optimal solution: only handles the common case.

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit number of simultaneously opened fds #578

Limit number of simultaneously opened fds #578

dra27 commented Mar 3, 2018 •

edited

Loading

rgrinberg commented Mar 4, 2018

dra27 commented Mar 4, 2018

ghost commented Mar 5, 2018

ghost commented Mar 5, 2018

rgrinberg commented Mar 6, 2018 •

edited

Loading

dra27 commented Mar 6, 2018

ghost commented Mar 6, 2018

ghost commented Mar 6, 2018

Limit number of simultaneously opened fds #578

Limit number of simultaneously opened fds #578

Conversation

dra27 commented Mar 3, 2018 • edited Loading

rgrinberg commented Mar 4, 2018

dra27 commented Mar 4, 2018

ghost commented Mar 5, 2018

ghost commented Mar 5, 2018

rgrinberg commented Mar 6, 2018 • edited Loading

dra27 commented Mar 6, 2018

ghost commented Mar 6, 2018

ghost commented Mar 6, 2018

dra27 commented Mar 3, 2018 •

edited

Loading

rgrinberg commented Mar 6, 2018 •

edited

Loading