Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit number of simultaneously opened fds #578

Merged
2 commits merged into from
Mar 8, 2018
Merged

Conversation

dra27
Copy link
Member

@dra27 dra27 commented Mar 3, 2018

Both @jpdeplaix and I have seen some Travis jobs failing with beta18 running out of fds (see, for example, this Travis log

I bisected the problem and found that it was #486 which introduced the issue. However, it's not a bug in the implementation, it's just that it reveals a problem. The redirect function in Action opens the files which will be written to immediately, but the jobs which are going to write to them (in this case, lots of calls to ocamldep.opt) obviously won't be scheduled at once so the fds are "hogged" by the scheduler. This creates a problem for things like mccs in opam's vendored build which does a lot of copy_files which obviously can proceed while ocamldep calls are going on, but eventually fail because of the number of open fds (I think @avsm has seen the same thing with RWO2?). Unusually, this is therefore a failure which becomes more apparently with -j 1 than less!

The proper fix is quite invasive, as it involves threading the state of redirection through Action and Process - I couldn't see a really neat way of guaranteeing that the file is both only opened as late as possible and also that it's always written to. However, what I've done here captures the common case of with-*-to (run foo)

Non-optimal solution: only handles the common case.
dra27 added a commit to dra27/opam that referenced this pull request Mar 4, 2018
dra27 added a commit to dra27/opam that referenced this pull request Mar 4, 2018
dra27 added a commit to dra27/opam that referenced this pull request Mar 4, 2018
@rgrinberg
Copy link
Member

Would wrapping up stdout_to and and stderr_to in a Lazy.t help out with this? The issue seems to be that resource acquisition is done too early. Why not delay it until it's actually needed?

I know lazy IO has problems, but we aren't actually letting our handles escape. So we aren't pretending to be composable. Also, only the fd acquisition is lazy. Reading/Writing is still done as eagerly as ever.

@dra27
Copy link
Member Author

dra27 commented Mar 4, 2018

I had an earlier (flawed) implementation which did the deferring (there's no need here for lazy values, as Process.std_output_to already has an encoding for laziness via the File or Opened_file constructors). It's not that invasive to make Action.exec use Process.std_output_to instead of (string * out_channel) option, the problem is that Process.run is a "sink" - i.e. if Process.run opens the channel, it's difficult to communicate that back to Action.exec (if you don't communicate it back, then this wouldn't work: (with-stdout-to foo (progn (run echo bar) (run echo baz))). I could see two ways to do this:

  1. Return stdout_to and stderr_to from Process.run (that's a fairly big change, I think)
    1a. Pass stdout_to and stderr_to as refs (that's a gross change, I think!)
  2. Do something with Scheduler.wait_for_available_job in Action.exec (I think that's possible to do, what I wasn't convinced was whether it was sensible).

The problem with both those ideas is a slight flaw in the underlying design - once you start a progn pipeline, you want to keep executing those jobs, you don't want the scheduler to run something else before continuing with the next thing in your sequence, because you want the fd to be released as soon as feasible.

@ghost
Copy link

ghost commented Mar 5, 2018

The problem with both those ideas is a slight flaw in the underlying design - once you start a progn pipeline, you want to keep executing those jobs, you don't want the scheduler to run something else before continuing with the next thing in your sequence, because you want the fd to be released as soon as feasible.

This should be fixed by #586

@ghost
Copy link

ghost commented Mar 5, 2018

I had an earlier (flawed) implementation which did the deferring (there's no need here for lazy values, as Process.std_output_to already has an encoding for laziness via the File or Opened_file constructors). It's not that invasive to make Action.exec use Process.std_output_to instead of (string * out_channel) option, the problem is that Process.run is a "sink" - i.e. if Process.run opens the channel, it's difficult to communicate that back to Action.exec (if you don't communicate it back, then this wouldn't work: (with-stdout-to foo (progn (run echo bar) (run echo baz))).

It should be fine to use (string * out_channel Lazy.t) option though, right? and pass the lazy value to the Process module

@rgrinberg
Copy link
Member

rgrinberg commented Mar 6, 2018

The implementation here is good, but the change still seems a little ad hoc. Do we still need this after the scheduling fix? I'd say no if it seems to address the problem for everyone. But if this PR makes the fd usage measurably more efficient, I suppose it should be included.

@dra27
Copy link
Member Author

dra27 commented Mar 6, 2018

@diml - using a lazy value is sort of my solution "1a", but some reason using a lazy value seems less unclean than passing a ref! I'd be happy to switch it to that - it would also solve the problem without the special case.

@rgrinberg - does the scheduling change solve this? The issue here isn't the scheduler, it's the delay between opening the file for redirection and actually scheduling anything to it, but I'd only glanced at the scheduling change PR.

@ghost
Copy link

ghost commented Mar 6, 2018

Looking again at process.ml, using a lazy value is not going to be enough as in the File _ case, we also close the fd right after starting the command.

So the only way to generalize this PR is to correctly set the tail field.

@ghost
Copy link

ghost commented Mar 6, 2018

I think we merge this as it and leave the generalization for later as it's a more invasive change

@ghost ghost merged commit b604871 into ocaml:master Mar 8, 2018
ghost pushed a commit that referenced this pull request Mar 14, 2018
Non-optimal solution: only handles the common case.
ghost pushed a commit that referenced this pull request Mar 14, 2018
Non-optimal solution: only handles the common case.
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants