instance churn causes xvfb to occasionally hang #561

rosshinkley · 2016-04-11T05:36:50Z

I wasn't sure where to put this - a new issue felt the most logical.

Digging into the test failures that happen intermittently, it looks like the behavior that causes the tests to fail also will present in regular applications (albeit seemingly not as often). Consider the following (extremely contrived) case:

var Nightmare = require('nightmare');
//create a fake url payload for serial execution
var urls = [];
for (var i = 0; i < 500; i++) {
    urls.push('http://localhost:7600/navigation');
}
//go to each url, and then end
urls.reduce(function(accumulator, url, ix) {
    return accumulator.then(function(results) {
      console.log(ix);
      return (Nightmare())
        .goto(url)
        .end()
        .then(function() {
          console.log('results');
          results.push(ix);
          return results;
        });
    });
  }, Promise.resolve([]))]
  .then(function(results) {
    console.log('done');
    //console.dir(results);
  }).catch(function(){
    console.dir(arguments);
  });

This example creates a lot of instance churn, causing Electron instances to be spun up and killed, much like how the unit test suite works. After running many times (I've had it happen in as few as 8 and as many as ~450 iterations, seems to be luck of the draw), Nightmare will hang. In a test context, this will cause Mocha to timeout, which at least partially explains the behavior with the intermittent test failures.

The hang appears to happen when a new BrowserWindow instance is created. Enabling ELECTRON_ENABLE_LOGGING and ELECTRON_ENABLE_STACK_DUMPING doesn't yield useful information. Piping Electron output directly to the parent process std* buffers also doesn't yield anything terribly interesting.

What I'd like to try next:

Try to get node-inspector working with Electron. So far, my attempts to get a debugger attached have been met with limited success.
Create BrowserWindow churn natively under Electron to see if the behavior can be recreated internally to Electron.
If that doesn't hang, I'll take a look at creating browser churn with a simplified node process -> electron process. That should help narrow things down.
If all else fails, I'll try to make a custom Electron build to try and figure out what's going on.

I'm open to suggestions and ideas on all points.

Some other peculiar (possibly unrelated?) things that cropped up as a result:

The event forwarding method always throws an error after the .kill() call is made because the window is destroyed. I'll be pulling together a small bugfix for that as time permits. (I don't think it has any bearing on how the rest of the project behaves, but it certainly can't hurt to fix.)
The close event case for a clean exit never happens unless win.close or win.destroy are manually called. In other words, you'll never see electron child process exited with code 0: success!.

The text was updated successfully, but these errors were encountered:

rosshinkley · 2016-04-11T19:42:34Z

Quick update: I was going through the CircleCI documents to make sure the above wasn't caused by something silly. Per the documentation, CircleCI uses xvfb, same as the chroot I was using to recreate the issues on. For grins, I fired up xfce on that chroot to run it under a "real" X client, and could not recreate the issue. It works the same as on my main Ubuntu development box, and it seems to work on pretty much everyone's machine.

To that end, I think this might be xvfb-specific. Anyone else having hanging problems with xvfb?

rosshinkley · 2016-04-12T14:17:39Z

Now that I know what to look for, if I'm reading this correctly, the Chromium team has already hit this. It appears to be an issue with forking and a memory allocation deadlock in one of the underlying libraries.

I'm going to take a swing at a workaround.

Mr0grog · 2016-04-12T16:17:39Z

👍 to spending some time researching this.

That said, I think it’s probably a given that, since every instantiation of Nightmare() creates at least two new processes, that’s going to cause significant overhead. Regardless of deeper potential issues in Electron, it will be better to only create one electron process that handles a collection of windows. We could further pool those windows—when someone calls end() on an instance, that’s a signal that we can reuse a window.

Given that Chromium has also hit this, if we add a pooling mechanism as above, that also gives us a way to cap the number of Nightmare instances running simultaneously. We can just queue up the rest to run when a window becomes available.

rosshinkley · 2016-04-12T18:25:01Z

Consider the time spent. :)

I've got a fix working locally: I can churn through hundreds of instances and finish without a problem - a huge improvement over the average failure point of a dozen or two. I also still need to verify CircleCI will use my patch to execute tests. Once I'm reasonably confident that's working, I'll open a PR for review.

By the by, the fix opens up an avenue of problems having to do with #502. That's probably a topic best left for that issue or for the eventual PR, and I'll be doing a writeup in one of those spots at some point.

This isn't necessarily a resource problem. Having more memory or a faster CPU would likely mask the problem, but wouldn't eliminate it entirely. It has more to do with the underpinnings of xvfb, how resources are requested, and how IPC messaging is done internally to xvfb and glib (at least, that's my tentative understanding).

(Warning: this thread is about to go offtopic.)

That said, I completely agree with you: spinning up multiple Electron instances gets resource-intensive in a hurry. I also agree that Nightmare should probably be what amounts to a BrowserWindow manager - at least, I think that's the kind of direction you're heading in. (I also seem to remember you making a similar comment elsewhere, and can't find it offhand now. It's been something I've wanted to talk to you about since I saw that.)

That approach has a couple of small hurdles:

The first BrowserWindow Electron opens has a special property where if you close it, the whole process dies. I don't know if this "master window" property is transferable (nor do I know what it's appropriately called). The Electron process + first browser window would likely need to be protected until .end() is called, so that's easy enough to sidestep.
Whatever the approach, I would argue it would need to be compatible with the current test suite.
Constructing Nightmare-compatible BrowserWindows would probably need to be broken out into its own class, somewhat similar to how you implemented FrameManager, and a similar class in the calling process will probably need to be constructed.
I'm betting you had parallel execution in mind as well, which makes the problems in Simultaneous calls to the same action can cause early returns/continues with the wrong results #493 all the worse.

I'd also have a couple of questions:

Do you think Nightmare should return back Nightmare-side instances of BrowserWindow (or NightmareBrowserWindow?) to perform tasks? I'm asking to try to get my head around parallel execution.
What's an acceptable maximum of BrowserWindows that can be open? How does that memory footprint differ?

That's all I can think of off the top of my head. I'm willing to dig into the above and start working through some of the implementation bits as time permits.

Thoughts?

rosshinkley · 2016-04-12T18:32:30Z

PS: the comment I couldn't find? I just found. Of course the second I say I can't find it, I happen to spot it. It's buried in the middle of #553.

Mr0grog · 2016-04-13T08:21:09Z

the fix opens up an avenue of problems having to do with #502

Iiiinteresting. All 👂👂👂

This isn't necessarily a resource problem. Having more memory or a faster CPU would likely mask the problem, but wouldn't eliminate it entirely.

Oh, totally; didn’t mean to imply it was! Just that our heavy resource use probably exacerbates 😬

I also agree that Nightmare should probably be what amounts to a BrowserWindow manager - at least, I think that's the kind of direction you're heading in.

Yup.

The first BrowserWindow Electron opens has a special property where if you close it, the whole process dies.

Well, this is kind of funny. It turns out that, if you use the command line to launch a given app, e.g. electron path/to/app, instead of bundling everything up with Electron for distribution, it uses its “default_app” as a sort of launcher (I had no idea this was how it worked). Interestingly, the default_app installs a handler for the window-all-closed event that shuts down the application!

Due to the way it’s written, the solution is to simply add our own window-all-closed event. So long as there’s more than one listener (the default_app’s and ours), the default_app one doesn’t do anything. Who knew? (Not I.)

Whatever the approach, I would argue it would need to be compatible with the current test suite.

Insofar as the test suite confines itself to public methods/properties on Nightmare and instances (pretty sure it does), I agree. I’m not totally sure we need to make sure things like nightmareInstance.child continues to exist and work the same, though.

I'm betting you had parallel execution in mind as well, which makes the problems in #493 all the worse.

Indeed :D I have a private laundry list of things I’d love to see changed in Nightmare, but the only one that is a Big Deal™ (aka the one that gives me that good old “is this a jenga tower” feel [I kid, but still]) is that one. However! I think that issue is largely independent of this one (save that care should be taken not to make it worse).

Do you think Nightmare should return back Nightmare-side instances of BrowserWindow (or NightmareBrowserWindow?) to perform tasks?

No. I don’t think switching from “N electrons, 1 window” to “1 electron, N windows” should have any externally visible impact on the API (save for private, non-documented bits). The rule that an instance of Nightmare == a window in Electron, however it’s accomplished, is pretty simple and straightforward. (If I’m missing what you’re getting at here, let me know.)

What's an acceptable maximum of BrowserWindows that can be open?

No idea, but:

10 seems like a maybe reasonable place to start ¯_(ツ)_/¯
This should probably be a constant/global so it’s easy to adjust in future releases or as the work on this goes on
This should probably be configurable by users at runtime for beefy or light-weight machines (e.g. Nightmare.setPoolSize(1000) if you like to live dangerously)
A fancypants implementation might look at os.totalmem() + os.freemem() + electron.screen and decide for itself, but even then you are likely dealing with some wobbly heuristics.

Back in #479 I wrote:

For reference, a really simple nightmare session takes 20-30 MB of RAM on my machine, but once a screenshot is taken, that can boost to as high as 360 MB.

The current screenshot API should be much lighter, but I’d guesstimate ~50 MB per window on my machine (13" retina MacBook Pro, 64 bit). That figure will definitely vary across hardware and OS.

rosshinkley · 2016-04-13T15:00:55Z

Iiiinteresting. All 👂👂👂

TL;DR: Make isn't the friendliest thing to Windows. There's a night sunk into VMs and tinkering that I haven't had time to do yet.

Oh, totally; didn’t mean to imply it was! Just that our heavy resource use probably exacerbates 😬

It aaaaabsolutely does. The results I get under chroot on my Chromebook (4gb of ram with a Tegra K1 processor) vs my main development box (16gb/i5) are pretty different. :)

Interestingly, the default_app installs a handler for the window-all-closed event that shuts down the application!
Who knew? (Not I.)

Hhhhhuh. The more you know. 🌠

I had seen the event before, but I hadn't realized that Electron internally deals with it in that way, and the documentation is ... misleading? Thanks for the education, and good eye!

Insofar as the test suite confines itself to public methods/properties on Nightmare and instances (pretty sure it does), I agree.

I think the only time internal methods - and specifically, evaluate_now - are used is during the custom action battery.

I’m not totally sure we need to make sure things like nightmareInstance.child continues to exist and work the same, though.

Oooh, I'm not sure I agree. child may take on a different meaning (eg, a Nightmare BrowserWindow could be attached to an Electron BrowserWindow, and I'll get to more on that in a minute).

However! I think [#493] is largely independent of this one (save that care should be taken not to make it worse).

True. I have a foul habit of "since we're going to have this down to the studs anyway, why not also do [insert stupidly complex thing]?" Reorganizing how Nightmare and Electron interact seems like a good time to think about simultaneous calls/multiple runs/handling messaging spaghetti.

No. I don’t think switching from “N electrons, 1 window” to “1 electron, N windows” should have any externally visible impact on the API (save for private, non-documented bits). The rule that an instance of Nightmare == a window in Electron, however it’s accomplished, is pretty simple and straightforward. (If I’m missing what you’re getting at here, let me know.)

Mmmm, don't know that I agree. The "window manager" would need to expose methods for listing, retrieving, creating, and closing windows. One of the biggest side perks to doing this is being able to handle pages that open new tabs/windows: when a new window is created from the browser, it could then be created with the Nightmare sugar and registered with the window manager.

I guess my point is that windows are not just created by users, they can also be created by the site, and both cases should probably be handled.

10 seems like a maybe reasonable place to start ¯_(ツ)_/¯

Deathly curious what the memory footprint difference is. I'd like to put together something about that, if nothing else for justifying the change.

This should probably be a constant/global so it’s easy to adjust in future releases or as the work on this goes on

Strongly agree.

This should probably be configurable by users at runtime for beefy or light-weight machines (e.g. Nightmare.setPoolSize(1000) if you like to live dangerously)

I also like to live dangerously.

A fancypants implementation might look at os.totalmem() + os.freemem() + electron.screen and decide for itself, but even then you are likely dealing with some wobbly heuristics.

Not that I'm against a high-fallutin' implementation like this, but I worry that then you're at the whim and mercy of the OS and may introduce cross-platform problems that are hard to debug/track down. At least for now, I'm in favor of keeping it simple.

Back in #479 I wrote:

For reference, a really simple nightmare session takes 20-30 MB of RAM on my machine, but once a screenshot is taken, that can boost to as high as 360 MB.

The current screenshot API should be much lighter, but I’d guesstimate ~50 MB per window on my machine (13" retina MacBook Pro, 64 bit). That figure will definitely vary across hardware and OS.

(Emphasis mine.) Per BrowserWindow or per Electron instance?

Mr0grog · 2016-04-13T16:07:52Z

Electron internally deals with it in that way, and the documentation is ... misleading?

Well, if I understand it right, that all only applies for “non-bundled” stuff. So if your electron app is sitting inside electron’s resources folder (as it is when a standalone app is all bundled up for distribution), default_app doesn’t get loaded and everything matches the docs. Regardless, that doesn’t matter for us.

The "window manager" would need to expose methods for listing, retrieving, creating, and closing windows.

I guess I was thinking this would be the role of some component that is entirely private—a new nightmare instance would ask that manager for a communications channel to a window when created and would close the channel/tell the manager it’s done when end-ed.

One of the biggest side perks to doing this is being able to handle pages that open new tabs/windows: when a new window is created from the browser, it could then be created with the Nightmare sugar and registered with the window manager.

Ah, hadn’t thought about child windows. This would still necessarily require new API for interacting with those child windows, though, right? Or I guess you could have an event for new-child-window or something that includes as a parameter a wrapper for the window. That could definitely be a Nightmare instance, though. (Seems like it would be the nicest way to do it for users at first blush.)

Deathly curious what the memory footprint difference is. I'd like to put together something about that, if nothing else for justifying the change.

+ 💯

I worry that [with complicated heuristics] you're at the whim and mercy of the OS and may introduce cross-platform problems that are hard to debug/track down. At least for now, I'm in favor of keeping it simple.

Toooootally. Said much more clearly than my “wobbly heuristics” note :P

I’d guesstimate ~50 MB per window on my machine (13" retina MacBook Pro, 64 bit). That figure will definitely vary across hardware and OS.

Per BrowserWindow or per Electron instance?

I have totally forgotten! I think my “20-30 MB normally” and “50-300 MB screenshotting” estimates were a combined 1 Electron + 1 window, so I guess each window would be a bit lighter (but I’m also guessing not a lot—the electron app process is probably relatively light). I also don’t remember whether this was with an essentially blank page or something fairly busy like yahoo.com.

Mr0grog · 2016-04-13T16:21:52Z

Reorganizing how Nightmare and Electron interact seems like a good time to think about simultaneous calls/multiple runs/handling messaging spaghetti.

Not totally disagreeing. I think there are really two different things tied up in #493—one is about pure safety (making sure that two simultaneous attempts to perform a routine in electron’s processes don’t wind up with crossed wires), which fits in more tightly with the work we’re talking about here. The other is whether the overall behavior/API should change to make those kinds of potential crossed-wire situations less likely to occur. That one is a big enough deal all on its own that I think it’s worth separating out.

Both of those changes and this one are all potentially high impact, though. Ideally this process/window management work won’t have any API impact for users, but I suspect there’s a reasonable likelihood it might change the story for plugins. Whatever we do with IPC will almost undoubtedly affect both plugins and end-user usage. It might be worth releasing all these changes together as a major revision given they all might be breaking changes.

rosshinkley mentioned this issue Apr 11, 2016

Crashing with no useful information #405

Closed

rosshinkley changed the title ~~flaky tests/instance churn~~ instance churn causes xvfb to occasionally hang Apr 11, 2016

Mr0grog mentioned this issue Apr 12, 2016

how to write code to prevent any memory issues. #562

Closed

rosshinkley mentioned this issue Apr 12, 2016

run tests under an xvfb wrapper when headless #565

Merged

Mr0grog mentioned this issue Apr 13, 2016

Simultaneous calls to the same action can cause early returns/continues with the wrong results #493

Closed

rosshinkley closed this as completed in #565 Apr 14, 2016

rosshinkley mentioned this issue Apr 14, 2016

npm scripts instead of Makefile #502

Closed

Mr0grog mentioned this issue Apr 20, 2016

Callback-based IPC #579

Merged

rosshinkley mentioned this issue Apr 27, 2016

Code 127 - you may not have electron installed correctly #602

Closed

rosshinkley mentioned this issue May 24, 2016

Question: Any idea why electron update in 2.1.4 broke Jenkins? #525

Closed

rosshinkley mentioned this issue Jun 7, 2016

Running Nightmare headlessly on Linux #224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instance churn causes xvfb to occasionally hang #561

instance churn causes xvfb to occasionally hang #561

rosshinkley commented Apr 11, 2016

rosshinkley commented Apr 11, 2016

rosshinkley commented Apr 12, 2016

Mr0grog commented Apr 12, 2016

rosshinkley commented Apr 12, 2016

rosshinkley commented Apr 12, 2016

Mr0grog commented Apr 13, 2016

rosshinkley commented Apr 13, 2016

Mr0grog commented Apr 13, 2016

Mr0grog commented Apr 13, 2016

instance churn causes xvfb to occasionally hang #561

instance churn causes xvfb to occasionally hang #561

Comments

rosshinkley commented Apr 11, 2016

rosshinkley commented Apr 11, 2016

rosshinkley commented Apr 12, 2016

Mr0grog commented Apr 12, 2016

rosshinkley commented Apr 12, 2016

rosshinkley commented Apr 12, 2016

Mr0grog commented Apr 13, 2016

rosshinkley commented Apr 13, 2016

Mr0grog commented Apr 13, 2016

Mr0grog commented Apr 13, 2016