-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
async_wrap,lib: Firing async-wrap callbacks for next-tick callbacks #6082
async_wrap,lib: Firing async-wrap callbacks for next-tick callbacks #6082
Conversation
This change updates nextTick() callback processing to call back into the node runtime to notify the runtime of async state transitions. Currently, the runtime will track the active async ID and fire any registered async-wrap callbacks. This gives async-wrap consistent user semantics for nextTick() callbacks. I anticipate some concerns around impact to perf. Rough measurements show an approximate 2x increase in time to process 1,000,000 nextTick() callbacks. These went from an average of 1094 ms per million nextTick() callbacks to 2133 ms per million with the changes in this PR. I'm open to alternative suggestions around implementation here. :) It's not lost on me that the implementation of next_tick.js is making an effort to minimize transitions from JS to the runtime, and this change increases those transitions by a factor of three. One of the goals here was to have basic entry-points for javascript code to notify the runtime of async transitions (namely, "enqueue", "start", and "end"). This allows a future where async call graph information can be tracked by the runtime and eliminates a number of cases where users will require the async-wrap callbacks. For example, continuation-local storage can be implemented on such an API, without callbacks, assuming each node in the graph contains a tuple of (unique ID, parent ID, pending child callback count, state), and where state is one of enqueued, running, or completed.
I'm sorry. I'm way too busy to review anything right now, @trevnorris will have to take care of this. |
I had an alternate implementation sitting in a local branch. Pushed part of it to trevnorris@4155f0e. This only handles a uid to |
Thanks @trevnorris . I took a quick look at your changes. From my measurements using your array technique, it takes 1-2 milliseconds to read an int value 1,000,000 times. Using the alternative in this PR (creating a v8:Integer each time) takes about 10-12 milliseconds per 1,000,000 reads. So, your approach is clearly faster, but my estimates are it will save only about 1-2% of the total cost of 1,000,000 nextTick callbacks. I didn't run this under a profiler, but I think the dominate factor here will be the transitions from JS->runtime->JS and back, for each of init/begin/complete callbacks. I can implement the different techniques and get some concrete measurements if you think it is important? |
@trevnorris - Any more feedback with these changes? I'd like to move this forward and I believe this PR gets to consistent API behavior for nextTick() callbacks. I tried out the technique using the uint32 array, and it was slower by ~20%! Its possible I did something wrong. Feel free to take a look at the changes here. This change just made it easy to test the compare the two approaches by using calling either Here's the code I ran to test out the perf of 1,000,000 callbacks: var moment = require('moment');
console.log("starting");
var start = new moment();
var count = 0;
var numIterations = 1000000;
for (var i = 0; i < numIterations; i++) {
process.nextTick(function() {
// uncomment below to use retrieve current/next IDs via uint32 array
// process.nextTickTemp(function() {
count++;
if (count === numIterations ) {
var now = new moment();
var elapsed = now.diff(start);
console.log("elapsed time is " + elapsed + " milliseconds");
process.exit();
}
});
} |
There isn't a need to cross over between C++ and JS while executing the callbacks. There's no need to store that much state. It can be a lot simpler and cheaper. This also doesn't take into account the integration with timers, which also are under the same constraints. If using shared flags is slower than an implementation detail is incorrect. I've already tested it and overhead is almost unnoticeable. That var ITER = 1e6;
var cntr = 0;
var t = process.hrtime();
(function runner() {
if (++cntr >= ITER) return done();
process.nextTick(runner);
}());
function done() {
t = process.hrtime(t);
console.log(((t[0] * 1e9 + t[1]) / ITER).toFixed(1) + ' ns/op');
} Or if you want to test having more than one callback in the var ITER = 1e6;
var cntr = 0;
var t = process.hrtime();
function noop() {}
(function runner() {
if (++cntr >= ITER) return done();
for (var i = 0; i < 9; i++)
process.nextTick(noop);
process.nextTick(runner);
}());
function done() {
t = process.hrtime(t);
console.log(((t[0] * 1e9 + t[1]) / (ITER * 10)).toFixed(1) + ' ns/op');
} As far as whether the performance difference matters, yes it does. This is the center of the event loop. We are already battling performance hindrances from domains needing to be checked (missed changing |
Thanks @trevnorris. Some comments/questions:
One of my goals here was to consolidate the logic around async enqueue/begin-execution/end-execution (here, here and here respectively). While any logic associated with these events can be replicated in javascript, this comes at the expense of code maintainability/readability.
Sorry, I'm not sure I'm following - specifically what state are you referring to? This PR only adds state to the runtime to track the "current-async-id", and state to each next-tick object (lines 155-157) to allow async-wrap callbacks with correct user semantics.
Interesting... it seemed to run fine for me. I'll play around with your suggestion to see if results differ.
Are there specific benchmarks I can run to measure a user's impact here? Probably my ignorance, but I'm a bit uncomfortable talking in the abstract about "it has to be fast". In my experience, it is more productive to speak quantitatively about how a set of changes impacts users' scenarios (e.g., "this change results in a 10% reduction in http requests/second").
Thanks. I'll update... |
Take a quick look around the file you're working in and you'll see that consolidating is not as important as performance.
You're adding 4 properties to a
Running, and then running in a way that gives you useful numbers are different. There is no real world use case where queue'ing up that many nextTick's exists.
There are too many variables to do a simple benchmark. The most reliable way is to generate the JIT'd code and analyze the difference. But in short, when running any benchmark that uses
I've seen allowing minor regressions happen since v0.8. Thanks to all those small hits, node is now noticeably slower performing I/O and async operations than it used to be. I'm willing to accept compromise in less performance intensive areas, but not here. For the most basic test, say that the nextTick benchmarks I posted don't regress any more than 5 maybe 10%. |
This was the necessary state to call into existing AsyncWrap callbacks. Are there specific suggestions you have to simplify? A few things I can think of here are to drop the parent ID (should be available after init() call) and to drop the "asyncState" object, however not having a state object associated with the callbacks would be inconsistent current form of async-wrap APIs.
It would be really helpful to me if you could point me at some existing tests and benchmarks that quantify the impact of changes such as this.
I'll take a look at trying to reduce the overhead here. |
The c++ code linting here doesn't actually pass. I'm guessing you are running on windows, since our vcbuild file isn't able to invoke our cpplint. Here's the errors:
|
Yeah, sorry ran this on windows. I'll get this cleaned up. Looks like whitespace/line-length issues. |
JS to runtime. Did this by doing the following: - Added a JS file at ./lib/async_wrap.js to expose certain async-wrap functionality to JS code. - exposed array of async-wrap callbacks from runtime to JS via an array. This is in anticpation of eventually supporting multiple async-wrap callbacks. - exposed current-async-ID & next-async-ID counters from runtime through JS via mapping an int64 to int32[2]. This allows reading/ writing of these values without transition to runtime & re-allocaiton of v8 types on each read/write.
Updated PR to reduce the number of times the nextTick logic needs to transition from
I'm going to look into quantifying how this change impacts node benchmarks. Please let me know of any concerns or specific suggestions around improving perf. |
7da4fd4
to
c7066fb
Compare
c133999
to
83c7a88
Compare
ping @mike-kaufman Can you rebase this? Thanks. |
I think this is superseded by #11883 |
Yes, closing this. |
Checklist
Caveats:
Affected core subsystem(s)
async_wrap;lib
Description of change
This change updates nextTick() callback processing to call back into
the node runtime to notify the runtime of async state transitions.
Currently, the runtime will track the active async ID and fire any
registered async-wrap callbacks. This gives async-wrap consistent user
semantics for nextTick() callbacks.
I anticipate some concerns around impact to perf. Rough measurements
show an approximate 2x increase in time to process 1,000,000 nextTick()
callbacks. These went from an average of 1094 ms per million
nextTick() callbacks to 2133 ms per million with the changes in this PR.
I'm open to alternative suggestions around implementation here. :) It's
not lost on me that the implementation of next_tick.js is making an
effort to minimize transitions from JS to the runtime, and this change
increases those transitions by a factor of three. One of the goals
here was to have basic entry-points for javascript code to notify the
runtime of async transitions (namely, "enqueue", "start", and "end").
This allows a future where async call graph information can be tracked
by the runtime and eliminates a number of cases where users will
require the async-wrap callbacks. For example, continuation-local
storage can be implemented on such an API, without callbacks, assuming
each node in the graph contains a tuple of (unique ID, parent ID,
pending child callback count, state), and where state is one of
enqueued, running, or completed.