Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v10.x] src: dispose of V8 platform in process.exit() #26048

Closed
wants to merge 2 commits into from

Conversation

MylesBorins
Copy link
Contributor

Calling process.exit() calls the C exit() function, which in turn
calls the destructors of static C++ objects. This can lead to race
conditions with other concurrently executing threads; disposing of all
Worker threads and then the V8 platform instance helps with this
(although it might not be a full solution for all problems of
this kind).

Refs: #24403
Refs: #25007

PR-URL: #25061
Reviewed-By: Gireesh Punathil [email protected]
Reviewed-By: Joyee Cheung [email protected]
Reviewed-By: Ben Noordhuis [email protected]

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. v10.x labels Feb 11, 2019
@MylesBorins MylesBorins requested a review from addaleax February 11, 2019 17:28
BethGriggs
BethGriggs previously approved these changes Feb 11, 2019
@MylesBorins
Copy link
Contributor Author

@BethGriggs
Copy link
Member

BethGriggs commented Feb 12, 2019

BethGriggs pushed a commit that referenced this pull request Feb 13, 2019
Calling `process.exit()` calls the C `exit()` function, which in turn
calls the destructors of static C++ objects. This can lead to race
conditions with other concurrently executing threads; disposing of all
Worker threads and then the V8 platform instance helps with this
(although it might not be a full solution for all problems of
this kind).

Refs: #24403
Refs: #25007

Backport-PR-URL: #26048
PR-URL: #25061
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
@BethGriggs
Copy link
Member

Landed in 3c3f350

@BethGriggs BethGriggs closed this Feb 13, 2019
@MylesBorins MylesBorins reopened this Feb 19, 2019
@MylesBorins
Copy link
Contributor Author

This commit breaks some repos for v10.x. For example gulp-util

> [email protected] test /Users/mborins/code/gulp-util
> jshint *.js lib/*.js test/*.js && mocha

sh: line 1: 23190 Abort trap: 6           jshint *.js lib/*.js test/*.js
npm ERR! Test failed.  See above for more details.

We are going to back it out of the v10.15.2 proposal and we can run CITGM against this PR and figure out what is going on. FWIW there were no breakages for citgm with this change on 11.x, so it is possible that there are some other required changes.

@gireeshpunathil I know that you wanted this to land on 10.x to fix some other related issues, perhaps you want to dig in to try and get this back in the 10.15.2 proposal

@MylesBorins MylesBorins dismissed stale reviews from gireeshpunathil and BethGriggs February 19, 2019 16:51

Needs new reviews as this PR broke 10.x

Copy link
Member

@BethGriggs BethGriggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 until CITGM errors are resolved

@addaleax
Copy link
Member

@MylesBorins Do you think you could create a core dump/stack trace for the crash you are seeing? I can’t reproduce it locally.

@MylesBorins
Copy link
Contributor Author

@addaleax I'll dig back in a bit later this week. What command should I run with to get the core dump? 😅

@gireeshpunathil
Copy link
Member

gireeshpunathil commented Feb 20, 2019

@MylesBorins - also can you tell which platform the issue is seen?

@addaleax
Copy link
Member

addaleax commented Feb 20, 2019

@MylesBorins Err, I’m not sure how to do that reliably on macOS. Running ulimit -c unlimited before the crashing command could help, but I’m not sure where the file would end up. (If you get a core dump, the exact node executable that was used to generated it would be helpful for investigating.)

@gireeshpunathil
Copy link
Member

Unable to recreate locally.
@BethGriggs had a core dump which she gave me, but lldb does not recognize it.
Suggested a couple of more options in expectation of getting a good core.

For any one else interested, looks like it is an easy recreate on mac os mojave with this step (node v10.15.2-rc.0): npm install gulp-util

@gireeshpunathil
Copy link
Member

(lldb) bt
* thread #9, stop reason = signal SIGSTOP
  * frame #0: 0x00000001006f3528 node`v8::internal::Parser::ParseFunctionLiteral(v8::internal::AstRawString const*, v8::internal::Scanner::Location, v8::internal::FunctionNameValidity, v8::internal::FunctionKind, int, v8::internal::FunctionLiteral::FunctionType, v8::internal::LanguageMode, v8::internal::ZoneList<v8::internal::AstRawString const*>*, bool*) + 216
    frame #1: 0x00000001006f49b3 node`v8::internal::Parser::DoParseFunction(v8::internal::ParseInfo*, v8::internal::AstRawString const*) + 1347
    frame #2: 0x00000001006f4230 node`v8::internal::Parser::ParseFunction(v8::internal::Isolate*, v8::internal::ParseInfo*, v8::internal::Handle<v8::internal::SharedFunctionInfo>) + 720
    frame #3: 0x000000010071a4e0 node`v8::internal::parsing::ParseFunction(v8::internal::ParseInfo*, v8::internal::Handle<v8::internal::SharedFunctionInfo>, v8::internal::Isolate*) + 832
    frame #4: 0x00000001002cd54c node`v8::internal::Compiler::Compile(v8::internal::Handle<v8::internal::SharedFunctionInfo>, v8::internal::Compiler::ClearExceptionFlag) + 924
    frame #5: 0x00000001002cdafd node`v8::internal::Compiler::Compile(v8::internal::Handle<v8::internal::JSFunction>, v8::internal::Compiler::ClearExceptionFlag) + 173
    frame #6: 0x00000001007afe94 node`v8::internal::Runtime_CompileLazy(int, v8::internal::Object**, v8::internal::Isolate*) + 132
    frame #7: 0x00001c866d25be3d
    frame #8: 0x00001c866d21232c
    frame #9: 0x00001c866d2118d5
    frame #10: 0x00001c866d2118d5
...
    frame #20: 0x000000010009e410 node`node::worker::MessagePort::OnMessage() + 320
    frame #21: 0x00000001009a789e node`uv__async_io + 317
    frame #22: 0x00000001009b7295 node`uv__io_poll + 1934
    frame #23: 0x00000001009a7d11 node`uv_run + 315
    frame #24: 0x00000001000d16f1 node`node::worker::Worker::Run() + 1467
    frame #25: 0x00007fff7d1b9661 libsystem_pthread.dylib`_pthread_body + 340
    frame #26: 0x00007fff7d1b950d libsystem_pthread.dylib`_pthread_start + 377
    frame #27: 0x00007fff7d1b8bf9 libsystem_pthread.dylib`thread_start + 13

(lldb) di -s 0x1006f3510 -c 20
node`v8::internal::Parser::ParseFunctionLiteral:
    0x1006f3510 <+192>: movb   $0x48, %r8b
    0x1006f3513 <+195>: movl   $0x0, -0x30(%rbp)
    0x1006f351a <+202>: leaq   0x106a96b(%rip), %rcx     ; v8::internal::FLAG_runtime_stats
    0x1006f3521 <+209>: testq  %rdi, %rdi
    0x1006f3524 <+212>: je     0x1006f3530               ; <+224>
    0x1006f3526 <+214>: movl   (%rcx), %ecx
->  0x1006f3528 <+216>: testl  %ecx, %ecx

(lldb) register read rcx
     rcx = 0x0000000000000000
(lldb) 

RCX is supposed to hold v8::internal::FLAG_runtime_stats

if (V8_UNLIKELY(FLAG_runtime_stats) && did_preparse_successfully) {

(lldb) thread select 1
* thread #1, stop reason = signal SIGSTOP
(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7cff1d82 libsystem_kernel.dylib`__semwait_signal + 10
    frame #1: 0x00007fff7d1bd824 libsystem_pthread.dylib`_pthread_join + 626
    frame #2: 0x00000001009b2b68 node`uv_thread_join + 14
    frame #3: 0x00000001000d1c1a node`node::worker::Worker::JoinThread() + 38
    frame #4: 0x0000000100029067 node`node::Environment::stop_sub_worker_contexts() + 89
    frame #5: 0x0000000100029001 node`node::Environment::Exit(int) + 43
    frame #6: 0x000000010023667f node`v8::internal::FunctionCallbackArguments::Call(v8::internal::CallHandlerInfo*) + 623

ok, got one crash with this version. Though this (SIGSEGV , unhandled crash) is different from what @BethGriggs reported (SIGABRT, controlled crash), chances are that the root cause may be same?

here, the main thread has cleaned up the execution environment and is waiting for the workers to stop, but one of the worker does not seem to have stopped. I don't know if there is an associated PR that applies here. @addaleax will know!

@addaleax
Copy link
Member

RCX is supposed to hold v8::internal::FLAG_runtime_stats

That’s a static global variable, I’m not sure how accessing it could fail? This looks more like execution was stopped at some random point due to the signal?

here, the main thread has cleaned up the execution environment and is waiting for the workers to stop, but one of the worker does not seem to have stopped.

What is the Worker thread doing at this point? Or is that the first stack trace in your comment?

I don't know if there is an associated PR that applies here.

Not that I know of, at least without knowing why the Worker hasn’t stopped.

@gireeshpunathil
Copy link
Member

@addaleax - yes, the first stack trace is for the worker.
rcx - let me look at it again!

@gireeshpunathil
Copy link
Member

I did 2 mistakes last time: i) used the master source ii) bluntly trusted the lldb's annotations in the right hand side.
But however, given this sequence, the rcx definitely is coming from static area (known to the compiler, computed as a known offset from the current instruction) , and rcx is null, and it is being dereferenced.

    0x1006f351a <+202>: leaq   0x106a96b(%rip), %rcx     ; v8::internal::FLAG_runtime_stats
    0x1006f3521 <+209>: testq  %rdi, %rdi
    0x1006f3524 <+212>: je     0x1006f3530               ; <+224>
    0x1006f3526 <+214>: movl   (%rcx), %ecx
->  0x1006f3528 <+216>: testl  %ecx, %ecx

moving on, I examined several dumps, and get different patterns, but one thing in common is: worker threads are running while main thread is waiting.

Could it be that isolate->TerminateExecution() was not called as part of the worker cleanup? But that does not explain the originally reported issue (no separate isolates exist in that test).

@gireeshpunathil
Copy link
Member

Able to get the original crash! (gulp-util, no worker in the picture)

(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff5864223e libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff586f8c1c libsystem_pthread.dylib`pthread_kill + 285
    frame #2: 0x00007fff585ab1c9 libsystem_c.dylib`abort + 127
    frame #3: 0x00000001009b28ea node`uv_mutex_lock + 20
    frame #4: 0x00000001000b4cf4 node`node::NodePlatform::ForIsolate(v8::Isolate*) + 32
    frame #5: 0x00000001000b4d9b node`node::NodePlatform::CallOnForegroundThread(v8::Isolate*, v8::Task*) + 33
    frame #6: 0x000000010058e7a1 node`v8::internal::IncrementalMarkingJob::Task::RunInternal() + 369
    frame #7: 0x00000001000b4ac8 node`node::PerIsolatePlatformData::RunForegroundTask(std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task> >) + 142
    frame #8: 0x00000001000b4472 node`node::PerIsolatePlatformData::FlushForegroundTasksInternal() + 630
    frame #9: 0x00000001000b4668 node`node::PerIsolatePlatformData::Shutdown() + 24
    frame #10: 0x00000001000b4618 node`node::PerIsolatePlatformData::~PerIsolatePlatformData() + 24

...

    frame #14: 0x00000001000b4a11 node`node::NodePlatform::Shutdown() + 43
    frame #15: 0x000000010003b332 node`node::$_0::Dispose() + 18
    frame #16: 0x0000000100029006 node`node::Environment::Exit(int) + 48
    frame #17: 0x000000010023667f node`v8::internal::FunctionCallbackArguments::Call(v8::internal::CallHandlerInfo*) + 623
    frame #18: 0x0000000100235bc1 node`v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous 

...

    frame #44: 0x0000000100042d8a node`node::Start(uv_loop_s*, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) + 326
    frame #45: 0x00000001000428b2 node`node::Start(int, char**) + 711
    frame #46: 0x0000000100001034 node`start + 52

there are only 2 threads, the other thread being inspector on uv_sem_wait.

The mutex does not have a destructor so goes down only with NodePlatform object, so not sure what is happening here!

@addaleax
Copy link
Member

@gireeshpunathil Looks like the issue is that we’re trying to recursively lock per_isolate_mutex_, once from NodePlatform::Shutdown() and once from NodePlatform::ForIsolate?

This reminds me of bafd808. We removed the FlushForegroundTasksInternal() call entirely and replaced it with checks that should make sure that there are no tasks to run, but maybe older versions of V8 behave differently?

I’m wondering whether it would be okay to just remove that call, not replacing it with checks.

@gireeshpunathil
Copy link
Member

wow! that matches.

but further progression from FlushForegroundTasksInternal and presence of v8::internal::IncrementalMarkingJob::Task::RunInternal in the stack indicates that there ARE pending / scheduled tasks right?

But on this very edge of termination, making sure gc activities are fully completed etc. does not seem to be required, so makes sense to me!

@BethGriggs
Copy link
Member

I have done a test build on MacOS and removing the FlushForegroundTasksInternal() call fixes the crash seen with gulp-utils. Thanks @addaleax, @gireeshpunathil!

@MylesBorins
Copy link
Contributor Author

Did this end up getting landed?

@BethGriggs
Copy link
Member

@MylesBorins, no. On merge, this either needs to remove the FlushForegroundTasksInternal() at bafd808#diff-5b8e0acae193b6f40922fb5cda94eec8L268, or should we raise a separate PR to remove that line on v10.x and make sure both backports land together?

@MylesBorins
Copy link
Contributor Author

@BethGriggs do you want to push those chnages to this pr?

addaleax and others added 2 commits March 27, 2019 11:37
Calling `process.exit()` calls the C `exit()` function, which in turn
calls the destructors of static C++ objects. This can lead to race
conditions with other concurrently executing threads; disposing of all
Worker threads and then the V8 platform instance helps with this
(although it might not be a full solution for all problems of
this kind).

Refs: nodejs#24403
Refs: nodejs#25007

PR-URL: nodejs#25061
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Node first calls `Isolate::Dispose`, then
`NodePlatform::UnregisterIsolate`.
This again calls `PerIsolatePlatformData::Shutdown`, which (before this
patch) called `FlushForegroundTasksInternal`, which might call
`RunForegroundTask` if it finds foreground tasks to be executed. This
will fail however, since `Isolate::GetCurrent` was already reset during
`Isolate::Dispose`.
Hence remove the check to `FlushForegroundTasksInternal` and add checks
instead that no more foreground tasks are scheduled.

Refs: v8#86

PR-URL: nodejs#25653
Reviewed-By: Gus Caplan <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
@nodejs-github-bot
Copy link
Collaborator

@BethGriggs
Copy link
Member

@MylesBorins, I cherry-picked the commit we need and rebased the PR

@BethGriggs BethGriggs self-requested a review March 27, 2019 15:20
BethGriggs pushed a commit that referenced this pull request Mar 28, 2019
Calling `process.exit()` calls the C `exit()` function, which in turn
calls the destructors of static C++ objects. This can lead to race
conditions with other concurrently executing threads; disposing of all
Worker threads and then the V8 platform instance helps with this
(although it might not be a full solution for all problems of
this kind).

Refs: #24403
Refs: #25007

Backport-PR-URL: #26048
PR-URL: #25061
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
BethGriggs pushed a commit that referenced this pull request Mar 28, 2019
Node first calls `Isolate::Dispose`, then
`NodePlatform::UnregisterIsolate`.
This again calls `PerIsolatePlatformData::Shutdown`, which (before this
patch) called `FlushForegroundTasksInternal`, which might call
`RunForegroundTask` if it finds foreground tasks to be executed. This
will fail however, since `Isolate::GetCurrent` was already reset during
`Isolate::Dispose`.
Hence remove the check to `FlushForegroundTasksInternal` and add checks
instead that no more foreground tasks are scheduled.

Refs: v8#86

Backport-PR-URL: #26048
PR-URL: #25653
Reviewed-By: Gus Caplan <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
@BethGriggs
Copy link
Member

Landed on v10.x-staging

@BethGriggs BethGriggs closed this Mar 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants