Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KVM][warm reboot] syncd crashes during shutdown with double-free-corruption #6466

Closed
vaibhavhd opened this issue Jan 15, 2021 · 24 comments · Fixed by sonic-net/sonic-sairedis#779 or #6581

Comments

@vaibhavhd
Copy link
Contributor

vaibhavhd commented Jan 15, 2021

sonic_dump_vlab-01_20210120_172005.tar.gz

Description
Tested Image: SONiC-OS-202012.5-264ecb18 on KVM switch, and warm-reboot test leads to SYNCD crash.

There are multiple messages for missing OIDs in switch state map prior to the crash.

Steps to reproduce the issue:

  1. Continuous warm reboot test in KVM
  2. DUT recovers after the crash.

Describe the results you received:

Jan 14 04:01:47.221771 vlab-01 NOTICE syncd#syncd: :- ~SwitchState: begin
Jan 14 04:01:47.221771 vlab-01 NOTICE syncd#syncd: :- ~SwitchState: switch oid:0x2100000000
Jan 14 04:01:47.221771 vlab-01 NOTICE syncd#syncd: :- ~SwitchState: end
Jan 14 04:01:47.238132 vlab-01 NOTICE syncd#syncd: :- clean_after_switch_remove: cleaning metadata for switch: oid:0x2100000000
Jan 14 04:01:47.263660 vlab-01 NOTICE syncd#syncd: :- clean_after_switch_remove: removed all objects related to switch oid:0x2100000000
Jan 14 04:01:47.265251 vlab-01 NOTICE syncd#syncd: :- removeAllSwitches: removing switch RID oid:0x2100000000 took 7.503044 sec
Jan 14 04:01:47.266023 vlab-01 NOTICE syncd#syncd: :- run: calling api uninitialize
Jan 14 04:01:47.266171 vlab-01 NOTICE syncd#syncd: :- uninitialize: begin
Jan 14 04:01:47.266318 vlab-01 NOTICE syncd#syncd: :- uninitialize: stopping threads
Jan 14 04:01:47.266523 vlab-01 NOTICE syncd#syncd: :- stopUnittestThread: begin
Jan 14 04:01:47.266858 vlab-01 NOTICE syncd#syncd: :- unittestChannelThreadProc: exit VS unittest channel thread

Jan 14 04:01:47.281354 vlab-01 WARNING syncd#syncd: message repeated 58 times: [ :- syncProcessEventPacket: oid oid:0x10000001f and it switch oid:0x2100000000 don't exists in switch state map]


Jan 14 04:01:49.797548 vlab-01 NOTICE root: Finished warm shutdown syncd process ...
Jan 14 04:01:49.931777 vlab-01 INFO syncd#supervisord 2021-01-14 04:01:49,929 INFO exited: syncd (terminated by SIGABRT (core dumped); not expected)

Jan 14 04:01:50.085997 vlab-01 DEBUG /container: container_stop: BEGIN
Jan 14 04:01:50.092784 vlab-01 DEBUG /container: read_data: config:True feature:syncd fields:[('set_owner', 'local'), ('no_fallback_to_local', False)] val:['local', False]
Jan 14 04:01:50.093828 vlab-01 DEBUG /container: read_data: config:False feature:syncd fields:[('current_owner', 'none'), ('remote_state', 'none'), ('container_id', '')] val:['none', 'none', '']
Jan 14 04:01:50.095413 vlab-01 DEBUG /container: container_stop: syncd: set_owner:local current_owner:none remote_state:none docker_id:syncd
Jan 14 04:01:50.935707 vlab-01 INFO syncd#supervisord 2021-01-14 04:01:50,933 WARN received SIGTERM indicating exit request
Jan 14 04:01:50.936418 vlab-01 INFO syncd#supervisord 2021-01-14 04:01:50,935 INFO waiting for supervisor-proc-exit-listener, rsyslogd to die
Jan 14 04:01:50.942039 vlab-01 INFO syncd#/supervisor-proc-exit-listener: Process syncd exited unxepectedly. Terminating supervisor...
Jan 14 04:01:50.942578 vlab-01 INFO syncd#supervisord 2021-01-14 04:01:50,942 WARN received SIGTERM indicating exit request

Describe the results you expected:
Crash free warm reboot.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@vaibhavhd
Copy link
Contributor Author

From the core, not a very helpful trace is generated from 3 running threads at the time of crash:


Core was generated by `/usr/bin/syncd -u -s -p /usr/share/sonic/hwsku/sai.profile'.


(gdb) thread 1
[Switching to thread 1 (Thread 0x7f1aca72a700 (LWP 32))]
#0  0x00007f1accbe17bb in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) t
[Current thread is 1 (Thread 0x7f1aca72a700 (LWP 32))]
(gdb) bt
#0  0x00007f1accbe17bb in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1accbcc535 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f1accc23508 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f1accc29c1a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f1accc2b730 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f1accc2eaf3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f1acd13afc9 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f1accca34cf in clone () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) thread 2
[Switching to thread 2 (Thread 0x7f1acc734bc0 (LWP 19))]
#0  0x00007f1acd13c495 in __pthread_timedjoin_ex () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) 
[Switching to thread 2 (Thread 0x7f1acc734bc0 (LWP 19))]
#0  0x00007f1acd13c495 in __pthread_timedjoin_ex () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007f1acd13c495 in __pthread_timedjoin_ex () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f1accfc3d53 in std::thread::join() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007f1acd22a303 in saivs::NetMsgRegistrar::~NetMsgRegistrar() () from /usr/lib/x86_64-linux-gnu/libsaivs.so.0
#3  0x00007f1accbe3d8c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f1accbe3eba in exit () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f1accbce0a2 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x0000563b2987775a in _start ()

(gdb) thread 3
[Switching to thread 3 (Thread 0x7f1acc72e700 (LWP 28))]
#0  0x00007f1accca37ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f1accca37ef in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1acd2dd66f in swss::Select::poll_descriptors(swss::Selectable**, unsigned int) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#2  0x00007f1acd2dd89b in swss::Select::select(swss::Selectable**, int) () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#3  0x00007f1acd2a82c3 in swss::Logger::settingThread() () from /usr/lib/x86_64-linux-gnu/libswsscommon.so.0
#4  0x00007f1accfc3b2f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f1acd13afa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007f1accca34cf in clone () from /lib/x86_64-linux-gnu/libc.so.6

@kcudnik
Copy link
Contributor

kcudnik commented Jan 20, 2021

could we get redisdb operation dump in log, to see if we will get VIDTORID empty map or missing 0x21000 switch in there, and if there maybe was called FLUSHALL on redis

@vaibhavhd
Copy link
Contributor Author

could we get redisdb operation dump in log, to see if we will get VIDTORID empty map or missing 0x21000 switch in there, and if there maybe was called FLUSHALL on redis

I don't have the recordings from the time when I reported the issue. But, I just reproduced this and attached a new set of logs in the dump https://github.com/Azure/sonic-buildimage/files/5844189/sonic_dump_vlab-01_20210120_172005.tar.gz

@kcudnik
Copy link
Contributor

kcudnik commented Jan 20, 2021

from internet digging this seems like pthread_cancel was invoked

(gdb) bt
#0  0x00007f1accbe17bb in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1accbcc535 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f1accc23508 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f1accc29c1a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f1accc2b730 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f1accc2eaf3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f1acd13afc9 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f1accca34cf in clone () from /lib/x86_64-linux-gnu/libc.so.6

@kcudnik
Copy link
Contributor

kcudnik commented Jan 20, 2021

attaching syslog from dump:

Jan 20 17:10:25.916734 vlab-01 NOTICE syncd#syncd: :- removeAllSwitches: removing switch RID oid:0x2100000000 took 7.337535 sec
Jan 20 17:10:25.917240 vlab-01 NOTICE syncd#syncd: :- run: calling api uninitialize
Jan 20 17:10:25.917344 vlab-01 NOTICE syncd#syncd: :- uninitialize: begin
Jan 20 17:10:25.917344 vlab-01 NOTICE syncd#syncd: :- uninitialize: stopping threads
Jan 20 17:10:25.917454 vlab-01 NOTICE syncd#syncd: :- stopUnittestThread: begin
Jan 20 17:10:25.917651 vlab-01 NOTICE syncd#syncd: :- unittestChannelThreadProc: exit VS unittest channel thread
Jan 20 17:10:25.917755 vlab-01 NOTICE syncd#syncd: :- stopUnittestThread: end
Jan 20 17:10:25.917858 vlab-01 NOTICE syncd#syncd: :- stopFdbAgingThread: begin
Jan 20 17:10:25.917989 vlab-01 NOTICE syncd#syncd: :- fdbAgingThreadProc: end
Jan 20 17:10:25.918037 vlab-01 NOTICE syncd#syncd: :- stopFdbAgingThread: end
Jan 20 17:10:25.918104 vlab-01 NOTICE syncd#syncd: :- stopEventQueueThread: begin
Jan 20 17:10:25.918238 vlab-01 NOTICE syncd#syncd: :- processQueueEvent: received EVENT_TYPE_END_THREAD, will process all messages and end
Jan 20 17:10:25.918321 vlab-01 NOTICE syncd#syncd: :- stopEventQueueThread: end
Jan 20 17:10:25.921015 vlab-01 NOTICE syncd#syncd: :- uninitialize: end
Jan 20 17:10:25.921100 vlab-01 NOTICE syncd#syncd: :- run: uninitialize finished
Jan 20 17:10:25.922445 vlab-01 NOTICE syncd#syncd: :- ~NetMsgRegistrar: begin
Jan 20 17:10:25.922898 vlab-01 NOTICE syncd#syncd: :- run: netlink msg listener ended
Jan 20 17:10:25.924744 vlab-01 INFO syncd#/supervisord: syncd double free or corruption (out)
Jan 20 17:10:27.001034 vlab-01 NOTICE root: Finished warm shutdown syncd process ...
Jan 20 17:10:27.196909 vlab-01 DEBUG /container: container_stop: BEGIN
Jan 20 17:10:27.197889 vlab-01 DEBUG /container: read_data: config:True feature:syncd fields:[('set_owner', 'local'), ('no_fallback_to_local', False)] val:['local', False]
Jan 20 17:10:27.198560 vlab-01 DEBUG /container: read_data: config:False feature:syncd fields:[('current_owner', 'none'), ('remote_state', 'none'), ('container_id', '')] val:['none', 'none', '']
Jan 20 17:10:27.199428 vlab-01 DEBUG /container: container_stop: syncd: set_owner:local current_owner:none remote_state:none docker_id:syncd
Jan 20 17:10:27.377792 vlab-01 INFO syncd#supervisord 2021-01-20 17:10:27,376 INFO exited: syncd (terminated by SIGABRT (core dumped); not expected)

notice that this one is slightly different than top one, we have double free here or corruption, and there is no message

Jan 20 17:10:25.917755 vlab-01 NOTICE syncd#syncd: :- stopUnittestThread: end

in previous one

@vaibhavhd vaibhavhd changed the title [KVM][warm reboot] syncd crash after missing OIDs in switch state map [KVM][warm reboot] syncd crashes during shutdown with double-free-corruption Jan 25, 2021
@vaibhavhd
Copy link
Contributor Author

from internet digging this seems like pthread_cancel was invoked

I tried to dig into the shutdown flow and could not find an explanation for pthread_cancel being invoked. Can you imagine the reason for pthread_cancel?

notice that this one is slightly different than top one, we have double free here or corruption, and there is no message

Jan 20 17:10:25.917755 vlab-01 NOTICE syncd#syncd: :- stopUnittestThread: end

in previous one

I think that in the previous one I just missed to capture those log lines while creating an issue. The logs remain very similar in every crash.
Do you know why

@kcudnik
Copy link
Contributor

kcudnik commented Jan 25, 2021

that's interesting, since there is a difference in those, if you forgot to capture that, then there maybe some clue into that, from current attached those seems like it broke in 2 different places
do you have previous logs maybe ?

@vaibhavhd
Copy link
Contributor Author

I don't have previous logs. But, I am saying that there was no difference in previous and new logs.

Also, I noticed that syncd uninitialize is completed before ~NetMsgRegistrar is called.:

Jan 20 17:10:25.917344 vlab-01 NOTICE syncd#syncd: :- uninitialize: begin
Jan 20 17:10:25.917344 vlab-01 NOTICE syncd#syncd: :- uninitialize: stopping threads
Jan 20 17:10:25.917454 vlab-01 NOTICE syncd#syncd: :- stopUnittestThread: begin
Jan 20 17:10:25.917651 vlab-01 NOTICE syncd#syncd: :- unittestChannelThreadProc: exit VS unittest channel thread
Jan 20 17:10:25.917755 vlab-01 NOTICE syncd#syncd: :- stopUnittestThread: end
Jan 20 17:10:25.917858 vlab-01 NOTICE syncd#syncd: :- stopFdbAgingThread: begin
Jan 20 17:10:25.917989 vlab-01 NOTICE syncd#syncd: :- fdbAgingThreadProc: end
Jan 20 17:10:25.918037 vlab-01 NOTICE syncd#syncd: :- stopFdbAgingThread: end
Jan 20 17:10:25.918104 vlab-01 NOTICE syncd#syncd: :- stopEventQueueThread: begin
Jan 20 17:10:25.918238 vlab-01 NOTICE syncd#syncd: :- processQueueEvent: received EVENT_TYPE_END_THREAD, will process all messages and end
Jan 20 17:10:25.918321 vlab-01 NOTICE syncd#syncd: :- stopEventQueueThread: end
Jan 20 17:10:25.921015 vlab-01 NOTICE syncd#syncd: :- uninitialize: end
Jan 20 17:10:25.921100 vlab-01 NOTICE syncd#syncd: :- run: uninitialize finished

>>>>>>>>> after syncd uninit

Jan 20 17:10:25.922445 vlab-01 NOTICE syncd#syncd: :- ~NetMsgRegistrar: begin
Jan 20 17:10:25.922898 vlab-01 NOTICE syncd#syncd: :- run: netlink msg listener ended
Jan 20 17:10:25.924744 vlab-01 INFO syncd#/supervisord: syncd double free or corruption (out)

  1. Is it possible that since syncd uninit has finished, something invokes the pthread_cancel for every other syncd initiated thread?
  2. Shouldn't the ~NetMsgRegistrar call be made before declaring that syncd uninitialization has finished?
  3. I noticed that there is no processing being done here for EVENT_TYPE_END_THREAD https://github.com/Azure/sonic-sairedis/blob/master/vslib/src/SaiEventQueue.cpp#L60 Is this expected?

@kcudnik
Copy link
Contributor

kcudnik commented Jan 26, 2021

  1. still not sure why, im't just assuming that is pthread_cancel since stacktrace looks exactly the same as other found on web
  2. NetMsgRegistrar is singleton (static object), and it's destroyed after main() will end
  3. thats expected

@vaibhavhd
Copy link
Contributor Author

So I tried with removing the newly added unregisterMessageHandler from the Thread, and could not reproduce this error.
So I think this issue has to be related with unregistering the handler. Possibly because of sending m_link_thread_event.notify() in the destructor ~NetMsgRegistrar()?

Also, I got a new trace from core, this time utilizing libc-dbg. As can be seen from the T1, the error is seen as part of __libc_thread_freeres which is a Thread shutdown function.

(gdb) thread apply all backtrace

Thread 3 (Thread 0x7f7e92d4f700 (LWP 29)):
#0  0x00007f7e932c47ef in epoll_wait (epfd=5, events=0x7f7e8c0018c0, maxevents=46, timeout=1000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f7e938ff7bf in swss::Select::poll_descriptors (this=0x7f7e92d4e850, c=0x7f7e92d4e7c8, timeout=<optimized out>) at select.cpp:128
#2  0x00007f7e938ff9eb in swss::Select::resultToString[abi:cxx11](int) (result=-1831540792) at select.cpp:194
#3  0x0000000000000005 in ?? ()
#4  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f7e92d55bc0 (LWP 20)):
#0  0x00007f7e9375d495 in __GI___pthread_timedjoin_ex (threadid=140181572466432, thread_return=0x0, abstime=0x0, block=<optimized out>) at pthread_join_common.c:89
#1  0x00007f7e935e4d53 in std::thread::join() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007f7e9384b303 in __gthread_mutex_lock (__mutex=0x7f7e93894178 <saivs::NetMsgRegistrar::getInstance()::instance+56>) at /usr/include/x86_64-linux-gnu/c++/8/bits/gthr-default.h:748
#3  std::mutex::lock (this=0x7f7e93894178 <saivs::NetMsgRegistrar::getInstance()::instance+56>) at /usr/include/c++/8/bits/std_mutex.h:103
#4  std::lock_guard<std::mutex>::lock_guard (__m=..., this=<synthetic pointer>) at /usr/include/c++/8/bits/std_mutex.h:162
#5  saivs::NetMsgRegistrar::~NetMsgRegistrar (this=0x7f7e93894140 <saivs::NetMsgRegistrar::getInstance()::instance>, __in_chrg=<optimized out>) at NetMsgRegistrar.cpp:34
#6  0x00007f7e93204d8c in __run_exit_handlers (status=0, listp=0x7f7e93386718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#7  0x00007f7e93204eba in __GI_exit (status=<optimized out>) at exit.c:139
#8  0x00007f7e931ef0a2 in __libc_start_main (main=0x55d016c4b080 <main(int, char**)>, argc=5, argv=0x7fff2f7cb838, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff2f7cb828) at ../csu/libc-start.c:342
#9  0x000055d016c4c75a in _start ()

Thread 1 (Thread 0x7f7e90d4b700 (LWP 33)):
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f7e931ed535 in __GI_abort () at abort.c:79
#2  0x00007f7e93244508 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f7e9334f28d "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007f7e9324ac1a in malloc_printerr (str=str@entry=0x7f7e93350ff8 "double free or corruption (out)") at malloc.c:5341
#4  0x00007f7e9324c730 in _int_free (av=0x7f7e93386c40 <main_arena>, p=0x7f7e7c0000a0, have_lock=<optimized out>) at malloc.c:4306
#5  0x00007f7e9324faf3 in tcache_thread_shutdown () at malloc.c:2961
#6  __malloc_arena_thread_freeres () at arena.c:950
#7  0x00007f7e93252b6c in __libc_thread_freeres () at thread-freeres.c:38
#8  0x00007f7e9375bfc9 in start_thread (arg=<optimized out>) at pthread_create.c:500
#9  0x00007f7e932c44cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) Connection to 10.250.0.101 closed by remote host.

@vaibhavhd
Copy link
Contributor Author

I have added a draft PR here - sonic-net/sonic-sairedis#779
I have some more changes in mind, that we can discuss before finalizing the fix.

@kcudnik
Copy link
Contributor

kcudnik commented Jan 26, 2021

hey, i reproduced this issue locally, and it's fairly consistent 3/3, i will work on this to figure out where can be the issue

@kcudnik
Copy link
Contributor

kcudnik commented Jan 26, 2021

i build small app:

#include "NetMsgRegistrar.h"
#include "swss/logger.h"
#include "unistd.h"

void asyncOnLinkMsg(
        _In_ int nlmsg_type,
        _In_ struct nl_object *obj)
{
    printf("async\n");
}

int main()
{
    swss::Logger::linkToDbNative("syncd");

    uint64_t m_linkCallbackIndex = saivs::NetMsgRegistrar::getInstance().registerCallback(&asyncOnLinkMsg);

    sleep(1);

    saivs::NetMsgRegistrar::getInstance().unregisterCallback(m_linkCallbackIndex);

    return 0;
}

and it gives:

 $ ./tests
async
async
async
remove handler
remove handler
double free or corruption (out)

and surprisingly when i comment "swss::Logger::linkToDbNative("syncd");" it works XD

@kcudnik
Copy link
Contributor

kcudnik commented Jan 26, 2021

ok i think i found issue:

NetDispatcher::getinstance
NetDispatcher::ctr
NetDispatcher::getinstance
NetDispatcher::getinstance
async
NetDispatcher::getinstance
async
NetDispatcher::getinstance
async
main exit
NetDispatcher::dtr
net registrar::dtr
NetDispatcher::getinstance
remove handler
NetDispatcher::getinstance
remove handler
double free or corruption (out)

so after main exit, static function locals (singleons) aer destroyed, that's why NetDispatcher::dtr is called, and at this point memroy don't exists, but then, thread notification for NetMsgRegistrar destructor comes in, and thread is woken up, and calls NetDispatcher::unregisterMessageHandler, but memory was already freed, but! since getInstance is called, and previous value was destroyed, a new singleton should be created at this point, and exception "throw "Trying to register on already registerd netlink message"; should be triggered, but new singleton is not created after first one was destroyed, and probably cause, is because static variable will be only initialized once.
and at this point i don't know whether this is expected behavior, or our singleton model should be revisited

i will work on very small example to demonstrate this behaviour

@kcudnik
Copy link
Contributor

kcudnik commented Jan 27, 2021

reopened since better solution would need to be found, and issue was automatically closed when PR related was merged

@kcudnik
Copy link
Contributor

kcudnik commented Jan 27, 2021

I found this interesting article: https://sourcemaking.com/design_patterns/to_kill_a_singleton which mentions problem that we having here, that one singleton depends on another and order of destruction is not guaranteed. They propose solution to use explicit destruction of singleton before exiting program.

There is also another mechanism where order is guaranteed using atexit() functionality, but in this case we would need to guarantee that constructor of NetDispatcher will call atexit() before NetMsgRegistrar constructor, and atexit method would need to handle singleton destruction.

@kcudnik
Copy link
Contributor

kcudnik commented Jan 27, 2021

i created a simple program to demonstrate cause of our issue:

#include <stdio.h>

class A
{
    public:

        static A& getInstance()
        {
            printf("A::getInstance()\n");

            static A a;

            return a;
        }

        void doSomething()
        {
            printf("A::doSomething()\n");
        }

    protected:

        ~A() { printf("A::~A\n"); }

    private:

        A() { printf("A::A\n"); }

        A(const A&) = delete;
};

class B
{
    public:

        static B& getInstance()
        {
            printf("B::getInstance()\n");

            static B a;

            return a;
        }

    protected:

        ~B()
        {
            printf("B::~B begin\n");

            printf("at this point A::~A() dtr could be already executed\n");

            A::getInstance().doSomething();

            printf("B::~B end\n");
        }

    private:

        B() { printf("B::B\n"); }

        B(const B&) = delete;
};

int main()
{
    printf("main begin\n");

    auto& b = B::getInstance();
    auto& a = A::getInstance();

    printf("main end\n");
}

and the output is like this:

 $ make && ./foo
make: Nie ma nic do zrobienia w 'all'.
main begin
B::getInstance()
B::B
A::getInstance()
A::A
main end
A::~A  // !!
B::~B begin
at this point A::~A() dtr could be already executed
A::getInstance()
A::doSomething()
B::~B end

in our case A::doSomething is NetDispatcher::unregisterMessageHandler() which tries to modify memory of std::map which was already destroyed, hence corruption/double free error

@kcudnik
Copy link
Contributor

kcudnik commented Jan 27, 2021

interestingly enough, if we swap order:

auto& b = B::getInstance();
auto& a = A::getInstance();

to

auto& a = A::getInstance();
auto& b = B::getInstance();

output is like this

 $ make && ./foo
g++  -I/usr/include/libnl3  -c -o foo.o foo.cpp
g++ foo.o -o foo
main begin
A::getInstance()
A::A
B::getInstance()
B::B
main end
B::~B begin
at this point A::~A() dtr could be already executed
A::getInstance()
A::doSomething()
B::~B end
A::~A

and A::~A() is executed at the end, and no issue is present, but this destruction order could be gcc internal implementation and we can't relay on that, that would need to be checked against c++ standard local storage duration, and if gcc follows it

@vaibhavhd
Copy link
Contributor Author

Thanks Kamil, that is nice and simple example. Two questions:

I think NetDispatcher::dtr should not be called before NetMsgRegistrar::dtr as the NetDispatcher still holds static instance that NetMsgRegistrar will use in desctruction path. This is the change that we should focus on perhaps.

in our case A::doSomething is NetDispatcher::unregisterMessageHandler() which tries to modify memory of std::map which was already destroyed, hence corruption/double free error

Based on syslog, the double-free error was not seen at this call, as the caller gets a return, prints something and immediately after thread's completion, the corruption is seen. Below is the
Jan 20 17:10:25.922898 vlab-01 NOTICE syncd#syncd: :- run: netlink msg listener ended
https://github.com/Azure/sonic-sairedis/blob/master/vslib/src/NetMsgRegistrar.cpp#L137

and A::~A() is executed at the end, and no issue is present, but this destruction order is strictly gcc internal implementation and we can't relay on that

Why can we not rely on destructor order that GCC internal implementation follows?

@kcudnik
Copy link
Contributor

kcudnik commented Jan 27, 2021

So the problem is that we don't have control when NetDispatche::dtr and NetMsgRegistrar::dtr is called, at least in current implementation.

As for error, it could be double free or corruption, when that corruption will happen in point of time that's good question, since memory of std::map was released and could be pointing to something else, but since erase(it) is called, it could modify some unknow memory, and this corruption could cause crash at any time, like function scope exit, or thread exit, im not able to predict when this will exactly happen.

we could call NetDispatche::getInstance() inside NetMsgRegistrar::getInstance, to make sure that NetDispatcher static variable is initialized before NetMsgRegistrar static is initialized, but as mentioned, if they will be destroyed in reversed order, we would need to make sure that c++ standard specifies this and gcc follows it in this case.

@kcudnik
Copy link
Contributor

kcudnik commented Jan 27, 2021

i added this:

 46 NetMsgRegistrar& NetMsgRegistrar::getInstance()
 47 {
 48     SWSS_LOG_ENTER();
 49
 50     // make sure net this instance is created before net registrar, so it can
 51     // be destroyed after net registrar instance
 52     swss::NetDispatcher::getInstance();
 53
 54     static NetMsgRegistrar instance;
 55
 56     return instance;
 57 }

and it solves the issue, if we assume that order of creation is guaranteed

@kcudnik
Copy link
Contributor

kcudnik commented Jan 27, 2021

if we can relay on this draft: https://eel.is/c++draft/basic.start.term

If an object is initialized statically, the object is destroyed in the same order as if the object was dynamically initialized.

not sure if this exactly explains this, and if this is what we are looking for, to confirm order, also in https://eel.is/c++draft/basic.stc#static there is nothing about order, but it could be in subparagraphs

@vaibhavhd
Copy link
Contributor Author

I still have problem connecting all pieces here.

since memory of std::map was released and could be pointing to something else, but since erase(it) is called, it could modify some unknow memory

Before the erase(it) call, we check for nlmsg_type presence in the map m_handlers and also find it present. So I think the map's memory is not released at this point. It is also possible that the
https://github.com/Azure/sonic-swss-common/blob/master/common/netdispatcher.cpp#L30
Additionally, we are not freeing anything here, I mean the erase calls removes the key from the map that it found. The value is a pointer, and nothing will be done to it.

@kcudnik
Copy link
Contributor

kcudnik commented Jan 27, 2021

yes it't is relased, as well as the mutex memory, maybe it's just not cleared, and it could be coincidence that we don't get exception when acquire mutex and when calling find() on map, and that memory is released already take a look at this call sequence: (i added locally printf logging on each function to see where each event happens)

NetDispatcher::getinstance
NetDispatcher::ctr
NetDispatcher::getinstance
NetDispatcher::getinstance
async
NetDispatcher::getinstance
async
NetDispatcher::getinstance
async
main exit
NetDispatcher::dtr // here !!
net registrar::dtr
NetDispatcher::getinstance
remove handler
NetDispatcher::getinstance
remove handler
double free or corruption (out)

here: net dispatcher dtr finished here, so std::map and mutex in that class is no longer valid memory, it can still contain previous values, but it's not valid, and when erase() comes it modifies it (internal map memory, since std::map contains internal pointer to it contents), and then modification of released memory goes south

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment