-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KVM][warm reboot] syncd crashes during shutdown with double-free-corruption #6466
[KVM][warm reboot] syncd crashes during shutdown with double-free-corruption #6466
Comments
From the core, not a very helpful trace is generated from 3 running threads at the time of crash:
|
could we get redisdb operation dump in log, to see if we will get VIDTORID empty map or missing 0x21000 switch in there, and if there maybe was called FLUSHALL on redis |
I don't have the recordings from the time when I reported the issue. But, I just reproduced this and attached a new set of logs in the dump https://github.com/Azure/sonic-buildimage/files/5844189/sonic_dump_vlab-01_20210120_172005.tar.gz |
from internet digging this seems like pthread_cancel was invoked
|
attaching syslog from dump:
notice that this one is slightly different than top one, we have double free here or corruption, and there is no message
in previous one |
I tried to dig into the shutdown flow and could not find an explanation for
I think that in the previous one I just missed to capture those log lines while creating an issue. The logs remain very similar in every crash. |
that's interesting, since there is a difference in those, if you forgot to capture that, then there maybe some clue into that, from current attached those seems like it broke in 2 different places |
I don't have previous logs. But, I am saying that there was no difference in previous and new logs. Also, I noticed that syncd uninitialize is completed before
|
|
So I tried with removing the newly added Also, I got a new trace from core, this time utilizing
|
I have added a draft PR here - sonic-net/sonic-sairedis#779 |
hey, i reproduced this issue locally, and it's fairly consistent 3/3, i will work on this to figure out where can be the issue |
i build small app:
and it gives:
and surprisingly when i comment "swss::Logger::linkToDbNative("syncd");" it works XD |
ok i think i found issue:
so after main exit, static function locals (singleons) aer destroyed, that's why NetDispatcher::dtr is called, and at this point memroy don't exists, but then, thread notification for NetMsgRegistrar destructor comes in, and thread is woken up, and calls NetDispatcher::unregisterMessageHandler, but memory was already freed, but! since getInstance is called, and previous value was destroyed, a new singleton should be created at this point, and exception "throw "Trying to register on already registerd netlink message"; should be triggered, but new singleton is not created after first one was destroyed, and probably cause, is because static variable will be only initialized once. i will work on very small example to demonstrate this behaviour |
reopened since better solution would need to be found, and issue was automatically closed when PR related was merged |
I found this interesting article: https://sourcemaking.com/design_patterns/to_kill_a_singleton which mentions problem that we having here, that one singleton depends on another and order of destruction is not guaranteed. They propose solution to use explicit destruction of singleton before exiting program. There is also another mechanism where order is guaranteed using atexit() functionality, but in this case we would need to guarantee that constructor of NetDispatcher will call atexit() before NetMsgRegistrar constructor, and atexit method would need to handle singleton destruction. |
i created a simple program to demonstrate cause of our issue: #include <stdio.h>
class A
{
public:
static A& getInstance()
{
printf("A::getInstance()\n");
static A a;
return a;
}
void doSomething()
{
printf("A::doSomething()\n");
}
protected:
~A() { printf("A::~A\n"); }
private:
A() { printf("A::A\n"); }
A(const A&) = delete;
};
class B
{
public:
static B& getInstance()
{
printf("B::getInstance()\n");
static B a;
return a;
}
protected:
~B()
{
printf("B::~B begin\n");
printf("at this point A::~A() dtr could be already executed\n");
A::getInstance().doSomething();
printf("B::~B end\n");
}
private:
B() { printf("B::B\n"); }
B(const B&) = delete;
};
int main()
{
printf("main begin\n");
auto& b = B::getInstance();
auto& a = A::getInstance();
printf("main end\n");
} and the output is like this:
in our case A::doSomething is NetDispatcher::unregisterMessageHandler() which tries to modify memory of std::map which was already destroyed, hence corruption/double free error |
interestingly enough, if we swap order:
to
output is like this
and A::~A() is executed at the end, and no issue is present, but this destruction order could be gcc internal implementation and we can't relay on that, that would need to be checked against c++ standard local storage duration, and if gcc follows it |
Thanks Kamil, that is nice and simple example. Two questions: I think
Based on syslog, the double-free error was not seen at this call, as the caller gets a return, prints something and immediately after thread's completion, the corruption is seen. Below is the
Why can we not rely on destructor order that GCC internal implementation follows? |
So the problem is that we don't have control when NetDispatche::dtr and NetMsgRegistrar::dtr is called, at least in current implementation. As for error, it could be double free or corruption, when that corruption will happen in point of time that's good question, since memory of std::map was released and could be pointing to something else, but since erase(it) is called, it could modify some unknow memory, and this corruption could cause crash at any time, like function scope exit, or thread exit, im not able to predict when this will exactly happen. we could call NetDispatche::getInstance() inside NetMsgRegistrar::getInstance, to make sure that NetDispatcher static variable is initialized before NetMsgRegistrar static is initialized, but as mentioned, if they will be destroyed in reversed order, we would need to make sure that c++ standard specifies this and gcc follows it in this case. |
i added this: 46 NetMsgRegistrar& NetMsgRegistrar::getInstance()
47 {
48 SWSS_LOG_ENTER();
49
50 // make sure net this instance is created before net registrar, so it can
51 // be destroyed after net registrar instance
52 swss::NetDispatcher::getInstance();
53
54 static NetMsgRegistrar instance;
55
56 return instance;
57 } and it solves the issue, if we assume that order of creation is guaranteed |
if we can relay on this draft: https://eel.is/c++draft/basic.start.term
not sure if this exactly explains this, and if this is what we are looking for, to confirm order, also in https://eel.is/c++draft/basic.stc#static there is nothing about order, but it could be in subparagraphs |
I still have problem connecting all pieces here.
Before the |
yes it't is relased, as well as the mutex memory, maybe it's just not cleared, and it could be coincidence that we don't get exception when acquire mutex and when calling find() on map, and that memory is released already take a look at this call sequence: (i added locally printf logging on each function to see where each event happens)
here: net dispatcher dtr finished here, so std::map and mutex in that class is no longer valid memory, it can still contain previous values, but it's not valid, and when erase() comes it modifies it (internal map memory, since std::map contains internal pointer to it contents), and then modification of released memory goes south |
sonic_dump_vlab-01_20210120_172005.tar.gz
Description
Tested Image: SONiC-OS-202012.5-264ecb18 on KVM switch, and warm-reboot test leads to SYNCD crash.
There are multiple messages for missing OIDs in switch state map prior to the crash.
Steps to reproduce the issue:
Describe the results you received:
Describe the results you expected:
Crash free warm reboot.
Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered: