-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zio->io_lock() spinlock damaged #2523
Comments
The same issue was observed again. The z_fr_iss_0/1 is stuck spinning on the internal mutex spin lock which protects the wait list. The damage looks like a double free but it's unclear how that can happen. Critical details from a second instance follow:
|
I hesitate to comment, since it is very unlikely to be related. We had many panics in zio->io_locks, which seemed connected to memory corruption. But it was related to how our SPL memory manager was designed. In the end it was caused by: .. because But I don't think ZOL's allocator cares as much about the size parameter as ours do though, so it is likely not to be relevant. |
As your links above pointed out, in fb8e608, the handling of |
Any insights here are welcome. But as @dweeezil pointed out in this case we're directly mapping You're comment however reminded me that ZoL's SPL has a |
Ah I see, you directly map it to Linux specific allocation calls which do not take size as an argument, and yet still have spa_strdup(). Anyway, sorry I couldn't help... |
@behlendorf If a double free occurs on an object allocated by spl_kmem_cache_alloc(), then it will be allocated twice. That would result in something like this happening. I managed to trigger something like this when building ztest with Clang not that long ago, but I did not identify the cause of the double free. |
One theory for openzfs#2523 is that somehow the spinlock at the start of the mutex is being accidentally overwritten. To test this theory a magic value is added before the mutex. By checking this value periodically against a known magic value we can detect if this is somehow happening. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2523
There exists a plausible cache concurrency issue with zio_wait(). This might occur because the zio->io_waiter to not assigned under a lock in zio_wait(), is not checked under a lock in zio_done(), and the zio may be dispatched to another thread for handling. That said, none of the actual crash dumps I've looked at show that this has ever occurred. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2523
@ryao Yes, a double free is a possible explanation for this but thus far I haven't been able to determine how that's possible. Have you been able to reproduce anything like this? It's been difficult to resolve because it occurs often enough to be a problem but I've been completely unable to reproduce it otherwise. |
Is it possible to make some kind of debug wrapper to catch the 2nd free in |
Oh never mind, only just now saw your patch to try to track this down :) |
@dswartz It should be but thus far I haven't been able to hit the issue with any of the debug patches in place. Generally I dislike adding this kind of debug to the main tree but we may have too just to get some additional information. |
@behlendorf I suspect that I could reproduce the double free that I encountered when building with Clang, but I do not have instructions to do that on hand at the moment. |
One theory for openzfs#2523 is that somehow the spinlock at the start of the mutex is being accidentally overwritten. To test this theory a magic value is added before the mutex. By checking this value periodically against a known magic value we can detect if this is somehow happening. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2523
There exists a plausible cache concurrency issue with zio_wait(). This might occur because the zio->io_waiter to not assigned under a lock in zio_wait(), is not checked under a lock in zio_done(), and the zio may be dispatched to another thread for handling. That said, none of the actual crash dumps I've looked at show that this has ever occurred. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2523
From the information gathered about issue openzfs#2523 it's clear that somehow the spin lock is being damaged. This could occur if the spin lock is reinitialized, the memory is accidentally overwritten, or freed back to the cache to soon. To determine exactly what is happening this patch adds a couple new sanity tests. * A zio->io_magic field is added before the io_lock. This field is designed to act as a red zone allow us to detect if the zio has been written. * The zio->io_magic field is also used to detect if somehow the constructor or destructor is running multiple for the object. This would effectively cause the spin lock to be reinitialized. * The destructor has been updated to poison the entire structure. This should cause us to quickly detect any use-after-free bugs. One the root cause of this issue can be determined this patch should be reverted. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2523
There exists a plausible cache concurrency issue with zio_wait(). This might occur because the zio->io_waiter to not assigned under a lock in zio_wait(), is not checked under a lock in zio_done(), and the zio may be dispatched to another thread for handling. That said, none of the actual crash dumps I've looked at show that this has ever occurred. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2523
From the information gathered about issue openzfs#2523 it's clear that somehow the spin lock is being damaged. This could occur if the spin lock is reinitialized, the memory is accidentally overwritten, or freed back to the cache to soon. To determine exactly what is happening this patch adds a couple new sanity tests. * A zio->io_magic field is added before the io_lock. This field is designed to act as a red zone allow us to detect if the zio has been written. * The zio->io_magic field is also used to detect if somehow the constructor or destructor is running multiple for the object. This would effectively cause the spin lock to be reinitialized. * The destructor has been updated to poison the entire structure. This should cause us to quickly detect any use-after-free bugs. Once the root cause of this issue can be determined this patch should be reverted. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2523
There exists a plausible cache concurrency issue with zio_wait(). This might occur because the zio->io_waiter to not assigned under a lock in zio_wait(), is not checked under a lock in zio_done(), and the zio may be dispatched to another thread for handling. That said, none of the actual crash dumps I've looked at show that this has ever occurred. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2523
I ran with the assert magic patch last night and hit an assert failure under load. Hope this helps narrow this nasty issue down!
|
@mgmartin Thanks, I was about to catch a failure in overnight stress testing as well. This should help us get to the root cause. |
This might help too--the hung task stacks after the panic:
|
@behlendorf I'll try it out and report back |
@olw2005 using just the patch in openzfs/spl#421 is enough. |
@tuxoko awesome! I've been running pretty well with a current 3.14 kernel and using the temporary fix of spl_kmem_cache_slab_limit=0 . I'll remove that module option and apply the patch across a few servers that were locking up pretty consistently. |
May one ask how this can be checked? I am of course curious if it is something we have to look into over in XNU. The locks (mutex) code appears to come from Mach: http://fxr.watson.org/fxr/source//osfmk/i386/i386_lock.s?v=xnu-2050.18.24;im=excerpts |
@lundman This is specific to Linux's mutex design, which was made faster at the expense of safety without anyone realizing it. Linus' explanation at the following url explains it fairly well: http://lwn.net/Articles/575477/ You will need to analyze XNU's locks to see if the same scenario is possible with them. |
It is known that mutex in Linux is not safe when using it to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx: zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). dbuf uses dbuf->db_mtx to protect reference counting. Add it's been reported multiple time that zio->io_lock and dbuf->db_mtx suffer from corruption, which is exactly the symptom of the race. openzfs/zfs#2523 openzfs/zfs#2897 This patch fix this kind of race by forcing serializaion on mutex_exit() with a spinlock, making the mutex safer by sacrificing a bit of performance and memory overhead. Signed-off-by: Chunwei Chen <[email protected]>
It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #421
@tuxoko's fix for this issue has been merged to the spl master branch. Because of how well the root cause explains all the symptoms we've seen I've closed out all the duplicates of this issue I'm aware of and linked them to the fix. And while I'm highly confident that this issue is now fixed, I'd like to leave this issue open a while longer to get absolute confirmation of that. Please let us know if you observe any similar symptoms with the fix applied. If things look good for the next few weeks we can close out this bug too. |
Commit: openzfs/zfs@a3c1eb7 From: Chunwei Chen <[email protected]> Date: Fri, 19 Dec 2014 11:31:59 +0800 Subject: mutex: force serialization on mutex_exit() to fix races It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Backported-by: Darik Horn <[email protected]> Closes #421 Conflicts: include/sys/mutex.h
It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #421
Fix confirmation: I've been running with this patch in place on a few servers for over a week now and have had 0 lock ups. Prior to the patch, I seemed to hit the issue within a day or two. I ran for a few days on 3.14 kernel, then moved everything back to a 3.18 kernel. I used the default spl module options with spl_kmem_cache_kmem_limit=1024 . Looks good to me. Thanks everyone for all the work in tracking this down and finding the issue! |
Is there a hot-fix patch for this fix? We are running the stable release (0.6.3) from July. How would we apply this patch cleanly without having other dependencies? |
@lidongyang This patch was pulled in the stable ZFS packages for Fedora, CentOS, Ubuntu, Debian, and Gentoo. If you're using any of these repositories simply update your packages and reboot to apply the fix. If you're building from source use the 0.6.3-1.2 tag. |
Closing this issue. After over two weeks of run time on numerous systems we've confirmed that the proposed fix has resolved the issue as expected. Thanks everybody. |
@behlendorf Great information. Thank you. I did not know there was a 0.6.3-1.2 tag. |
@behlendorf I'm not sure where to put this, but it'd be really great to start to track some of these differences between linux and illumos w.r.t this project. Subtle incompatibilities like this will make it extremely difficult to ever move to a single OpenZFS repository unless we start to collect and document the known differences, and kernel API requirements. @tuxoko great job finding the root cause! |
The zio_cons() constructor and zio_dest() destructor don't exist in the upstream Illumos code. They were introduced as a workaround to avoid issue openzfs#2523. Since this issue has now been resolved this code is being reverted to bring ZoL back in sync with Illumos. Signed-off-by: Brian Behlendorf <[email protected]>
The zio_cons() constructor and zio_dest() destructor don't exist in the upstream Illumos code. They were introduced as a workaround to avoid issue openzfs#2523. Since this issue has now been resolved this code is being reverted to bring ZoL back in sync with Illumos. Signed-off-by: Brian Behlendorf <[email protected]>
The zio_cons() constructor and zio_dest() destructor don't exist in the upstream Illumos code. They were introduced as a workaround to avoid issue openzfs#2523. Since this issue has now been resolved this code is being reverted to bring ZoL back in sync with Illumos. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ned Bass <[email protected]> Issue openzfs#3063
It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#421 Conflicts: include/sys/mutex.h
There exists a plausible cache concurrency issue with zio_wait(). This might occur because the zio->io_waiter to not assigned under a lock in zio_wait(), is not checked under a lock in zio_done(), and the zio may be dispatched to another thread for handling. That said, none of the actual crash dumps I've looked at show that this has ever occurred. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#2523
The following hang was observed running zfs-0.6.3 under RHEL 6.5. Very similar hangs in zio_wait() have been reported rarely for prior releases. However, this time a crash dump was available for detailed inspection which helps shed some light on the issue.
The hang in
zio_wait()
shown below is the result of the txg_sync thread getting blocked in__mutex_unlock_slowpath
on the associated spin lock. This is somewhat surprising because the spin lock is managed entirely by the Linux kernelmutex_lock
andmutex_unlock
functions. Because we have access to the crash dump we're able to inspect the exact state of the spin lock.Under x86_64 on Linux ticket spin locks are used. For those not familiar with ticket spin locks there's a good LWN article explaining them, http://lwn.net/Articles/267968/. But very quickly each lock is split in to two parts a next value and an owner value. When the next value equals the owner value the lock is not held and can be acquired.
Here's where it gets interesting. In the case shown below the spin lock has a value of 0x00020002. This means that since the lock was initialized it has been unlocked twice. But has only been locked once. In this state it will effectively block forever and never acquire the lock.
There are only two plausible ways this can happen.
I've inspected the code and thus far haven't found a way that a multiple unlock could occur. This also seems particularly unlikely since as I mentioned above the management of this lock is handled by the Linux mutex primitive. It seems very unlikely that it suffers from any kind of bug like this.
That leaves the spin lock being reinitialized while it is locked. I like this theory much better because it cleanly explains how the lock can end up in this state and also why this problem is so rare. The spin lock would need to be damaged at exactly the right instant. The problem is that the surrounding zio structure doesn't appear to be damaged as one might expect. I also haven't found any code which might cause this.
So unfortunately, while we now have solid evidence for what happened. It's still unclear to me how this can happen.
The text was updated successfully, but these errors were encountered: