-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arc_adapt : Soft Lockups on ZFS when exporting zvols over iSCSI #2517
Comments
This happens to me on NFS so it isn't specific to zvols. |
Thanks for this, do you have a full soft-lockup message from when it happens to you? What kernel are you using? cheers |
I'm using the following: 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013 x86_64 x86_64 I do not have the logs but I remember the "blocked for more than 120 On Fri, Jul 25, 2014 at 4:26 AM, edamato [email protected] wrote:
|
We had a very similar lockup on our main storage system last night which forced a hard reboot. System / software configuration is similar to the above: zpool config: zfs modules options: ZPool / zvols have default values other than lz4 compression=on, 128kB blocksize, dedup / checksum / atime / xattr are all off. Uptime was about 90 days, so memory usage / fragmentation was a little on the high side but had been stable for weeks with at least 5-10 GB free RAM and the system was operating normally. The lockup occurred at 18:28 EST. Prior to that the system had spent the bulk of the afternoon under normal (light) load and (this part is probably key) freeing (via scsi unmap commands from VMware ESXi 5.1) approximately 600 GB of data from a zvol with 23T logical referenced data. The space freeing (at least from VMware's perspective, not sure if this is a sync or async option in zfs?) had completed about a half hour before (~ 1800 hrs). Not long afterward the system stopped responding to I/O requests from VMware and we got this in /var/log/messages for an hour when I noticed the problem and hard reset the system (after it failed to shutdown cleanly): Oct 1 18:27:11 dtc-san kernel: BUG: soft lockup - CPU#4 stuck for 67s! [zvol/30:2080] |
Could there be a common thread in these lockups? Most of the stack trace business is Greek to me, but this line from the original poster's trace looks related to a SCSI unmap command (which my system was also doing a lot of shortly before going down in flames): Jun 28 05:39:52 nodeB kernel: [] ? __vunmap+0x2e/0x120 |
Could this be another instance of issue #2523 ? |
I think there are a couple issues getting conflated in this bug. The backtraces from the original post look like memory contention in |
It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #421
This issue which is a duplicate of #2523 was resolved by the following commit. Full details can be found in the commit message and related lwn article. openzfs/spl@a3c1eb7 mutex: force serialization on mutex_exit() to fix races |
Commit: openzfs/zfs@a3c1eb7 From: Chunwei Chen <[email protected]> Date: Fri, 19 Dec 2014 11:31:59 +0800 Subject: mutex: force serialization on mutex_exit() to fix races It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Backported-by: Darik Horn <[email protected]> Closes #421 Conflicts: include/sys/mutex.h
It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #421
@behlendorf This is not a duplicate of #2523. Instead, it is a duplicate of #3091. |
It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#421 Conflicts: include/sys/mutex.h
Hi All,
I'm experiencing soft lockups on ZFS during normal iscsi I/O on the clients. The lockups only happen if I am using zvols exported over iSCSI, and happen more often if zvols are doing lots of I/O. These lockups are normally followed up by the server entering an unresponsive state and having to be rebooted. zpool and zfs commands seem to hang.
Environment:
Examples of the lockups:
Examples of the lockups in detail:
The text was updated successfully, but these errors were encountered: