-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG via zil_replay_log_record & zfs_replay_remove #527
Comments
Thanks for posting this, clearly we've got a bug in the rmdir case of zil_replay(). For now you could avoid the issue by briefly setting 'zil_replay_disable=1'. You'll discard the zil an lose the last few changes before your pool crashed but you should be able to mount the fs. |
I'm also getting this problem. Never met it with 3.1.6 with behlendorf@a576bc0. Running a debug version hits 2 assertions: I have the /tmp/spl-log* files, if needed. |
I also found a case of the transaction error in my log: PLError: 8221:0:(spl-err.c:67:vcmn_err()) WARNING: ZFS replay transaction error 5, dataset data/misc, seq 0x1c, txtype 9 No oops, and the system had been idle before a clean reboot. I set the pool to readonly, and rebooted with 0.6.0.50 to make a backup of the pool, which worked fine. (No errors after the reboot). After that I set the pool back to read/write and booted back to 0.6.0.54, but the error didn't appear anymore. I'm using the following custom settings: options zfs zfs_arc_max=0x200000000 I'm not sure if my issue is related is the same as this one, the transaction error message is very similar, but the rest of the symptoms are not. A scrub is running now, no errors so far. The version running before the reboot was 0.6.0.51, which was likely the version that caused the corrupted zil entry. |
While I'm not 100% sure of the root cause for this issue my best guess is that it was caused by a small regression which was accidentally introduced by commit 2c6d0b1. The patch was designed to minimize the odds of hitting a particular deadlock but accidentally introduces a case where a zio might not be properly reinitialized. This could lead to something like this. This flaw exists for the Ubuntu 0.6.0.53 - 0.6.0-54 releases and was fixed in 0.6.0.55 and stable ppas. It has also been fixed in the master branch but was present in -rc7 release tag, so expect an -rc8 release fairly quickly. If you've seen this issue I'd suggest updating to one of these fixed versions. If you encounter this issue you can most likely resolve it be disabling zil replay |
[ 152.451302] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 152.451308] IP: [] mutex_lock+0x20/0x50
[ 152.451344] PGD 3d427067 PUD 3d426067 PMD 0
[ 152.451348] Oops: 0002 [#1] SMP
[ 152.451354] CPU 0
[ 152.451355] Modules linked in: parport_pc ppdev dm_crypt lp parport rfcomm bnep bluetooth zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq psmouse virtio_balloon serio_raw snd_timer snd_seq_device snd soundcore snd_page_alloc i2c_piix4 binfmt_misc squashfs overlayfs nls_utf8 isofs dm_raid45 xor dm_mirror dm_region_hash dm_log btrfs zlib_deflate libcrc32c virtio_blk 8139too sym53c8xx 8139cp scsi_transport_spi virtio_pci virtio_ring virtio floppy
[ 152.451382]
[ 152.451388] Pid: 6971, comm: mount.zfs Tainted: P 3.0.0-12-generic #20-Ubuntu Bochs Bochs
[ 152.451391] RIP: 0010:[] [] mutex_lock+0x20/0x50
[ 152.451395] RSP: 0018:ffff88003f709838 EFLAGS: 00010246
[ 152.451397] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000010
[ 152.451398] RDX: ffff88003f709fd8 RSI: 0000000000000000 RDI: 0000000000000000
[ 152.451400] RBP: ffff88003f709848 R08: 0000000000000000 R09: e018000000000000
[ 152.451402] R10: ffff88003f709860 R11: 000000000000000c R12: ffff88003d803b50
[ 152.451403] R13: ffff88003d04e000 R14: ffff88003d803b50 R15: ffff88003d803760
[ 152.451407] FS: 00007fe7e1676b80(0000) GS:ffff88007f000000(0000) knlGS:0000000000000000
[ 152.451409] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 152.451410] CR2: 0000000000000000 CR3: 000000003d417000 CR4: 00000000000006f0
[ 152.451416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 152.451421] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 152.451423] Process mount.zfs (pid: 6971, threadinfo ffff88003f708000, task ffff88003d87c560)
[ 152.451425] Stack:
[ 152.451426] 0000000000000000 ffff88003d803b50 ffff88003f709898 ffffffffa0323d81
This is with 0.6.0.34-0ubuntu1~oneiric1 from the PPA:
[ 152.451429] ffff88003fab4190 ffff88003f7098b0 0000000000000000 ffffffff00000010
[ 152.451432] ffffc900020a9a60 ffff88003d8038f8 ffff88003d04e000 ffff88003d803b50
[ 152.451434] Call Trace:
[ 152.451569] [] sa_lookup+0x31/0x60 [zfs]
[ 152.451604] [] zfs_inode_update+0x40/0x180 [zfs]
[ 152.451625] [] zfs_rmdir+0x51e/0x760 [zfs]
[ 152.451646] [] zfs_replay_remove+0xc9/0xd0 [zfs]
[ 152.451665] [] zil_replay_log_record+0xf3/0x1f0 [zfs]
[ 152.451685] [] zil_parse+0x416/0x850 [zfs]
[ 152.451706] [] ? perf_event_task_sched_out+0x2e/0xa0
[ 152.451726] [] ? zil_aitx_compare+0x20/0x20 [zfs]
[ 152.451746] [] ? zil_replay_error.isra.5+0xc0/0xc0 [zfs]
[ 152.451766] [] zil_replay+0xb1/0x100 [zfs]
[ 152.451787] [] zfs_sb_setup+0x130/0x140 [zfs]
[ 152.451806] [] zfs_domount+0x233/0x280 [zfs]
[ 152.451821] [] ? sget+0x1b4/0x230
[ 152.451823] [] ? unlock_super+0x30/0x30
[ 152.451936] [] ? zpl_mount+0x30/0x30 [zfs]
[ 152.451956] [] zpl_fill_super+0xe/0x20 [zfs]
[ 152.451958] [] mount_nodev+0x5d/0xc0
[ 152.451977] [] zpl_mount+0x25/0x30 [zfs]
[ 152.451979] [] mount_fs+0x43/0x1b0
[ 152.451993] [] vfs_kern_mount+0x6a/0xc0
[ 152.451996] [] do_kern_mount+0x54/0x110
[ 152.451999] [] do_mount+0x1a4/0x260
[ 152.452001] [] sys_mount+0x90/0xe0
[ 152.452016] [] system_call_fastpath+0x16/0x1b
[ 152.452017] Code: 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 10 48 89 5d f0 4c 89 65 f8 66 66 66 66 90 48 89 fb e8 83 f5 ff ff 48 89 df <3e> ff 0f 79 05 e8 16 03 00 00 65 48 8b 04 25 80 cd 00 00 4c 8b
[ 152.452038] RIP [] mutex_lock+0x20/0x50
[ 152.452041] RSP
[ 152.452042] CR2: 0000000000000000
[ 152.452074] ---[ end trace 4f6c2b5ca5cbb66c ]---
The text was updated successfully, but these errors were encountered: