-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround issue cleaning up automounted snapshots on Linux #12670
Conversation
i have no comments on this PR other than to say your work on the project hasn't gone unnoticed and i wanted to thank you for picking it up. you seem to be having fun learning all of the ropes. |
Heh, more like playing whack-a-mole every time I run into an irritation with the FS I've used for so long rather than just shrugging, but thank you. I'd say it's all enlightened self-interest, but I don't use encryption and keep staring at bugs in that too... |
module/os/linux/zfs/zfs_ctldir.c
Outdated
ASSERT3S(se->se_taskqid, ==, TASKQID_INVALID); | ||
rw_exit(&se->se_taskqid_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just move this ASSERT
down so it's right before the call to taskq_dispatch_delay
under the new lock. I think we can live without ASSERTing this in the delay <= 0
case in the name of readability (even though it's true). Plus it's a bit silly to take the lock here when the ASSERT will get compiled out on production builds anyway.
This could then also be changed to a mutex since this seems to be the only place we take it as a reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh huh, I pushed an old version - I thought taking the lock for nothing was silly too, and #ifdef ZFS_DEBUG'd it, not sure where that change went in this push...
But sure, that can be done instead.
7eb39ed
to
3f8d02f
Compare
@rincebrain Would you mind squashing the commits? |
Squash! e: And again with a pull --rebase to grab 6d680e6... |
On Linux, sometimes, when ZFS goes to unmount an automounted snap, it fails a VERIFY check on debug builds, because taskq_cancel_id returned ENOENT after not finding the taskq it was trying to cancel. This presumably happens when it already died for some reason; in this case, we don't really mind it already being dead, since we're just going to dispatch a new task to unmount it right after. So we just ignore it if we get back ENOENT trying to cancel here, retry a couple times if we get back the only other possible condition (EBUSY), and log to dbgmsg if we got anything but ENOENT or success. (We also add some locking around taskqid, to avoid one or two cases of two instances of trying to cancel something at once.) Closes: openzfs#11632 Signed-off-by: Rich Ercolani <[email protected]>
Great. Thanks! |
On Linux, sometimes, when ZFS goes to unmount an automounted snap, it fails a VERIFY check on debug builds, because taskq_cancel_id returned ENOENT after not finding the taskq it was trying to cancel. This presumably happens when it already died for some reason; in this case, we don't really mind it already being dead, since we're just going to dispatch a new task to unmount it right after. So we just ignore it if we get back ENOENT trying to cancel here, retry a couple times if we get back the only other possible condition (EBUSY), and log to dbgmsg if we got anything but ENOENT or success. (We also add some locking around taskqid, to avoid one or two cases of two instances of trying to cancel something at once.) Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#11632 Closes openzfs#12670
On Linux, sometimes, when ZFS goes to unmount an automounted snap, it fails a VERIFY check on debug builds, because taskq_cancel_id returned ENOENT after not finding the taskq it was trying to cancel. This presumably happens when it already died for some reason; in this case, we don't really mind it already being dead, since we're just going to dispatch a new task to unmount it right after. So we just ignore it if we get back ENOENT trying to cancel here, retry a couple times if we get back the only other possible condition (EBUSY), and log to dbgmsg if we got anything but ENOENT or success. (We also add some locking around taskqid, to avoid one or two cases of two instances of trying to cancel something at once.) Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #11632 Closes #12670
Motivation and Context
Getting annoyed by #11632 popping up
Description
Sometimes, for cleaning up automatic .zfs/snapshot mounts, we go to cancel taskqids which we subsequently can't find before dispatching a new unmount task. On a non-DEBUG build, it just barrels ahead and spawns a new one even if the cancel errored out and didn't set taskqid to TASKQID_INVALID. On a DEBUG build, we get a VERIFY failure instead.
Since the cancel failing seems pretty harmless to ignore here, changed the logic to just shrug and act like it succeeded if it gets back ENOENT (which is what it gets if the task can't be found), and retry a couple times before giving up like before on EBUSY (the other status you can get back from the cancel call). (dbgmsg print added when the retry logic is hit in order to make it easier to see what's going on...)
How Has This Been Tested?
Well, on my 2.0.6 system where this was happening regularly after a lot of snapshots were mounted, a previous version which printed on hitting the retry logic and didn't permit ENOENT printed at most once before succeeding on a subsequent run.
(Curiously, at least once it printed, still only once, that it was retrying on taskqid 0 (== TASKQID_INVALID), so...is it getting here before taskqid is properly set/updated somewhere sometimes, somehow, and then it gets updated by someone else in situ? Because otherwise, it should have printed 5 times...hm.)
Types of changes
Checklist:
Signed-off-by
.