zfs destroy (a snapshot) SOMETIMES fails silently #1007

ghost · 2012-10-03T23:18:43Z

zfs destroy fs@somesnap
returns no error, but the snapshot is still there (as zfs list can show).

This did not happen ever after the system was freshly booted, only after some time of operation.
When destroying recursively, only SOME children snapshots are excluded from destruction, the parent being destroyed (!), making zfs_autosnap (and other scripts) fail to accomplish their goal.

Might be related to an earlier problem, when (prior to rc-10) happened to cause zfs destroy to fail with an error message along the lines of "Could not destroy fs@somesnap because snapshot is in use" This message is gone and the snapshots simply stay around, introducing a silent misbehavior.

Threads in Forum (likely) related to this, with aditional information on this:
https://groups.google.com/a/zfsonlinux.org/d/topic/zfs-discuss/hf64pJT9psU/discussion
https://groups.google.com/a/zfsonlinux.org/d/topic/zfs-discuss/bE66hKauCP0/discussion
https://groups.google.com/a/zfsonlinux.org/d/topic/zfs-discuss/IvEGaK_uOes/discussion
https://groups.google.com/a/zfsonlinux.org/d/topic/zfs-discuss/5GPAxLXQTTA/discussion

ghost · 2012-10-05T00:55:37Z

Links replaced with permalinks. (sorry for that mistake)

ryao · 2012-10-11T09:29:00Z

Does this still happen in latest HEAD?

ghost · 2012-10-11T09:53:38Z

Richard Yao:

Does this still happen in latest HEAD?

What does that mean? - I am using rc11 (stable). Do you want me to try
daily instead?

UbuntuNewbie (wanting to say: i am no developer)

ryao · 2012-10-11T10:54:43Z

Please try the daily PPA.

ghost · 2012-10-12T00:17:48Z

Richard Yao:

Please try daily.

Ok. At first, it looks excellent. Obviously the memory management has
improved as i was unable to recreate the weird condition willingly.

But dont expect too much, with rc11, i observed 1 occurrence in 2 days.
So i'll have it running for the coming week and see...

U.N.

ghost · 2012-10-12T09:54:17Z

Richard Yao:

Please try daily.

Hello. The symptoms resurfaced again.

Before rebooting i witnessed some irregular behavior, sometimes zfs
destroy would actually work, sometimes not. No error message whatsoever.
Also no error message if attempting to destroy non existant snapshot.
After reboot, all is working fine again - as usual - and my cleanup
scripts restored desired system state.
Unfortunately, i am lacking the know-how to investigate any further.

But system stability has improved and zfs RAM usage is diminished,
leaving some free space for the other tasks without going into "deep
think", a behavior i had already gotten used to.

ghost · 2012-10-18T23:58:14Z

I wrote:
"No error message whatsoever. Also no error message if attempting to destroy non existant snapshot."

That was a mistake. There are messages complaining, if an attempt to destroy a non-existant snapshot is made.

Still the phenomenon reappears at an alarming rate (at least once in 2 days, at times repeatedly during one day).

This happens so frequently, that i rely on the following cleanup script after boot:

#!/usr/bin/awk -f
# zfs list -t snapshot -o name -r RAID |
# { awk -f } ~/scripts/snapfilter.awk -v mode=exec |
# xargs -l zfs destroy

# exec: nur deletes
# listall: alle Ausgaben
# list: nur die überzähligen
# { ohne } : nur Statistik
# check : nur scratch

# Wie paaren sich die Klammern?

# ind [] enthält die Bezeichnungen der Snapshot-Typen
# stat [] enthält die Anzahl der Vorkommnisse der einzelnen Typen
# max [] enthält die derzeit konfigurierten Obergrenzen (GELOGEN)
# count [] zählt INNERHALB eines Datasets die entsprechenden Vorkommen

BEGIN { FS="@zfs-auto-snap_"
    a [i++] = "DUMMY"
    j++ # initialisiere bekannte (8) Snapshot-Typen:
    ind [j] = "frequent"
    max [ind [j]] = 4
    j++
    ind [j] = "hourly"
    max [ind [j]] = 24
    j++
    ind [j] = "daily"
    max [ind [j]] = 8
    j++
    ind [j] = "weekly"
    max [ind [j]] = 4
    j++
    ind [j] = "monthly"
    max [ind [j]] = 12
    j++
    ind [j] = "reboot"
    max [ind [j]] = 8
    j++
    ind [j] = "vmstart"
    max [ind [j]] = 10
    j++
    ind [j] = "vmexit"
    max [ind [j]] = 10
    j++
    for (;j > 0;j--) stat [ind [j]] = 0
}

NF == 2 { 
    a [i++] = $0 # erkannte autosnaps speichern
    cat = substr ($2, 1, index ($2, "-") - 1) # ermittelt den Snapshot-Untertyp
    stat [cat] ++ # zählt die Kategorie
    autos ++
#   res = res substr (cat, 1, 1)
}

NF == 1 { if ($0 != "NAME") reste++ } # andere nur zählen


END { 
# Variablen zurücksetzen:
scratch = 0

# Jetzt beginnt die eigentliche Analyse. Dazu werden die Snaps RÜCKWÄRTS bearbeitet:
for (j = i - 1; j >= 0; j--) {
    split (a [j], parts)
    fs = parts [1]

    i = index (parts [2], "-")
    cat = substr (parts [2], 1, i - 1) # ermittelt den Snapshot-Untertyp
    rest = substr (parts [2], i + 1)

#   print j, a [j], fs, cat, rest # Debugging

    if (fs == lastfs) {
        count [cat] ++
        if (count [cat] > max [cat]) {
            # Überlauf. Löschen fängt an
            scratch ++
            if ( mode == "exec" ) {
                print a [j] # entspricht --> fs FS cat "-" rest
            }
        }
    }
    else {
        if ( mode == "list" ) {
            for (i = 1; i < 9; i++) {
                if (count [ind [i]] > max [ind [i]]) {
                    printf "%6s %s (%s)\n", count [ind [i]] "/" max [ind [i]], lastfs, ind [i]
                }
            }
        }
        if ( mode == "listall" ) {
            for (i = 1; i < 9; i++) {
                if (count [ind [i]] > 0) {
                    printf "%6s %s (%s)\n", count [ind [i]] "/" max [ind [i]], lastfs, ind [i]
                }
            }
        }
        for (i = 1; i < 9; i++) {
            count [ind [i]] = 0
        }
        count [cat] ++
        lastfs = fs
        lastcat = cat
    }
}

    # Statistische Ausgaben:
    if ( mode == "check" ) print scratch
    else {
        if ( mode != "exec" ) {
            print "gesamt", autos + reste, "scratch", scratch
            for (j = 1;j < 9;j++) printf "%5d %s\n", stat [ind [j]], ind [j] "-snapshots"
            printf "%5d %s\n", autos, "autosnaps insgesamt"
            printf "%5d %s\n", reste, "andere (manuelle)"
            printf "%5d %s\n", scratch, "ZU LÖSCHEN"
        }
    }
}

jvsalo · 2012-10-21T00:49:37Z

Hi,

for me (Debian testing, rc11), the below sequence reproduces this 100% of the time:

root@thinkpad:# zfs create -o mountpoint=/mnt rpool/testpool
root@thinkpad:# zfs snapshot rpool/testpool@testsnap
root@thinkpad:# cd /mnt/.zfs/snapshot/testsnap/
root@thinkpad:/mnt/.zfs/snapshot/testsnap# cd -
/root
root@thinkpad:# zfs destroy -v rpool/testpool@testsnap
will destroy rpool/testpool@testsnap
will reclaim 0
root@thinkpad:# zfs list -t all -r rpool/testpool
NAME USED AVAIL REFER MOUNTPOINT
rpool/testpool 30K 64.4G 30K /mnt
rpool/testpool@testsnap 0 - 30K -
root@thinkpad:# zfs destroy -v rpool/testpool@testsnap
will destroy rpool/testpool@testsnap
will reclaim 0
root@thinkpad:# zfs list -t all -r rpool/testpool
NAME USED AVAIL REFER MOUNTPOINT
rpool/testpool 30K 64.4G 30K /mnt
root@thinkpad:#

rlaager · 2012-10-21T00:54:42Z

As another data point, I saw this behaviour on OpenIndiana a few days ago. I don't know if this was before or after I applied updates. A reboot fixed it and I haven't reproduced it there since. I haven't tried the steps @jvsalo just left.

jvsalo · 2012-10-21T16:07:53Z

It appears lingering snapshot mounts (/proc/mounts) may be related. When a snapshot is accessed, a new mount is created:

rpool/testfs@testsnap /mnt/.zfs/snapshot/testsnap zfs ro,relatime,xattr 0 0

However, even when all file descriptors (according to lsof) have been closed, the mount might not disappear and destroying the snapshot once causes the mount to disappear, and a second destroy actually destroys the snapshot. The snapshot is accessible and would be re-mounted if doing so after the first destroy.

The above applies if I enter commands by hand (so there is delay), but if I run a script like this, things get even worse:

  #!/bin/sh
  zfs create rpool/testfs
  mount -t zfs rpool/testfs /mnt
  zfs snapshot rpool/testfs@testsnap
  cd /mnt/.zfs/snapshot/testsnap/
  cd -
  zfs destroy -v rpool/testfs@testsnap
  zfs list -t all -r rpool/testfs
  zfs destroy -v rpool/testfs@testsnap
  zfs list -t all -r rpool/testfs
  umount /mnt
  zfs destroy -v -r rpool/testfs

I would get

  root@thinkpad:/tmp# sh reprod.sh 
  /tmp
  will destroy rpool/testfs@testsnap
  will reclaim 0
  NAME                    USED  AVAIL  REFER  MOUNTPOINT
  rpool/testfs             30K  64.2G    30K  legacy
  rpool/testfs@testsnap      0      -    30K  -
  will destroy rpool/testfs@testsnap
  will reclaim 0
  NAME                    USED  AVAIL  REFER  MOUNTPOINT
  rpool/testfs             30K  64.2G    30K  legacy
  rpool/testfs@testsnap      0      -    30K  -
  umount: /mnt: device is busy.
          (In some cases useful info about processes that use
           the device is found by lsof(8) or fuser(1))
  will destroy rpool/testfs@testsnap
  cannot destroy 'rpool/testfs@testsnap': dataset is busy

... and rpool/testfs@testsnap is invulnerable to destroy unless I manually unmount /mnt/.zfs/snapshot/testsnap first. After that destroy works as expected.

ghost · 2012-10-21T17:00:10Z

Interesting theory there. Only i cannot confirm my case would ever be
related to it:
The snapshots typically failing to get destroyed HAVE the
snapdir=visible attribute set, but positively have NEVER been used (up
till now). I just turned snapdir=hidden everywhere just for fun to see,
if that stops the issue from showing itself, which i seriously doubt.

In any case: i have no reproducer, one time it appeared over night on a
PC basically idle (as i posted in forum). I (also: up till now) cannot
even think of a condition, that increases the likelyhood of the issue
reappearing (except time ;-))

So in any case - if the issue mentioned would get resolved - it might
not resolve my frequent destroy failures. But i will start to use the -v
switch for upcoming deletes outside the zfs-auto-snapshot command/script.

Am 21.10.2012 18:07, schrieb Jaakko:

It appears lingering snapshot mounts (/proc/mounts) may be related.
When a snapshot is accessed, a new mount is created:

rpool/testfs@testsnap /mnt/.zfs/snapshot/testsnap zfs
ro,relatime,xattr 0 0

However, even when all file descriptors (according to lsof) have been
closed, the mount might not disappear and destroying the snapshot once
causes the mount to disappear, and a second destroy actually destroys
the snapshot. The snapshot is accessible and would be re-mounted if
doing so after the first destroy.

(...)

jvsalo · 2012-10-21T17:50:56Z

However, I can only reproduce this on my 3.5.x hosts:

Produces on my laptop: ZFS rootfs, Debian testing, kernel 3.5.3 (x86_64), rc11 built by hand
Produces on server Use Barriers in pre-2.6.24 kernels #1: ZFS for data only, Debian testing, kernel 3.5.4 (x86_64), rc11 Ubuntu PPA
Doesn't produce on server Implement zeventd daemon #2: ZFS for data only, Debian squeeze, kernel 3.2.2 (x86_64), rc11 Ubuntu PPA
Doesn't produce on server Use New BIO_RW_FAILFAST_* API #3: ZFS for data only, Debian testing, kernel 3.2.0-3-amd64 (x86_64), rc11 Ubuntu PPA

On all hosts ZFS is controlling disk devices directly. Laptop has SSD, others rotating storage.

@UbuntuNewbie: yeah, could be a different issue, I'm opening a new ticket for my reproducer.

ghost · 2012-11-05T07:25:34Z

Additional information, confirming the suspicion, that the observed behavior is a weird side-effect of another problem:

Ever since this is going on (at least since zfs-auto-snap creates a lot of snapshots), i am running gnome-system-monitor among with automated scripts to catch this issue as early as possible. And today, i could see something (ok, the very outer edge of it) that i cannot explain. on a system almost idle, only a handful of non-greedy tasks running, all of a sudden, i saw a constant rise of memory consumption, taking up gigabytes of free memory. At the moment, there was no more left, ALL processors started to get extremely busy (100%), leaving a system non responsive for some time.

After while, the system came back to being responsive, which i used to shut down most of the tasks cleanly. then i checked the status of the snaps and THERE IT WAS: some snapshots did not get destroyed and also could not by hand, - until the next reboot.

So it looks like the extreme memory pressure - and the way zfs created and handled it - left some corruption behind, that - at least up till now - required a reboot to get resolved.

I understand this comment lacks the precision a developer might be looking for, only a general outline. But since i never witnessed this building up, i thought it might be useful to someone...

devsk · 2012-12-24T01:03:13Z

The theory of the snapshot folder being mounted on access is correct. I manually went in and unmounted the snapshot specific folder in the .zfs folder, and I could destroy the snapshot.

Looks like the auto snapshot script needs to make sure that if it finds any mounts pointing into .zfs folder in /proc/mounts, it umounts them and cribs with lsof output if it can umount them.

maxximino · 2013-01-17T22:58:39Z

with 7973e46 automatic unmount seems working:

sigmaii ~ # cd .zfs/snapshot/newsnapshot/
sigmaii newsnapshot # zfs destroy mypool/myhomedir@newsnapshot
cannot destroy snapshots in mypool/myhomedir@newsnapshot: dataset is busy
sigmaii newsnapshot # cd ..
sigmaii snapshot # zfs destroy mypool/myhomedir@newsnapshot
sigmaii snapshot # cd newsnapshot
bash: cd: newsnapshot: File o directory non esistente

behlendorf · 2013-01-17T23:39:37Z

@maxximino Thanks for commenting in all these related issues and verifying that it's fixed. Closing issue.

…enzfs#1007) Bumps [serde_bytes](https://github.com/serde-rs/bytes) from 0.11.10 to 0.11.11. - [Release notes](https://github.com/serde-rs/bytes/releases) - [Commits](serde-rs/bytes@0.11.10...0.11.11) --- updated-dependencies: - dependency-name: serde_bytes dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

ghost closed this as completed Oct 5, 2012

ghost reopened this Oct 5, 2012

jvsalo mentioned this issue Oct 21, 2012

zfs destroy fails to destroy snapshots #1064

Closed

dajhorn mentioned this issue Dec 28, 2012

try to dismount automatic mounts before running zfs destroy zfsonlinux/zfs-auto-snapshot#8

Open

behlendorf closed this as completed Jan 17, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zfs destroy (a snapshot) SOMETIMES fails silently #1007

zfs destroy (a snapshot) SOMETIMES fails silently #1007

ghost commented Oct 3, 2012

ghost commented Oct 5, 2012

ryao commented Oct 11, 2012

ghost commented Oct 11, 2012

ryao commented Oct 11, 2012

ghost commented Oct 12, 2012

ghost commented Oct 12, 2012

ghost commented Oct 18, 2012

jvsalo commented Oct 21, 2012

rlaager commented Oct 21, 2012

jvsalo commented Oct 21, 2012

ghost commented Oct 21, 2012

jvsalo commented Oct 21, 2012

ghost commented Nov 5, 2012

devsk commented Dec 24, 2012

maxximino commented Jan 17, 2013

behlendorf commented Jan 17, 2013

zfs destroy (a snapshot) SOMETIMES fails silently #1007

zfs destroy (a snapshot) SOMETIMES fails silently #1007

Comments

ghost commented Oct 3, 2012

ghost commented Oct 5, 2012

ryao commented Oct 11, 2012

ghost commented Oct 11, 2012

ryao commented Oct 11, 2012

ghost commented Oct 12, 2012

ghost commented Oct 12, 2012

ghost commented Oct 18, 2012

jvsalo commented Oct 21, 2012

rlaager commented Oct 21, 2012

jvsalo commented Oct 21, 2012

ghost commented Oct 21, 2012

jvsalo commented Oct 21, 2012

ghost commented Nov 5, 2012

devsk commented Dec 24, 2012

maxximino commented Jan 17, 2013

behlendorf commented Jan 17, 2013