Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs destroy (a snapshot) SOMETIMES fails silently #1007

Closed
ghost opened this issue Oct 3, 2012 · 16 comments
Closed

zfs destroy (a snapshot) SOMETIMES fails silently #1007

ghost opened this issue Oct 3, 2012 · 16 comments
Milestone

Comments

@ghost
Copy link

ghost commented Oct 3, 2012

zfs destroy fs@somesnap
returns no error, but the snapshot is still there (as zfs list can show).

This did not happen ever after the system was freshly booted, only after some time of operation.
When destroying recursively, only SOME children snapshots are excluded from destruction, the parent being destroyed (!), making zfs_autosnap (and other scripts) fail to accomplish their goal.

Might be related to an earlier problem, when (prior to rc-10) happened to cause zfs destroy to fail with an error message along the lines of "Could not destroy fs@somesnap because snapshot is in use" This message is gone and the snapshots simply stay around, introducing a silent misbehavior.

Threads in Forum (likely) related to this, with aditional information on this:
https://groups.google.com/a/zfsonlinux.org/d/topic/zfs-discuss/hf64pJT9psU/discussion
https://groups.google.com/a/zfsonlinux.org/d/topic/zfs-discuss/bE66hKauCP0/discussion
https://groups.google.com/a/zfsonlinux.org/d/topic/zfs-discuss/IvEGaK_uOes/discussion
https://groups.google.com/a/zfsonlinux.org/d/topic/zfs-discuss/5GPAxLXQTTA/discussion

@ghost
Copy link
Author

ghost commented Oct 5, 2012

Links replaced with permalinks. (sorry for that mistake)

@ghost ghost closed this as completed Oct 5, 2012
@ghost ghost reopened this Oct 5, 2012
@ryao
Copy link
Contributor

ryao commented Oct 11, 2012

Does this still happen in latest HEAD?

@ghost
Copy link
Author

ghost commented Oct 11, 2012

Richard Yao:

Does this still happen in latest HEAD?

What does that mean? - I am using rc11 (stable). Do you want me to try
daily instead?

UbuntuNewbie (wanting to say: i am no developer)

@ryao
Copy link
Contributor

ryao commented Oct 11, 2012

Please try the daily PPA.

@ghost
Copy link
Author

ghost commented Oct 12, 2012

Richard Yao:

Please try daily.

Ok. At first, it looks excellent. Obviously the memory management has
improved as i was unable to recreate the weird condition willingly.

But dont expect too much, with rc11, i observed 1 occurrence in 2 days.
So i'll have it running for the coming week and see...

U.N.

@ghost
Copy link
Author

ghost commented Oct 12, 2012

Richard Yao:

Please try daily.

Hello. The symptoms resurfaced again.

Before rebooting i witnessed some irregular behavior, sometimes zfs
destroy would actually work, sometimes not. No error message whatsoever.
Also no error message if attempting to destroy non existant snapshot.
After reboot, all is working fine again - as usual - and my cleanup
scripts restored desired system state.
Unfortunately, i am lacking the know-how to investigate any further.

But system stability has improved and zfs RAM usage is diminished,
leaving some free space for the other tasks without going into "deep
think", a behavior i had already gotten used to.

@ghost
Copy link
Author

ghost commented Oct 18, 2012

I wrote:
"No error message whatsoever. Also no error message if attempting to destroy non existant snapshot."

That was a mistake. There are messages complaining, if an attempt to destroy a non-existant snapshot is made.

Still the phenomenon reappears at an alarming rate (at least once in 2 days, at times repeatedly during one day).

This happens so frequently, that i rely on the following cleanup script after boot:

#!/usr/bin/awk -f
# zfs list -t snapshot -o name -r RAID |
# { awk -f } ~/scripts/snapfilter.awk -v mode=exec |
# xargs -l zfs destroy

# exec: nur deletes
# listall: alle Ausgaben
# list: nur die überzähligen
# { ohne } : nur Statistik
# check : nur scratch

# Wie paaren sich die Klammern?

# ind [] enthält die Bezeichnungen der Snapshot-Typen
# stat [] enthält die Anzahl der Vorkommnisse der einzelnen Typen
# max [] enthält die derzeit konfigurierten Obergrenzen (GELOGEN)
# count [] zählt INNERHALB eines Datasets die entsprechenden Vorkommen

BEGIN { FS="@zfs-auto-snap_"
    a [i++] = "DUMMY"
    j++ # initialisiere bekannte (8) Snapshot-Typen:
    ind [j] = "frequent"
    max [ind [j]] = 4
    j++
    ind [j] = "hourly"
    max [ind [j]] = 24
    j++
    ind [j] = "daily"
    max [ind [j]] = 8
    j++
    ind [j] = "weekly"
    max [ind [j]] = 4
    j++
    ind [j] = "monthly"
    max [ind [j]] = 12
    j++
    ind [j] = "reboot"
    max [ind [j]] = 8
    j++
    ind [j] = "vmstart"
    max [ind [j]] = 10
    j++
    ind [j] = "vmexit"
    max [ind [j]] = 10
    j++
    for (;j > 0;j--) stat [ind [j]] = 0
}

NF == 2 { 
    a [i++] = $0 # erkannte autosnaps speichern
    cat = substr ($2, 1, index ($2, "-") - 1) # ermittelt den Snapshot-Untertyp
    stat [cat] ++ # zählt die Kategorie
    autos ++
#   res = res substr (cat, 1, 1)
}

NF == 1 { if ($0 != "NAME") reste++ } # andere nur zählen


END { 
# Variablen zurücksetzen:
scratch = 0

# Jetzt beginnt die eigentliche Analyse. Dazu werden die Snaps RÜCKWÄRTS bearbeitet:
for (j = i - 1; j >= 0; j--) {
    split (a [j], parts)
    fs = parts [1]

    i = index (parts [2], "-")
    cat = substr (parts [2], 1, i - 1) # ermittelt den Snapshot-Untertyp
    rest = substr (parts [2], i + 1)

#   print j, a [j], fs, cat, rest # Debugging

    if (fs == lastfs) {
        count [cat] ++
        if (count [cat] > max [cat]) {
            # Überlauf. Löschen fängt an
            scratch ++
            if ( mode == "exec" ) {
                print a [j] # entspricht --> fs FS cat "-" rest
            }
        }
    }
    else {
        if ( mode == "list" ) {
            for (i = 1; i < 9; i++) {
                if (count [ind [i]] > max [ind [i]]) {
                    printf "%6s %s (%s)\n", count [ind [i]] "/" max [ind [i]], lastfs, ind [i]
                }
            }
        }
        if ( mode == "listall" ) {
            for (i = 1; i < 9; i++) {
                if (count [ind [i]] > 0) {
                    printf "%6s %s (%s)\n", count [ind [i]] "/" max [ind [i]], lastfs, ind [i]
                }
            }
        }
        for (i = 1; i < 9; i++) {
            count [ind [i]] = 0
        }
        count [cat] ++
        lastfs = fs
        lastcat = cat
    }
}

    # Statistische Ausgaben:
    if ( mode == "check" ) print scratch
    else {
        if ( mode != "exec" ) {
            print "gesamt", autos + reste, "scratch", scratch
            for (j = 1;j < 9;j++) printf "%5d %s\n", stat [ind [j]], ind [j] "-snapshots"
            printf "%5d %s\n", autos, "autosnaps insgesamt"
            printf "%5d %s\n", reste, "andere (manuelle)"
            printf "%5d %s\n", scratch, "ZU LÖSCHEN"
        }
    }
}

@jvsalo
Copy link

jvsalo commented Oct 21, 2012

Hi,

for me (Debian testing, rc11), the below sequence reproduces this 100% of the time:

root@thinkpad:# zfs create -o mountpoint=/mnt rpool/testpool
root@thinkpad:
# zfs snapshot rpool/testpool@testsnap
root@thinkpad:# cd /mnt/.zfs/snapshot/testsnap/
root@thinkpad:/mnt/.zfs/snapshot/testsnap# cd -
/root
root@thinkpad:
# zfs destroy -v rpool/testpool@testsnap
will destroy rpool/testpool@testsnap
will reclaim 0
root@thinkpad:# zfs list -t all -r rpool/testpool
NAME USED AVAIL REFER MOUNTPOINT
rpool/testpool 30K 64.4G 30K /mnt
rpool/testpool@testsnap 0 - 30K -
root@thinkpad:
# zfs destroy -v rpool/testpool@testsnap
will destroy rpool/testpool@testsnap
will reclaim 0
root@thinkpad:# zfs list -t all -r rpool/testpool
NAME USED AVAIL REFER MOUNTPOINT
rpool/testpool 30K 64.4G 30K /mnt
root@thinkpad:
#

@rlaager
Copy link
Member

rlaager commented Oct 21, 2012

As another data point, I saw this behaviour on OpenIndiana a few days ago. I don't know if this was before or after I applied updates. A reboot fixed it and I haven't reproduced it there since. I haven't tried the steps @jvsalo just left.

@jvsalo
Copy link

jvsalo commented Oct 21, 2012

It appears lingering snapshot mounts (/proc/mounts) may be related. When a snapshot is accessed, a new mount is created:

rpool/testfs@testsnap /mnt/.zfs/snapshot/testsnap zfs ro,relatime,xattr 0 0

However, even when all file descriptors (according to lsof) have been closed, the mount might not disappear and destroying the snapshot once causes the mount to disappear, and a second destroy actually destroys the snapshot. The snapshot is accessible and would be re-mounted if doing so after the first destroy.

The above applies if I enter commands by hand (so there is delay), but if I run a script like this, things get even worse:

  #!/bin/sh
  zfs create rpool/testfs
  mount -t zfs rpool/testfs /mnt
  zfs snapshot rpool/testfs@testsnap
  cd /mnt/.zfs/snapshot/testsnap/
  cd -
  zfs destroy -v rpool/testfs@testsnap
  zfs list -t all -r rpool/testfs
  zfs destroy -v rpool/testfs@testsnap
  zfs list -t all -r rpool/testfs
  umount /mnt
  zfs destroy -v -r rpool/testfs

I would get

  root@thinkpad:/tmp# sh reprod.sh 
  /tmp
  will destroy rpool/testfs@testsnap
  will reclaim 0
  NAME                    USED  AVAIL  REFER  MOUNTPOINT
  rpool/testfs             30K  64.2G    30K  legacy
  rpool/testfs@testsnap      0      -    30K  -
  will destroy rpool/testfs@testsnap
  will reclaim 0
  NAME                    USED  AVAIL  REFER  MOUNTPOINT
  rpool/testfs             30K  64.2G    30K  legacy
  rpool/testfs@testsnap      0      -    30K  -
  umount: /mnt: device is busy.
          (In some cases useful info about processes that use
           the device is found by lsof(8) or fuser(1))
  will destroy rpool/testfs@testsnap
  cannot destroy 'rpool/testfs@testsnap': dataset is busy

... and rpool/testfs@testsnap is invulnerable to destroy unless I manually unmount /mnt/.zfs/snapshot/testsnap first. After that destroy works as expected.

@ghost
Copy link
Author

ghost commented Oct 21, 2012

Interesting theory there. Only i cannot confirm my case would ever be
related to it:
The snapshots typically failing to get destroyed HAVE the
snapdir=visible attribute set, but positively have NEVER been used (up
till now). I just turned snapdir=hidden everywhere just for fun to see,
if that stops the issue from showing itself, which i seriously doubt.

In any case: i have no reproducer, one time it appeared over night on a
PC basically idle (as i posted in forum). I (also: up till now) cannot
even think of a condition, that increases the likelyhood of the issue
reappearing (except time ;-))

So in any case - if the issue mentioned would get resolved - it might
not resolve my frequent destroy failures. But i will start to use the -v
switch for upcoming deletes outside the zfs-auto-snapshot command/script.

Am 21.10.2012 18:07, schrieb Jaakko:

It appears lingering snapshot mounts (/proc/mounts) may be related.
When a snapshot is accessed, a new mount is created:

rpool/testfs@testsnap /mnt/.zfs/snapshot/testsnap zfs
ro,relatime,xattr 0 0

However, even when all file descriptors (according to lsof) have been
closed, the mount might not disappear and destroying the snapshot once
causes the mount to disappear, and a second destroy actually destroys
the snapshot. The snapshot is accessible and would be re-mounted if
doing so after the first destroy.

(...)

@jvsalo
Copy link

jvsalo commented Oct 21, 2012

However, I can only reproduce this on my 3.5.x hosts:

  • Produces on my laptop: ZFS rootfs, Debian testing, kernel 3.5.3 (x86_64), rc11 built by hand
  • Produces on server Use Barriers in pre-2.6.24 kernels #1: ZFS for data only, Debian testing, kernel 3.5.4 (x86_64), rc11 Ubuntu PPA
  • Doesn't produce on server Implement zeventd daemon #2: ZFS for data only, Debian squeeze, kernel 3.2.2 (x86_64), rc11 Ubuntu PPA
  • Doesn't produce on server Use New BIO_RW_FAILFAST_* API #3: ZFS for data only, Debian testing, kernel 3.2.0-3-amd64 (x86_64), rc11 Ubuntu PPA

On all hosts ZFS is controlling disk devices directly. Laptop has SSD, others rotating storage.

@UbuntuNewbie: yeah, could be a different issue, I'm opening a new ticket for my reproducer.

@ghost
Copy link
Author

ghost commented Nov 5, 2012

Additional information, confirming the suspicion, that the observed behavior is a weird side-effect of another problem:

Ever since this is going on (at least since zfs-auto-snap creates a lot of snapshots), i am running gnome-system-monitor among with automated scripts to catch this issue as early as possible. And today, i could see something (ok, the very outer edge of it) that i cannot explain. on a system almost idle, only a handful of non-greedy tasks running, all of a sudden, i saw a constant rise of memory consumption, taking up gigabytes of free memory. At the moment, there was no more left, ALL processors started to get extremely busy (100%), leaving a system non responsive for some time.

After while, the system came back to being responsive, which i used to shut down most of the tasks cleanly. then i checked the status of the snaps and THERE IT WAS: some snapshots did not get destroyed and also could not by hand, - until the next reboot.

So it looks like the extreme memory pressure - and the way zfs created and handled it - left some corruption behind, that - at least up till now - required a reboot to get resolved.

I understand this comment lacks the precision a developer might be looking for, only a general outline. But since i never witnessed this building up, i thought it might be useful to someone...

@devsk
Copy link

devsk commented Dec 24, 2012

The theory of the snapshot folder being mounted on access is correct. I manually went in and unmounted the snapshot specific folder in the .zfs folder, and I could destroy the snapshot.

Looks like the auto snapshot script needs to make sure that if it finds any mounts pointing into .zfs folder in /proc/mounts, it umounts them and cribs with lsof output if it can umount them.

@maxximino
Copy link
Contributor

with 7973e46 automatic unmount seems working:

sigmaii ~ # cd .zfs/snapshot/newsnapshot/
sigmaii newsnapshot # zfs destroy mypool/myhomedir@newsnapshot
cannot destroy snapshots in mypool/myhomedir@newsnapshot: dataset is busy
sigmaii newsnapshot # cd ..
sigmaii snapshot # zfs destroy mypool/myhomedir@newsnapshot
sigmaii snapshot # cd newsnapshot
bash: cd: newsnapshot: File o directory non esistente

@behlendorf
Copy link
Contributor

@maxximino Thanks for commenting in all these related issues and verifying that it's fixed. Closing issue.

pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
…enzfs#1007)

Bumps [serde_bytes](https://github.com/serde-rs/bytes) from 0.11.10 to 0.11.11.
- [Release notes](https://github.com/serde-rs/bytes/releases)
- [Commits](serde-rs/bytes@0.11.10...0.11.11)

---
updated-dependencies:
- dependency-name: serde_bytes
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants