You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's been a weird conditioning happening every once in a while that is bugging me and I'd really like to dig further and figure out what is going on. I've got some VMs with ZFS pools built from a single LVM2 device. Every now and again, one of those VMs fills up its ZFS filesystem to 100% and occasionally this causes:
WARNING: Pool 'vg01-data' has encountered an uncorrectable I/O failure and has been suspended
The pool gets suspended, all writes to the filesystem are stopped and applications hang or crash.
I have been unable to reproduce this condition in a controlled test. No matter how I fill up a filesystem built in exactly the same way to 100%, ZFS does not report uncorrectable I/O failures.
This is very puzzling and I'd really like some help from people who understand the ZFS codebase better as to where to look for clues in solving this one. If it happens again, what diagnostics would be most useful? Where might the problem lie in the code, is it going to be in the ZFS code or the LVM2 code?
I'm guessing at the moment, but I wonder if it's possible that there is some ZFS metadata that cannot be written out because the underlying LVM2 volume is full, some metadata that ZFS expected would be written. But not sure why I can't reproduce that issue, unless the data has to be an exact size to trigger it?
One possible solution I've considered is setting up a quota or a reservation so that the filesystem I create cannot use 100% of the underlying pool. I'm tempted to just give it a try and if I never get the error again then considering it as good as solved, but there again as I can't reproduce the problem I might just hit the same issue again in 2 months and I'll be back to the drawing board.
If anyone has any suggestions for how to tackle this one, I'd love to hear your ideas.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
There's been a weird conditioning happening every once in a while that is bugging me and I'd really like to dig further and figure out what is going on. I've got some VMs with ZFS pools built from a single LVM2 device. Every now and again, one of those VMs fills up its ZFS filesystem to 100% and occasionally this causes:
The pool gets suspended, all writes to the filesystem are stopped and applications hang or crash.
I have been unable to reproduce this condition in a controlled test. No matter how I fill up a filesystem built in exactly the same way to 100%, ZFS does not report uncorrectable I/O failures.
This is very puzzling and I'd really like some help from people who understand the ZFS codebase better as to where to look for clues in solving this one. If it happens again, what diagnostics would be most useful? Where might the problem lie in the code, is it going to be in the ZFS code or the LVM2 code?
I'm guessing at the moment, but I wonder if it's possible that there is some ZFS metadata that cannot be written out because the underlying LVM2 volume is full, some metadata that ZFS expected would be written. But not sure why I can't reproduce that issue, unless the data has to be an exact size to trigger it?
One possible solution I've considered is setting up a quota or a reservation so that the filesystem I create cannot use 100% of the underlying pool. I'm tempted to just give it a try and if I never get the error again then considering it as good as solved, but there again as I can't reproduce the problem I might just hit the same issue again in 2 months and I'll be back to the drawing board.
If anyone has any suggestions for how to tackle this one, I'd love to hear your ideas.
Thanks
Mark
Beta Was this translation helpful? Give feedback.
All reactions