-
Notifications
You must be signed in to change notification settings - Fork 625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VM disk corruption with Apple Silicon #1957
Comments
In case it is relevant, I was compiling in a separate APFS (Case-sensitive) Volume as described here. This volume seems absolutely fine - so the corruption seems limited to the VM itself. I can't see how this could have happened with 100GB but I wonder if it's possible that it ran out of space? I could try increasing the disk size, but the whole point of using an external volume was that this would not be necessary. |
|
Hmm. I just tried again but compiling in Answering your other questions:
|
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This would, I think, really help us. Our use-case is this - we want to be able to edit files from within macOS, but then compile inside Almalinux9. The codebase we are compiling is relatively large (>4 million lines of C++) and can take up to 400 GB of temporary compilation space. I was reluctant to make separate VMs with this much local storage, especially since a lot of us will be working on laptops. Ideally we would have a large build area (possibly on an external drive), accessible from several VMs, and with very fast disk io to the VM (since otherwise the build time can become unusably slow). We do NOT, in general, need to be able to access this build area from the host (at least, not with fast io - it would mainly be to examine compilation failures) (I will get back to the other tests shortly - but I'm currently travelling with limited work time, and it seems very likely that the issue is related to compiling outside the VM) |
I'm not sure how virtiofs affects the XFS disk, but maybe this issue should be reported to Apple? |
I was under the impression that the problem was with the - location: /Volumes/Lima
writable: true So the remote filesystem is a separate topic*, from this ARM64 disk corruption. Sorry for the added noise. Though I don't see how switching from remote * should continue in a different discussion Note that disk images cannot be shared... |
Is this relevant? (UTM uses vz too) Looks like people began to hit this issue since September, so I wonder if Apple introduced a regression on that time? I still can't repro the issue locally though. |
Can anybody confirm this rumor?
Removing these lines will disable ballooning: Lines 598 to 604 in 7cb2b2e
|
For what it's worth, I believe I've narrowed down the problem that I've noticed in utmapp/UTM#4840 to having used an external SSD drive. I've not reproduced the corruption if the VM lives on my Mac's internal storage. @EdwardMoyse Your separate APFS volume... is it on the same storage device that your Mac runs on, or is it a separate external device? @AkihiroSuda I've not seen disabling the Balloon device to help with preventing corruption. At least, if I'm working with a QEMU-based VM that lives on my external SSD storage, it has |
Probably, you are hitting a different issue with a similar symptom ? |
@wdormann my APFS Volume is on same device (SSD) as macOS. It's not an external device in my case. |
Thanks for the input. I've been testing the disk itself, and it has yet to report errors. |
I think I reproduced the issue with the default Ubuntu template:
(Non-minimum, non-deterministic) repro steps:
Filesystems:
The VM disk is located in the default path |
Tried to remove the balloon, but the filesystem still seems to break intermittently
|
I might be perhaps misunderstanding you, but I don't think I am using "non-OS volume for the underlying VM OS storage". For clarity, here is my setup:
So I would characterise this as rathe ra problem using a non-OS volume for the intensive disk operations from within the VM. |
I'll admit I'm not familiar with Lima.
Perhaps Lima does this all for you under the hood, but I suppose that I'd need to know exactly what it's doing to have any hope of understanding what's going on. |
It's the latter (but I cannot tell you any technicalities how it works). From within both the host and the VM I can access |
Do you specify a mount type in your |
My apologies for the delay in replying, but i have been looking into this. The workflow is the same - compile https://gitlab.cern.ch/atlas/atlasexternals using the attached template with various configurations of host, qemu/vz, cores and memory. TLDR; updating to With
(from hint here)
Notes:
|
FWIW, I've added some test results and comments here: utmapp/UTM#4840 (comment) I've not ruled out that there is some issue with the macOS filesystem/hypervisor layer, but I've only seen corruption with a Linux VM, and not macOS or Windows doing the exact same thing, from the exact same VM disk backing. What is interesting to me is that if I take the exact same disk and reformat it as APFS instead of ExFAT, Linux 6.5.6 or 6.4.15 will not experience disk corruption. My theory is that given an unfortunate combination of speed/latency/something-else for disk backing, a Linux VM might experience disk corruption. |
Could you submit your insight to Apple? Probably via https://www.apple.com/feedback/macos.html |
Oh, so that might be why it is mostly affecting external disks ? Did people forget to (re-)format them before using ? EDIT: no, not so simple "I create a separate APFS (Case-sensitive) Volume," |
And for me, I'm not using external (to the VM) disks any more - if you look at the table I posted here you will see that in the |
In my case it occurs with internal disk nd very frequent on fedora images. Just create fedora vm and do dnf update, corruption happens immediately. EDIT: vz in my case |
I don't recall if I mentioned it here, but through eliminating variables I was able to pinpoint a configuration for a likely-to-corrupt-older-Linux-kernels situation, and that is having the VM hosted on an ExFAT-formatted partition (which just happens to be on an external disk for me). Based on how macOS/APFS works, I don't think it's even possible for me to test how ExFAT might perform on my internal disk. At least not without major reconfiguration of my system drive. If others are able to reproduce the disk corruption without relying on ExFAT at the host level, that at least helps eliminate the ExFAT-layer possibility of where the problem lies. At least for me, I've been able to avoid the problem by reformatting my external disk to APFS, as that seems to tweak at least one of the required variables to see this bug happen. At least if the Linux kernel version is new enough. At a conceptual level, it is indeed possible that Linux may be doing nothing wrong at all. In other words, it could be possible that Linux just happens to be unlucky enough to express the disk usage patterns that can trigger a bug that presents symptoms as a corrupted (BTRFS in my case) file system. But I suspect that being able to positively acknowledge the difference between a somewhat unlikely to see Linux data corruption bug and a bug at the macOS hypervisor / storage level is probably beyond my skill set. |
So it seems like there are a lot of references to people mentioning issues related to external disks and non-APFS filesystems. I am using the internal disk on my m2 mini with the default APFS filesystem and I've experienced disk corruption once but haven't specifically been able to force it to be reproduced but I haven't tried very hard to be honest but I did want to point out that maybe external disks and other filesystems may not be the specific cause but may just be easier to trigger compared to internal APFS. I run Debian Bookworm and after repairing the filesystem with a fsck I did also upgrade my kernel from |
The above table also lists corrupting when running with qemu/hvf, so it might not even be unique to vz... |
It is not unique to vz, and it is not unique to external disks. With :-( |
Okay, I updated the title and the original comment to hopefully clarify that this is a problem with every conceivable permutation of lima. Unfortunately for me lima is completely unusable at the moment, and so for the moment I'm giving up. |
I can reproduce this with 2 methods: Are you able to reproduce this as well? |
It still seems to be unique to one operating system and one hardware architecture, though? Maybe even Apple's issue. |
Sorry, yes. I was being very single-minded in my statement above! I will rephrase the title. |
This issue might be worth reporting to https://gitlab.com/qemu-project/qemu/-/issues too, if the issue is reproducible with bare QEMU (without using Lima) |
At the risk of further fragmentation of the discussion of this issue, but at the potential benefit of getting the right eyeballs, I've filed: https://gitlab.com/qemu-project/qemu/-/issues/1997 (i.e., yes this can be reproduced with QEMU, as opposed to the Apple Hypervisor Framework) |
This may fix the issue for vz: ( Thanks to @wpiekutowski utmapp/UTM#4840 (comment) @wdormann utmapp/UTM#4840 (comment) ) |
Oh wow - I've run my test twice with the patched version of lima and no corruption or crashes! From reading the ticket, it's more a workaround than a complete fix, but I'll happily take it! Thanks @AkihiroSuda |
Tip
EDIT by @AkihiroSuda
For
--vm-type=vz
, this issue seems to have been solved in Lima v0.19 (#2026)Description
Lima version: 0.18.0
macOS: 14.0 (23A344)
VM: Almalinux9
I was trying to do a big compile, using a VM with the attached configuration (vz)
The build aborted with:
And afterwards, even in a different terminal, I see:
I was also logged into a display, and there I saw e.g.
If I try to log in again with:
each time I see something like the following appear in the display window:
Edit: there has been a lot of discussion below, and the corruption can happen with both
vz
andqemu
, and on external (to the VM) and internal disks. Some permutations seem more likely to provoke a corruption than others. I have reproduced my experiments in the table in the following comment below.The text was updated successfully, but these errors were encountered: