-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reboot hangs for UKI images #2384
Comments
@vipsharm as we can't reproduce this, can you share the ISO you are using so we can try to reproduce with that one? Also, did you give to the machine enough RAM/CPU resources? |
it turns out the state partition gets filled up. In my case it was even full after the upgrade before kubernetes even started (one need to hold I increased the size of the state partition with:
and this allowed me to boot and see the Pods coming up. After everything was running, I did a cold I think somehow containerd ends up writing the container filesystems on the state partition and it gets that filled up. Maybe there is some directory we should be mounting elsewhere and we don't? |
The last thing that was printed before it started printing the "No space left on device" afaict was "simple directory" (about Given the errors refer to directories in /sysroot/usr` I guess the copying here is what fails: https://github.com/kairos-io/immucore/blob/8a142fe41f046098eb5676e055f50de02d342c13/pkg/state/steps_uki.go#L410 |
We only perform a check on the type of directory (mount point or not) on the top level dir (in this case |
What was the reason we only scanned top level directories again? (https://github.com/kairos-io/immucore/blob/8a142fe41f046098eb5676e055f50de02d342c13/pkg/state/steps_uki.go#L379). @Itxaka do you remember maybe? |
We didn't had anything in submounts that it's private no? So any submounts in the subdirs of the root dirs would be propagated when moving the mountpoints? Something like that I think. |
maybe it's the order of things then? Maybe it starts copying |
Could it be that your issue come from ram? After all the new sysroot is mounted as tmpfs so it the VM has low memory or the image is big it can lead to ram exhaustion and that would cause the copy stuff to fail? I just wonder why aren't we seeing the same issue anywhere else when testing? |
I thought the same and increased my VMs RAM to 15Gb but the error is still there. We probably don't see the error because we are not running much on Kubernetes. If you see the screenshot with the errors, it's writing complete container filesystems there. This could easily get huge in size. Given we are only trying to "fake" a chroot environment, I don't think it was ever the intention to copy/duplicate such huge files. Also the directory they reside ( |
I tried to cleanup my cluster before I reboot, by removing helm releases and deployments. I wanted to see if by cleaning up as many containers as I could, it would not fail. Either I didn't delete enough container or it doesn't make any difference but the error is still there after reboot. |
I increased the VMs RAM to 30Gb and after taking quite some more time, it eventually booted with no disk errors. So this proves that the problem is the insane amount of data we are copying to the in-memory filesystem. |
blergh. We tried to make it really simple but then it didnt work. So maybe we would need to rework it and mount things directly under the new fake sysroot to avoid all teh copying? currently:
We probably should do:
The only reason we needed to move from / to /sysroot was that / was of type rootfs and that broke the kubernetes thingie So now we would need to rework it to be on the tmpfs from the scratch, just have a minimal / rootfs, move /proc /dev /run and such into /sysroot and chroot into it from the first immucore step so we run everything in the proper final system. We wil still need to do the Copy there BUT we do it before we do all the mounting of state and such, so its simpler as its basically a 1-1 copy, with no extra mounts in there |
I think its easier like that, doing all the preparation before the copying. Otherwise another approach is to make the cp/mount function recursive...... or use rsync or something more intelligent for the copying? |
If it's possible to work on the tmpfs from the beginning I agree, it sounds like a better option. The rest sounds complicated and might result to different problems down the line. |
seem to be the case also for real HW (not only affecting VMs) |
does this issue only occur on VMs? |
No it shouldn't matter if it's a VM or not. It's now fixed on master. |
Closing now as this is fixed in master and will be part of the upcoming 3.0.4 release. (see #2428 ) |
Kairos version:
3.0.1
CPU architecture, OS, and Version:
Ubuntu 23.10
After cluster is setup, rebooting the OS hangs in QEMU VM.
IMG_1572.MOV
IMG_1571.MOV
The text was updated successfully, but these errors were encountered: