-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boot failed on hosts with LVM thin/raid logical volumes #9300
Comments
We reproduced the issue on VM. Logs and VMs can be found here: https://drive.google.com/drive/folders/1aj4GUOjO_EqSFdz4GgAZXhpCJ75ggZOm In production, we are in a situation where we cannot reboot or update servers that have LV with thin or raid1 type. After reboot, the server can't boot from 30 minutes to several hours. Eventually server boots up... Of course, all servers have the correct modules configured and LVs created in the past year (using LVM CSI). It looks like some race in kernel module initialization. As a workaround, can you provide a machine config option to not activate LVM at boot time? (because all known LVM CSI implementations are active LVs themselves) |
@mst1711 FYI |
Can you extract the relevant bits as text? What is the root cause that you're seeing? It's almost impossible to follow screenshots, and they are not searchable. |
We provide a shared Google Disk folder with details. (https://drive.google.com/drive/folders/1aj4GUOjO_EqSFdz4GgAZXhpCJ75ggZOm) VM boot logs from that folder: Unfortunately, we can't provide such logs from bare metal because lack of a serial console. |
The problem was discovered when updating from version 1.6.7 to version 1.7.5. |
@eshikhov would you be able to test with a custom iso/installer? |
Yes, I can try |
Could you sent a message on siderolabs slack? |
Drop `activateLogicalVolumes` sequencer step. LVM package already ships proper udev rules to handle this. ```text ❯ tree lvm2/usr/lib/udev/rules.d/ lvm2/usr/lib/udev/rules.d/ ├── 10-dm.rules ├── 11-dm-lvm.rules ├── 13-dm-disk.rules ├── 69-dm-lvm.rules └── 95-dm-notify.rules 1 directory, 5 files ``` Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
I checked it on a custom image, the problem cannot be reproduced. After booting, LVM volumes are available. Everything works. |
@smira can the fix be backported to 1.7? |
Fantastic support, thanks! Waiting for the release to "unlock" our servers. |
I don't know, this seems to be bigger/breaking change if things go wrong in some setups. |
I understand your concern: changing the boot sequence in the patch release is against the usual practice. Anyway - thanks again for the good OS and support. |
yes, that's true as well, but it's too big of a change for a patch release if it breaks some setups. Talos 1.8.0-beta.1 will be released next Monday, so you can use that. |
It is safe to use 1.8 worker nodes with 1.7 controll-plane? |
In general no, but should be fine for this pair (1.7 <> 1.8). |
Drop `activateLogicalVolumes` sequencer step. LVM package already ships proper udev rules to handle this. ```text ❯ tree lvm2/usr/lib/udev/rules.d/ lvm2/usr/lib/udev/rules.d/ ├── 10-dm.rules ├── 11-dm-lvm.rules ├── 13-dm-disk.rules ├── 69-dm-lvm.rules └── 95-dm-notify.rules 1 directory, 5 files ``` Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]> (cherry picked from commit e17fafa)
Just tested 1.8.0-beta.1 :
Udevd rules do not activate LVM logical volumes independent of their type (linear, thin, raid, or whatever). For our case, it is ok, but it seems it should be nodes in release notes at least. IMO volume actuation should be done by CSI driver for their controlled volumes. |
@eshikhov @vladimirfx could you test with this installer image: |
Of course, we will check it soon. Can you share what changes it contains? |
yes, restructured code to behave like a normal os, if all volume groups is available and disks backing it is available it gets activated (with events from udev). The previous implementation tried to enable every volume group unconditionally. Following this reference: https://man7.org/linux/man-pages/man7/lvmautoactivation.7.html |
Support lvm auto-activation as per https://man7.org/linux/man-pages/man7/lvmautoactivation.7.html. This changes from how Talos previously used to unconditionally tried to activate all volume groups to based on udev events. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
Support lvm auto-activation as per https://man7.org/linux/man-pages/man7/lvmautoactivation.7.html. This changes from how Talos previously used to unconditionally tried to activate all volume groups to based on udev events. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
Does it fail the boot in case of LVM activation failure? |
it doesn't, it's a controller, so it'll just retry, and if lvm reports it's not healthy it just ignores |
LVM2 configure uses `AC_PATH_TOOL` to find modprobe path if not set, as we're building inside tools it was picking up `/toolchain/bin/modprobe`. Explicitly set `MODPROBE_CMD=/sbin/modprobe` to the path of `modprobe` binary in talos. Fixes: siderolabs/talos#9300 Signed-off-by: Noel Georgi <[email protected]>
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
LVM2 configure uses `AC_PATH_TOOL` to find modprobe path if not set, as we're building inside tools it was picking up `/toolchain/bin/modprobe`. Explicitly set `MODPROBE_CMD=/sbin/modprobe` to the path of `modprobe` binary in talos. Fixes: siderolabs/talos#9300 Signed-off-by: Noel Georgi <[email protected]> (cherry picked from commit ca2e8c8)
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
Support lvm auto-activation as per https://man7.org/linux/man-pages/man7/lvmautoactivation.7.html. This changes from how Talos previously used to unconditionally tried to activate all volume groups to based on udev events. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]> (cherry picked from commit d8ab498)
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
does not work on this image: ghcr.io/frezbo/installer:v1.8.0-alpha.2-31-g9fa08e843-dirty |
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]> (cherry picked from commit 76318bd)
Hmm weird, added a test that needs |
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
Are you sure these lvm volumes have auto-activation? Also would it be possible to upload |
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
On the old release of lvm volumes were automatically activated The file you requested from a non-working node: |
I think this issue can be closed in favor of #9365 because node boot up is fixed. |
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
Use LVM2 tests that relies on module loading by lvm. Fixes: siderolabs#9300 Signed-off-by: Noel Georgi <[email protected]>
Discussed in #9140
Originally posted by eshikhov August 9, 2024
Hi,
When using the lvm mirror and lvm thin pool functionality, the OS is loaded randomly.
When setting up for the first time, everything is fine, the kernel modules (dm_raid and dm_thin_pool) are loaded with the config. Then the lvm mirror and lvm thin pool volumes are created and everything works well until the server is rebooted (and the reboot can be caused by various reasons, be it talosctl upgrade with the --reboot-mode powercycle option or just talosctl reboot --mode powercycle). After a restart, talos does not start, but constantly goes into cyclic restarts and at some time of restart it can boot successfully.
Screenshots of unsuccessful launches that I managed to take:
Dump debug information about the cluster:
support.zip
Talos v1.7.6
The text was updated successfully, but these errors were encountered: