Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boot failed on hosts with LVM thin/raid logical volumes #9300

Closed
eshikhov opened this issue Sep 11, 2024 Discussed in #9140 · 44 comments · Fixed by #9303, #9333, siderolabs/pkgs#1034 or #9346
Closed

Boot failed on hosts with LVM thin/raid logical volumes #9300

eshikhov opened this issue Sep 11, 2024 Discussed in #9140 · 44 comments · Fixed by #9303, #9333, siderolabs/pkgs#1034 or #9346

Comments

@eshikhov
Copy link

Discussed in #9140

Originally posted by eshikhov August 9, 2024
Hi,

When using the lvm mirror and lvm thin pool functionality, the OS is loaded randomly.

When setting up for the first time, everything is fine, the kernel modules (dm_raid and dm_thin_pool) are loaded with the config. Then the lvm mirror and lvm thin pool volumes are created and everything works well until the server is rebooted (and the reboot can be caused by various reasons, be it talosctl upgrade with the --reboot-mode powercycle option or just talosctl reboot --mode powercycle). After a restart, talos does not start, but constantly goes into cyclic restarts and at some time of restart it can boot successfully.

Screenshots of unsuccessful launches that I managed to take:

screenshot_1

screenshot_2

screenshot_3

screenshot_4

screenshot_5

Dump debug information about the cluster:

support.zip

Talos v1.7.6

@vladimirfx
Copy link

vladimirfx commented Sep 11, 2024

We reproduced the issue on VM. Logs and VMs can be found here: https://drive.google.com/drive/folders/1aj4GUOjO_EqSFdz4GgAZXhpCJ75ggZOm

In production, we are in a situation where we cannot reboot or update servers that have LV with thin or raid1 type. After reboot, the server can't boot from 30 minutes to several hours. Eventually server boots up...

Of course, all servers have the correct modules configured and LVs created in the past year (using LVM CSI).

It looks like some race in kernel module initialization.

As a workaround, can you provide a machine config option to not activate LVM at boot time? (because all known LVM CSI implementations are active LVs themselves)

@vladimirfx
Copy link

@mst1711 FYI

@smira
Copy link
Member

smira commented Sep 11, 2024

Can you extract the relevant bits as text? What is the root cause that you're seeing? It's almost impossible to follow screenshots, and they are not searchable.

@vladimirfx
Copy link

Can you extract the relevant bits as text? What is the root cause that you're seeing? It's almost impossible to follow screenshots, and they are not searchable.

We provide a shared Google Disk folder with details. (https://drive.google.com/drive/folders/1aj4GUOjO_EqSFdz4GgAZXhpCJ75ggZOm)

VM boot logs from that folder:
failed-boot-serial-console-output.txt
success-boot-serial-console-output.txt

Unfortunately, we can't provide such logs from bare metal because lack of a serial console.

@eshikhov
Copy link
Author

The problem was discovered when updating from version 1.6.7 to version 1.7.5.
I'll check it again on version 1.6 releases

@frezbo
Copy link
Member

frezbo commented Sep 11, 2024

@eshikhov would you be able to test with a custom iso/installer?

@eshikhov
Copy link
Author

Yes, I can try

@frezbo
Copy link
Member

frezbo commented Sep 11, 2024

Could you sent a message on siderolabs slack?

frezbo added a commit to frezbo/talos that referenced this issue Sep 11, 2024
Drop `activateLogicalVolumes` sequencer step.

LVM package already ships proper udev rules to handle this.

```text
❯ tree lvm2/usr/lib/udev/rules.d/
lvm2/usr/lib/udev/rules.d/
├── 10-dm.rules
├── 11-dm-lvm.rules
├── 13-dm-disk.rules
├── 69-dm-lvm.rules
└── 95-dm-notify.rules

1 directory, 5 files
```

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
@eshikhov
Copy link
Author

I checked it on a custom image, the problem cannot be reproduced. After booting, LVM volumes are available. Everything works.

@vladimirfx
Copy link

@smira can the fix be backported to 1.7?

@vladimirfx
Copy link

Fantastic support, thanks!

Waiting for the release to "unlock" our servers.

@smira
Copy link
Member

smira commented Sep 11, 2024

@smira can the fix be backported to 1.7?

I don't know, this seems to be bigger/breaking change if things go wrong in some setups.

@vladimirfx
Copy link

@smira can the fix be backported to 1.7?

I don't know, this seems to be bigger/breaking change if things go wrong in some setups.

I understand your concern: changing the boot sequence in the patch release is against the usual practice.
But the fixed version correctly initializes LVM (using udev rules) in contrast to unfixed which makes OS unbootable. So it is not just chore work for dropping the legacy part of the boot sequence - that effectively fixes real-user critical bugs without user-visible behavior change for others.

Anyway - thanks again for the good OS and support.

@smira
Copy link
Member

smira commented Sep 11, 2024

But the fixed version correctly initializes LVM (using udev rules) in contrast to unfixed which makes OS unbootable. So it is not just chore work for dropping the legacy part of the boot sequence - that effectively fixes real-user critical bugs without user-visible behavior change for others.

yes, that's true as well, but it's too big of a change for a patch release if it breaks some setups.

Talos 1.8.0-beta.1 will be released next Monday, so you can use that.

@vladimirfx
Copy link

Talos 1.8.0-beta.1 will be released next Monday, so you can use that.

It is safe to use 1.8 worker nodes with 1.7 controll-plane?

@smira
Copy link
Member

smira commented Sep 11, 2024

Talos 1.8.0-beta.1 will be released next Monday, so you can use that.

It is safe to use 1.8 worker nodes with 1.7 controll-plane?

In general no, but should be fine for this pair (1.7 <> 1.8).

smira pushed a commit to smira/talos that referenced this issue Sep 13, 2024
Drop `activateLogicalVolumes` sequencer step.

LVM package already ships proper udev rules to handle this.

```text
❯ tree lvm2/usr/lib/udev/rules.d/
lvm2/usr/lib/udev/rules.d/
├── 10-dm.rules
├── 11-dm-lvm.rules
├── 13-dm-disk.rules
├── 69-dm-lvm.rules
└── 95-dm-notify.rules

1 directory, 5 files
```

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
(cherry picked from commit e17fafa)
@vladimirfx
Copy link

Just tested 1.8.0-beta.1 :

  1. Nodes are booted normally - ok
  2. Talos do not activate LVM volumes - warn

Udevd rules do not activate LVM logical volumes independent of their type (linear, thin, raid, or whatever).

For our case, it is ok, but it seems it should be nodes in release notes at least. IMO volume actuation should be done by CSI driver for their controlled volumes.

@frezbo
Copy link
Member

frezbo commented Sep 18, 2024

@eshikhov @vladimirfx could you test with this installer image: ghcr.io/frezbo/installer:v1.8.0-alpha.2-26-g18daedb51-dirty ?

@vladimirfx
Copy link

@eshikhov @vladimirfx could you test with this installer image: ghcr.io/frezbo/installer:v1.8.0-alpha.2-26-g18daedb51-dirty ?

Of course, we will check it soon. Can you share what changes it contains?

@frezbo
Copy link
Member

frezbo commented Sep 18, 2024

@eshikhov @vladimirfx could you test with this installer image: ghcr.io/frezbo/installer:v1.8.0-alpha.2-26-g18daedb51-dirty ?

Of course, we will check it soon. Can you share what changes it contains?

yes, restructured code to behave like a normal os, if all volume groups is available and disks backing it is available it gets activated (with events from udev). The previous implementation tried to enable every volume group unconditionally. Following this reference: https://man7.org/linux/man-pages/man7/lvmautoactivation.7.html

frezbo added a commit to frezbo/talos that referenced this issue Sep 18, 2024
Support lvm auto-activation as per
https://man7.org/linux/man-pages/man7/lvmautoactivation.7.html.

This changes from how Talos previously used to unconditionally tried to
activate all volume groups to based on udev events.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
frezbo added a commit to frezbo/talos that referenced this issue Sep 18, 2024
Support lvm auto-activation as per
https://man7.org/linux/man-pages/man7/lvmautoactivation.7.html.

This changes from how Talos previously used to unconditionally tried to
activate all volume groups to based on udev events.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
@vladimirfx
Copy link

yes, restructured code to behave like a normal os, if all volume groups is available and disks backing it is available it gets activated (with events from udev). The previous implementation tried to enable every volume group unconditionally. Following this reference: https://man7.org/linux/man-pages/man7/lvmautoactivation.7.html

Does it fail the boot in case of LVM activation failure?
In our case activation failed because of the kernel modules loading race (not device drivers but pure software components), so all PVs are available on boot and VGs start activating. I am scared that the new implementaion will fail the same way as the original...

@frezbo
Copy link
Member

frezbo commented Sep 18, 2024

yes, restructured code to behave like a normal os, if all volume groups is available and disks backing it is available it gets activated (with events from udev). The previous implementation tried to enable every volume group unconditionally. Following this reference: https://man7.org/linux/man-pages/man7/lvmautoactivation.7.html

Does it fail the boot in case of LVM activation failure? In our case activation failed because of the kernel modules loading race (not device drivers but pure software components), so all PVs are available on boot and VGs start activating. I am scared that the new implementaion will fail the same way as the original...

it doesn't, it's a controller, so it'll just retry, and if lvm reports it's not healthy it just ignores

frezbo added a commit to frezbo/pkgs that referenced this issue Sep 20, 2024
LVM2 configure uses `AC_PATH_TOOL` to find modprobe path if not set, as
we're building inside tools it was picking up `/toolchain/bin/modprobe`.
Explicitly set `MODPROBE_CMD=/sbin/modprobe` to the path of `modprobe`
binary in talos.

Fixes: siderolabs/talos#9300

Signed-off-by: Noel Georgi <[email protected]>
frezbo added a commit to frezbo/talos that referenced this issue Sep 20, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
frezbo added a commit to frezbo/talos that referenced this issue Sep 20, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
smira pushed a commit to smira/pkgs that referenced this issue Sep 20, 2024
LVM2 configure uses `AC_PATH_TOOL` to find modprobe path if not set, as
we're building inside tools it was picking up `/toolchain/bin/modprobe`.
Explicitly set `MODPROBE_CMD=/sbin/modprobe` to the path of `modprobe`
binary in talos.

Fixes: siderolabs/talos#9300

Signed-off-by: Noel Georgi <[email protected]>
(cherry picked from commit ca2e8c8)
frezbo added a commit to frezbo/talos that referenced this issue Sep 21, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
frezbo added a commit to frezbo/talos that referenced this issue Sep 21, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
smira pushed a commit to smira/talos that referenced this issue Sep 21, 2024
Support lvm auto-activation as per
https://man7.org/linux/man-pages/man7/lvmautoactivation.7.html.

This changes from how Talos previously used to unconditionally tried to
activate all volume groups to based on udev events.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
(cherry picked from commit d8ab498)
frezbo added a commit to frezbo/talos that referenced this issue Sep 21, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
frezbo added a commit to frezbo/talos that referenced this issue Sep 21, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
@eshikhov
Copy link
Author

eshikhov commented Sep 22, 2024

does not work on this image: ghcr.io/frezbo/installer:v1.8.0-alpha.2-31-g9fa08e843-dirty

dmesg_after_reboot.log
controller-runtime.log

smira pushed a commit to smira/talos that referenced this issue Sep 23, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
(cherry picked from commit 76318bd)
@frezbo
Copy link
Member

frezbo commented Sep 23, 2024

Hmm weird, added a test that needs dm_raid and pvscan seems to load them

frezbo added a commit to frezbo/talos that referenced this issue Sep 23, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
frezbo added a commit to frezbo/talos that referenced this issue Sep 23, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
@frezbo
Copy link
Member

frezbo commented Sep 23, 2024

Are you sure these lvm volumes have auto-activation? Also would it be possible to upload support.zip from talosctl support going forward?

frezbo added a commit to frezbo/talos that referenced this issue Sep 23, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
frezbo added a commit to frezbo/talos that referenced this issue Sep 23, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
frezbo added a commit to frezbo/talos that referenced this issue Sep 23, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
@eshikhov
Copy link
Author

On the old release of lvm volumes were automatically activated

The file you requested from a non-working node:
support.zip

@vladimirfx
Copy link

I think this issue can be closed in favor of #9365 because node boot up is fixed.

@smira smira closed this as completed Sep 24, 2024
smira pushed a commit to frezbo/talos that referenced this issue Sep 30, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
smira pushed a commit to frezbo/talos that referenced this issue Oct 1, 2024
Use LVM2 tests that relies on module loading by lvm.

Fixes: siderolabs#9300

Signed-off-by: Noel Georgi <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants