-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reboot Issues After Resizing #695
Comments
Ok, to make this testable in a scalable way, I've converted what you've described into a userData script. That script looks like: #!/bin/bash
#
# Bail on errors
set -euo pipefail
#
# Be verbose
set -x
#
################################################################################
# Log everything below into syslog
exec 1> >( logger -s -t "$( basename "${0}" )" ) 2>&1
# Patch-up the system
dnf update -y
# Allocate the additional storage
if [[ $( mountpoint /boot/efi ) =~ "is a mountpoint" ]]
then
if [[ -d /sys/firmware/efi ]]
then
echo "Partitioning for EFI-enabled instance-type..."
else
echo "Partitioning for EFI-ready AMI..."
fi
growpart --free-percent=50 /dev/nvme0n1 4
lvm lvresize --size=+4G /dev/mapper/RootVG-rootVol
xfs_growfs /dev/mapper/RootVG-rootVol
lvm lvresize --size=+8G /dev/mapper/RootVG-varVol
xfs_growfs /dev/mapper/RootVG-varVol
lvm lvresize --size=+6G /dev/mapper/RootVG-auditVol
xfs_growfs /dev/mapper/RootVG-auditVol
else
echo "Partitioning for BIOS-boot instance-type..."
growpart --free-percent=50 /dev/nvme0n1 2
lvm lvresize --size=+4G /dev/mapper/RootVG-rootVol
xfs_growfs /dev/mapper/RootVG-rootVol
lvm lvresize --size=+8G /dev/mapper/RootVG-varVol
xfs_growfs /dev/mapper/RootVG-varVol
lvm lvresize --size=+6G /dev/mapper/RootVG-auditVol
xfs_growfs /dev/mapper/RootVG-auditVol
fi
# Reboot
systemctl reboot The above should function for either BIOS-boot or EFI-boot EC2s launched from the various spel AMIs and replicate what you described as your process. Some notes:
To launch a batch (of 30) instances, I use a BASH one-liner like: mapfile -t INSTANCES < <(
aws ec2 run-instances \
--image-id ami-021ba76fc66135488 \
--instance-type t2.xlarge \
--subnet-id <SUBNET_ID> \
--security-group-id <SECURITY_GROUP_ID> \
--iam-instance-profile 'Name=<IAM_ROLE_NAME>'
--key-name <PROVISIONING_KEY_NAME> \
--block-device-mappings 'DeviceName=/dev/sda1,Ebs={
DeleteOnTermination=true,
VolumeType=gp3,
VolumeSize=100,
Encrypted=false
}' \
--user-data file:///tmp/userData.spel_695 \
--count 30 --query 'Instances[].InstanceId' \
--output text | \
tr '\t' '\n'
) This saves all of the newly-launched instance's IDs to a BASH array (named |
As a quick "ferinstance": to reboot all of the instances in a batch (captured into the for INSTANCE in "${INSTANCES[@]}"
do
echo "Rebooting $INSTANCE..."
INSTANCE_IP="$(
aws ec2 describe-instances \
--instance-id "${INSTANCE}" \
--query 'Reservations[].Instances[].PrivateIpAddress' \
--output text
)
timeout 10 ssh "maintuser@${INSTANCE_IP}" "sudo systemctl reboot"
done |
Note: after spinning up 30 t3 instances from the "This AMI has the booting issues" Ultimately, I would invite you to replicate what I've done and notify me if you have failures and/or provide a more-complete description of how to reproduce the issues you're seeing. If you wish to further discuss but don't want to include potentially-sensitive information in this public discussion, you can email me at [email protected] (obviously, this is a "throwaway" address used for establishing initial, private communications betwen you and me). |
One last question, @mrabe142: if you're currently finding success with deploying using the 03.2 AMI, are you able to patch it? Asking because, sometime after the introduction of the EFI-ready AMIs, we received reports that the |
Yes, I do a I did some preliminary testing using I used the same userdata block as you above, same storage size. I tested with two different instance types:
I ran two instances of each type. The only differences were the instance type between the two. It might be worth spinning up with the second instance type and see if you see the same thing. I can run more tests when I have more time. |
Ok. I'll try with the t3.2xlarge. That said, the build-platform used for creating the Amazon Machine Images are At any rate, "I guess we'll see": it gives me one more thing to try out. |
Alright. Interesting. Launched 50 of the t3.2xlarge instances using the I'll see what, if anything, I can do to get information from them. |
Curiouser and curiouser… I hadn't realized from prior communications that your reboot failures were leaving you at emergency-mode, so I hadn't included setting a root-password in my userData payload. So, I remedied that and launched a new set of 50. However, wanting to try to save money on the troubleshooting, I'd changed that batch to a |
Oof… Switch back to t3.2xlarge and got 19/50 failures. W |
Just as a hedge against there being some kind of race-condition causing issues when executing: growpart --free-percent=50 /dev/nvme0n1 4
lvm lvresize --size=+4G /dev/mapper/RootVG-rootVol
xfs_growfs /dev/mapper/RootVG-rootVol
lvm lvresize --size=+8G /dev/mapper/RootVG-varVol
xfs_growfs /dev/mapper/RootVG-varVol
lvm lvresize --size=+6G /dev/mapper/RootVG-auditVol
xfs_growfs /dev/mapper/RootVG-auditVol I changed it to the more-compact (and internally-enforced workflows): growpart --free-percent=50 /dev/nvme0n1 4
lvresize -r --size=+4G /dev/mapper/RootVG-rootVol
lvresize -r --size=+8G /dev/mapper/RootVG-varVol
lvresize -r --size=+6G /dev/mapper/RootVG-auditVol Sadly, made no difference in presence of reboot failures: the batch that I used the changed content in was 12/50. |
Ok. This may be an issue with the systemd version in RHEL 8.x. When I looked up the error message coming from systemd's trying to mount # systemctl --no-pager status -l var-log-audit.mount
● var-log-audit.mount - /var/log/audit
Loaded: loaded (/etc/fstab; generated)
Active: failed (Result: protocol) since Mon 2024-07-01 13:27:30 UTC; 32min ago
Where: /var/log/audit
What: /dev/mapper/RootVG-auditVol
Docs: man:fstab(5)
man:systemd-fstab-generator(8)
Jul 01 13:27:30 ip-140-48-100-118 systemd[1]: Mounting /var/log/audit...
Jul 01 13:27:30 ip-140-48-100-118 systemd[1]: var-log-audit.mount: Mount process finished, but there is no mount.
Jul 01 13:27:30 ip-140-48-100-118 systemd[1]: var-log-audit.mount: Failed with result 'protocol'.
Jul 01 13:27:30 ip-140-48-100-118 systemd[1]: Failed to mount /var/log/audit. I was able to turn up an issue filed in late 2018 against the
And, when I check the failed EC2s, I see: # systemctl --version
systemd 239 (239-82.el8)
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=legacy Why I didn't see this on |
Just as a sanity-check I tried two more tests to validate that the problem only occurs when modifying either the "OS volumes" ( i.e.,
As noted previously, haven't encountered issues when adding secondary EBS to host non-OS data. Similarly, our processes around creating AMIs only tests whether the boot-EBS can be grown, but does not make any LVM modifications or do any reboots. At any rate, the manner in which we generally use Linux EC2s and test the AMIs we publish likely accounts for why the underlying problem hasn't previously manifested. Going to open an issue with Red Hat. In the interim, I would suggest not trying to grow or create further volumes within the root LVM2 volume-group. Which is to say, if you've got application-data that's been driving you to alter the boot volumes' sizes, place that data on LVM objects outside of the root volume-group. |
Issue opened with Red Hat. Engineers are reviewing |
Red Hats has moved the case from their Storage team – the ones who deal with LVM, etc. – to their Services team – the ones that oversee In the near term, if you're expanding volumes to host application-data, switch to hosting that data on volumes separate from the Root volume-group and mount as appropriate. Otherwise, until Red Hat can identify a real fix, they recommend adding the |
Got a response back from the Service Team, last night.
Following that RHEL-5907 issue-link, looks like this has been going on since at least September. Not sure why any of the AMIs have worked for you. I don't have the time to verify, but I'm going to assume that the issue is present in all of our RHEL 8 AMIs:
(we have older than the above, just that the deprecation-tags mean that they won't show up in a search) |
This turned out to be a vendor (Red Hat) issue. Closing this case as there's (currently) nothing to be done via this project. |
Update: Vendor-assigned engineer finally updated their Jira associated with this problem. That engineer has decided it's a WONTFIX because Red Hat 8 is too late in its lifecycle to be worth fixing what he characterized as a "nice to have" (poor word-choice: "rare" or "corner case" would probably have been a less-loaded choice). From the vendor's Jira (RHEL-5907):
|
this only ever happens on Nitro-enabled EC2 instances for me, but this was absolutely brutal to debug. Only happens if I include an Definitely still a problem today, even if the responsibility is on RedHat. Glad I found these threads at least |
Thanks for the report @crispysipper . Yeah, we're aware, but still haven't found a good general solution for EL8 (Red Hat at this point has told us to just use RHEL9 🤦). If you find any further details or have more suggestions, please reach out and let us know! |
Thanks @lorengordon - I'm a big consumer of your images so a big thank you. Makes my job less painful for sure :) Right now I am working around it by mounting the typical failed LV's with |
Creating new ticket as it seems like OP for #691 may no longer be having issues (no comment since June 14th) and don't want to continue spamming if such is the case.
Originally posted by @mrabe142 in #691 (comment):
And, per @mrabe142 in #691 (comment):
And, per @mrabe142 in #691 (comment):
The text was updated successfully, but these errors were encountered: