Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boothook not executing for cloud-init 'ubuntu/23.3.1-0ubuntu1_20.04.1' #4572

Closed
supershal opened this issue Nov 1, 2023 · 10 comments
Closed
Labels
bug Something isn't working correctly incomplete Action required by submitter

Comments

@supershal
Copy link

Bug report

Cluster API Provider AWS uses boothook to run a script that fetches userdata from the metadata server and save it to a file.
Which is executed next to initialize the instance.
The Ubuntu 20.04 AMI comes with cloud-init 'ubuntu/23.3.1-0ubuntu1_20.04.1' installed with it. The cloud-init fails to run #boothook on this version which fails the bootstrap process. The related bug is filed at kubernetes-sigs/image-builder#1333

Please suggest if there is a better way to test or simulate boothook execution.

Steps to reproduce the problem

  1. Create a base AMI with ERROR_ON_USER_DATA_FAILURE=false to disable the feature.
    Without the above settings the instance does not initialize and I am unable to ssh to it.
    Feature Info: https://cloudinit.readthedocs.io/en/20.3/topics/hacking.html#cloudinit.features.ERROR_ON_USER_DATA_FAILURE
    Instruction to override it: https://cloudinit.readthedocs.io/en/20.3/topics/hacking.html#module-cloudinit.features

  2. Create an AWS instance using the AMI created above and following sample user-data.
    user-data.txt:

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="3f88ff831aa9188003bb992d697f4a566650ab0bb50290082e2c8b2a70aa"

--3f88ff831aa9188003bb992d697f4a566650ab0bb50290082e2c8b2a70aa
content-type: text/cloud-boothook

#cloud-boothook
#!/bin/bash


set -o errexit
set -o nounset
set -o pipefail

echo "creating /etc/secret-userdata.txt file"

echo '#!/bin/bash' >> /etc/secret-userdata.txt
echo "echo 'boothook created file'" >> /etc/secret-userdata.txt

echo "restarting cloud-init"
systemctl restart cloud-init
log::success_exit

--3f88ff831aa9188003bb992d697f4a566650ab0bb50290082e2c8b2a70aa
content-type: text/x-include-url

file:///etc/secret-userdata.txt

--3f88ff831aa9188003bb992d697f4a566650ab0bb50290082e2c8b2a70aa--

  1. Create AWS instance with User-data using CLI or AWS console
aws ec2 run-instances --image-id <IMAGE_CREATED_IN_STEP_1> --key-name <YOUR_KEYPAIR> --security-groups <SG_GROUP>  --instance-type t2.large 
  1. ssh to the instance and check failures in logs at /var/log/cloud-init-output.log. log content posted below

Environment details

  • Cloud-init version: ubuntu/23.3.1-0ubuntu1_20.04.1
  • Operating System Distribution: Ubuntu-20.04
  • Cloud provider, platform or installer type: AWS . Base AMI: ami-04bad3c587fe60d89 (name: ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230112)

cloud-init logs

```[2023-10-25 13:51:53] 2023-10-25 13:51:53,258 - util.py[WARNING]: failed stage init
[2023-10-25 13:51:53] failed run of stage init
[2023-10-25 13:51:53] ------------------------------------------------------------
[2023-10-25 13:51:53] Traceback (most recent call last):
[2023-10-25 13:51:53]   File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 78, in read_file_or_url
[2023-10-25 13:51:53]     with open(file_path, "rb") as fp:
[2023-10-25 13:51:53] FileNotFoundError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt'
[2023-10-25 13:51:53]
[2023-10-25 13:51:53] The above exception was the direct cause of the following exception:
[2023-10-25 13:51:53]
[2023-10-25 13:51:53] Traceback (most recent call last):
[2023-10-25 13:51:53]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 238, in _do_include
[2023-10-25 13:51:53]     resp = read_file_or_url(
[2023-10-25 13:51:53]   File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 84, in read_file_or_url
[2023-10-25 13:51:53]     raise UrlError(cause=e, code=code, headers=None, url=url) from e
[2023-10-25 13:51:53] cloudinit.url_helper.UrlError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt'
[2023-10-25 13:51:53]
[2023-10-25 13:51:53] The above exception was the direct cause of the following exception:
[2023-10-25 13:51:53]
[2023-10-25 13:51:53] Traceback (most recent call last):
[2023-10-25 13:51:53]   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 766, in status_wrapper
[2023-10-25 13:51:53]     ret = functor(name, args)
[2023-10-25 13:51:53]   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 453, in main_init
[2023-10-25 13:51:53]     init.update()
[2023-10-25 13:51:53]   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 484, in update
[2023-10-25 13:51:53]     self._store_processeddata(self.datasource.get_userdata(), "userdata")
[2023-10-25 13:51:53]   File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 599, in get_userdata
[2023-10-25 13:51:53]     self.userdata = self.ud_proc.process(self.get_userdata_raw())
[2023-10-25 13:51:53]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 88, in process
[2023-10-25 13:51:53]     self._process_msg(convert_string(blob), accumulating_msg)
[2023-10-25 13:51:53]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 159, in _process_msg
[2023-10-25 13:51:53]     self._do_include(payload, append_msg)
[2023-10-25 13:51:53]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 264, in _do_include
[2023-10-25 13:51:53]     _handle_error(message, urle)
[2023-10-25 13:51:53]   File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 72, in _handle_error
[2023-10-25 13:51:53]     raise RuntimeError(error_message) from source_exception
[2023-10-25 13:51:53] RuntimeError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt' for url: file:///etc/secret-userdata.txt
[2023-10-25 13:51:53] ------------------------------------------------------------
@TheRealFalcon
Copy link
Member

@supershal , thanks for the bug report. I'm having a hard time understanding exactly what the root issue is, and given your reproducer, there's no difference in behavior between 23.3 and any earlier versions. The reproducer fails because you're referencing a file that does not exist yet, but this should be expected. content-type: text/x-include-url will happen before a boothook runs. In the absence of the include, the boothook still runs.

It would be helpful to get either the exact userdata that was used or the tarball resulting from cloud-init collect-logs --include-userdata (with sensitive information redacted from both).

@TheRealFalcon TheRealFalcon added incomplete Action required by submitter and removed new An issue that still needs triage labels Nov 14, 2023
@supershal
Copy link
Author

@TheRealFalcon We found the root cause. We are running cloud-init ubuntu/23.3.1-0ubuntu1_20.04.1 and creating the feature flag override by creating cloudinit/feature_overrides.py with ERROR_ON_USER_DATA_FAILURE=false . Apparently this mechanism to override feature flag was removed recently. #4228

We were able to run cloud-init successfully by directly setting ERROR_ON_USER_DATA_FAILURE=true in the cloudinit/features.py file.

Can you please provide there is any instructions to set feature overrides when creating an instance.

@TheRealFalcon
Copy link
Member

Can you please provide there is any instructions to set feature overrides when creating an instance.

There is no supported way. The documentation could be more clear, but these flags are not meant to be user-modifiable. They are exclusively for distro packagers (i.e., people who will support cloud-init on their OS distribution) of cloud-init to ease patching the code. By modifying the features, you are essentially modifying the source of cloud-init and are outside the supported path.

We found the root cause

I think the failure you highlighted is actually highlighting a root cause further up the chain. Cloud-init will raise the error that you're seeing if it cannot unzip some gzipped userdata, or if it cannot download/process an #include line in the user data of multipart mime type. Are either of those things true in your case?

We were able to run cloud-init successfully by directly setting ERROR_ON_USER_DATA_FAILURE=true in the cloudinit/features.py file.

When you do this, do find WARNINGs in /var/log/cloud-init.log complaining that it could not download or process user data? If so, is there a reason you cannot fix that issue?

@supershal
Copy link
Author

I think the failure you highlighted is actually highlighting a root cause further up the chain. Cloud-init will raise the error that you're seeing if it cannot unzip some gzipped userdata, or if it cannot download/process an #include line in the user data of multipart mime type. Are either of those things true in your case?

The current implementation is relying on #include to fail on first cloud-init run and ignore the failure on first run. It will be successful on second cloud-init run.
Following is the sequence of the flow:

  1. Create a custom AMI with ERROR_ON_USER_DATA_FAILURE=false link.

  2. upon setting ERROR_ON_USER_DATA_FAILURE=false in the AMI, the cloud-init will continue executing the boothook link after ignoring failure in #include.

  3. The boothook will download userdata from AWS SSM link store to the file (/etc/secret-userdata.txt) and restart cloud-init.

  4. The second cloud-init run will not fail the #include as it will find user-data at /etc/secret-userdata.txt .

This mechanism was working before feature flag override was removed.

When you do this, do find WARNINGs in /var/log/cloud-init.log complaining that it could not download or process user data? If so, is there a reason you cannot fix that issue?

Yes. we do get the warning in the log about non existence of /etc/secret-userdata.txt. Looks like only way to fix this in short term is to directly modify cloudinit/features.py and set ERROR_ON_USER_DATA_FAILURE=false. We need to find a way to fix this long term though.

@holmanb
Copy link
Member

holmanb commented Jan 21, 2024

and restart cloud-init.

By "restart cloud-init", are you referring to this line?

The boothook will download userdata from AWS SSM link store to the file (/etc/secret-userdata.txt)

This basically hacks a whole custom datasource definition into a boothook. That is unexpected, and I'm honestly surprised that this ever worked.

@holmanb
Copy link
Member

holmanb commented Jan 21, 2024

I think that what you really want is a custom datasource definition. Cloud-init's datasources are responsible for getting the user-data from the platform that cloud-init is running on. Doing this in the expected way (defining a datasource) means that in the long term you will experience less broken behavior, since what you are doing in that boothook is really what a datasource definition is for. This will prevent you from having to hack cloud-init to override userdata or restart cloud-init or anything like that. With a custom datasource it will "just work" during boot on the first try. To do this you will need to:

  1. Define a datasource. This will consist of a Python module DataSource<yourname>.py.
  2. Put the datasource into the sources directory (on Ubuntu that is /usr/lib/python3/dist-packages/cloudinit/sources/)[1]
  3. Set datasource_list in cloud.cfg to contain just your datasource[1]

Once you do this, at runtime cloud-init will discover the datasource and then use it to get the user data.

How to define a datasource python module:

Unfortunately, this isn't well documented. We have a todo item for that, but these are the core requirements for a datasource, and some pointers on how to get started:

  1. The Python module will have a class DataSource<yourname>. The class will inherit from cloudinit.sources.DataSource. This class must implement _get_data(). See the upstream datasources here for examples (I'd suggest looking at a simpler one to start out, like GCE).
  2. The Python module must have a function named get_datasource_list(). In your case since network is required I would expect the datasources list that it uses to contain a single entry in the list: [(DataSource<yourname>, (cloudinit.sources.DEP_FILESYSTEM, cloudinit.sources.DEP_NETWORK)]

Example DataSourceSSM.py

from cloudinit.sources import (
    DEP_FILESYSTEM, DEP_NETWORK, DataSource, list_from_depends
)


class DataSourceSSM(DataSource):
    def _get_data(self):
        """get configuration data from platform

        Puts configuration data in the following properties:

            self.metadata
            self.userdata_raw

        Returns True on success, False on failure
        """
        pass


# Return a list of data sources that match this set of dependencies.
def get_datasource_list(depends):
    return list_from_depends(
        depends,
        [(DataSourceSSM, (DEP_FILESYSTEM, DEP_NETWORK))],
    )

[1] I assume your ansible role would do this.

@supershal
Copy link
Author

Thank you @holmanb for providing the instructions for custom python module. We will look into it for long term.

Currently we decided to keep the boothook which will run and place the secret-userdata.txt file in /etc/cloud/cloud.cfg.d/secret-userdata.txt instead of etc/secret-userdata.txt and remove file:///etc/secret-userdata.txt from the user-data so it wont fail cloud-init run and restart it.
When cloud-init runs it merges user-data from /etc/cloud/cloud.cfg.d/secret-userdata.txt and runs as expected.

Do you see any issue with this approach?

@TheRealFalcon
Copy link
Member

Do you see any issue with this approach?

Cloud-init has serveral boot stages and by restarting only cloud-init.service, most of the modules that act on user data will not be run. In general, restarting any of cloud-init's services manually shouldn't be considered a production-supported operation. This only "worked" for you previously because you were relying on incidental behavior of cloud-init failure. Additionally, since many modules only run once per instance, restarting the services won't cause module execution to re-run.

See the documentation on how to re-run cloud-init.

@holmanb
Copy link
Member

holmanb commented Jan 23, 2024

Cloud-init has serveral boot stages and by restarting only cloud-init.service, most of the modules that act on user data will not be run. In general, restarting any of cloud-init's services manually shouldn't be considered a production-supported operation. This only "worked" for you previously because you were relying on incidental behavior of cloud-init failure. Additionally, since many modules only run once per instance, restarting the services won't cause module execution to re-run.

@supershal I agree with @TheRealFalcon's statements, he explains well the broken behavior that I was alluding to. This is why I asked about the line where you restart the cloud-init.service.

@supershal
Copy link
Author

Thank you @TheRealFalcon and @holmanb for your inputs and links to the docs. We will followup the guide to come up with better solution as you suggested. we can close this issue now.

@TheRealFalcon TheRealFalcon closed this as not planned Won't fix, can't repro, duplicate, stale Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly incomplete Action required by submitter
Projects
None yet
Development

No branches or pull requests

3 participants