This repository has been archived by the owner on Jul 27, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 414
Implement testing in docker containers #1289
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Right now, users have to install terraform and ansible, and their dependencies, in order to user Mantl. Vagrant is a demo/test-drive option, but not for other use cases. After the installation(s), then they have to run multiple commands in order to get a cluster up. What I'm working towards is a way for users to be able to get started with Mantl in one or two steps, that removes the need to install dependenies. This commit reworks Dockerfile and docker_launch.sh, and in the tests that I have done, it can accomplish this. Using an alpine linux base for this image reduces the size of the image, although we do have to add build utils back in for the python dependencies. As we reduce those dependencies, this image will get smaller. I added a MANTL_CONFIG_DIR env var to the dockerfile, and set it to /local. This directory is there to hold custom configuration for users. It has to be different than /mantl (where the source code goes) because we don't want users to have to clone the repository in order to add in their custom files. We add the `/bin/sh -c` entrypoint to make this image closer to an executable. Removing the ONBUILD statements from the Dockerfile simplifies using this image as a base to build other images from. They are replaced with logic in docker_launch.sh to check for user-provided files in a particular place. If these files don't exist, then we copy/generate the files. When users mount a host volume in the MANTL_CONFIG_DIR, they will get copies of all of the generated files in that volume, which they are free to customize further. Documentation will need to be added for what needs to be mounted where. TERRAFORM_STATE_ROOT is changed to TERRAFORM_STATE. Because the expectation is that users will mount the config dir onto their local machines, they will want the tfstate file persist between docker runs.
I'm changing the semantics of the SSH_KEY env. Since all the SSH tools assume that the key is at ~/.ssh/id_rsa, I decided that it would be simpler to symlink any user provided keys to that location. Therefore, the SSH_KEY env var will refer to the location of the user-provided key
1. the Travis file has lost many of its directives in favor with moving them over to a new home in a setup script 2. The docker_launch script has also lost its setup logic, and now contains only the steps necessary to get a cluster up and running 3. the docker_setup python script takes care of getting the container into proper state. As long as you mount a volume onto /local, your config files will be symlinked to their proper locations. Otherwise, they are generated for you, and saved to the mounted volume. 4. I'm changing the semantics of the SSH_KEY env. Since all the SSH tools assume that the key is at ~/.ssh/id_rsa, I decided that it would be simpler to symlink any user provided keys to that location. Therefore, the SSH_KEY env var will refer to the location of the user-provided key. We're going to have to decide if that is really necessary
When I ported the setup logic to python, I did not adding any logging output. I have fixed that, by including helpful messages to users on what the setup is currently doing. I also have the setup script set the SSH_KEY env if none is set, instead of aborting. I was forgetting to add settings to multiple places, so I figured that it was time for some DRY refactoring. This adds global envs for the docker run lines, to make it easier to keep things consistent. For example, I changed docker run to attach stdout & stderr instead of allocating a pseudo-tty. This change allows the security setup script to generate credentials randomly, as they did before. The volume parts of the docker run call need to be placed on the same lines, because that's the most direct way to ensure that they are populated to the correct values. If there is a DRY way of doing this, that would be better, but I can't think of anything right now. During my heavy edits of the travis file, I found that the indentation was inconsistent. I ensured that my editor had the correct tab/space settings, and adjusted the indentation.
I was on the wrong path when I was working on this last night. I found a better way of ensuring that the symlink target is correctly assigned: using the os.path.join function instead of the abspath/relpath functions. This fixes the symlink logic for terraform_files. After that, I was researching the docker run `--env` flag, I discovered this line in the official docker docs: > [I]f no `=` is provided, then that variable's current value is passed > through I just have to pass in the names of the secret vars, and docker should pull them in. Until I found this particular solution, I spent many commits trying to debug this problem. Because ssh-agent and other ssh-related code has to be run often in the docker containers, I moved the needed invocations to a separate script, docker_ssh.sh. This will make docker run invocations shorter. Because I prepended this to the build-cluster command, I removed ssh keygen logic from build-cluster. We need ssh keys to be symlinked for launch and testing, but we don't want to run the whole setup script each time. Assuming that the keys have already been generated from a run of docker_setup.py, all that function does is symlink them. The destroy steps need to be executed in the docker container as well, even though I have not been able to get to this point in my testing.
ssh-agent has a 'lifetime option' where you can pass in a command and args. The agent then only runs for a specified time, and the command is run in a subshell. With the /bin/sh -c invocation at the start of the ssh-agent's subshell, the image remains flexible. The other change in here is that I fixed the relpath for the docker_setup script in destroy_build.sh. With this change, the builds will be destroyed properly once again.
1. I added a couple of new scripts to the root directory, and the ssh one in particular is not useful. We want to wrap all of that functionality into `docker.py`, which will have these scripts plus other functions possibly. 2. I removed the pip self-upgrade 3. I add git to the list of packages to delete from apk
We are rolling setup and deploy commands into one python script, docker.py. It doesn't have much functionality, but this PR is getting out of scope quickly, and it's needed for CI. The destroy_build and start_build scripts were just wrappers around a couple of shell commands, which can be wrapped by docker.py instead. I also had logic in the security_setup wrapper function to move the generated security file into MANTL_CONFIG_DIR, then symlink it. However, Travis threw an error. Since that functionality is for users, and not CI, I elected to remove that functionality, and add it in later if needed. To optimize for the size of the docker image, I have apk delete the package. curl is a dependency of git, so when I removed that package, I also removed curl. Therefore, we install it separately. I checked out run-tests.sh from testing/switchboard, and ported the documentation skipping, and skipping on master pushes.
I added the shlex split function to the script, so the git command can get interpreted more accurately. I had to add the git package back into the image (I orginally removed it to save on space). The commit_range_cmd required a subshell, and using a subshell for a subprocess is risky. I ported the grep invocation to python if/elif/else block, with logging.info for files detected. The log lines are probably too verbose.
I need to pass in non-secret travis env vars that I wish to use in the ci commands for docker.py.
I'm quickly iterating through these changes for travis, and I'm finding that some of the assumptions that I made for a user are not working out well for travis. Because we are focusing on getting CI working before we get this working for end users, I decided that we are going to have commands specifically for CI. Those commands are prefixed with "ci-". I want to explore the option of detecting a "CI" env var, but I felt that this was a straighter path to our goal. The TF files need to be symlinked during the build and destroy steps, not just during setup. @siddharthist made a good point about the terraform files without their extension. The "clever" way that I circumvented terraform grabbing multiple terraform files was clumsy, so I replaced it with a better approach. I added logic to the setup function in docker.py to symlink the file defined in TERRAFORM_FILE if the env exists. Additionally, I added OS_PRT to the secret envs. This represents CiscoCloud's secret ssh port number.
I replaced return statements with calls to sys.exit with command return codes. Also, I added in the ssh command for accessing OS
The travis user was created for the shipped cluster, so now we can use that username and a new key. Now we can run actual testing code on OS. With the key I was using previously, I was running into a problem where ssh asked for a passphrase, even though I created the key without one. It seems the problem was from when the file was converted to a env string then back to a file. I'm using travis's file encryption feature to add the ssh key directly. Additionally, I disabled strict host key checking, added the remote ssh command logic to ci_destroy.
For testing against OpenStack, we should have separate clones of mantl for each test run, to avoid collisions during concurrent runs. The remote command for this now clones the repository into a new directory that's the same name as the commit, checks out that commit, runs tests, then removes the directory after tests are completed.
The strings that represent the OpenStack ssh commands in ci_build and ci_destroy were getting very long and hard to edit, and the format call was also getting unruly. Therefore, I turned the strings into multiline strings, and the positional arguments to the format method call are now kwargs. The multiline string gets turned back into a single line before being passed into the call(split()) call.
For a while now, when GCE is tested in travis, the remote-exec step would fail. After much tinkering, I found that all I needed to do was to edit the testing/gce.tf file to point to the pub ssh key generated during CI runs, and it worked on my local machine. There were a few edits here that seems to have been lost during the rebase. A formatting change and a line edit isn't bad to loose in squashing about 200 commits!
I was certain during my testing that the public key was the thing that was supposed to be used, but Travis was still having errors
The final ; char in the shell script to remove the clone from OpenStack was throwing an error, and is also redundant, since no other commands follow it.
Looking throught the travis logs of other branches, the ssh key in question is supposed to be a public key. So, I am changing this back to a public key, since it wasn't related to my problem.
I was comparing the gce terraform file that I made for testing to the most recent gce.sample.tf file. It turns out that I never updated the file to use the modules! I need to work on getting the testing system to test against the sample files instead of custom ones.
The short/long-name vars in my testing/gce.tf file had string interpolations in them, and that is not allowed in terraform. Therefore, I have to move them to the module definitions themeselves. The default values now assume that the build number is 0-0.
When I was refactoring code, it seems that I did not include ssh-add everywhere, as I believed. I doubt that it was lost in the rebasing: I'm sure I just forgot to put it in there.
This module is not currently needed, since we don't have any tests that rely on a DNS registry. Our tests rely upon terraform state to keep track of cluster nodes, so we don't need it here.
The ci-build logic for OS included a lot of setup. I took out the setup logic and put it in ci-setup, and I added a call to ln to create the symlink for the terraform file that I want to use for the test. (I changed the matrix in .travis.yml to specify the actual terraform file to use.) Also, I've added logging statements to track the final ssh command sent to the jump box. Also, I changed the format for logs. The time should come first
Using the curly braces with the python string format method was causing problems, and we only need the raw value for the PID, so I removed the curly braces.
We need to exclude other branches from the filter, because if there are more recent merges in the repository, then there will be issues with the wrong branch being pulled up. This error came up the first time I tried this, because a different PR was merged into master just minutes after I submitted this.
I didn't quote the subshells, so travis paired the wrong parens together. Quoting them should fix it
I can't nest subshells with $(), apparently. So, I'm going to use backticks for the outer command, and $() for the inner. I'm going to draft an issue with Travis for this.
I'm moving the complex subshell assignment logic to before_script, would should be interpreted better than in the env section.
Adding the export to the var assignment in the before_script section will hopefully fix this behavior in travis, where the var is not properly global
The way that I was manipulating the strings for the destroy command was inserting the entire template of the ssh command into itself again, without interpolations. To fix this, I named a separate variable
The external network uuid variable was set to the default in the sample. This replaces it with an actual value.
There was a merge conflict in testing/gce.tf file that occured when the k8s dedicated workers feature was added to master. The master version of this file used the old GCE module that was kept for compatibility. This branch's version of that file uses the modular version. I'll need to copy the modular version with the dedicated k8s feature, but I needed to test changes to the OpenStack tf file, so that will come in a separate commit.
I removed the floating ip modules for control and workers, but I didn't remove the variable from the instance modules themselves. This has corrected that.
The OS network name for CI was set to the default, "mantl". Naturally, this is not specific enough for our purposes, so I added it explicitly
I was trying to kill the ssh agent process without specifying what signal to send. I've chosen SIGTERM for gentle destruction I reduced the nunber of nodes that are created for OS clusters by 4 nodes. Also, I've reduced the number of subproc calls for the destruction phase by repeating the terraform command in the same call.
The remote command to kill the ssh-agent process and destroy the cluster was not wrapped in single quotes. This was causing problems similar to what was happening in the build-ci command.
At this time, we are blocked on testing against OpenStack, but we want to have our v1.1 release. So, we will add OpenStack support in v1.2, once we work out the issues.
2 tasks
vagrant-config.yml is on the ignore list, so I needed to force add the file.
When a node has a consul health check fail, we can't see the reason why in the Travis build logs. This fixes that problem by printing the Output field on health check failure for a node.
I've had two hyphens for the 'force' flag for terrafrom destroy for a while, but for some reason, the last couple of builds have not destroyed their resources, and print out error messages about it.
The terraform destroy command invocation currently has the form of `cmd -flag; cmd -flag`. shlex parses the semicolon as part of the first flag, which it is not. I changed it so it has the form `cmd || cmd`. This should fix the bug at hand. Additionally, the short curcuit feature of || in bash cuts down on redundant destroy invocations
The || was not being parsed correctly in the destroy command, so I removed them. Originally, the multiple destroy command was for the AWS terraform provider, but I don't even know if that bug still exists.
This is ready to be merged. A last minute inclusion is printing out the "Output" for consul health checks that fail, so that will help debug those checks in the future |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This branch is the rebased version of #1261