Implement testing in docker containers #1289

sehqlr · 2016-03-22T16:29:24Z

This branch is the rebased version of #1261

Installs cleanly on a fresh build of most recent master branch
Upgrades cleanly from the most recent release
Updates documentation relevant to the changes

Right now, users have to install terraform and ansible, and their dependencies, in order to user Mantl. Vagrant is a demo/test-drive option, but not for other use cases. After the installation(s), then they have to run multiple commands in order to get a cluster up. What I'm working towards is a way for users to be able to get started with Mantl in one or two steps, that removes the need to install dependenies. This commit reworks Dockerfile and docker_launch.sh, and in the tests that I have done, it can accomplish this. Using an alpine linux base for this image reduces the size of the image, although we do have to add build utils back in for the python dependencies. As we reduce those dependencies, this image will get smaller. I added a MANTL_CONFIG_DIR env var to the dockerfile, and set it to /local. This directory is there to hold custom configuration for users. It has to be different than /mantl (where the source code goes) because we don't want users to have to clone the repository in order to add in their custom files. We add the `/bin/sh -c` entrypoint to make this image closer to an executable. Removing the ONBUILD statements from the Dockerfile simplifies using this image as a base to build other images from. They are replaced with logic in docker_launch.sh to check for user-provided files in a particular place. If these files don't exist, then we copy/generate the files. When users mount a host volume in the MANTL_CONFIG_DIR, they will get copies of all of the generated files in that volume, which they are free to customize further. Documentation will need to be added for what needs to be mounted where. TERRAFORM_STATE_ROOT is changed to TERRAFORM_STATE. Because the expectation is that users will mount the config dir onto their local machines, they will want the tfstate file persist between docker runs.

I'm changing the semantics of the SSH_KEY env. Since all the SSH tools assume that the key is at ~/.ssh/id_rsa, I decided that it would be simpler to symlink any user provided keys to that location. Therefore, the SSH_KEY env var will refer to the location of the user-provided key

1. the Travis file has lost many of its directives in favor with moving them over to a new home in a setup script 2. The docker_launch script has also lost its setup logic, and now contains only the steps necessary to get a cluster up and running 3. the docker_setup python script takes care of getting the container into proper state. As long as you mount a volume onto /local, your config files will be symlinked to their proper locations. Otherwise, they are generated for you, and saved to the mounted volume. 4. I'm changing the semantics of the SSH_KEY env. Since all the SSH tools assume that the key is at ~/.ssh/id_rsa, I decided that it would be simpler to symlink any user provided keys to that location. Therefore, the SSH_KEY env var will refer to the location of the user-provided key. We're going to have to decide if that is really necessary

When I ported the setup logic to python, I did not adding any logging output. I have fixed that, by including helpful messages to users on what the setup is currently doing. I also have the setup script set the SSH_KEY env if none is set, instead of aborting. I was forgetting to add settings to multiple places, so I figured that it was time for some DRY refactoring. This adds global envs for the docker run lines, to make it easier to keep things consistent. For example, I changed docker run to attach stdout & stderr instead of allocating a pseudo-tty. This change allows the security setup script to generate credentials randomly, as they did before. The volume parts of the docker run call need to be placed on the same lines, because that's the most direct way to ensure that they are populated to the correct values. If there is a DRY way of doing this, that would be better, but I can't think of anything right now. During my heavy edits of the travis file, I found that the indentation was inconsistent. I ensured that my editor had the correct tab/space settings, and adjusted the indentation.

I was on the wrong path when I was working on this last night. I found a better way of ensuring that the symlink target is correctly assigned: using the os.path.join function instead of the abspath/relpath functions. This fixes the symlink logic for terraform_files. After that, I was researching the docker run `--env` flag, I discovered this line in the official docker docs: > [I]f no `=` is provided, then that variable's current value is passed > through I just have to pass in the names of the secret vars, and docker should pull them in. Until I found this particular solution, I spent many commits trying to debug this problem. Because ssh-agent and other ssh-related code has to be run often in the docker containers, I moved the needed invocations to a separate script, docker_ssh.sh. This will make docker run invocations shorter. Because I prepended this to the build-cluster command, I removed ssh keygen logic from build-cluster. We need ssh keys to be symlinked for launch and testing, but we don't want to run the whole setup script each time. Assuming that the keys have already been generated from a run of docker_setup.py, all that function does is symlink them. The destroy steps need to be executed in the docker container as well, even though I have not been able to get to this point in my testing.

ssh-agent has a 'lifetime option' where you can pass in a command and args. The agent then only runs for a specified time, and the command is run in a subshell. With the /bin/sh -c invocation at the start of the ssh-agent's subshell, the image remains flexible. The other change in here is that I fixed the relpath for the docker_setup script in destroy_build.sh. With this change, the builds will be destroyed properly once again.

1. I added a couple of new scripts to the root directory, and the ssh one in particular is not useful. We want to wrap all of that functionality into `docker.py`, which will have these scripts plus other functions possibly. 2. I removed the pip self-upgrade 3. I add git to the list of packages to delete from apk

We are rolling setup and deploy commands into one python script, docker.py. It doesn't have much functionality, but this PR is getting out of scope quickly, and it's needed for CI. The destroy_build and start_build scripts were just wrappers around a couple of shell commands, which can be wrapped by docker.py instead. I also had logic in the security_setup wrapper function to move the generated security file into MANTL_CONFIG_DIR, then symlink it. However, Travis threw an error. Since that functionality is for users, and not CI, I elected to remove that functionality, and add it in later if needed. To optimize for the size of the docker image, I have apk delete the package. curl is a dependency of git, so when I removed that package, I also removed curl. Therefore, we install it separately. I checked out run-tests.sh from testing/switchboard, and ported the documentation skipping, and skipping on master pushes.

I added the shlex split function to the script, so the git command can get interpreted more accurately. I had to add the git package back into the image (I orginally removed it to save on space). The commit_range_cmd required a subshell, and using a subshell for a subprocess is risky. I ported the grep invocation to python if/elif/else block, with logging.info for files detected. The log lines are probably too verbose.

I need to pass in non-secret travis env vars that I wish to use in the ci commands for docker.py.

I'm quickly iterating through these changes for travis, and I'm finding that some of the assumptions that I made for a user are not working out well for travis. Because we are focusing on getting CI working before we get this working for end users, I decided that we are going to have commands specifically for CI. Those commands are prefixed with "ci-". I want to explore the option of detecting a "CI" env var, but I felt that this was a straighter path to our goal. The TF files need to be symlinked during the build and destroy steps, not just during setup. @siddharthist made a good point about the terraform files without their extension. The "clever" way that I circumvented terraform grabbing multiple terraform files was clumsy, so I replaced it with a better approach. I added logic to the setup function in docker.py to symlink the file defined in TERRAFORM_FILE if the env exists. Additionally, I added OS_PRT to the secret envs. This represents CiscoCloud's secret ssh port number.

I replaced return statements with calls to sys.exit with command return codes. Also, I added in the ssh command for accessing OS

The travis user was created for the shipped cluster, so now we can use that username and a new key. Now we can run actual testing code on OS. With the key I was using previously, I was running into a problem where ssh asked for a passphrase, even though I created the key without one. It seems the problem was from when the file was converted to a env string then back to a file. I'm using travis's file encryption feature to add the ssh key directly. Additionally, I disabled strict host key checking, added the remote ssh command logic to ci_destroy.

For testing against OpenStack, we should have separate clones of mantl for each test run, to avoid collisions during concurrent runs. The remote command for this now clones the repository into a new directory that's the same name as the commit, checks out that commit, runs tests, then removes the directory after tests are completed.

The strings that represent the OpenStack ssh commands in ci_build and ci_destroy were getting very long and hard to edit, and the format call was also getting unruly. Therefore, I turned the strings into multiline strings, and the positional arguments to the format method call are now kwargs. The multiline string gets turned back into a single line before being passed into the call(split()) call.

For a while now, when GCE is tested in travis, the remote-exec step would fail. After much tinkering, I found that all I needed to do was to edit the testing/gce.tf file to point to the pub ssh key generated during CI runs, and it worked on my local machine. There were a few edits here that seems to have been lost during the rebase. A formatting change and a line edit isn't bad to loose in squashing about 200 commits!

I was certain during my testing that the public key was the thing that was supposed to be used, but Travis was still having errors

The final ; char in the shell script to remove the clone from OpenStack was throwing an error, and is also redundant, since no other commands follow it.

Looking throught the travis logs of other branches, the ssh key in question is supposed to be a public key. So, I am changing this back to a public key, since it wasn't related to my problem.

I was comparing the gce terraform file that I made for testing to the most recent gce.sample.tf file. It turns out that I never updated the file to use the modules! I need to work on getting the testing system to test against the sample files instead of custom ones.

The short/long-name vars in my testing/gce.tf file had string interpolations in them, and that is not allowed in terraform. Therefore, I have to move them to the module definitions themeselves. The default values now assume that the build number is 0-0.

When I was refactoring code, it seems that I did not include ssh-add everywhere, as I believed. I doubt that it was lost in the rebasing: I'm sure I just forgot to put it in there.

This module is not currently needed, since we don't have any tests that rely on a DNS registry. Our tests rely upon terraform state to keep track of cluster nodes, so we don't need it here.

The ci-build logic for OS included a lot of setup. I took out the setup logic and put it in ci-setup, and I added a call to ln to create the symlink for the terraform file that I want to use for the test. (I changed the matrix in .travis.yml to specify the actual terraform file to use.) Also, I've added logging statements to track the final ssh command sent to the jump box. Also, I changed the format for logs. The time should come first

Using the curly braces with the python string format method was causing problems, and we only need the raw value for the PID, so I removed the curly braces.

We need to exclude other branches from the filter, because if there are more recent merges in the repository, then there will be issues with the wrong branch being pulled up. This error came up the first time I tried this, because a different PR was merged into master just minutes after I submitted this.

I didn't quote the subshells, so travis paired the wrong parens together. Quoting them should fix it

I can't nest subshells with $(), apparently. So, I'm going to use backticks for the outer command, and $() for the inner. I'm going to draft an issue with Travis for this.

I'm moving the complex subshell assignment logic to before_script, would should be interpreted better than in the env section.

Adding the export to the var assignment in the before_script section will hopefully fix this behavior in travis, where the var is not properly global

The way that I was manipulating the strings for the destroy command was inserting the entire template of the ssh command into itself again, without interpolations. To fix this, I named a separate variable

The external network uuid variable was set to the default in the sample. This replaces it with an actual value.

There was a merge conflict in testing/gce.tf file that occured when the k8s dedicated workers feature was added to master. The master version of this file used the old GCE module that was kept for compatibility. This branch's version of that file uses the modular version. I'll need to copy the modular version with the dedicated k8s feature, but I needed to test changes to the OpenStack tf file, so that will come in a separate commit.

I removed the floating ip modules for control and workers, but I didn't remove the variable from the instance modules themselves. This has corrected that.

The OS network name for CI was set to the default, "mantl". Naturally, this is not specific enough for our purposes, so I added it explicitly

I was trying to kill the ssh agent process without specifying what signal to send. I've chosen SIGTERM for gentle destruction I reduced the nunber of nodes that are created for OS clusters by 4 nodes. Also, I've reduced the number of subproc calls for the destruction phase by repeating the terraform command in the same call.

The remote command to kill the ssh-agent process and destroy the cluster was not wrapped in single quotes. This was causing problems similar to what was happening in the build-ci command.

At this time, we are blocked on testing against OpenStack, but we want to have our v1.1 release. So, we will add OpenStack support in v1.2, once we work out the issues.

vagrant-config.yml is on the ignore list, so I needed to force add the file.

When a node has a consul health check fail, we can't see the reason why in the Travis build logs. This fixes that problem by printing the Output field on health check failure for a node.

I've had two hyphens for the 'force' flag for terrafrom destroy for a while, but for some reason, the last couple of builds have not destroyed their resources, and print out error messages about it.

The terraform destroy command invocation currently has the form of `cmd -flag; cmd -flag`. shlex parses the semicolon as part of the first flag, which it is not. I changed it so it has the form `cmd || cmd`. This should fix the bug at hand. Additionally, the short curcuit feature of || in bash cuts down on redundant destroy invocations

The || was not being parsed correctly in the destroy command, so I removed them. Originally, the multiple destroy command was for the AWS terraform provider, but I don't even know if that bug still exists.

sehqlr · 2016-04-19T16:51:42Z

This is ready to be merged. A last minute inclusion is printing out the "Output" for consul health checks that fail, so that will help debug those checks in the future

sehqlr and others added 18 commits March 21, 2016 17:01

Fix bad merge

cd7f33c

Pass in travis env vars explicitly

bae504c

I need to pass in non-secret travis env vars that I wish to use in the ci commands for docker.py.

Replace return with exit, add ssh call

1718388

I replaced return statements with calls to sys.exit with command return codes. Also, I added in the ssh command for accessing OS

Change ssh key from public to private

e8e8193

I was certain during my testing that the public key was the thing that was supposed to be used, but Travis was still having errors

BrianHicks mentioned this pull request Mar 23, 2016

Docker image-only deployments #1261

Closed

BrianHicks added enhancement ci/testing labels Mar 23, 2016

BrianHicks added this to the 1.1 milestone Mar 23, 2016

BrianHicks assigned sehqlr Mar 23, 2016

sehqlr added 7 commits March 23, 2016 10:45

Remove final ; from destroy command

851ed2b

The final ; char in the shell script to remove the clone from OpenStack was throwing an error, and is also redundant, since no other commands follow it.

Change private to public ssh key

0ad92b6

Looking throught the travis logs of other branches, the ssh key in question is supposed to be a public key. So, I am changing this back to a public key, since it wasn't related to my problem.

Call ssh-add during ci operations

13097c1

When I was refactoring code, it seems that I did not include ssh-add everywhere, as I believed. I doubt that it was lost in the rebasing: I'm sure I just forgot to put it in there.

Remove cloud-dns module for gce

cad48c5

This module is not currently needed, since we don't have any tests that rely on a DNS registry. Our tests rely upon terraform state to keep track of cluster nodes, so we don't need it here.

sehqlr added 20 commits April 1, 2016 11:21

Echo out CI_HEAD_COMMIT, checking for correctness

96b6a9d

Make the directory for the clone

95e8c54

Wrap ci-build cmd with ', fix indent in ci-destroy

0e5ceee

Remove curly braces for SSH_AGENT_PID env var

e3bf02f

Using the curly braces with the python string format method was causing problems, and we only need the raw value for the PID, so I removed the curly braces.

Wrap subshells with quotes

0724251

I didn't quote the subshells, so travis paired the wrong parens together. Quoting them should fix it

Remove outer double quotes

1f3df4e

Use backticks for CI_HEAD_COMMIT export

e704121

I can't nest subshells with $(), apparently. So, I'm going to use backticks for the outer command, and $() for the inner. I'm going to draft an issue with Travis for this.

Move CI_HEAD_COMMIT assignment to before_script

e9885b5

I'm moving the complex subshell assignment logic to before_script, would should be interpreted better than in the env section.

Export CI_HEAD_COMMIT during assignment

3f5e8c2

Adding the export to the var assignment in the before_script section will hopefully fix this behavior in travis, where the var is not properly global

Fix assignment bug in the ci_destroy command

a494de0

The way that I was manipulating the strings for the destroy command was inserting the entire template of the ssh command into itself again, without interpolations. To fix this, I named a separate variable

Add external network uuid to cisco-os.tf file

a02d0a4

The external network uuid variable was set to the default in the sample. This replaces it with an actual value.

Remove floating-ip allocations for non-edge nodes

e25bc6a

Rm flaoting-ip vars from control & worker modules

0612fd0

I removed the floating ip modules for control and workers, but I didn't remove the variable from the instance modules themselves. This has corrected that.

Downsize edge nodes, add security-setup to OS

2fdd60a

Add build-number specific name to OS network

d791e71

The OS network name for CI was set to the default, "mantl". Naturally, this is not specific enough for our purposes, so I added it explicitly

Wrap destroy-ci command in single quotes

c15a240

The remote command to kill the ssh-agent process and destroy the cluster was not wrapped in single quotes. This was causing problems similar to what was happening in the build-ci command.

Remove OS from testing matrix for 1.1 release

27fd7be

At this time, we are blocked on testing against OpenStack, but we want to have our v1.1 release. So, we will add OpenStack support in v1.2, once we work out the issues.

langston-barrett mentioned this pull request Apr 18, 2016

Run CI tests with vagrant #1359

Closed

2 tasks

sehqlr added 5 commits April 18, 2016 13:31

Force add of testing/vagrant-config.yml

3641356

vagrant-config.yml is on the ignore list, so I needed to force add the file.

Print Output to build log on healthcheck fail

01fce57

When a node has a consul health check fail, we can't see the reason why in the Travis build logs. This fixes that problem by printing the Output field on health check failure for a node.

Reduce num of hyphens for destroy flag

f3ddc8d

I've had two hyphens for the 'force' flag for terrafrom destroy for a while, but for some reason, the last couple of builds have not destroyed their resources, and print out error messages about it.

Remove second part of destroy command

eb17d1a

The || was not being parsed correctly in the destroy command, so I removed them. Originally, the multiple destroy command was for the AWS terraform provider, but I don't even know if that bug still exists.

stevendborrelli merged commit 776c983 into master Apr 19, 2016

stevendborrelli deleted the testing/docker branch April 19, 2016 16:53

sehqlr mentioned this pull request Apr 19, 2016

Move travis build steps into a "switchboard" run-tests script #1249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement testing in docker containers #1289

Implement testing in docker containers #1289

sehqlr commented Mar 22, 2016

sehqlr commented Apr 19, 2016

Implement testing in docker containers #1289

Implement testing in docker containers #1289

Conversation

sehqlr commented Mar 22, 2016

sehqlr commented Apr 19, 2016