Skip to content
This repository has been archived by the owner on Jul 27, 2023. It is now read-only.

Implement testing in docker containers #1289

Merged
merged 59 commits into from
Apr 19, 2016
Merged

Implement testing in docker containers #1289

merged 59 commits into from
Apr 19, 2016

Conversation

sehqlr
Copy link
Contributor

@sehqlr sehqlr commented Mar 22, 2016

This branch is the rebased version of #1261

  • Installs cleanly on a fresh build of most recent master branch
  • Upgrades cleanly from the most recent release
  • Updates documentation relevant to the changes

sehqlr and others added 18 commits March 21, 2016 17:01
Right now, users have to install terraform and ansible, and their
dependencies, in order to user Mantl. Vagrant is a demo/test-drive
option, but not for other use cases. After the installation(s), then
they have to run multiple commands in order to get a cluster up.

What I'm working towards is a way for users to be able to get started
with Mantl in one or two steps, that removes the need to install
dependenies. This commit reworks Dockerfile and docker_launch.sh, and in
the tests that I have done, it can accomplish this.

Using an alpine linux base for this image reduces the size of the image,
although we do have to add build utils back in for the python
dependencies. As we reduce those dependencies, this image will get
smaller.

I added a MANTL_CONFIG_DIR env var to the dockerfile, and set it to
/local. This directory is there to hold custom configuration for users.
It has to be different than /mantl (where the source code goes) because
we don't want users to have to clone the repository in order to add in
their custom files. We add the `/bin/sh -c` entrypoint to make this
image closer to an executable.

Removing the ONBUILD statements from the Dockerfile simplifies using
this image as a base to build other images from. They are replaced with
logic in docker_launch.sh to check for user-provided files in a
particular place.  If these files don't exist, then we copy/generate the
files. When users mount a host volume in the MANTL_CONFIG_DIR, they will
get copies of all of the generated files in that volume, which they are
free to customize further. Documentation will need to be added for what
needs to be mounted where.

TERRAFORM_STATE_ROOT is changed to TERRAFORM_STATE. Because the
expectation is that users will mount the config dir onto their local
machines, they will want the tfstate file persist between docker runs.
I'm changing the semantics of the SSH_KEY env. Since all the SSH tools
assume that the key is at ~/.ssh/id_rsa, I decided that it would be
simpler to symlink any user provided keys to that location. Therefore,
the SSH_KEY env var will refer to the location of the user-provided key
1. the Travis file has lost many of its directives in favor with moving
them over to a new home in a setup script
2. The docker_launch script has also lost its setup logic, and now
contains only the steps necessary to get a cluster up and running
3. the docker_setup python script takes care of getting the container
into proper state. As long as you mount a volume onto /local, your
config files will be symlinked to their proper locations. Otherwise,
they are generated for you, and saved to the mounted volume.
4. I'm changing the semantics of the SSH_KEY env. Since all the SSH
tools assume that the key is at ~/.ssh/id_rsa, I decided that it would
be simpler to symlink any user provided keys to that location.
Therefore, the SSH_KEY env var will refer to the location of the
user-provided key. We're going to have to decide if that is really
necessary
When I ported the setup logic to python, I did not adding any logging
output. I have fixed that, by including helpful messages to users on
what the setup is currently doing. I also have the setup script set the
SSH_KEY env if none is set, instead of aborting.

I was forgetting to add settings to multiple places, so I figured that
it was time for some DRY refactoring. This adds global envs for the
docker run lines, to make it easier to keep things consistent.

For example, I changed docker run to attach stdout & stderr instead of
allocating a pseudo-tty. This change allows the security setup script to
generate credentials randomly, as they did before.

The volume parts of the docker run call need to be placed on the same
lines, because that's the most direct way to ensure that they are
populated to the correct values. If there is a DRY way of doing this,
that would be better, but I can't think of anything right now.

During my heavy edits of the travis file, I found that the indentation
was inconsistent. I ensured that my editor had the correct tab/space
settings, and adjusted the indentation.
I was on the wrong path when I was working on this last
night. I found a better way of ensuring that the symlink target is
correctly assigned: using the os.path.join function instead of the
abspath/relpath functions. This fixes the symlink logic for terraform_files.

After that, I was researching the docker run `--env` flag, I discovered this
line in the official docker docs:

> [I]f no `=` is provided, then that variable's current value is passed
> through

I just have to pass in the names of the secret vars, and
docker should pull them in. Until I found this particular solution, I
spent many commits trying to debug this problem.

Because ssh-agent and other ssh-related code has to be run often in the
docker containers, I moved the needed invocations to a separate script,
docker_ssh.sh. This will make docker run invocations shorter.

Because I prepended this to the build-cluster command, I removed
ssh keygen logic from build-cluster.
We need ssh keys to be symlinked for launch and testing, but we don't
want to run the whole setup script each time.
Assuming that the keys have already been generated from a run of
docker_setup.py, all that function does is symlink them.

The destroy steps need to be executed in the docker container as well,
even though I have not been able to get to this point in my testing.
ssh-agent has a 'lifetime option' where you can pass in a command and
args. The agent then only runs for a specified time, and the command is
run in a subshell. With the /bin/sh -c invocation at the start of the
ssh-agent's subshell, the image remains flexible.

The other change in here is that I fixed the relpath for the
docker_setup script in destroy_build.sh. With this change, the builds
will be destroyed properly once again.
1. I added a couple of new scripts to the root directory, and the ssh
one in particular is not useful. We want to wrap all of that
functionality into `docker.py`, which will have these scripts plus other
functions possibly.
2. I removed the pip self-upgrade
3. I add git to the list of packages to delete from apk
We are rolling setup and deploy commands into one python script,
docker.py. It doesn't have much functionality, but this PR is getting
out of scope quickly, and it's needed for CI.

The destroy_build and start_build scripts were just wrappers around a
couple of shell commands, which can be wrapped by docker.py instead. I
also had logic in the security_setup wrapper function to move the
generated security file into MANTL_CONFIG_DIR, then symlink it. However,
Travis threw an error. Since that functionality is for users, and not
CI, I elected to remove that functionality, and add it in later if
needed.

To optimize for the size of the docker image, I have apk delete the
package. curl is a dependency of git, so when I removed that package, I
also removed curl. Therefore, we install it separately.

I checked out run-tests.sh from testing/switchboard, and ported the
documentation skipping, and skipping on master pushes.
I added the shlex split function to the script, so the git command can
get interpreted more accurately. I had to add the git package back into
the image (I orginally removed it to save on space).

The commit_range_cmd required a subshell, and using a subshell for a
subprocess is risky. I ported the grep invocation to python if/elif/else
block, with logging.info for files detected. The log lines are probably
too verbose.
I need to pass in non-secret travis env vars that I wish to use in the
ci commands for docker.py.
I'm quickly iterating through these changes for travis, and I'm finding
that some of the assumptions that I made for a user are not working out
well for travis.

Because we are focusing on getting CI working before we get this working
for end users, I decided that we are going to have commands specifically
for CI. Those commands are prefixed with "ci-". I want to explore the
option of detecting a "CI" env var, but I felt that this was a
straighter path to our goal.

The TF files need to be symlinked during the build and destroy steps,
not just during setup.  @siddharthist made a good point about the
terraform files without their extension. The "clever" way that I
circumvented terraform grabbing multiple terraform files was clumsy, so
I replaced it with a better approach. I added logic to the setup
function in docker.py to symlink the file defined in TERRAFORM_FILE if
the env exists.

Additionally, I added OS_PRT to the secret envs. This represents
CiscoCloud's secret ssh port number.
I replaced return statements with calls to sys.exit with command return
codes. Also, I added in the ssh command for accessing OS
The travis user was created for the shipped cluster, so now we can use
that username and a new key. Now we can run actual testing code on OS.

With the key I was using previously, I was running into a problem where
ssh asked for a passphrase, even though I created the key without one.
It seems the problem was from when the file was converted to a env
string then back to a file. I'm using travis's file encryption feature
to add the ssh key directly. Additionally, I disabled strict host key
checking, added the remote ssh command logic to ci_destroy.
For testing against OpenStack, we should have separate clones of mantl
for each test run, to avoid collisions during concurrent runs. The
remote command for this now clones the repository into a new directory
that's the same name as the commit, checks out that commit, runs tests,
then removes the directory after tests are completed.
The strings that represent the OpenStack ssh commands in ci_build and
ci_destroy were getting very long and hard to edit, and the format call
was also getting unruly. Therefore, I turned the strings into multiline
strings, and the positional arguments to the format method call are now
kwargs. The multiline string gets turned back into a single line before
being passed into the call(split()) call.
For a while now, when GCE is tested in travis, the remote-exec step
would fail. After much tinkering, I found that all I needed to do was to
edit the testing/gce.tf file to point to the pub ssh key generated
during CI runs, and it worked on my local machine.

There were a few edits here that seems to have been lost during the
rebase. A formatting change and a line edit isn't bad to loose in
squashing about 200 commits!
I was certain during my testing that the public key was the thing that
was supposed to be used, but Travis was still having errors
sehqlr added 7 commits March 23, 2016 10:45
The final ; char in the shell script to remove the clone from OpenStack
was throwing an error, and is also redundant, since no other commands
follow it.
Looking throught the travis logs of other branches, the ssh key in
question is supposed to be a public key. So, I am changing this back to
a public key, since it wasn't related to my problem.
I was comparing the gce terraform file that I made for testing to the
most recent gce.sample.tf file. It turns out that I never updated the
file to use the modules! I need to work on getting the testing system to
test against the sample files instead of custom ones.
The short/long-name vars in my testing/gce.tf file had string
interpolations in them, and that is not allowed in terraform. Therefore,
I have to move them to the module definitions themeselves. The default
values now assume that the build number is 0-0.
When I was refactoring code, it seems that I did not include ssh-add
everywhere, as I believed. I doubt that it was lost in the rebasing: I'm
sure I just forgot to put it in there.
This module is not currently needed, since we don't have any tests that
rely on a DNS registry. Our tests rely upon terraform state to keep
track of cluster nodes, so we don't need it here.
The ci-build logic for OS included a lot of setup. I took out the setup
logic and put it in ci-setup, and I added a call to ln to create the
symlink for the terraform file that I want to use for the test. (I
changed the matrix in .travis.yml to specify the actual terraform file
to use.) Also, I've added logging statements to track the final ssh
command sent to the jump box.

Also, I changed the format for logs. The time should come first
sehqlr added 20 commits April 1, 2016 11:21
Using the curly braces with the python string format method was causing
problems, and we only need the raw value for the PID, so I removed the
curly braces.
We need to exclude other branches from the filter, because if there are
more recent merges in the repository, then there will be issues with the
wrong branch being pulled up. This error came up the first time I tried
this, because a different PR was merged into master just minutes after I
submitted this.
I didn't quote the subshells, so travis paired the wrong parens
together. Quoting them should fix it
I can't nest subshells with $(), apparently. So, I'm going to use
backticks for the outer command, and $() for the inner. I'm going to
draft an issue with Travis for this.
I'm moving the complex subshell assignment logic to before_script, would
should be interpreted better than in the env section.
Adding the export to the var assignment in the before_script section
will hopefully fix this behavior in travis, where the var is not
properly global
The way that I was manipulating the strings for the destroy command was
inserting the entire template of the ssh command into itself again,
without interpolations. To fix this, I named a separate variable
The external network uuid variable was set to the default in the sample.
This replaces it with an actual value.
There was a merge conflict in testing/gce.tf file that occured when the
k8s dedicated workers feature was added to master. The master version of
this file used the old GCE module that was kept for compatibility. This
branch's version of that file uses the modular version. I'll need to
copy the modular version with the dedicated k8s feature, but I needed to
test changes to the OpenStack tf file, so that will come in a separate
commit.
I removed the floating ip modules for control and workers, but I didn't
remove the variable from the instance modules themselves. This has
corrected that.
The OS network name for CI was set to the default, "mantl". Naturally,
this is not specific enough for our purposes, so I added it explicitly
I was trying to kill the ssh agent process without specifying what
signal to send. I've chosen SIGTERM for gentle destruction

I reduced the nunber of nodes that are created for OS clusters by 4
nodes.

Also, I've reduced the number of subproc calls for the destruction phase
by repeating the terraform command in the same call.
The remote command to kill the ssh-agent process and destroy the cluster
was not wrapped in single quotes. This was causing problems similar to
what was happening in the build-ci command.
At this time, we are blocked on testing against OpenStack, but we want
to have our v1.1 release. So, we will add OpenStack support in v1.2,
once we work out the issues.
@langston-barrett langston-barrett mentioned this pull request Apr 18, 2016
2 tasks
sehqlr added 5 commits April 18, 2016 13:31
vagrant-config.yml is on the ignore list, so I needed to force add the
file.
When a node has a consul health check fail, we can't see the reason why
in the Travis build logs. This fixes that problem by printing the Output
field on health check failure for a node.
I've had two hyphens for the 'force' flag for terrafrom destroy for a
while, but for some reason, the last couple of builds have not destroyed
their resources, and print out error messages about it.
The terraform destroy command invocation currently has the form of `cmd
-flag; cmd -flag`. shlex parses the semicolon as part of the first flag,
which it is not.  I changed it so it has the form `cmd || cmd`. This
should fix the bug at hand. Additionally, the short curcuit feature of
|| in bash cuts down on redundant destroy invocations
The || was not being parsed correctly in the destroy command, so I
removed them. Originally, the multiple destroy command was for the AWS
terraform provider, but I don't even know if that bug still exists.
@sehqlr
Copy link
Contributor Author

sehqlr commented Apr 19, 2016

This is ready to be merged. A last minute inclusion is printing out the "Output" for consul health checks that fail, so that will help debug those checks in the future

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants