Skip to content

Latest commit

 

History

History

core

Deploy Nomad, Vault and Consul clusters using Terraform

This is based on the example by Hashicorp.

The vault deployment is based off this example.

Basic Concepts

This Terraform module allows you to bootstrap an initial cluster of Consul servers, Vault servers, Nomad servers, and Nomad clients.

After the initial bootstrap, you will need to perform additional configuration for production hardening. In particular, you will have to initialise Vault and configure Vault before you can use it for anything.

You can use the other companion Terraform modules in this repository to perform some of the configuration. They are documented briefly below, and in more detail in their own directories.

Requirements

You should consider using named profiles to store your AWS credentials.

Prerequisites

Terraform remote state

Terraform is configured to store its state on a S3 bucket. It will also use a DynamoDB table to perform locking.

You should setup a S3 bucket to store the state, and then create a DynamoDB table with LockID as its string primary key.

Then, configure it a file such as backend-config.tfvars. See this page for more information.

AWS Pre-requisites

  • Create a VPC with the necessary subnets (public, private, database etc.) to deploy this to. For example, you can try this module.
  • Have a domain either registered with AWS Route 53 or other registrar.
  • Create an AWS Hosted zone for the domain or subdomain. If the domain is registered with another registrar, it must have its name servers set to AWS.
  • Use AWS Certficate Manager to request certificates for the domain and its wildcard subdomains. For example, you need to request a certificate that contains the names nomad.some.domain AND *.nomad.some.domain.

Certificates

You will need to generate the following certificates:

  • A Root CA
  • Vault Certificate

Refer to instructions here.

Preparing Secrets

Building AMIs

We first need to use packer to build several AMIs. You will also need to have Ansible 2.7 installed.

The list below will link to example packer scripts that we have provided. If you have additional requirements, you are encouraged to extend from these examples.

Read this on how to specify AWS credentials.

Refer to each of the directories for instruction.

Take note of the AMI IDs returned from this.

Defining Variables

You should refer to variables.tf and then create your own variable file.

Most of the variables should be pretty straight forward and are documented inline with their description. Some of the more complicated variables are described below.

vault_tls_key_policy_arn

The Vault packer template and this module expects Vault to be deployed with TLS certificate and the key. The key is expected to be encrypted using a Key Management Service (KMS) Customer Managed Key (CMK).

In order for the Vault EC2 instances to be able to decrypt the keys on first run, the instances will need to be provided with the necessary IAM policy.

You will have to define the appropriate IAM policy, and then provide the ARN of the IAM policy using the vault_tls_key_policy_arn variable.

Before you can define an IAM policy, you have to define the appropriate key policy for your CMK so that the keys policies can be managed by IAM. Refer to this document for more information.

After that is done, you can following the example below to define the appropriate policy.

# Use this to retrieve the ARN of a KMS CMK with the alias `terraform`
data "aws_kms_alias" "terraform" {
    name = "alias/terraform"
}

# Define the policy using this data source. If you used the example `cli.json`, this should suffice
# See https://docs.aws.amazon.com/kms/latest/developerguide/iam-policies.html
data "aws_iam_policy_document" "vault_decrypt" {
    policy_id = "VaultTlsDecrypt"

    statement {
        effect = "Allow"
        actions = [
            "kms:Decrypt"
        ]

        resources = [
            "${data.aws_kms_alias.terraform.target_key_arn}"
        ],

        condition {
            test = "StringEquals"
            variable = "kms:EncryptionContext:type"
            values = ["key"]
        }

        condition {
            test = "StringEquals"
            variable = "kms:EncryptionContext:usage"
            values = ["encryption"]
        }
    }
}

resource "aws_iam_policy" "vault_decrypt" {
    name = "VaultTlsDecrypt"
    description = "Policy to allow Vault to use the KMS terraform key to decrypt key encrypting keys."
    policy = "${data.aws_iam_policy_document.vault_decrypt.json}"
}

module "core" {
    source = "..."

    vault_tls_key_policy_arn = "${aws_iam_policy.vault_decrypt.arn}"
}

Terraform

Initialize Terraform

Terraform will need to be initialized with the appropriate backend settings:

terraform init --backend-config backend-config.tfvars

Running Terraform

Assuming that you have a variable file named vars.tfvars, you can simply run terraform with:

# Preview the plan
terraform plan --var-file vars.tfvars

# Execute the plan
terraform apply --var-file vars.tfvars

Consul, Docker and DNS Gotchas

See this post for a solution.

Post Terraforming Tasks

As indicated above, the initial Terraform apply will bootstrap the cluster in a usable but unhardened manner. You will need to perform some tasks to harden it further.

Consul Gossip Traffic

Due to a limitation in how Terraform deals with count, we are unable to automatically provision the security group rules for Nomad (both servers and clients) and Vault instances to retrieve Consul Serf Gossip traffic. The instances can send out Serf gossip traffic to Consul servers and other peers, but they are unable to receive any.

It is recommended that you provision the rules for Nomad and Vault instances using this Terraform module.

This allows better scaling as the number of nodes in the Consul cluster increases.

Vault Initialisation and Configuration

After you have applied the Terraform plan, we need to perform some manual steps in order to set up Vault.

The helper script vault-helper.sh will have some instructions on what you need to do to initialise and unseal the servers

You can use our utility Ansible playbooks to perform the tasks.

To generate an inventory for the playbooks, you can run

./vault-helper.sh -i > inventory

Vault Auto-unseal

Vault 1.0 and above now supports auto unseal and this module supports its use.

In general, the steps for initialisation are the same. The only difference is that any operation (such as upgrading the servers) that requires the restarting of Vault will no longer require manual unsealing. The unseal keys are then only used for operations like recovery or generating root tokens, for example.

There are additional steps to perform when configuring for auto-unseal that is not automated for you by this module:

  • Provisioning a KMS Customer Managed Key (CMK)
  • Attaching an appropriate policy to the CMK so that the IAM policies provisioned by this module can provide Vault servers access to the key
  • (Optional) Provisioning a VPC endpoint for KMS in the appropriate subnets

You might want to consider using the vault-auto-unseal helper module in combination with the core module.

Vault Integration with Nomad

Nomad can be integrated with Vault so that jobs can transparently retrieve tokens from Vault.

After you have initialised and unsealed Vault, you can use the nomad-vault-integration module to Terraform the required policies and settings for Vault.

Make sure you have properly configured Vault with the appropriate authentication methods so that your users can authenticate with Vault to get the necessary tokens and credentials.

The default user_data scripts for Nomad servers and clients will automatically detect that the policies have been setup and will configure themselves correctly. To update your cluster to use the new Vault integration, simply follow the section below to update the Nomad servers first and then the clients.

Nomad ACL

ACL can be enabled for Nomad so that only users with the necessary tokens can submit jobs. This module only enables the built-in access controls provided by the ACL facility in the Open Source version of Nomad. Additional controls provided by Sentinel in the Enterprise version is not enabled.

After you have initialised and unsealed Vault, you can use the nomad-acl module to Terraform the required policies and settings for Vault and Nomad.

Make sure you have properly configured Vault with the appropriate authentication methods so that your users can authenticate with Vault to get the necessary tokens and credentials.

The default user_data scripts for Nomad servers and clients will automatically detect that the policies have been setup and will configure themselves correctly. To update your cluster to use the new Nomad ACL, simply follow the section below to update the Nomad servers first and then the clients.

SSH access via Vault

We can use Vault's SSH secrets engine to generate signed certificates to access your machines via SSH.

See the vault-ssh module for more information.

Other Integrations

There are other integrations that add features to your cluster in this repository that are not mentioned in this README.

Upgrading and updating

In general, to upgrade or update the servers, you will have to update the packer template file, build a new AMI, then update the terraform variables with the new AMI ID. Then, you can run terraform apply to update the launch configuration.

Then, you will need to terminate the various instances for Auto Scaling Group to start new instances with the updated launch configurations. You should do this ONE BY ONE, or at least be mindful about the quorum and maximum number of instances that can be terminated each time. See the consensus table for clarity in the quorum values.

Automated upgrade via script

You may choose to run the Python 3 upgrade.py script, which should not require additional dependency installation, to upgrade the instances.

Currently only consul, nomad-server and vault services can be done in this way.

Before you run the upgrade script, make sure to have your AWS credentials (such as env vars) set up correctly, to point to the right environment for the instance upgrade. You may also choose to supply the environment variables while running the script, e.g. AWS_PROFILE=staging ./upgrade.py ....

The script will terminate every instance type one-by-one, which is the safest way to maintain quorum in each of the service type. Additionally, the script will check that the new instance is up and correctly connected to its respective service, before continuing with the upgrade process, until all old instances in that service type have been terminated and upgraded with new ones.

A fast (and furious) mode has also been added to the script, which can be turned on with the flag --fast. In this mode, the script will calculate the maximum number of instances to terminate at runtime for the specified service, while still maintaining quorum even if the new instances are unable to start up properly. Please take precaution when using this mode.

For consul, the command should look like this:

./upgrade.py consul --consul-addr https://consul.x.y

For nomad-server, the command should look like this:

./upgrade.py nomad-server --nomad-addr https://nomad.x.y

For vault, the command should look like this:

./upgrade.py vault \
    --consul-addr https://consul.x.y \
    --vault-ca-cert /path/to/environments/xxx/ca.pem"

For more information, run

./upgrade.py -h

Manually upgrading Consul

Important: It is important that you only terminate Consul instances one by one. Make sure the new servers are healthy and have joined the cluster before continuing. If you lose more than a quorum of servers, you might have data loss and have to perform outage recovery.

  1. Build your new AMI, and Terraform apply the new AMI.
  2. Terminate the instance that you would like to remove.
  3. The Consul server will gracefully exit, and cause the node to become unhealthy, and AWS will automatically start a new instance.
  4. Make sure the new instance started by AWS is healthy before continuing. For example, use consul operator raft list-peers.

You can use this AWS CLI command to terminate the instance:

aws autoscaling \
    terminate-instance-in-auto-scaling-group \
    --no-should-decrement-desired-capacity \
    --instance-id "xxx"

Replace xxx with the instance ID.

Manually upgrading Nomad Servers

Important: It is important that you only terminate Nomad server instances one by one. Make sure the new servers are healthy and have joined the cluster before continuing. If you lose more than a quorum of servers, you might have data loss and have to perform outage recovery.

  1. Build your new AMI, and Terraform apply the new AMI.
  2. Terminate the instance that you would like to remove.
  3. The nomad server will gracefully exit, and cause the node to become unhealthy, and AWS will automatically start a new instance.
  4. Make sure the new instance started by AWS is healthy before continuing. For example, use nomad server members to check whether the new instances created have joined the cluster.

You can use this AWS CLI command:

aws autoscaling \
    terminate-instance-in-auto-scaling-group \
    --no-should-decrement-desired-capacity \
    --instance-id "xxx"

Replace xxx with the instance ID.

Manually upgrading Nomad Clients

Important: These steps are recommended to minimise the outage your services might experience. In particular, if your service only has one instance of it running, you will definitely encounter outage. Ensure that your services have at least two instances running.

  1. Build your new AMI, and Terraform apply the new AMI.
  2. Take note of the old instances ID that you are going to retire. You can get a list of the instance IDs with the command:
aws autoscaling describe-auto-scaling-groups \
    --auto-scaling-group-name ASGName \
    | jq --raw-output '.AutoScalingGroups[0].Instances[].InstanceId' \
    | tee instance-ids.txt
  1. Using Terraform or the AWS console, set the desired capacity of your auto-scaling group to twice the current desired value. Make sure the maximum is set to a high enough value so that you set the appropriate desired value. This spins up new clients that will take over the allocations from the instances you are retiring. Alternatively, you can use the AWS CLI too:
aws autoscaling update-auto-scaling-group \
    --auto-scaling-group-name ASGName \
    --max-size xxx \
    --desired-capacity xxx

Wait for the new nodes to be ready before you continue.

  1. Find the Nomad node IDs for each instance. Assuming you have saved the instance IDs to instance-ids.txt and that you have kept the default configuration where the node name is the instance ID:
nomad node status -json > nodes.json
echo -n "" > node-ids.txt
while read p; do
    jq --raw-output ".[] | select (.Name == \"${p}\") | .ID" nodes.json  >> node-ids.txt
done < instance-ids.txt
  1. Set the instances you are going to retire as "ineligible" in Nomad. For example, assuming you have saved the node IDs to node-ids.txt:
while read p; do
  nomad node eligibility -disable "${p}"
done < node-ids.txt
  1. The following has to be done one instance at a time. Detach the instance from the ASG and wait for the ELB connections to drain. Make sure you wait for the connections to completely drain first before continuing. Then, drain the clients.
  1. After the allocations are drained, terminate the instances. For example, assuming you have saved the instance IDs to instance-ids.txt:
aws ec2 terminate-instances \
    --instance-ids $(cat instance-ids.txt | tr '\n' ' ')

Manually upgrading Vault

Important: It is important that you update the instances one by one. Make sure the new instance is healthy, has joined the cluster and is unsealed first before continuing.

  1. Terminate the instance and AWS will automatically start a new instance.
  2. Unseal the new instance. If you do not do so, new instances will eventually be unable to configure themselves properly. This is especially so if you have performed any post bootstrap configuration.

You can seal the server by SSH'ing into the server and running vault operator seal with the required token. You can optionally use our utility Ansible playbooks to do so.

You can terminate instances by using this AWS CLI command:

aws autoscaling \
    terminate-instance-in-auto-scaling-group \
    --no-should-decrement-desired-capacity \
    --instance-id "xxx"

Replace xxx with the instance ID.

Inputs and Outputs

Refer to INOUT.md