Add support for spot instances (#55)

* Add support for spot instances * Add test for spot instances
infrahouse · Dec 1, 2024 · 931e19f · 931e19f
1 parent eb59c40
commit 931e19f
Show file tree

Hide file tree

Showing 13 changed files with 327 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -34,6 +34,7 @@ module "website" {
   userdata              = module.webserver_userdata.userdata
   stickiness_enabled    = true
 }
+```
 
 ### Security groups
 
@@ -45,15 +46,20 @@ The module creates two security groups. One for the load balancer, another - for
 The load balancer security group allows traffic to TCP ports 443 and `var.alb_listener_port` (80 by default).
 
 The backend security group allows user traffic and health checks coming from the load balancer.
-Also, the security group allows SSH from the VPC wehere the backend instances reside and from `var.ssh_cidr_block`.
+Also, the security group allows SSH from the VPC where the backend instances reside and from `var.ssh_cidr_block`.
 It is 0.0.0.0/0 by default, but the goal is allow user restrict access let's say to anyone but the management VPC.
 
 Both security groups allow incoming ICMP traffic.
 
 Additionally, the user can specify additional security groups via `var.extra_security_groups_backend`.
 They will be added to the backend instance alongside with the created backend security group.
 
-```
+### Using spot instances
+
+By default, the module launches on-demand instances only. However, if you specify `var.on_demand_base_capacity`,
+the ASG will fulfill its capacity by as many on-demand instances as `var.on_demand_base_capacity` and the rest will
+be spot instances.
+
 ## Requirements
 
 | Name | Version |
@@ -166,6 +172,7 @@ They will be added to the backend instance alongside with the created backend se
 | <a name="input_key_pair_name"></a> [key\_pair\_name](#input\_key\_pair\_name) | SSH keypair name to be deployed in EC2 instances | `string` | n/a | yes |
 | <a name="input_max_instance_lifetime_days"></a> [max\_instance\_lifetime\_days](#input\_max\_instance\_lifetime\_days) | The maximum amount of time, in \_days\_, that an instance can be in service, values must be either equal to 0 or between 7 and 365 days. | `number` | `30` | no |
 | <a name="input_min_healthy_percentage"></a> [min\_healthy\_percentage](#input\_min\_healthy\_percentage) | Amount of capacity in the Auto Scaling group that must remain healthy during an instance refresh to allow the operation to continue, as a percentage of the desired capacity of the Auto Scaling group. | `number` | `100` | no |
+| <a name="input_on_demand_base_capacity"></a> [on\_demand\_base\_capacity](#input\_on\_demand\_base\_capacity) | If specified, the ASG will request spot instances and this will be the minimal number of on-demand instances. | `number` | `null` | no |
 | <a name="input_protect_from_scale_in"></a> [protect\_from\_scale\_in](#input\_protect\_from\_scale\_in) | Whether newly launched instances are automatically protected from termination by Amazon EC2 Auto Scaling when scaling in. | `bool` | `false` | no |
 | <a name="input_root_volume_size"></a> [root\_volume\_size](#input\_root\_volume\_size) | Root volume size in EC2 instance in Gigabytes | `number` | `30` | no |
 | <a name="input_service_name"></a> [service\_name](#input\_service\_name) | Descriptive name of a service that will use this VPC | `string` | `"website"` | no |
@@ -174,6 +181,7 @@ They will be added to the backend instance alongside with the created backend se
 | <a name="input_subnets"></a> [subnets](#input\_subnets) | Subnet ids where load balancer should be present | `list(string)` | n/a | yes |
 | <a name="input_tags"></a> [tags](#input\_tags) | Tags to apply to instances in the autoscaling group. | `map(string)` | <pre>{<br/>  "Name": "webserver"<br/>}</pre> | no |
 | <a name="input_target_group_port"></a> [target\_group\_port](#input\_target\_group\_port) | TCP port that a target listens to to serve requests from the load balancer. | `number` | `80` | no |
+| <a name="input_target_group_type"></a> [target\_group\_type](#input\_target\_group\_type) | Target group type: instance, ip, alb. Default is instance. | `string` | `"instance"` | no |
 | <a name="input_userdata"></a> [userdata](#input\_userdata) | userdata for cloud-init to provision EC2 instances | `string` | n/a | yes |
 | <a name="input_wait_for_capacity_timeout"></a> [wait\_for\_capacity\_timeout](#input\_wait\_for\_capacity\_timeout) | How much time to wait until all instances are healthy | `string` | `"20m"` | no |
 | <a name="input_zone_id"></a> [zone\_id](#input\_zone\_id) | Domain name zone ID where the website will be available | `string` | n/a | yes |

diff --git a/asg.tf b/asg.tf
@@ -19,9 +19,27 @@ resource "aws_autoscaling_group" "website" {
     }
     triggers = ["tag"]
   }
-  launch_template {
-    id      = aws_launch_template.website.id
-    version = aws_launch_template.website.latest_version
+  dynamic "launch_template" {
+    for_each = var.on_demand_base_capacity == null ? [1] : []
+    content {
+      id      = aws_launch_template.website.id
+      version = aws_launch_template.website.latest_version
+    }
+  }
+  dynamic "mixed_instances_policy" {
+    for_each = var.on_demand_base_capacity == null ? [] : [1]
+    content {
+      instances_distribution {
+        on_demand_base_capacity                  = var.on_demand_base_capacity
+        on_demand_percentage_above_base_capacity = 0
+      }
+      launch_template {
+        launch_template_specification {
+          launch_template_id = aws_launch_template.website.id
+          version            = aws_launch_template.website.latest_version
+        }
+      }
+    }
   }
   instance_maintenance_policy {
     min_healthy_percentage = var.asg_min_healthy_percentage

diff --git a/requirements.txt b/requirements.txt
@@ -5,4 +5,4 @@ myst-parser ~= 2.0
 pytest ~= 7.3
 pytest-timeout ~= 2.1
 pytest-rerunfailures ~= 12.0
-requests ~= 2.31
+requests ~= 2.32
diff --git a/test_data/test_spot/datasources.tf b/test_data/test_spot/datasources.tf
@@ -0,0 +1,77 @@
+data "cloudinit_config" "webserver_init" {
+  gzip          = false
+  base64_encode = true
+
+  part {
+    content_type = "text/cloud-config"
+    content = join(
+      "\n",
+      [
+        "#cloud-config",
+        yamlencode(
+          {
+            "package_update" : true,
+            packages : [
+              "xinetd",
+              "net-tools"
+            ]
+            write_files : [
+              {
+                path : "/etc/xinetd.d/http"
+                permissions : "0600"
+                content : file("${path.module}/xinetd.d.http")
+              },
+              {
+                path : "/usr/local/bin/httpd"
+                permissions : "0755"
+                content : file("${path.module}/httpd.sh")
+              }
+            ]
+            runcmd : [
+              "systemctl start xinetd"
+            ]
+          }
+        )
+      ]
+    )
+  }
+}
+
+data "aws_ami" "ubuntu" {
+  most_recent = true
+
+  filter {
+    name   = "name"
+    values = ["ubuntu/images/hvm-ssd/ubuntu-${var.ubuntu_codename}-*"]
+  }
+
+  filter {
+    name   = "architecture"
+    values = ["x86_64"]
+  }
+
+  filter {
+    name   = "virtualization-type"
+    values = ["hvm"]
+  }
+
+  filter {
+    name = "state"
+    values = [
+      "available"
+    ]
+  }
+
+  owners = ["099720109477"] # Canonical
+}
+
+data "aws_iam_policy_document" "webserver_permissions" {
+  statement {
+    actions   = ["ec2:Describe*"]
+    resources = ["*"]
+  }
+}
+
+data "aws_route53_zone" "website" {
+  name = var.dns_zone
+}
diff --git a/test_data/test_spot/httpd.sh b/test_data/test_spot/httpd.sh
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+
+http_response () {
+    HTTP_CODE=$1
+    MESSAGE=${2:-Message Undefined}
+    length=$((${#MESSAGE} + 2))
+    if [[ "$HTTP_CODE" -eq 503 ]]; then
+      echo -en "HTTP/1.1 503 Service Unavailable\r\n"
+    elif [[ "$HTTP_CODE" -eq 200 ]]; then
+      echo -en "HTTP/1.1 200 OK\r\n"
+    else
+      echo -en "HTTP/1.1 ${HTTP_CODE} UNKNOWN\r\n"
+    fi
+    echo -en "Content-Type: text/plain\r\n"
+    echo -en "Connection: close\r\n"
+    echo -en "Content-Length: ${length}\r\n"
+    echo -en "\r\n"
+    echo -en "$MESSAGE"
+    echo -en "\r\n"
+    sleep 0.1
+    exit 0
+}
+
+http_response 200 "Success Message"
diff --git a/test_data/test_spot/main.tf b/test_data/test_spot/main.tf
@@ -0,0 +1,26 @@
+resource "aws_key_pair" "test" {
+  public_key = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDpgAP1z1Lxg9Uv4tam6WdJBcAftZR4ik7RsSr6aNXqfnTj4civrhd/q8qMqF6wL//3OujVDZfhJcffTzPS2XYhUxh/rRVOB3xcqwETppdykD0XZpkHkc8XtmHpiqk6E9iBI4mDwYcDqEg3/vrDAGYYsnFwWmdDinxzMH1Gei+NPTmTqU+wJ1JZvkw3WBEMZKlUVJC/+nuv+jbMmCtm7sIM4rlp2wyzLWYoidRNMK97sG8+v+mDQol/qXK3Fuetj+1f+vSx2obSzpTxL4RYg1kS6W1fBlSvstDV5bQG4HvywzN5Y8eCpwzHLZ1tYtTycZEApFdy+MSfws5vPOpggQlWfZ4vA8ujfWAF75J+WABV4DlSJ3Ng6rLMW78hVatANUnb9s4clOS8H6yAjv+bU3OElKBkQ10wNneoFIMOA3grjPvPp5r8dI0WDXPIznJThDJO5yMCy3OfCXlu38VDQa1sjVj1zAPG+Vn2DsdVrl50hWSYSB17Zww0MYEr8N5rfFE= aleks@MediaPC"
+}
+
+module "lb" {
+  source = "../../"
+  providers = {
+    aws     = aws
+    aws.dns = aws
+  }
+  service_name                 = "website"
+  subnets                      = var.lb_subnet_ids
+  ami                          = data.aws_ami.ubuntu.id
+  backend_subnets              = var.backend_subnet_ids
+  asg_name                     = var.asg_name
+  asg_min_size                 = 2
+  on_demand_base_capacity      = 1
+  internet_gateway_id          = var.internet_gateway_id
+  zone_id                      = data.aws_route53_zone.website.zone_id
+  dns_a_records                = var.dns_a_records
+  key_pair_name                = aws_key_pair.test.key_name
+  userdata                     = data.cloudinit_config.webserver_init.rendered
+  health_check_type            = "ELB"
+  instance_profile_permissions = data.aws_iam_policy_document.webserver_permissions.json
+  instance_role_name           = var.instance_role_name
+}
diff --git a/test_data/test_spot/outputs.tf b/test_data/test_spot/outputs.tf
@@ -0,0 +1,23 @@
+output "network_subnet_public_ids" {
+  value = var.lb_subnet_ids
+}
+
+output "network_subnet_private_ids" {
+  value = var.backend_subnet_ids
+}
+
+output "network_subnet_all_ids" {
+  value = concat(var.backend_subnet_ids, var.lb_subnet_ids)
+}
+
+output "asg_name" {
+  value = module.lb.asg_name
+}
+
+output "instance_profile_name" {
+  value = module.lb.instance_profile_name
+}
+
+output "load_balancer_dns_name" {
+  value = module.lb.load_balancer_dns_name
+}
diff --git a/test_data/test_spot/providers.tf b/test_data/test_spot/providers.tf
@@ -0,0 +1,12 @@
+provider "aws" {
+  assume_role {
+    role_arn = var.role_arn
+  }
+  region = var.region
+  default_tags {
+    tags = {
+      "created_by" : "infrahouse/terraform-aws-website-pod" # GitHub repository that created a resource
+    }
+
+  }
+}
diff --git a/test_data/test_spot/terraform.tf b/test_data/test_spot/terraform.tf
@@ -0,0 +1,12 @@
+terraform {
+  required_providers {
+    aws = {
+      source  = "hashicorp/aws"
+      version = "~> 5.11"
+    }
+    cloudinit = {
+      source  = "hashicorp/cloudinit"
+      version = "~> 2.3"
+    }
+  }
+}
diff --git a/test_data/test_spot/variables.tf b/test_data/test_spot/variables.tf
@@ -0,0 +1,13 @@
+variable "region" {}
+variable "role_arn" {}
+variable "dns_a_records" {
+  default = ["", "www", "bogus-test-stuff"]
+}
+variable "dns_zone" {}
+variable "ubuntu_codename" {}
+variable "asg_name" { default = null }
+
+variable "backend_subnet_ids" {}
+variable "lb_subnet_ids" {}
+variable "internet_gateway_id" {}
+variable "instance_role_name" { default = null }
diff --git a/test_data/test_spot/xinetd.d.http b/test_data/test_spot/xinetd.d.http
@@ -0,0 +1,17 @@
+# default: on
+# description: xinetdhttpservice
+service xinetdhttpservice
+{
+# Inspired by: https://github.com/rglaue/xinetd_bash_http_service/blob/master/xinetdhttpservice_config
+        disable = no
+        flags           = REUSE
+        socket_type     = stream
+        type            = UNLISTED
+        port            = 80
+        wait            = no
+        user            = nobody
+        server          = /usr/local/bin/httpd
+        log_on_failure  += USERID
+        only_from       = 0.0.0.0/0
+        per_source      = UNLIMITED
+}
diff --git a/tests/test_spot.py b/tests/test_spot.py
@@ -0,0 +1,85 @@
+import json
+from os import path as osp
+from pprint import pformat
+from textwrap import dedent
+
+import pytest
+from infrahouse_toolkit.terraform import terraform_apply
+
+from tests.conftest import (
+    TEST_ZONE,
+    REGION,
+    UBUNTU_CODENAME,
+    TRACE_TERRAFORM,
+    TEST_ROLE_ARN,
+    TEST_TIMEOUT,
+    wait_for_instance_refresh,
+    LOG,
+)
+
+
+@pytest.mark.timeout(TEST_TIMEOUT)
+def test_lb(
+    service_network,
+    ec2_client,
+    route53_client,
+    elbv2_client,
+    autoscaling_client,
+    keep_after,
+):
+    subnet_public_ids = service_network["subnet_public_ids"]["value"]
+    subnet_private_ids = service_network["subnet_private_ids"]["value"]
+    internet_gateway_id = service_network["internet_gateway_id"]["value"]
+
+    terraform_dir = "test_data/test_spot"
+
+    with open(osp.join(terraform_dir, "terraform.tfvars"), "w") as fp:
+        fp.write(
+            dedent(
+                f"""
+                region          = "{REGION}"
+                role_arn        = "{TEST_ROLE_ARN}"
+                dns_zone        = "{TEST_ZONE}"
+                ubuntu_codename = "{UBUNTU_CODENAME}"
+
+                lb_subnet_ids       = {json.dumps(subnet_public_ids)}
+                backend_subnet_ids  = {json.dumps(subnet_private_ids)}
+                internet_gateway_id = "{internet_gateway_id}"
+                """
+            )
+        )
+
+    with terraform_apply(
+        terraform_dir,
+        destroy_after=not keep_after,
+        json_output=True,
+        enable_trace=TRACE_TERRAFORM,
+    ) as tf_output:
+        asg_name = tf_output["asg_name"]["value"]
+        wait_for_instance_refresh(asg_name, autoscaling_client)
+        response = autoscaling_client.describe_auto_scaling_groups(
+            AutoScalingGroupNames=[
+                asg_name,
+            ],
+        )
+        LOG.debug(
+            "describe_auto_scaling_groups(%s): %s",
+            asg_name,
+            pformat(response, indent=4),
+        )
+
+        healthy_instance = None
+        for instance in response["AutoScalingGroups"][0]["Instances"]:
+            LOG.debug("Evaluating instance %s", pformat(instance, indent=4))
+            if instance["LifecycleState"] == "InService":
+                healthy_instance = instance
+                break
+        assert healthy_instance, f"Could not find a healthy instance in ASG {asg_name}"
+        healthy_instance_count = len(
+            [
+                i
+                for i in response["AutoScalingGroups"][0]["Instances"]
+                if i["LifecycleState"] == "InService"
+            ]
+        )
+        assert healthy_instance_count == 2
diff --git a/variables.tf b/variables.tf
@@ -267,6 +267,12 @@ variable "service_name" {
   default     = "website"
 }
 
+variable "on_demand_base_capacity" {
+  description = "If specified, the ASG will request spot instances and this will be the minimal number of on-demand instances."
+  type        = number
+  default     = null
+}
+
 variable "ssh_cidr_block" {
   description = "CIDR range that is allowed to SSH into the backend instances.  Format is a.b.c.d/<prefix>."
   type        = string