Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 963] Moves ECS services into private subnets #1014

Merged
merged 18 commits into from
Jan 17, 2024
Merged

Conversation

coilysiren
Copy link
Collaborator

@coilysiren coilysiren commented Jan 12, 2024

Summary

Fixes #963

This PR will be split into 3 major stages:

Stage 1: Create private subnets

The existing networking infrastructure is in its "default" (AWS provided) state. This state does not meet the standards of our ideal network layer design in a few ways. But the most notable one is: the default networking infrastructure lacks private subnets. You cannot remove the public IP addresses from the ECS instances without a first creating private subnets. So I'm requesting code review for Stage 1, backfilling the private subnets into the default VPC. This also includes attaching the VPC endpoints onto those private subnets.

Deploying this will require me to deploy the subnets from my branch, before I merge. I've talked about this with @aplybeah. Essentially, I need to (temporarily) ignore "only deploy from main" rule in order to resolve #963 in a timely manner. The alternative would be for me to take several days to deploy an entirely new networking layer from scratch, which feels like too much work for this task / our timeline.

Due to the above, there will be a long running (eg. for a few days) terraform state diff on the infra/networks/ folder.

Stage 2: Creating a "staging" site inside the private subnets

When I create the private subnets, I won't be able to be sure that they are totally working for the sake of deploying our application. There's a chance this could be a breaking change from the point of view of the application. So it requires testing, specifically testing in the form of creating a 3rd simpler ECS application. I'm calling this application "staging", since "dev" and "prod" are already in use. I'll deploy this application to the private subnets to confirm that the application is working. I'll then keep the application around so that we can use it for other purposes later.

Stage 3: Move all sites into the private subnets

After confirming that staging works, I'll make sure the terraform configuration for dev and prod are using the new subnets as well. I'll deploy to dev and prod after I merge. This will not change the load balancer URL, but it might cause a few minutes of downtime. I expect that any potential downtime will be 5 minutes at most, enough time for the new app containers to start up.

Testing

Deploy the networking layer changes:

$ cd infra/networks/
$ terraform apply

Deploy the frontend application (with the latest build as of Jan 16th)

$ cd infra/frontend/service
$ terraform init -backend-config=staging.s3.tfbackend -reconfigure
$ terraform apply \
    -var="environment_name=staging" \
    -var="image_tag=2698a01eca3a7373ae4ddeec78271108f50234fd"

Deploy the database

$ cd infra/api/database
$ terraform init -backend-config=staging.s3.tfbackend -reconfigure
$ terraform apply \
    -var="environment_name=staging"

Deploy the backend application (with the latest build as of Jan 16th)

$ cd infra/api/service
$ terraform init -backend-config=staging.s3.tfbackend -reconfigure
$ terraform apply \
    -var="environment_name=staging" \
    -var="image_tag=2698a01eca3a7373ae4ddeec78271108f50234fd"

Future Work

In the future, I would like to move to the network architecture described in the Nava infra template. That design specifies a separate network for every environment. This would be an improvement over our current design, where there is one network for all environments. The fact that there's only 1 network is what makes testing stage 1 so complicated.

This PR includes a new staging app, but does not (yet) add that staging app to Github Actions. I intend to do that at a later date!

@coilysiren coilysiren changed the title Backfill private subnets [Issue 963] Moves ECS services into private subnets Jan 12, 2024
@coilysiren coilysiren self-assigned this Jan 12, 2024
@coilysiren coilysiren marked this pull request as ready for review January 12, 2024 22:02
Copy link
Collaborator

@jamesbursa jamesbursa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

infra/networks/subnet-backfill.tf Outdated Show resolved Hide resolved
infra/networks/subnet-backfill.tf Outdated Show resolved Hide resolved
infra/networks/subnet-backfill.tf Outdated Show resolved Hide resolved
infra/networks/subnet-backfill.tf Outdated Show resolved Hide resolved
Comment on lines +11 to +12
name = "tag:subnet_type"
values = ["private"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes from deploying in every subnet, to only deploying into the private subnets

# TODO(https://github.com/navapbc/template-infra/issues/152) set assign_public_ip = false after using private subnets
# checkov:skip=CKV_AWS_333:Switch to using private subnets
assign_public_ip = true
assign_public_ip = false
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is the whole reason I'm here

Comment on lines +28 to +30
# S3 and DynamoDB use Gateway VPC endpoints. All other services use Interface VPC endpoints
interface_vpc_endpoints = toset([for aws_service in local.aws_service_integrations : aws_service if !contains(["s3", "dynamodb"], aws_service)])
gateway_vpc_endpoints = toset([for aws_service in local.aws_service_integrations : aws_service if contains(["s3", "dynamodb"], aws_service)])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern was pulled from the Nava template: https://github.com/navapbc/template-infra/

resource "aws_vpc_endpoint" "aws_service" {
for_each = local.aws_service_integrations

vpc_id = data.aws_vpc.default.id
service_name = "com.amazonaws.${data.aws_region.current.name}.${each.key}"
vpc_endpoint_type = "Interface"
security_group_ids = [aws_security_group.aws_services[0].id]
subnet_ids = data.aws_subnets.default.ids
subnet_ids = [for subnet in aws_subnet.backfill_private : subnet.id]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: https://betagrantsgov.slack.com/archives/C05TSL64VUH/p1705438485101009

I needed to deploy the VPC endpoints exclusively into the private subnets

@@ -70,9 +70,20 @@ variable "vpc_id" {
description = "Uniquely identifies the VPC."
}

variable "subnet_ids" {
variable "public_subnet_ids" {
Copy link
Collaborator Author

@coilysiren coilysiren Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these changes were all pulled from the Nava template. Here's the diff between our version and main:

navapbc/template-infra@fe5c7cd...main

@@ -2,9 +2,14 @@
import itertools
from operator import itemgetter
import os
import json
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coilysiren
Copy link
Collaborator Author

I'm going to merge as-is since I've got two approvals! I'll watch the dev deploy to make sure it works 🙏🏼

@coilysiren coilysiren merged commit 174d6e8 into main Jan 17, 2024
9 checks passed
@coilysiren coilysiren deleted the backfill-subnets branch January 17, 2024 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task]: Update ECS Settings so That ECS.2 Check Passes
3 participants