-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue 963] Moves ECS services into private subnets #1014
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Co-authored-by: James Bursa <[email protected]>
Co-authored-by: James Bursa <[email protected]>
Co-authored-by: James Bursa <[email protected]>
Co-authored-by: James Bursa <[email protected]>
name = "tag:subnet_type" | ||
values = ["private"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes from deploying in every subnet, to only deploying into the private subnets
# TODO(https://github.com/navapbc/template-infra/issues/152) set assign_public_ip = false after using private subnets | ||
# checkov:skip=CKV_AWS_333:Switch to using private subnets | ||
assign_public_ip = true | ||
assign_public_ip = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is the whole reason I'm here
# S3 and DynamoDB use Gateway VPC endpoints. All other services use Interface VPC endpoints | ||
interface_vpc_endpoints = toset([for aws_service in local.aws_service_integrations : aws_service if !contains(["s3", "dynamodb"], aws_service)]) | ||
gateway_vpc_endpoints = toset([for aws_service in local.aws_service_integrations : aws_service if contains(["s3", "dynamodb"], aws_service)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern was pulled from the Nava template: https://github.com/navapbc/template-infra/
resource "aws_vpc_endpoint" "aws_service" { | ||
for_each = local.aws_service_integrations | ||
|
||
vpc_id = data.aws_vpc.default.id | ||
service_name = "com.amazonaws.${data.aws_region.current.name}.${each.key}" | ||
vpc_endpoint_type = "Interface" | ||
security_group_ids = [aws_security_group.aws_services[0].id] | ||
subnet_ids = data.aws_subnets.default.ids | ||
subnet_ids = [for subnet in aws_subnet.backfill_private : subnet.id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: https://betagrantsgov.slack.com/archives/C05TSL64VUH/p1705438485101009
I needed to deploy the VPC endpoints exclusively into the private subnets
@@ -70,9 +70,20 @@ variable "vpc_id" { | |||
description = "Uniquely identifies the VPC." | |||
} | |||
|
|||
variable "subnet_ids" { | |||
variable "public_subnet_ids" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of these changes were all pulled from the Nava template. Here's the diff between our version and main:
@@ -2,9 +2,14 @@ | |||
import itertools | |||
from operator import itemgetter | |||
import os | |||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to merge as-is since I've got two approvals! I'll watch the |
Summary
Fixes #963
This PR will be split into 3 major stages:
Stage 1: Create private subnets
The existing networking infrastructure is in its "default" (AWS provided) state. This state does not meet the standards of our ideal network layer design in a few ways. But the most notable one is: the default networking infrastructure lacks private subnets. You cannot remove the public IP addresses from the ECS instances without a first creating private subnets. So I'm requesting code review for Stage 1, backfilling the private subnets into the default VPC. This also includes attaching the VPC endpoints onto those private subnets.
Deploying this will require me to deploy the subnets from my branch, before I merge. I've talked about this with @aplybeah. Essentially, I need to (temporarily) ignore "only deploy from main" rule in order to resolve #963 in a timely manner. The alternative would be for me to take several days to deploy an entirely new networking layer from scratch, which feels like too much work for this task / our timeline.
Due to the above, there will be a long running (eg. for a few days) terraform state diff on the
infra/networks/
folder.Stage 2: Creating a "staging" site inside the private subnets
When I create the private subnets, I won't be able to be sure that they are totally working for the sake of deploying our application. There's a chance this could be a breaking change from the point of view of the application. So it requires testing, specifically testing in the form of creating a 3rd simpler ECS application. I'm calling this application "staging", since "dev" and "prod" are already in use. I'll deploy this application to the private subnets to confirm that the application is working. I'll then keep the application around so that we can use it for other purposes later.
Stage 3: Move all sites into the private subnets
After confirming that staging works, I'll make sure the terraform configuration for dev and prod are using the new subnets as well. I'll deploy to dev and prod after I merge. This will not change the load balancer URL, but it might cause a few minutes of downtime. I expect that any potential downtime will be 5 minutes at most, enough time for the new app containers to start up.
Testing
Deploy the networking layer changes:
Deploy the frontend application (with the latest build as of Jan 16th)
Deploy the database
Deploy the backend application (with the latest build as of Jan 16th)
Future Work
In the future, I would like to move to the network architecture described in the Nava infra template. That design specifies a separate network for every environment. This would be an improvement over our current design, where there is one network for all environments. The fact that there's only 1 network is what makes testing stage 1 so complicated.
This PR includes a new
staging
app, but does not (yet) add that staging app to Github Actions. I intend to do that at a later date!