Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cli): cannot hotswap ECS task definitions containing certain intrinsics #26404

Merged
merged 24 commits into from
Aug 17, 2023

Conversation

tmokmss
Copy link
Contributor

@tmokmss tmokmss commented Jul 18, 2023

Reason for this change

Currently an ECS task definition cannot be hotswapped in many cases, for example when it contains references to certain AWS resources as environment variables. This PR removes the limitation and make hotswaps possible in much more various situations.

Specifically, if there is any token that is not part of the following, we cannot hotswap a task definition:

  1. Ref functions, which will be resolved as physical resource ids
  2. GetAtt functions for some resources and attributes (manually supported one by one, code)
  3. several simple CFn functions (join, split, select, sub)
  4. parameters like AWS::AccountId, Region, Partition, or UrlSuffix

Although it is not supported much, using GetAtt function in a task definition is very common (imagine you reference other resource's property as an environment variable). This PR allows to hotswap a task definition even if it contains these tokens.

Solution

To hotswap a task definition, we need to construct a task definition to call registerTaskDefinition API. For this, we have to evaluate CloudFormation template locally to resolve all the intrinsics in the template. However, some intrinsics such as Fn::GetAtt is not fully supported by CDK CLI because we have to manually support them for each AWS resource type.

The reason why some task definitions are unhotswappable is that there are such intrinsics in the template and the CDK fails to evaluate it locally. So the basic idea to overcome this limitation in this PR is that we don't try to evaluate it locally, but we fetch the latest task definition already deployed and get the required values from it.

Here's how we can implement the idea.

How we determine if changes to a task definition can be hotswapped

In the hotswap process, we have to decide whether the change can be hotswapped or not. Now we can hotswap the task definition if

  1. there are only changes in ContainerDefinitions field, and all the fields in the task definition can be evaluated locally. (original behavior) OR,
  2. there are only changes in ContainerDefinitions field, and all the updated field can be evaluated locally (added in this PR).

The first condition can actually be included in the second condition, but for now I keep it as-is to avoid the possibility of breaking the existing behavior.

If the second condition is true, we fetch the latest task definition from AWS account, override the updated fields, and register a new task definition to update a service. By this way, we don't have to evaluate tokens in unchanged fields locally, allowing to use hotswap in much more situations.

How we compare the old and new task definition

Here is an example task definition:

{
    "ContainerDefinitions": [
        {
            "Environment": [
                {
                    "Name": "VPC_ID",
                    "Value": {
                        "Fn::GetAtt": [
                            "Vpc8378EB38",
                            "CidrBlock"
                        ]
                    }
                }
            ],
            "Essential": true,
            "Image": "nginx:stable",
            "Name": "EcsApp"
        }
    ],
    "Cpu": "256",
    "Family": "myteststackTask289798EC",
    "Memory": "512",
    "NetworkMode": "awsvpc",
    "RequiresCompatibilities": [
        "FARGATE"
    ],
    "TaskRoleArn": {
        "Fn::GetAtt": [
            "TaskTaskRoleE98524A1",
            "Arn"
        ]
    }
}

We compare the old and new task definition in the following steps:

  1. Check if there are only changes in ContainerDefinitions field. If not, we cannot hotswap.
  2. Try evaluateCfnExpression on the containerDefinitons. If it can be evaluated, proceed to hotswap. If not, proceed to step 3.
  3. Check if the length of ContainerDefinitions is the same. If not, we cannot hotswap.
  4. For each container definition, deep-compare each key (e.g. Environment, Image, Name, etc)
  5. For each key, if there is any diff in the corresponding value, try evaluateCfnExpression on the value. If the evaluation fails, we cannot hotswap.
  6. After checking all the keys and there is no field that cannot be hotswapped, proceed to hotswap.

Imagine if there is a change only in Image field (container image tag) but Environment field contains unsupported intrinsics (e.g. "Fn::GetAtt": ["Vpc8378EB38", "CidrBlock"]). In the previous CDK CLI we cannot hotswap it due to an evaluation error. We can now hotswap it because we don't have to evaluate the Environment field when it has no diffs.

Closes #25563


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@github-actions github-actions bot added the admired-contributor [Pilot] contributed between 13-24 PRs to the CDK label Jul 18, 2023
@aws-cdk-automation aws-cdk-automation requested a review from a team July 18, 2023 15:16
@github-actions github-actions bot added effort/medium Medium work item – several days of effort feature-request A feature should be added or improved. p1 labels Jul 18, 2023
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.

A comment requesting an exemption should contain the text Exemption Request. Additionally, if clarification is needed add Clarification Request to a comment.

@aws-cdk-automation aws-cdk-automation added the pr/needs-cli-test-run This PR needs CLI tests run against it. label Jul 18, 2023
@tmokmss
Copy link
Contributor Author

tmokmss commented Jul 18, 2023

Exemption Request: hotswap code does not support integration tests. fwiw I tested it manually by deploying a stack with the below code:

import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import { StringParameter } from 'aws-cdk-lib/aws-ssm';

const app = new cdk.App();
const stack = new cdk.Stack(app, 'my-test-stack');

const vpc = new ec2.Vpc(stack, 'Vpc', {
  subnetConfiguration: [
    {
      name: 'Public',
      subnetType: ec2.SubnetType.PUBLIC,
      cidrMask: 18,
    },
    {
      name: 'Isolated',
      subnetType: ec2.SubnetType.PRIVATE_ISOLATED,
      cidrMask: 18,
    },
  ],
});

const cluster = new ecs.Cluster(stack, 'EcsCluster', {
  vpc,
});

const taskDefinition = new ecs.FargateTaskDefinition(stack, 'Task', {});

taskDefinition.addContainer('EcsApp', {
  // change latest to stable and run hotswap
  image: ecs.ContainerImage.fromRegistry('nginx:latest'),
  environment: {
    // Previously we couldn't hotswap the task definition referencing properties like vpcCidrBlock.
    VPC_ID: vpc.vpcCidrBlock,
  },
});

cdk.Tags.of(taskDefinition).add('Key', 'Value');

new ecs.FargateService(stack, 'Service', {
  cluster,
  taskDefinition,
  assignPublicIp: true,
});

Place the file to somewhere like packages/@aws-cdk-testing/framework-integ/test/aws-ecs/test/hotswap.ts, run tsc, and run npx cdk -a test/aws-ecs/test/hotswap.js deploy.

@aws-cdk-automation aws-cdk-automation added the pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback. label Jul 18, 2023
@aws-cdk-automation aws-cdk-automation added the pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. label Jul 18, 2023
@corymhall corymhall added the pr-linter/exempt-integ-test The PR linter will not require integ test changes label Jul 19, 2023
@aws-cdk-automation aws-cdk-automation added pr/needs-maintainer-review This PR needs a review from a Core Team Member and removed pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. labels Jul 19, 2023
@kaizencc kaizencc removed the pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback. label Jul 21, 2023
@aws-cdk-automation aws-cdk-automation removed the pr/needs-maintainer-review This PR needs a review from a Core Team Member label Jul 21, 2023
@kaizencc
Copy link
Contributor

Running the cli integ tests right now, just so I can exempt you from the linter

@tmokmss
Copy link
Contributor Author

tmokmss commented Jul 21, 2023

@kaizencc Thank you very much!

By the way, how about removing the cli integ test check from the PR linter? I believe the PR linter is for a contributor to validate if the PR is ready for review, but as for the cli integ check, there is nothing they can do to make the linter pass, which is a bit frustrating. I think the need to run the cli integ test can be expressed with another label.

@kaizencc kaizencc added pr-linter/cli-integ-tested Assert that any CLI changes have been integ tested and removed pr/needs-cli-test-run This PR needs CLI tests run against it. labels Jul 21, 2023
@aws-cdk-automation aws-cdk-automation dismissed their stale review July 21, 2023 21:03

✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.

@aws-cdk-automation aws-cdk-automation added the pr/needs-maintainer-review This PR needs a review from a Core Team Member label Jul 21, 2023
@aws-cdk-automation aws-cdk-automation added the pr/needs-maintainer-review This PR needs a review from a Core Team Member label Aug 10, 2023
@tmokmss
Copy link
Contributor Author

tmokmss commented Aug 10, 2023

@rix0rrr I added/fixed unit tests and now it's ready for review :) Hope you can check this, thanks.

@rix0rrr
Copy link
Contributor

rix0rrr commented Aug 14, 2023

I like this a lot, thanks! Ready to ship it, but final check: did you do manual testing on the latest revision with all the requested changes in it?

Cheers!

@rix0rrr rix0rrr self-assigned this Aug 14, 2023
@rix0rrr rix0rrr temporarily deployed to test-pipeline August 14, 2023 11:46 — with GitHub Actions Inactive
@tmokmss
Copy link
Contributor Author

tmokmss commented Aug 14, 2023

@rix0rrr Thanks :) Yes, I confirmed that I can successfully hotswap this CDK code: #26404 (comment)

rix0rrr
rix0rrr previously approved these changes Aug 16, 2023
@aws-cdk-automation
Copy link
Collaborator

➡️ PR build request submitted to test-main-pipeline ⬅️

A maintainer must now check the pipeline and add the pr-linter/cli-integ-tested label once the pipeline succeeds.

@mergify
Copy link
Contributor

mergify bot commented Aug 16, 2023

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@aws-cdk-automation aws-cdk-automation removed the pr/needs-maintainer-review This PR needs a review from a Core Team Member label Aug 16, 2023
@mergify
Copy link
Contributor

mergify bot commented Aug 16, 2023

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@mergify mergify bot dismissed rix0rrr’s stale review August 17, 2023 05:24

Pull request has been modified.

@tmokmss
Copy link
Contributor Author

tmokmss commented Aug 17, 2023

@rix0rrr mergify was failing for some reason:

Mergify doesn't have permission to update

For security reasons, Mergify can't update this pull request. Try updating locally.
GitHub response: refusing to allow a GitHub App to create or update workflow .github/workflows/close-stale-prs.yml without workflows permission
err-code: 5E909

I guess merging with main manually will fix this issue (because updating workflow files from mergify is prohibited?). Can you approve it again? Thanks 🙏

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 754a08e
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@aws-cdk-automation aws-cdk-automation added the pr/needs-maintainer-review This PR needs a review from a Core Team Member label Aug 17, 2023
@mergify
Copy link
Contributor

mergify bot commented Aug 17, 2023

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@mergify mergify bot merged commit 6d315b8 into aws:main Aug 17, 2023
@aws-cdk-automation aws-cdk-automation removed the pr/needs-maintainer-review This PR needs a review from a Core Team Member label Aug 17, 2023
tmokmss added a commit to tmokmss/aws-cdk that referenced this pull request Sep 29, 2023
mergify bot pushed a commit that referenced this pull request Oct 12, 2023
…tain intrinsics" (#27358)

Closes #27343

From v2.93.0 (#26404), we hotswap ECS task definition in the following steps:

1. Calculate patch from old/new CFn template
2. Apply the patch to the task definition fetched from describeTaskDefinition API
3. register the new task definition and update services

The root cause of the issue #27343 is the order of containerDefinitions array is somehow shuffled when deployed to ECS, so the patch calculated from CFn template becomes invalid.

For example, when the containerDefinitions in a CFn template is like below:

```json
    "ContainerDefinitions": [
     {
      "Name": "main",
      "Image": "imageA"
     },
     {
      "Name": "sidecar",
      "Image": "imageB"
     }
    ],
```

the deployed task definition can sometimes become like this:

```json
    "ContainerDefinitions": [
     {
      "Name": "sidecar",
      "Image": "imageB"
     },
     {
      "Name": "main",
      "Image": "imageA"
     }
    ],
```

This makes a patch calculated from CFn template diff completely invalid. We can sort both CFn template and the response of describeTaskDefinition API in a deterministic order, but it is still unreliable because there can be more arrays whose order will be shuffled. [The response of describeTaskDefinition](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_DescribeTaskDefinition.html#API_DescribeTaskDefinition_ResponseSyntax) has many array fields, and it is not documented if they may be shuffled or not.

I guess we should completely abandon this approach, because it cannot be reliable enough. I have an idea for more reliable approach, but at least it should be reverted asap as it's breaking the ECS hotswap feature.

I'm really sorry for me not being aware with this behavior 🙏 


----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
admired-contributor [Pilot] contributed between 13-24 PRs to the CDK effort/medium Medium work item – several days of effort feature-request A feature should be added or improved. p1 pr-linter/cli-integ-tested Assert that any CLI changes have been integ tested pr-linter/exempt-integ-test The PR linter will not require integ test changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cli: allow hotswap of ECS task definitions that contain tokens
6 participants