(aws-logs): log retention update can race against the target lambda's log group creation resulting in OperationAbortedException #15709

steved · 2021-07-22T04:23:05Z

When a lambda runs a log group and stream are created if they do not already exist. When creating a CDK Lambda function, the log retention update can then race with the background log group creation resulting in an OperationAbortedException error.

Reproduction Steps

Hard to reproduce reliably as it relies on a race condition between AWS and the log retention lambda.

What did you expect to happen?

Log retention policy is properly applied.

What actually happened?

2021-07-22T02:52:24.985Z	a53fc34d-6d93-4590-94fd-8115d0b93e3b	INFO	OperationAbortedException: A conflicting operation is currently in progress against this resource. Please try again.
    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'OperationAbortedException',
  time: 2021-07-22T02:52:24.925Z,
  requestId: '[redacted]',
  statusCode: 400,
  retryable: false,
  retryDelay: 56.75647811158137
}

Environment

CDK CLI Version : 1.110.1
Framework Version: 1.110.1
Node.js Version: 10.24.0
OS : Buster (container)
Language (Version): all, Python (3.8)

Other

AWS log group creation cloudtrail event:

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "[...]",
        "arn": "[...]",
        "accountId": "[...]",
        "accessKeyId": "[...]",
        "sessionContext": {
            "sessionIssuer": {
               ...
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2021-07-22T02:52:23Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2021-07-22T02:52:24Z",
    "eventSource": "logs.amazonaws.com",
    "eventName": "CreateLogGroup",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "[...]",
    "userAgent": "awslambda-worker/1.0  rusoto/0.42.0 rust/1.52.1 linux",
    "requestParameters": {
        "logGroupName": "/aws/lambda/mylambda"
    },
    "responseElements": null,
    "requestID": "[...]",
    "eventID": "[...]",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "apiVersion": "20140328",
    "managementEvent": true,
    "recipientAccountId": "...",
    "eventCategory": "Management"
}

Log retention API call:

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "...",
        "arn": "...",
        "accountId": "...",
        "accessKeyId": "...",
        "sessionContext": {
            "sessionIssuer": {
                ...
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2021-07-22T02:52:23Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2021-07-22T02:52:24Z",
    "eventSource": "logs.amazonaws.com",
    "eventName": "CreateLogGroup",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "[...]",
    "userAgent": "aws-sdk-nodejs/2.880.0 linux/v12.22.1 exec-env/AWS_Lambda_nodejs12.x promise",
    "errorCode": "OperationAbortedException",
    "errorMessage": "A conflicting operation is currently in progress against this resource. Please try again.",
    "requestParameters": {
        "logGroupName": "/aws/lambda/mylambda"
    },
    "responseElements": null,
    "requestID": "[...]",
    "eventID": "[...]",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "apiVersion": "20140328",
    "managementEvent": true,
    "recipientAccountId": "[...]",
    "eventCategory": "Management"
}

#2237 seems to be a fix for the same potential error on the actual log retention lambda's log group, but not the target.

The text was updated successfully, but these errors were encountered:

rix0rrr · 2021-07-22T08:39:26Z

Easiest way to fix this is to update the Lambda to be resilient against this occurring.

rix0rrr · 2021-07-22T08:39:27Z

Easiest way to fix this is to update the Lambda to be resilient against this occurring.

steved · 2021-07-22T18:30:42Z

How would you recommend to approach this? Would the addition of the error code to this exception help? Or a global retry?

aws-cdk/packages/@aws-cdk/aws-logs/lib/log-retention-provider/index.ts

Line 25 in 2c4ef01

if (e.code !== 'ResourceAlreadyExistsException') {

rix0rrr · 2021-07-22T21:05:46Z

I woukd say retry on the specific error

ddl-denis-parnovskiy · 2021-08-23T22:46:27Z

Please, could someone review the proposed fix?

We have 2 bugs in packages/@aws-cdk/aws-logs/lib/log-retention-provider/index.ts

It handles OperationAbortedException race for its own lambda but not for the lambda created elsewhere in CDK (this issue essential).
When it handles OperationAbortedException race for its own lambda it skips retention assignment altogether while it should just ignore the log group creation error.

Fixes: #15709 When creating a lambda with log retention, CDK actually creates 2 lambda functions. The second lambda function alters log retention of the log group of the first lambda and the retention of its own log group. Because log group creation is asynchronous, the log retention lambda tries to pre-create both log groups to guarantee it has an object to work on. If a normal lambda execution also creates the related log group at the same time, an "OperationAbortedException:... Please retry" error is returned. The existing code handles this situation for log retention lambda but not for the first lambda. This fix adds the retry pattern to the general log group creation code. Also existing code had a bug: if OperationAbortedException is hit, the error is hidden but the retention policy is skipped and not actually applied. This fix addresses this bug as well. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

github-actions · 2021-09-10T14:13:36Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Includes fix for aws/aws-cdk#15709

* bump cdk to v1.124.0 Includes fix for aws/aws-cdk#15709 * update unit tests * match aws-cdk version in CI

slavman · 2021-09-23T11:59:01Z

@steved we still have this error when add logRetention. cdk version is 1.124.0

m90 · 2021-11-03T08:15:08Z

I am also experiencing this issue when using 1.127.0 throughout.

horsmand · 2021-11-05T21:05:26Z

My team is hitting this with version 1.129.0.

We have some bash scripting that allows us to deploy multiple CDK apps in parallel that share some base infrastructure and we are very consistently hitting this issue.

It actually seems worse with the new fix because before it would fail silently and let the deployment continue, but now it fails the whole CDK deployment due to the Lambda function now throwing errors instead of swallowing them.

I haven't captured logs from the Lambda yet, but I am seeing failures like this in the cdk deploy output:

SingletonLambda2e19e24316ee4d1aa3c918d35eddd446/LogRetention (SingletonLambda2e19e24316ee4d1aa3c918d35eddd446LogRetention970D13ED) Received response status [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress against this resource. Please try again.

This leads me to believe the retry logic isn't being hit at all, since the code for that looks like it should throw a new error with the message Out of attempts to create a logGroup.

matmeylan · 2021-11-09T08:52:35Z

We're also getting this error with 1.128.0 and 1.131.0. It happens at deploy time when trying to access a function's log group in order to setup an alarm based on errors in the logs. It happens consistently.

Here's the stack we're getting:

	9:40:23 PM | CREATE_FAILED        | Custom::LogRetention                      | XXX...gRetention2CACE5F0
Received response status [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress
against this resource. Please try again. (RequestId: 6e534485-2033-4eaa-a4fe-2c0cde8dbcf3)

	new LogRetention (/local/node_modules/@aws-cdk/aws-logs/lib/log-retention.ts:53:22)
	\_ Function.get logGroup [as logGroup] (/local/node_modules@aws-cdk/aws-lambda/lib/functiod response status [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress
n.ts:550:28)
	\_ LambdaMonitoring.setupAlarm (/local/cdk/lib/monitoring/lambda-monitoring.ts:83:35)eceived response status [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress
	\_ new LambdaMonitoring (/local/cdk/lib/monitoring/lambda-monitoring.ts:27:24)
	\_ BackendStack.setupLambdaMonitoring (/local/cdk/lib/backend-stack.ts:1616:7)
	\_ BackendStack.setupCloudwatchDashboards (/local/cdk/lib/backend-stack.ts:1436:10)
	\_ BackendStack.setupCloudwatch (/local/cdk/lib/backend-stack.ts:1380:10)
	\_ new BackendStack (/local/cdk/lib/backend-stack.ts:78:12)
	\_ StackBuilder.buildStack (/local/cdk/lib/stack-builder.ts:220:28)
	\_ StackBuilder.build (/local/cdk/lib/stack-builder.ts:18:30)
	\_ processTicksAndRejections (internal/process/task_queues.js:95:5)

oscarwest · 2021-11-11T21:27:08Z

We've been hitting this as well, "@aws-cdk/core": "1.126.0"
Still seems to be a bug here somewhere: @ddl-denis-parnovskiy

EDIT: Same problem with

"@serverless-stack/cli": "0.51.1",
"@serverless-stack/resources": "0.51.1",
"@aws-cdk/core": "1.132.0"

Custom::LogRetention | LambdaGETprojectsprojectNameapikeysLogRetention247393C8 Received response status [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress against this resource. Please try again. (RequestId: 62f8c1aa-d164-402b-9e75-d852f9092acb)

horsmand · 2021-11-17T19:43:48Z

I've opened a new ticket since many people have been reporting the issue seems to be happening still. I believe I've made some findings in regards to why it might still be causing problems: #17546

steved added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jul 22, 2021

github-actions bot added @aws-cdk/aws-lambda Related to AWS Lambda @aws-cdk/aws-logs Related to Amazon CloudWatch Logs labels Jul 22, 2021

github-actions bot assigned nija-at and rix0rrr Jul 22, 2021

rix0rrr added effort/small Small work item – less than a day of effort p1 and removed needs-triage This issue or PR still needs to be triaged. labels Jul 22, 2021

rix0rrr removed their assignment Jul 22, 2021

nija-at assigned rix0rrr and unassigned nija-at Jul 29, 2021

nija-at removed the @aws-cdk/aws-lambda Related to AWS Lambda label Jul 29, 2021

rix0rrr removed their assignment Aug 11, 2021

ddl-denis-parnovskiy mentioned this issue Aug 17, 2021

fix(logs): log retention fails with OperationAbortedException #16083

Merged

mergify bot closed this as completed in #16083 Sep 10, 2021

steved added a commit to dominodatalab/cdk-cf-eks that referenced this issue Sep 21, 2021

bump cdk to v1.124.0

1b91791

Includes fix for aws/aws-cdk#15709

steved mentioned this issue Sep 21, 2021

bump cdk to v1.124.0 dominodatalab/cdk-cf-eks#74

Merged

steved added a commit to dominodatalab/cdk-cf-eks that referenced this issue Sep 22, 2021

bump cdk to v1.124.0 (#74)

43a9545

* bump cdk to v1.124.0 Includes fix for aws/aws-cdk#15709 * update unit tests * match aws-cdk version in CI

horsmand mentioned this issue Nov 17, 2021

(aws-logs): log retention failing cdk deployment with OperationAbortedException #17546

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(aws-logs): log retention update can race against the target lambda's log group creation resulting in OperationAbortedException #15709

(aws-logs): log retention update can race against the target lambda's log group creation resulting in OperationAbortedException #15709

steved commented Jul 22, 2021

rix0rrr commented Jul 22, 2021

rix0rrr commented Jul 22, 2021

steved commented Jul 22, 2021

rix0rrr commented Jul 22, 2021

ddl-denis-parnovskiy commented Aug 23, 2021

github-actions bot commented Sep 10, 2021

slavman commented Sep 23, 2021 •

edited

Loading

m90 commented Nov 3, 2021

horsmand commented Nov 5, 2021

matmeylan commented Nov 9, 2021 •

edited

Loading

oscarwest commented Nov 11, 2021 •

edited

Loading

horsmand commented Nov 17, 2021

(aws-logs): log retention update can race against the target lambda's log group creation resulting in OperationAbortedException #15709

(aws-logs): log retention update can race against the target lambda's log group creation resulting in OperationAbortedException #15709

Comments

steved commented Jul 22, 2021

Reproduction Steps

What did you expect to happen?

What actually happened?

Environment

Other

rix0rrr commented Jul 22, 2021

rix0rrr commented Jul 22, 2021

steved commented Jul 22, 2021

rix0rrr commented Jul 22, 2021

ddl-denis-parnovskiy commented Aug 23, 2021

github-actions bot commented Sep 10, 2021

⚠️COMMENT VISIBILITY WARNING⚠️

slavman commented Sep 23, 2021 • edited Loading

m90 commented Nov 3, 2021

horsmand commented Nov 5, 2021

matmeylan commented Nov 9, 2021 • edited Loading

oscarwest commented Nov 11, 2021 • edited Loading

horsmand commented Nov 17, 2021

slavman commented Sep 23, 2021 •

edited

Loading

matmeylan commented Nov 9, 2021 •

edited

Loading

oscarwest commented Nov 11, 2021 •

edited

Loading