Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(stepfunctions-tasks): AWS Batch integration with container overrides breaks with token values #19546

Closed
jstag711 opened this issue Mar 24, 2022 · 8 comments · Fixed by #19578
Assignees
Labels
@aws-cdk/aws-stepfunctions-tasks bug This issue is a bug. cause/tokens We did not consider token values effort/small Small work item – less than a day of effort p1

Comments

@jstag711
Copy link

jstag711 commented Mar 24, 2022

What is the problem?

I was eagerly awaiting 1.149 as it appeared to fix the issue that prevents container overrides of memory and cpu:

#18993
#19298

However, now I get the following error:

{
  "resourceType": "batch",
  "resource": "submitJob.sync",
  "error": "Batch.ClientException",
  "cause": "Value -1.888154589709072e+289 for type MEMORY in resourceRequirement is not valid. Provide a valid number as input. (Service: AWSBatch; Status Code: 400; Error Code: ClientException; Request ID: dcd8c27a-baac-414a-9b64-da5733cc1344; Proxy: null)"
}

Reproduction Steps

This is the code that sets up the container overrides

BatchContainerOverrides containerOverrides = BatchContainerOverrides.builder()
                .command(JsonPath.listAt("$." + SOLVER_TASK_PARAMETERS + "." + COMMAND))
                .vcpus(JsonPath.numberAt("$." + SOLVER_TASK_PARAMETERS + "." + CPU))
                .memory(Size.mebibytes(JsonPath.numberAt("$." + SOLVER_TASK_PARAMETERS + "." + MEMORY)))
                .build();

After I updated to 1.149 there is still no resourceRequirements field on BatchContainerOverrides. Based on the fix it appears that these values would just be copied from the top-level vcpus and memory.

Previously I was receiving the warning that my container overrides were being ignored, but the job would still execute.

What did you expect to happen?

I expected that with this update my container overrides would be respected.

What actually happened?

{
  "resourceType": "batch",
  "resource": "submitJob.sync",
  "error": "Batch.ClientException",
  "cause": "Value -1.888154589709072e+289 for type MEMORY in resourceRequirement is not valid. Provide a valid number as input. (Service: AWSBatch; Status Code: 400; Error Code: ClientException; Request ID: dcd8c27a-baac-414a-9b64-da5733cc1344; Proxy: null)"
}

CDK CLI Version

1.149.0

Framework Version

No response

Node.js Version

12.22.11

OS

MacOS 10.15.7

Language

Java

Language Version

Java 11

Other information

No response

@jstag711 jstag711 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Mar 24, 2022
@github-actions github-actions bot added the @aws-cdk/aws-batch Related to AWS Batch label Mar 24, 2022
@kaizencc kaizencc changed the title AWS Batch step function integration with container overrides: v 1.149.0 breaks my integration (stepfunctions-tasks): AWS Batch integration with container overrides: v 1.149.0 breaks my integration Mar 24, 2022
@kaizencc kaizencc added p1 and removed needs-triage This issue or PR still needs to be triaged. @aws-cdk/aws-stepfunctions-tasks labels Mar 24, 2022
@kaizencc kaizencc added effort/small Small work item – less than a day of effort @aws-cdk/aws-stepfunctions-tasks and removed @aws-cdk/aws-batch Related to AWS Batch labels Mar 24, 2022
@kaizencc
Copy link
Contributor

Hi @jstag711, I'm hesitant to say this is a regression at the moment. Previously your values were being ignored. Now that they are not, it's hard to say if they were invalid before, or if there is a bug in the code.

It looks like the number provided to memory currently (-1.888154589709072e+289) is not what you are expecting. Can you tell me what you are expecting Size.mebibytes(JsonPath.numberAt("$." + SOLVER_TASK_PARAMETERS + "." + MEMORY) to be?

@jstag711
Copy link
Author

jstag711 commented Mar 24, 2022

So if you look at the first link I sent, the values were (emphasis on the quotation marks) "ignored" - they were in fact read and then flagged as ignored. This is very weird behavior IMO, the warning text essentially says "we see you setting this value, we read it, and we're not going to use it".

I expect the value to be 61440.

Does the fix not work if the values are JsonPath values? I can attempt to set fixed values there and see if the behavior remains the same.

@peterwoodworth
Copy link
Contributor

Yeah what we need to figure out is what value is getting passed into here:

Size.mebibytes(JsonPath.numberAt("$." + SOLVER_TASK_PARAMETERS + "." + MEMORY))

If you can log both JsonPath.numberAt("$." + SOLVER_TASK_PARAMETERS + "." + MEMORY) and the full thing, then we'd be able to figure out at least what stage the bug is occuring

@jstag711
Copy link
Author

jstag711 commented Mar 24, 2022

The value is 61440. The input to the step has not changed.

Also I'm not sure what you mean by log, that's part of the step function definition. I can show you the input if you want to verify my json path is valid.

@peterwoodworth
Copy link
Contributor

When I pass in a hardcoded 61440, this is the relevant part of the DefinitionString synthesized for the State Machine

"ContainerOverrides\":{\"ResourceRequirements\":[{\"Type\":\"MEMORY\",\"Value\":\"61440\"},{\"Type\":\"VCPU\",\"Value\":\"2\"}]}}}

What value are you seeing in your synthesized definition string?

@jstag711
Copy link
Author

jstag711 commented Mar 24, 2022

Pasting the relevant portion:

"ContainerOverrides":{"Command.$":"$.solverTaskParameters.cmd","ResourceRequirements":[{"Type":"MEMORY","Value":"-1.888154589709072e+289"},{"Type":"VCPU","Value":"-1.8881545897090718e+289"}]},"Timeout":{"AttemptDurationSeconds":900}}}}}

This is what it was prior to the upgrade:

"ContainerOverrides":{"Command.$":"$.solverTaskParameters.cmd","Memory.$":"$.solverTaskParameters.memory","Vcpus.$":"$.solverTaskParameters.cpu"},"Timeout":{"AttemptDurationSeconds":3300}}}}}

So can confirm the template is synthesized incorrectly. When I change the JsonPath params to raw numbers, I get the right thing:

"","ContainerOverrides":{"Command.$":"$.solverTaskParameters.cmd","ResourceRequirements":[{"Type":"MEMORY","Value":"10"},{"Type":"VCPU","Value":"10"}]},"Timeout":{"AttemptDurationSeconds":900}}}}}"

@peterwoodworth
Copy link
Contributor

I think I see what's happening here.

Because you're using JsonPath.numberAt(), that will return a token. We're trying to call the direct value of a token, which won't work if we want JsonPath.

What's interesting is that the same applies for GpuCount

Value: `${containerOverrides.gpuCount}`,
, just no one has run into this error before with that?

We need to consider the case where the value is a token

@kaizencc kaizencc added the cause/tokens We did not consider token values label Mar 24, 2022
@kaizencc kaizencc changed the title (stepfunctions-tasks): AWS Batch integration with container overrides: v 1.149.0 breaks my integration (stepfunctions-tasks): AWS Batch integration with container overrides breaks with token values Mar 24, 2022
rix0rrr added a commit that referenced this issue Mar 26, 2022
Number tokens are encoded as a range of very large negative numbers (for
example: -1.888154589709072e+289). When these are naively stringified,
the `resolve()` method doesn't recognize and translate them anymore,
and these numbers end up in the target template in a confusing way.

However, recognizing them is actually not that hard and can be done
using a regex. We can then do the token resolution appropriately, making
it so that construct authors do not have to call
`Tokenization.stringifyNumber()` anymore in order to support
stringification of number values.

Fixes #19546.
@mergify mergify bot closed this as completed in #19578 Apr 1, 2022
mergify bot pushed a commit that referenced this issue Apr 1, 2022
Number tokens are encoded as a range of very large negative numbers (for
example: -1.888154589709072e+289). When these are naively stringified,
the `resolve()` method doesn't recognize and translate them anymore,
and these numbers end up in the target template in a confusing way.

However, recognizing them is actually not that hard and can be done
using a regex. We can then do the token resolution appropriately, making
it so that construct authors do not have to call
`Tokenization.stringifyNumber()` anymore in order to support
stringification of number values.

Fixes #19546, closes #19550.


----

### All Submissions:

* [ ] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/master/CONTRIBUTING.md)

### Adding new Unconventional Dependencies:

* [ ] This PR adds new unconventional dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/master/CONTRIBUTING.md/#adding-new-unconventional-dependencies)

### New Features

* [ ] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/master/INTEGRATION_TESTS.md)?
	* [ ] Did you use `cdk-integ` to deploy the infrastructure and generate the snapshot (i.e. `cdk-integ` without `--dry-run`)?

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

github-actions bot commented Apr 1, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

This was referenced Apr 7, 2022
mergify bot added a commit that referenced this issue Apr 7, 2022
See [CHANGELOG](https://github.com/aws/aws-cdk/blob/bump/1.152.0/CHANGELOG.md)

For convenience, extracted the relevant CHANGELOG entry:

## [1.152.0](v1.151.0...v1.152.0) (2022-04-06)


### Features

* **cfnspec:** cloudformation spec v63.0.0 ([#19679](#19679)) ([dba96a9](dba96a9))
* **cfnspec:** cloudformation spec v65.0.0 ([#19745](#19745)) ([796fc64](796fc64))
* **cli:** add --build option ([#19663](#19663)) ([eb9b8e2](eb9b8e2)), closes [#19667](#19667)
* **cli:** preview of `cdk import` ([#17666](#17666)) ([4f12209](4f12209))
* **core:** throw error when stack name exceeds max length ([#19725](#19725)) ([1ffd45e](1ffd45e))
* **eks:** add k8s v1.22 ([#19756](#19756)) ([9a518c5](9a518c5))
* **opensearch:** Add latest Opensearch Version 1.2 ([#19749](#19749)) ([a2ac36e](a2ac36e))
* add new integration test runner ([#19754](#19754)) ([1b4d010](1b4d010))
* **eks:** alb-controller v2.4.1 ([#19653](#19653)) ([1ec08df](1ec08df))
* **lambda:** add support for ephemeral storage ([#19552](#19552)) ([f1d9b6a](f1d9b6a)), closes [#19605](#19605)
* **s3:** EventBridge bucket notifications ([#18614](#18614)) ([d8e602b](d8e602b)), closes [#18076](#18076)
* **synthetics:** new puppeteer 3.5 runtime ([#19673](#19673)) ([ce2b91b](ce2b91b)), closes [#19634](#19634)


### Bug Fixes

* **aws_applicationautoscaling:** Add missing members to PredefinedMetric enum ([#18978](#18978)) ([75a6fa7](75a6fa7)), closes [#18969](#18969)
* **cli:** apps with many resources scroll resource output offscreen ([#19742](#19742)) ([053d22c](053d22c)), closes [#19160](#19160)
* **cli:** support attributes of DynamoDB Tables for hotswapping ([#19620](#19620)) ([2321ece](2321ece)), closes [#19421](#19421)
* **cloudwatch:** automatic metric math label cannot be suppressed ([#17639](#17639)) ([7fa3bf2](7fa3bf2))
* **codedeploy:** add name validation for Application, Deployment Group and Deployment Configuration ([#19473](#19473)) ([9185042](9185042))
* **codedeploy:** the Service Principal is wrong in isolated regions ([#19729](#19729)) ([7e9a43d](7e9a43d)), closes [#19399](#19399)
* **core:** `Fn.select` incorrectly short-circuits complex expressions ([#19680](#19680)) ([7f26fad](7f26fad))
* **core:** detect and resolve stringified number tokens ([#19578](#19578)) ([7d9ab2a](7d9ab2a)), closes [#19546](#19546) [#19550](#19550)
* **core:** reduce CFN template indent size to save bytes ([#19656](#19656)) ([fd63ca3](fd63ca3))
* **ecs:** 'desiredCount' and 'ephemeralStorageGiB' cannot be tokens ([#19453](#19453)) ([c852239](c852239)), closes [#16648](#16648)
* **ecs:** remove unnecessary error when adding volume to external task definition ([#19774](#19774)) ([5446ded](5446ded)), closes [#19259](#19259)
* **iam:** policies aren't minimized as far as possible ([#19764](#19764)) ([876ed8a](876ed8a)), closes [#19751](#19751)
* **logs:** Faulty Resource Policy Generated ([#19640](#19640)) ([1fdf122](1fdf122)), closes [#17544](#17544)
mergify bot added a commit that referenced this issue Apr 7, 2022
See [CHANGELOG](https://github.com/aws/aws-cdk/blob/bump/2.20.0/CHANGELOG.md)

For convenience, extracted the relevant CHANGELOG entry:

## [2.20.0](v2.19.0...v2.20.0) (2022-04-07)


### Features

* **cfnspec:** cloudformation spec v63.0.0 ([#19679](#19679)) ([dba96a9](dba96a9))
* **cfnspec:** cloudformation spec v65.0.0 ([#19745](#19745)) ([796fc64](796fc64))
* **cli:** add --build option ([#19663](#19663)) ([eb9b8e2](eb9b8e2)), closes [#19667](#19667)
* **cli:** preview of `cdk import` ([#17666](#17666)) ([4f12209](4f12209))
* **core:** throw error when stack name exceeds max length ([#19725](#19725)) ([1ffd45e](1ffd45e))
* **eks:** add k8s v1.22 ([#19756](#19756)) ([9a518c5](9a518c5))
* **opensearch:** Add latest Opensearch Version 1.2 ([#19749](#19749)) ([a2ac36e](a2ac36e))
* add new integration test runner ([#19754](#19754)) ([1b4d010](1b4d010))
* **eks:** alb-controller v2.4.1 ([#19653](#19653)) ([1ec08df](1ec08df))
* **lambda:** add support for ephemeral storage ([#19552](#19552)) ([f1d9b6a](f1d9b6a)), closes [#19605](#19605)
* **s3:** EventBridge bucket notifications ([#18614](#18614)) ([d8e602b](d8e602b)), closes [#18076](#18076)


### Bug Fixes

* **aws_applicationautoscaling:** Add missing members to PredefinedMetric enum ([#18978](#18978)) ([75a6fa7](75a6fa7)), closes [#18969](#18969)
* **cli:** apps with many resources scroll resource output offscreen ([#19742](#19742)) ([053d22c](053d22c)), closes [#19160](#19160)
* **cli:** support attributes of DynamoDB Tables for hotswapping ([#19620](#19620)) ([2321ece](2321ece)), closes [#19421](#19421)
* **cloudwatch:** automatic metric math label cannot be suppressed ([#17639](#17639)) ([7fa3bf2](7fa3bf2))
* **codedeploy:** add name validation for Application, Deployment Group and Deployment Configuration ([#19473](#19473)) ([9185042](9185042))
* **codedeploy:** the Service Principal is wrong in isolated regions ([#19729](#19729)) ([7e9a43d](7e9a43d)), closes [#19399](#19399)
* **core:** `Fn.select` incorrectly short-circuits complex expressions ([#19680](#19680)) ([7f26fad](7f26fad))
* **core:** detect and resolve stringified number tokens ([#19578](#19578)) ([7d9ab2a](7d9ab2a)), closes [#19546](#19546) [#19550](#19550)
* **core:** reduce CFN template indent size to save bytes ([#19656](#19656)) ([fd63ca3](fd63ca3))
* **ecs:** 'desiredCount' and 'ephemeralStorageGiB' cannot be tokens ([#19453](#19453)) ([c852239](c852239)), closes [#16648](#16648)
* **ecs:** remove unnecessary error when adding volume to external task definition ([#19774](#19774)) ([5446ded](5446ded)), closes [#19259](#19259)
* **iam:** policies aren't minimized as far as possible ([#19764](#19764)) ([876ed8a](876ed8a)), closes [#19751](#19751)
* **logs:** Faulty Resource Policy Generated ([#19640](#19640)) ([1fdf122](1fdf122)), closes [#17544](#17544)
StevePotter pushed a commit to StevePotter/aws-cdk that referenced this issue Apr 27, 2022
Number tokens are encoded as a range of very large negative numbers (for
example: -1.888154589709072e+289). When these are naively stringified,
the `resolve()` method doesn't recognize and translate them anymore,
and these numbers end up in the target template in a confusing way.

However, recognizing them is actually not that hard and can be done
using a regex. We can then do the token resolution appropriately, making
it so that construct authors do not have to call
`Tokenization.stringifyNumber()` anymore in order to support
stringification of number values.

Fixes aws#19546, closes aws#19550.


----

### All Submissions:

* [ ] Have you followed the guidelines in our [Contributing guide?](https://github.com/aws/aws-cdk/blob/master/CONTRIBUTING.md)

### Adding new Unconventional Dependencies:

* [ ] This PR adds new unconventional dependencies following the process described [here](https://github.com/aws/aws-cdk/blob/master/CONTRIBUTING.md/#adding-new-unconventional-dependencies)

### New Features

* [ ] Have you added the new feature to an [integration test](https://github.com/aws/aws-cdk/blob/master/INTEGRATION_TESTS.md)?
	* [ ] Did you use `cdk-integ` to deploy the infrastructure and generate the snapshot (i.e. `cdk-integ` without `--dry-run`)?

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-stepfunctions-tasks bug This issue is a bug. cause/tokens We did not consider token values effort/small Small work item – less than a day of effort p1
Projects
None yet
4 participants