Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(eks): pods become CrashLoopBackOff when using INFERENTIA or TRAINIUM instance type #29651

Closed
wants to merge 4 commits into from

Conversation

wafuwafu13
Copy link
Contributor

@wafuwafu13 wafuwafu13 commented Mar 29, 2024

Issue # (if applicable)

#29262

Reason for this change

When we use INFERENTIA or TRAINIUM instance type, https://github.com/aws/aws-cdk/blob/main/packages/aws-cdk-lib/aws-eks/lib/addons/neuron-device-plugin.yaml is applied to cluster but Pod become CrashLoopBackOff (detail log #29262 (comment))

The current yaml https://github.com/aws-neuron/aws-neuron-sdk/blob/master/docs/neuron-container-tools/k8s-neuron-device-plugin.yml is File not found now.

# source: https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-container-tools/k8s-neuron-device-plugin.yml

Description of changes

Description of how you validated changes

  • Pass unit tests
  • Pass integ tests

Checklist


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@github-actions github-actions bot added p2 valued-contributor [Pilot] contributed between 6-12 PRs to the CDK labels Mar 29, 2024
@aws-cdk-automation aws-cdk-automation requested a review from a team March 29, 2024 10:24
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.

A comment requesting an exemption should contain the text Exemption Request. Additionally, if clarification is needed add Clarification Request to a comment.

@wafuwafu13 wafuwafu13 changed the title fix(aws-eks): Pods become CrashLoopBackOff when using INFERENTIA or TRAINIUM instance type fix(eks): pods become CrashLoopBackOff when using INFERENTIA or TRAINIUM instance type Mar 29, 2024
private addNeuronDevicePluginRbac() {
if (!this._neuronDevicePluginRbacClusterRole) {
const clusterRoleFileContents = fs.readFileSync(path.join(__dirname, 'addons', 'neuron-device-plugin-rbac-cluster-role.yaml'), 'utf8');
const sanitizedClusterRole = YAML.parse(clusterRoleFileContents);
Copy link
Contributor Author

@wafuwafu13 wafuwafu13 Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I use parseAllDocuments, I don't need to divide k8s-neuron-device-plugin-rbac.yml into three files but the return type of parseAllDocuments is not equal to the return type of parse so addManifest function cannot handle parsed yaml.
I think divide k8s-neuron-device-plugin-rbac.yml into three files and use parse is the simplest solution.

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 91507a4
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@wafuwafu13
Copy link
Contributor Author

Exemption Request: I updated integ.eks-inference-nodegroup and integ.eks-inference

@aws-cdk-automation aws-cdk-automation added pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback. pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. labels Mar 29, 2024
This was referenced Apr 1, 2024
@aws-cdk-automation
Copy link
Collaborator

This PR has been in the CHANGES REQUESTED state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week.

@shikha372 shikha372 self-assigned this Apr 22, 2024
@aws-cdk-automation
Copy link
Collaborator

This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error.

@aws-cdk-automation aws-cdk-automation added the closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. label Apr 27, 2024
@aws-cdk-automation aws-cdk-automation removed the pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. label Apr 27, 2024
@aws-cdk-automation
Copy link
Collaborator

The pull request linter fails with the following errors:

❌ Fixes must contain a change to an integration test file and the resulting snapshot.

PRs must pass status checks before we can provide a meaningful review.

If you would like to request an exemption from the status checks or clarification on feedback, please leave a comment on this PR containing Exemption Request and/or Clarification Request.

✅ A exemption request has been requested. Please wait for a maintainer's review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. p2 pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback. valued-contributor [Pilot] contributed between 6-12 PRs to the CDK
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants