Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ecs] [ContainerPullError]: CannotPullContainerError: pull image manifest has been retried 1 time(s): failed to resolve ref #2493

Open
tremendoustj opened this issue Dec 8, 2024 · 1 comment
Assignees
Labels
Proposed Community submitted issue

Comments

@tremendoustj
Copy link

We are getting this error and have checked all the things that are suggested

CannotPullContainerError: pull image manifest has been retried 1 time(s): failed to resolve ref .dkr.ecr.us-west-2.amazonaws.com/otel-collector:latest@sha256:863****9ac: unexpected status from HEAD request to https://.dkr.ecr.us-west-2.amazonaws.com/v2/otel-collector/blobs/sha256:863*****9ac: 403 Forbidden

AWS Gateway endpoint for s3 is configured correctly.
AWS dkr.ecr and ecr.api endpoints are associated with our subnets.
ECR repository is private and has the relevant permissions to allow pull from other accounts where ECS Tasks are running.
ECS Tasks' IAM Role has the correct permissions set and we are using the exact same role for all the ECS Tasks.
Network configuration also seems to be correct.
Now, what I have observed is it is somehow trying to pull an image with the right tag, but with wrong SHA256 code : 863*9ac This SHA256 doesn't exist in our repo, probably because the latest tag image has been overriden and replace with some other SHA256 image.

Task definition is pointing to latest tag image only, and we don't hardcode the SHA256 code either. Then why is it trying to pull the image of that specific SHA256 code in this case is what I am not able to get, and need help with from the AWS Community

Tasks have been running fine since quite long.

Automatically, some tasks have started getting restarted and they were not able to pull one of the sidecar images.

This has caused the tasks to fail and throw an event which we trigger an alarm on SERVICE TASK START IMPAIRED

Once we have received this alert, it has been observed that we are getting this alert since ECS is trying to use an image SHA of sidecar, which has been modified recently.

Our task definitions doesn't hardcode the SHA but somehow that is what is being pulled always.

A force redeployment of the services had worked fine, but that shouldn't happen in general, since we are hardcoding the tag to be used and not the SHA

@tremendoustj tremendoustj added the Proposed Community submitted issue label Dec 8, 2024
@vibhav-ag vibhav-ag self-assigned this Dec 9, 2024
@pallymore
Copy link
Member

Task definition is pointing to latest tag image only, and we don't hardcode the SHA256 code either. Then why is it trying to pull the image of that specific SHA256 code in this case is what I am not able to get

This is likely the issue. ECS now enforces version consistency on container images by default - if the original image is gone new tasks won't be able to pull them. See the announcement here: https://aws.amazon.com/blogs/containers/announcing-software-version-consistency-for-amazon-ecs-services/

Using :latest tag may seem harmless (or even convenient) but it may hurt your service in the long run - for example it'll be harder to rollback/debug in case of an issue.

From here I can see a few options:

  • Move away from using :latest tag - instead use explicit versions, and update the task definition for each deployment (there are tools to automate this) - this is what I'd recommend
  • Always use Force new deployment when updating the service - force new deployment will invalidate the current task set, and the agent will try to find the new sha256 hash for the image. This can be costly/slow since all tasks will be replaced.
  • Opt-out of the feature by updating the task definition, and setting versionConsistency to disabled for each image using :latest tag. see: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#ContainerDefinition-versionconsistency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests

3 participants