Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: provide versioned docker images in ghcr/dockerhub #1086

Closed
merryHunter opened this issue Jul 4, 2022 · 13 comments
Closed

Request: provide versioned docker images in ghcr/dockerhub #1086

merryHunter opened this issue Jul 4, 2022 · 13 comments
Assignees
Labels
bug Something isn't working ci-gitlab cml-image Subcommand cml-runner Subcommand duplicate Déjà lu question User requesting support

Comments

@merryHunter
Copy link

merryHunter commented Jul 4, 2022

Hi! Currently, the base CML Docker images are rebuilt based on the latest code and pushed every day to e.g. docker://ghcr.io/iterative/cml:0-dvc2-base1 or https://hub.docker.com/r/iterativeai/cml/tags. That means there is no way to make a rollback to previous version. Unfortunately, recent changes affected our cloud training pipelines and we had to make adjustments to them.

In my opinion it would be beneficial to have stable, fixed versioned docker images. That would ensure that once we pull from them, there is no chance something is updated or broken.

@0x2b3bfa0
Copy link
Member

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Jul 4, 2022

@merryHunter, are you using GitHub Actions? In that case, you can pin an exact CML version by using the following setup step instead of a container:

- uses: iterative/setup-cml@v1
  with:
    version: 0.14.0

Additionally, this will allow you to use any container image or just remove it altogether.

@merryHunter
Copy link
Author

Hi @0x2b3bfa0 , thanks for references to issues! No, we are using Gitlab CI, that's why we cannot access it at the moment.

@0x2b3bfa0
Copy link
Member

Then, you can try installing one of our binary releases:

curl https://github.com/iterative/cml/releases/download/v0.16.1/cml-linux-x64 --output /usr/bin/cml && chmod a+x $_

@merryHunter
Copy link
Author

I am not aware about the recent changes, but just to give a bit of more context to the problem we faced: we have a CI in Gitlab where a cml-runner is launching training on AWS with a startup script that mounts EFS to access the data. For no reason, our startup script started to silently fail while the cml runner job was successful. After debugging and looking into script logs at ec2 instance, we saw error like E: Could not get lock /var/lib/dpkg/lock – open (11: Resource temporarily unavailable) E: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?, as we had apt-get update instruction. We fixed it by adding sleep command in the beginning, allowing this way some new process to finish installing the software, but still, no ones can prevent something to happen in the future with current approach of tagging docker images.

We also have problems with passing down env variable 'DOCKER_SHM_SIZE=4g', but that's another issue.

@0x2b3bfa0
Copy link
Member

Thank you for the detailed description of the issue. 🙏

Pinning CML might not suffice to solve this issue, as it depends internally on https://github.com/iterative/terraform-provider-iterative (unpinned) to provision cloud instances. Moreover machine images aren't pinned either.

@0x2b3bfa0
Copy link
Member

The provided startup script runs synchronously. Therefore, your issue can only (?) be caused by an ongoing automatic upgrade. 🤔

@merryHunter
Copy link
Author

Exactly, that's the problem with the software upgrade we identified. However, that only means that as CML depends on TPI (which a new tool btw) that can be changed in unexpected way, it would be really great to have at least major releases tagged in dockerhub. I have read the threads, I see it's a hard decision to use certain tag naming, yet the problem is there.

@dacbd
Copy link
Contributor

dacbd commented Jul 4, 2022

there is a hidden cml option you can use to pin a tpi/cml version for your created instance.

cml runner ... \
    --cml-version="v0.15.2" \
    --tpi-version="= 0.10.18" \
...

@dacbd dacbd added cml-runner Subcommand discussion Waiting for team decision ci-gitlab labels Jul 5, 2022
@casperdcl
Copy link
Contributor

fixed it by adding sleep command in the beginning

@merryHunter we just did the same in iterative/terraform-provider-iterative#621 (what cml runner uses under-the-hood) so you don't have to :)

@merryHunter
Copy link
Author

@casperdcl @0x2b3bfa0 that's amazing patch!:) Glad our issue helped identify the problem. Then let's close this issue.

@0x2b3bfa0
Copy link
Member

For those who are looking for proper, production-grade container images: there aren't any.

See also

@0x2b3bfa0 0x2b3bfa0 self-assigned this Jul 7, 2022
@0x2b3bfa0 0x2b3bfa0 added bug Something isn't working duplicate Déjà lu question User requesting support cml-image Subcommand and removed discussion Waiting for team decision labels Jul 7, 2022
@dacbd
Copy link
Contributor

dacbd commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci-gitlab cml-image Subcommand cml-runner Subcommand duplicate Déjà lu question User requesting support
Projects
None yet
Development

No branches or pull requests

4 participants