-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Comment) Rule 0 - Don't use docker #96
Comments
I'm going to disagree here - reproducibility is a continuum, and using a Docker image (e.g., stored in a registry with care taken so the image isn't purged) is more reproducible than not, and definitely needing to rebuild is less reproducible than having the image already built. You could then easily pull down to a read only SIF (Singularity container) to have a more time-worthy artifact. |
@vsoch thank you for your quick reply! Why is an existing externally stored docker image more reproducible that one that is built automatically? In best case, the image is reproducible somehow by someone (maybe it is not even documented, how to do so), given that its hidden decencies are. But in practice it is likely that this image or one of the decencies invoke some How could you take care of any registry outside your control? codehaus.org purged everything (nowadays not even the domain exists), two years ago nobody could image that this would ever happen. Maven and nodejs packages were deleted after new maintainers took or bought packages, in some cases there was even malicious content provided instead. Remember when "left pad" was deleted? It was in the news everywhere. This broke all builds using leftpad, and every build that broke obviously was not reproducible. |
It's not a binary state - reproducible or not reproducible - it's a continuum that ranges from perhaps poorly reproducible to "best effort" reproducible (and I'd argue perfect reproducibility is more of a pipe dream).
Rebuilding an image requires all remote installs, etc. to still be present. As you noted, this isn't always reliable. Instead, pulling a pre-built container at least promises to get the previously built layers. Is it perfect? Of course not - as you noted even registries can go away. But retrieving the same layers and containers that someone used for an analysis is slightly more reproducible (in my opinion) than building afresh and not being sure you have the same software, updates, etc. And of course you would want to pull by digest and not tag, which is a moving target.
You cannot. You must be resilient to the likely migration and change. E.g., in the CI world we've jumped around from Travis to Circle, from Circle to GitHub workflows, and I'm sure I'll need to jump again. It's part of the "scrappy and resilient researcher / research software engineer" life - we take advantage of what is available to us at a particular time, and when the time comes (a service goes away) we refactor for a different one. |
@vsoch thank you for your detailed answer. Ahh I see. If an image has remote installs, it is not reproducible and thus must be stored as source (local copy). Additionally, it might be impossible to maintain, for example if in ten years a little small bug needs to be fixed, lets say the zlib overflow bug, it can get almost impossible: you don't find the needed packages to install and also not the sources and so on... If you even rely on external build systems, you cannot know whether anything is reproducible at all, maybe GitHub uses a secret Microsoft technology to make things looking as they would work to ensure you remained locked in - and your scripts fail on any other infrastructure, who knows. If a such service goes away, your reproducibility goes away: if your package builds in GitHub only, it can be impossible to produce the same outside - and at least you have to to modify the input, e.g. Dockerfile, and since Docker cannot produce binary the same input, it can be hard to prove that the new output is even functional reproducing... |
I would be very interested to see an image that is used ten years later. Software is living in a way - the libraries that are valued and used will be updated (and have new containers) and the people that use them will follow suite. I much prefer older software dying/going away and the ecosystem continuing to change and grow. Again, part of being a research software engineer is flexibility and portability. When something goes away, we move and find another way. This "perfectly reproducible and reliable" means that you speak of is a fantasy. |
@vsoch I already had to build such old software to (help others) proving certain aspects. Actually it was easier with lets say 20 years old software and with "modern" technologies it seems to get more and more difficult. We use docker with certain specific rules to keep our built process predictable (reproducible here is still a long way to go). That's why I stated my comments here: my experiences and derived conclusions / rules do differ in some points essentially or are even contrary (for example my 1st rule is: "during build do not use anything from the internet"). |
I would like to add the “Rule 0” for Writing Dockerfiles for Reproducible Artifacts" and it is:
“Do not use docker”.
Docker makes it hard to be reproducible, the eco system (docker hub, tutorials…) are changing and hard to “burn on DVD and put into a safe” (e.g. for escrow). Whatever starts with “apt update” cannot be reproducible by definition ("depends on the internet") and this is very common in docker communities.
Even if having all input data reproducible, then the docker images still are not, because docker has no way to omit or fix timestamps included in the built artifacts, so was can at most be “functional reproducible”, which is hard to prove (reproducibility is easy to prove, just secure hash the input and the result – same hashes (for same input), then it must have same result, if and only if hashes match it is reproducible).
The text was updated successfully, but these errors were encountered: