-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download: Improved container priority and resolution #2364
Download: Improved container priority and resolution #2364
Conversation
…s solution was prone to false positives and negatives.
…ne string which I abused as comment.
…w class are stateful, i.e. if the function is called without a parameter that was provided in a previous call, then it is still present.
Codecov Report
@@ Coverage Diff @@
## dev #2364 +/- ##
==========================================
+ Coverage 72.76% 72.80% +0.04%
==========================================
Files 78 78
Lines 8889 8888 -1
==========================================
+ Hits 6468 6471 +3
+ Misses 2421 2417 -4
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only some small comments
valid_containers = list(filter(either_url_or_docker.match, container_value_defs)) | ||
|
||
if valid_containers: | ||
cleaned_matches = cleaned_matches + valid_containers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand, what's the difference between cleaned_matches
and valid_containers
? Are they obtained with the same regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaned matches
is the list (line 784) that collects all valid_containers
from the loop over the raw findings (line 796).
this_container = docker_match.group(0) | ||
break | ||
# all implemented options exhausted. Nothing left to be done: | ||
log.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be warning? It doesn't stop execution, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Until 2.8, the download would exit and fail if a container image download failed. But that meant that pipelines could not be downloaded with container images, if just one image failed for whatever reason. I briefly discussed this with mashehu, and we agreed that the friendlier user experience would be to continue the download.
Like so, the user only has to manually troubleshoot one image download, in contrast to somehow obtaining 40 images or so one by one. Therefore, this changed behaviour was implemented with release 2.9.
nf_core/download.py
Outdated
""" | ||
d = {} | ||
for c in container_list: | ||
log.info(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a message? For example:
log.info(c) | |
log.info(f"Evaluating container {c}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehm...no, because that log.info()
statement was never meant to stay in there, as it just clutters the output. Excellent catch, I have deleted it now!
That was a leftover of a debugging process, when I failed to understand why certain images showed up in the container_list
even if they were not submitted to the function. This happened because the function retained the state from a previous execution, due to my oversight regarding the distinction between Class and Static methods. I assumed the behaviour of a Static method without explicitly defining it as one.)
# | ||
@with_temporary_folder | ||
@mock.patch("nf_core.utils.fetch_wf_config") | ||
def test_find_container_images_modules(self, tmp_path, mock_fetch_wf_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will be very useful, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're welcome!
It will benefit me as well, as I won't be surprised by a sudden new module syntax. Once we settle on a new template syntax and modify the modules create
and lint
accordingly, we can also directly include one in this test to ensure that downloads can also resolve the container paths accurately.
This PR improves the code in Download that detects and resolves container images in modules and configs.
Previously, a loop was used to prioritize http:// downloads over docker URIs. For example, from
the strings
'singularity'
,'https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0'
and'biocontainers/fastqc:0.11.9--0'
were extracted into a list. Then we iterated over this list and broke the loop as soon as a match with a URL regex or Docker regex was found.This approach was flaky, because
'singularity'
or'apptainer'
are valid Docker URIs. Thus, unless it was explicitly ignored as a false positive, it prevented further evaluation and detection of the real image paths. Now, evaluation continues in any case, which should make it more resilient for further syntax updates. Furthermore, the old approach also relied on the convention that the URL would always be listed before the Docker URI.The old approach failed entirely for early DSL2, where a separate container declaration was used:
For each container declaration, a new loop was created, such that it was not possible to prioritize the direct download over the Docker URI for early DSL2. Therefore, each image was once downloaded and once pulled.
This PR now changes the approach to first gather all Docker URIs and download URLs from all modules and configs into a joint, global list. Subsequently, a new helper function
prioritize_direct_download()
removes all Docker URIs from the list for which also direct download URLs are contained. Consolidating container URIs/URLs from different modules is now possible for the first time.In addition to the improved function, a new test has been devised that runs on mock modules in the tests/data directory. So any future module standard should be reflected by a mock module in that folder in the future.
PR checklist
CHANGELOG.md
is updateddocs
is updated