-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate missing bowtie2 index files #164
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think this improves your workflow, then please merge this, but let me explain why I usually don’t include the indexing step in the workflow anymore. I used to do this, but there are a couple of problems.
Often, the reference and index are external to the workflow, i.e. in some other, often shared directory. That directory may be read only, so the workflow cannot generate the index directly in that directory. If the directory is writable, concurrently running instances of the workflow might try to generate the index at the same time, and the final index could be corrupt. Writing into locations that should be considered input would also be unexpected behavior as far as the user is concerned.
The alternative (that you chose) is to use an index if it exists, but to generate one in a run-specific folder if it doesn’t. This avoids the issues arising from treating input as output, but forces index generation for every newly setup pipeline run. In the general case, I think this is a waste of resources and the tradeoff I liked better was to just require the index to exist, that is, to treat the index is an input to the pipeline.
As I said, if this makes things simpler for your specific case, then please go ahead; the way you implemented it is a good one.
if not path.with_name(path.name + ext).exists(): | ||
raise FileNotFoundError( | ||
f"Bowtie2 index file missing for '{path}'\n- " | ||
+ str(path) + ext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s a bit inconsistent to mix an f-string with string concatenation. I’d recommend using only an f-string. You should also remove the \n-
part, which only made sense in the previous version where the -
was used as a bullet, but now this doesn’t make sense as you complain only about a single file.
f"Bowtie2 index file missing for '{path}'\n- " | ||
+ str(path) + ext | ||
) | ||
return(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no parentheses around the returned value.
8 | ||
run: | ||
if params.index: | ||
with open(str(log), "w") as logf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems more like debugging output, do you think this should be in here?
Thanks for the quick feedback! I agree with your concerns about in-situ index generation, that's why I put it as run-specific output. This is a feature with a little bit of a broader audience in mind, for our own use it is not so relevant because we have the config files already set up in a shared environment and this almost never changes. I was trying to streamline running Regarding waste of resources, I came to the conclusion that more advanced users would notice this and less advanced ones would prefer to generate them within the workflow. But maybe it's better to force users to save computing time. A possible alternative would be to only run index generation if the user explicitly asks for it, as in I like the bowtie2-index path explicit in the A totally different course of action could be having a single installation-wide The main advantage I see is that one would only need to touch the references' paths once when installing minute, and maybe when/if a new reference is added, which is not so frequent in general. The disadvantage is that this configuration would probably need to be edited through the CLI wrapping code, because where the global configuration is should be transparent to the user, I think, so maybe something like a |
Yes, that was the tradeoff I mentioned (but didn’t explain explicitly): Having a separate index generation step is "safer" but less convenient.
This is a good option. There’s actually no need to invoke Snakemake, you can just run the command directly (via subprocess.run), so it could be Perhaps it’s also nice to check whether the index files exist within
Absolutely, that makes sense.
I added something similar to IgDiscover: Most of the configuration is run-specific, but there can also be a The Note that this is a user-specific configuration, so every minute user would have to copy the initial configuration from somewhere. Or the user-specific configuration file could have a single configuration setting that points to the actual system-wide configuration ... Thinking even further: If you only need to configure references and indexes, you could have a single configuration setting that is the name of a directory with the FASTA references and their indices. Then only one person needs to add a new reference to that file and index it to make it available to everyone. One minor usability problem could be that it’s possibly a bit intransparent where the index comes from: Imagine someone trying out minute, creating a configuration file, but then they never use it and install it again a year later, at which point they get an error message because the configuration no longer reflects where they store their references and indices. This is easy to solve by letting the pipeline clearly print out from where it reads the configuration and indices.
With the directory idea, And I cannot resist mentioning it: strobealign needs only 2.5 minutes to index GRCh38, so the discussion about wasted CPU time would become moot ... |
Thanks for your insights :) I just left this a bit on standby because I am interested in evaluating strobealign, so as you say there is no point in updating this if eventually we move towards using it, so there might be another PR in that direction in which case I will close this one. The discussion about where to config the references, even though related I think it would belong to another PR as well. |
This PR includes functionality related to resolve #162 . Mainly:
bowtie2_build
rule that creates the bowtie2 index files if they are missing from theminute.yaml
file or not found where the user specified they were.bowtie2_index
field altogether in theminute.yaml
file so these changes are compatible with previous config files (although this missing field will trigger a bowtie2_build rule instead of searching for them in the same folder as the fasta files were).There is always the question on whether it is better to crash if index files are not valid instead of generating them by default, but in the amount of computation of a pipeline run, an accidental extra index building rule does not feel like too much over computing.
Indexes are kept as output in case one wants to reuse them. If they existed they are linked under
bowtie2_indexes
directory.