Generate missing bowtie2 index files #164

cnluzon · 2023-03-03T15:17:38Z

This PR includes functionality related to resolve #162 . Mainly:

Make bowtie2 index files location explicit instead of inferring where it is from the reference fasta file. This makes configuration a bit more verbose but it is more flexible for structures like illumina iGenomes downloads where indexes are already built in a different directory than the fasta reference files.
Added a bowtie2_build rule that creates the bowtie2 index files if they are missing from the minute.yaml file or not found where the user specified they were.
Allow for missing bowtie2_index field altogether in the minute.yaml file so these changes are compatible with previous config files (although this missing field will trigger a bowtie2_build rule instead of searching for them in the same folder as the fasta files were).

There is always the question on whether it is better to crash if index files are not valid instead of generating them by default, but in the amount of computation of a pipeline run, an accidental extra index building rule does not feel like too much over computing.

Indexes are kept as output in case one wants to reuse them. If they existed they are linked under bowtie2_indexes directory.

marcelm

If you think this improves your workflow, then please merge this, but let me explain why I usually don’t include the indexing step in the workflow anymore. I used to do this, but there are a couple of problems.

Often, the reference and index are external to the workflow, i.e. in some other, often shared directory. That directory may be read only, so the workflow cannot generate the index directly in that directory. If the directory is writable, concurrently running instances of the workflow might try to generate the index at the same time, and the final index could be corrupt. Writing into locations that should be considered input would also be unexpected behavior as far as the user is concerned.

The alternative (that you chose) is to use an index if it exists, but to generate one in a run-specific folder if it doesn’t. This avoids the issues arising from treating input as output, but forces index generation for every newly setup pipeline run. In the general case, I think this is a waste of resources and the tradeoff I liked better was to just require the index to exist, that is, to treat the index is an input to the pipeline.

As I said, if this makes things simpler for your specific case, then please go ahead; the way you implemented it is a good one.

marcelm · 2023-03-13T14:38:20Z

minute/__init__.py

+        if not path.with_name(path.name + ext).exists():
+            raise FileNotFoundError(
+                f"Bowtie2 index file missing for '{path}'\n- "
+                + str(path) + ext


It’s a bit inconsistent to mix an f-string with string concatenation. I’d recommend using only an f-string. You should also remove the \n- part, which only made sense in the previous version where the - was used as a bullet, but now this doesn’t make sense as you complain only about a single file.

marcelm · 2023-03-13T14:38:36Z

minute/__init__.py

+                f"Bowtie2 index file missing for '{path}'\n- "
+                + str(path) + ext
+            )
+    return(path)


There should be no parentheses around the returned value.

marcelm · 2023-03-13T14:41:29Z

minute/Snakefile

+        8
+    run:
+        if params.index:
+            with open(str(log), "w") as logf:


This seems more like debugging output, do you think this should be in here?

cnluzon · 2023-03-13T16:30:42Z

Thanks for the quick feedback! I agree with your concerns about in-situ index generation, that's why I put it as run-specific output.

This is a feature with a little bit of a broader audience in mind, for our own use it is not so relevant because we have the config files already set up in a shared environment and this almost never changes. I was trying to streamline running minute as much as possible. One can argue that a user that is already working with these types of workflows is not totally alien to a task like generating a bowtie2 index, and they would only need to do it once, so maybe it is not extremely necessary functionality.

Regarding waste of resources, I came to the conclusion that more advanced users would notice this and less advanced ones would prefer to generate them within the workflow. But maybe it's better to force users to save computing time. A possible alternative would be to only run index generation if the user explicitly asks for it, as in minute run --build-index, and error otherwise.

I like the bowtie2-index path explicit in the minute.yaml because that makes it more flexible in terms of location of shared resources (even though one could also easily work around this for instance with symbolic links).

A totally different course of action could be having a single installation-wide minute.yaml config file with the reference and indexes paths, instead of a per-run config file. I am not too familiar with python setup configuration, it seems a very feasible thing to do. I am more interested in your insights on the possible shortcomings of this.

The main advantage I see is that one would only need to touch the references' paths once when installing minute, and maybe when/if a new reference is added, which is not so frequent in general. The disadvantage is that this configuration would probably need to be edited through the CLI wrapping code, because where the global configuration is should be transparent to the user, I think, so maybe something like a minute add-reference command would be needed. I would be interested in this approach because it is kind of annoying to either keep linking/copying around minute.yaml files on all the runs that point to the same references and configurations, and the rest of run-specific configuration values we have are just literals that could be set as CLI parameters with defaults. This is not mutually exclusive with the minute run --build-index option stated above, but it has the same result in reducing the amount of setup required to launch a minute run.

marcelm · 2023-03-13T21:16:20Z

Regarding waste of resources, I came to the conclusion that more advanced users would notice this and less advanced ones would prefer to generate them within the workflow.
But maybe it's better to force users to save computing time.

Yes, that was the tradeoff I mentioned (but didn’t explain explicitly): Having a separate index generation step is "safer" but less convenient.

A possible alternative would be to only run index generation if the user explicitly asks for it, as in minute run --build-index, and error otherwise.

This is a good option. There’s actually no need to invoke Snakemake, you can just run the command directly (via subprocess.run), so it could be minute build-index or so. Or print out the exact bowtie2-build command the user needs to run.

Perhaps it’s also nice to check whether the index files exist within minute run before calling Snakemake. This would allow for a cleaner error message.

I like the bowtie2-index path explicit in the minute.yaml because that makes it more flexible in terms of location of shared resources (even though one could also easily work around this for instance with symbolic links).

Absolutely, that makes sense.

A totally different course of action could be having a single installation-wide minute.yaml config file with the reference and indexes paths, instead of a per-run config file. I am not too familiar with python setup configuration, it seems a very feasible thing to do. I am more interested in your insights on the possible shortcomings of this.

I added something similar to IgDiscover: Most of the configuration is run-specific, but there can also be a ~/.config/igdiscover.yaml file that is shared by all runs.

The $XDG_CONFIG_HOME and similar variables are a standard that would be good to follow if you implement this. Feel free to copy the code from IgDiscover.

Note that this is a user-specific configuration, so every minute user would have to copy the initial configuration from somewhere. Or the user-specific configuration file could have a single configuration setting that points to the actual system-wide configuration ...

Thinking even further: If you only need to configure references and indexes, you could have a single configuration setting that is the name of a directory with the FASTA references and their indices. Then only one person needs to add a new reference to that file and index it to make it available to everyone.

One minor usability problem could be that it’s possibly a bit intransparent where the index comes from: Imagine someone trying out minute, creating a configuration file, but then they never use it and install it again a year later, at which point they get an error message because the configuration no longer reflects where they store their references and indices. This is easy to solve by letting the pipeline clearly print out from where it reads the configuration and indices.

[...]
This is not mutually exclusive with the minute run --build-index option stated above, but it has the same result in reducing the amount of setup required to launch a minute run.

With the directory idea, minute run --build-index could become a minute update-index command that just re-generates the index for newly added references.

And I cannot resist mentioning it: strobealign needs only 2.5 minutes to index GRCh38, so the discussion about wasted CPU time would become moot ...

cnluzon · 2023-03-21T09:36:56Z

Thanks for your insights :) I just left this a bit on standby because I am interested in evaluating strobealign, so as you say there is no point in updating this if eventually we move towards using it, so there might be another PR in that direction in which case I will close this one.

The discussion about where to config the references, even though related I think it would belong to another PR as well.

cnluzon added 4 commits March 3, 2023 14:08

Add Bowtie2 index to config.yaml

1756a4e

Generate missing bowtie2 indexes

23f1803

Allow for missing bowtie2_index field in config.yaml

2362bc8

Version bump + changelog

7cffcd0

cnluzon marked this pull request as ready for review March 3, 2023 15:29

cnluzon added 3 commits March 3, 2023 17:42

bowtie2_build output moved to log

3bc205d

Allow for empty string as bowtie2_index in minute.yaml

a7bee70

Add threads to bowtie2_build rule

2e03641

cnluzon requested a review from marcelm March 13, 2023 13:23

marcelm reviewed Mar 13, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate missing bowtie2 index files #164

Generate missing bowtie2 index files #164

cnluzon commented Mar 3, 2023

marcelm left a comment

marcelm Mar 13, 2023

marcelm Mar 13, 2023

marcelm Mar 13, 2023

cnluzon commented Mar 13, 2023

marcelm commented Mar 13, 2023

cnluzon commented Mar 21, 2023

Generate missing bowtie2 index files #164

Are you sure you want to change the base?

Generate missing bowtie2 index files #164

Conversation

cnluzon commented Mar 3, 2023

marcelm left a comment

Choose a reason for hiding this comment

marcelm Mar 13, 2023

Choose a reason for hiding this comment

marcelm Mar 13, 2023

Choose a reason for hiding this comment

marcelm Mar 13, 2023

Choose a reason for hiding this comment

cnluzon commented Mar 13, 2023

marcelm commented Mar 13, 2023

cnluzon commented Mar 21, 2023