Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate missing bowtie2 index files #164

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Generate missing bowtie2 index files #164

wants to merge 7 commits into from

Conversation

cnluzon
Copy link
Collaborator

@cnluzon cnluzon commented Mar 3, 2023

This PR includes functionality related to resolve #162 . Mainly:

  • Make bowtie2 index files location explicit instead of inferring where it is from the reference fasta file. This makes configuration a bit more verbose but it is more flexible for structures like illumina iGenomes downloads where indexes are already built in a different directory than the fasta reference files.
  • Added a bowtie2_build rule that creates the bowtie2 index files if they are missing from the minute.yaml file or not found where the user specified they were.
  • Allow for missing bowtie2_index field altogether in the minute.yaml file so these changes are compatible with previous config files (although this missing field will trigger a bowtie2_build rule instead of searching for them in the same folder as the fasta files were).

There is always the question on whether it is better to crash if index files are not valid instead of generating them by default, but in the amount of computation of a pipeline run, an accidental extra index building rule does not feel like too much over computing.

Indexes are kept as output in case one wants to reuse them. If they existed they are linked under bowtie2_indexes directory.

@cnluzon cnluzon marked this pull request as ready for review March 3, 2023 15:29
@cnluzon cnluzon requested a review from marcelm March 13, 2023 13:23
Copy link
Collaborator

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think this improves your workflow, then please merge this, but let me explain why I usually don’t include the indexing step in the workflow anymore. I used to do this, but there are a couple of problems.

Often, the reference and index are external to the workflow, i.e. in some other, often shared directory. That directory may be read only, so the workflow cannot generate the index directly in that directory. If the directory is writable, concurrently running instances of the workflow might try to generate the index at the same time, and the final index could be corrupt. Writing into locations that should be considered input would also be unexpected behavior as far as the user is concerned.

The alternative (that you chose) is to use an index if it exists, but to generate one in a run-specific folder if it doesn’t. This avoids the issues arising from treating input as output, but forces index generation for every newly setup pipeline run. In the general case, I think this is a waste of resources and the tradeoff I liked better was to just require the index to exist, that is, to treat the index is an input to the pipeline.

As I said, if this makes things simpler for your specific case, then please go ahead; the way you implemented it is a good one.

if not path.with_name(path.name + ext).exists():
raise FileNotFoundError(
f"Bowtie2 index file missing for '{path}'\n- "
+ str(path) + ext
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s a bit inconsistent to mix an f-string with string concatenation. I’d recommend using only an f-string. You should also remove the \n- part, which only made sense in the previous version where the - was used as a bullet, but now this doesn’t make sense as you complain only about a single file.

f"Bowtie2 index file missing for '{path}'\n- "
+ str(path) + ext
)
return(path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no parentheses around the returned value.

8
run:
if params.index:
with open(str(log), "w") as logf:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more like debugging output, do you think this should be in here?

@cnluzon
Copy link
Collaborator Author

cnluzon commented Mar 13, 2023

Thanks for the quick feedback! I agree with your concerns about in-situ index generation, that's why I put it as run-specific output.

This is a feature with a little bit of a broader audience in mind, for our own use it is not so relevant because we have the config files already set up in a shared environment and this almost never changes. I was trying to streamline running minute as much as possible. One can argue that a user that is already working with these types of workflows is not totally alien to a task like generating a bowtie2 index, and they would only need to do it once, so maybe it is not extremely necessary functionality.

Regarding waste of resources, I came to the conclusion that more advanced users would notice this and less advanced ones would prefer to generate them within the workflow. But maybe it's better to force users to save computing time. A possible alternative would be to only run index generation if the user explicitly asks for it, as in minute run --build-index, and error otherwise.

I like the bowtie2-index path explicit in the minute.yaml because that makes it more flexible in terms of location of shared resources (even though one could also easily work around this for instance with symbolic links).

A totally different course of action could be having a single installation-wide minute.yaml config file with the reference and indexes paths, instead of a per-run config file. I am not too familiar with python setup configuration, it seems a very feasible thing to do. I am more interested in your insights on the possible shortcomings of this.

The main advantage I see is that one would only need to touch the references' paths once when installing minute, and maybe when/if a new reference is added, which is not so frequent in general. The disadvantage is that this configuration would probably need to be edited through the CLI wrapping code, because where the global configuration is should be transparent to the user, I think, so maybe something like a minute add-reference command would be needed. I would be interested in this approach because it is kind of annoying to either keep linking/copying around minute.yaml files on all the runs that point to the same references and configurations, and the rest of run-specific configuration values we have are just literals that could be set as CLI parameters with defaults. This is not mutually exclusive with the minute run --build-index option stated above, but it has the same result in reducing the amount of setup required to launch a minute run.

@marcelm
Copy link
Collaborator

marcelm commented Mar 13, 2023

Regarding waste of resources, I came to the conclusion that more advanced users would notice this and less advanced ones would prefer to generate them within the workflow.
But maybe it's better to force users to save computing time.

Yes, that was the tradeoff I mentioned (but didn’t explain explicitly): Having a separate index generation step is "safer" but less convenient.

A possible alternative would be to only run index generation if the user explicitly asks for it, as in minute run --build-index, and error otherwise.

This is a good option. There’s actually no need to invoke Snakemake, you can just run the command directly (via subprocess.run), so it could be minute build-index or so. Or print out the exact bowtie2-build command the user needs to run.

Perhaps it’s also nice to check whether the index files exist within minute run before calling Snakemake. This would allow for a cleaner error message.

I like the bowtie2-index path explicit in the minute.yaml because that makes it more flexible in terms of location of shared resources (even though one could also easily work around this for instance with symbolic links).

Absolutely, that makes sense.

A totally different course of action could be having a single installation-wide minute.yaml config file with the reference and indexes paths, instead of a per-run config file. I am not too familiar with python setup configuration, it seems a very feasible thing to do. I am more interested in your insights on the possible shortcomings of this.

I added something similar to IgDiscover: Most of the configuration is run-specific, but there can also be a ~/.config/igdiscover.yaml file that is shared by all runs.

The $XDG_CONFIG_HOME and similar variables are a standard that would be good to follow if you implement this. Feel free to copy the code from IgDiscover.

Note that this is a user-specific configuration, so every minute user would have to copy the initial configuration from somewhere. Or the user-specific configuration file could have a single configuration setting that points to the actual system-wide configuration ...

Thinking even further: If you only need to configure references and indexes, you could have a single configuration setting that is the name of a directory with the FASTA references and their indices. Then only one person needs to add a new reference to that file and index it to make it available to everyone.

One minor usability problem could be that it’s possibly a bit intransparent where the index comes from: Imagine someone trying out minute, creating a configuration file, but then they never use it and install it again a year later, at which point they get an error message because the configuration no longer reflects where they store their references and indices. This is easy to solve by letting the pipeline clearly print out from where it reads the configuration and indices.

[...]
This is not mutually exclusive with the minute run --build-index option stated above, but it has the same result in reducing the amount of setup required to launch a minute run.

With the directory idea, minute run --build-index could become a minute update-index command that just re-generates the index for newly added references.

And I cannot resist mentioning it: strobealign needs only 2.5 minutes to index GRCh38, so the discussion about wasted CPU time would become moot ...

@cnluzon
Copy link
Collaborator Author

cnluzon commented Mar 21, 2023

Thanks for your insights :) I just left this a bit on standby because I am interested in evaluating strobealign, so as you say there is no point in updating this if eventually we move towards using it, so there might be another PR in that direction in which case I will close this one.

The discussion about where to config the references, even though related I think it would belong to another PR as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make minute generate bowtie2 index if no index is found
2 participants