Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BWA_MEM and BWAMEM2_MEM pass too many indexs #262

Closed
edmundmiller opened this issue Mar 8, 2021 · 20 comments
Closed

BWA_MEM and BWAMEM2_MEM pass too many indexs #262

edmundmiller opened this issue Mar 8, 2021 · 20 comments
Labels
bug Something isn't working

Comments

@edmundmiller
Copy link
Contributor

Description of the bug

When using these modules with aws-igenomes the mem fails, because the below lines pick up multiple indexes

INDEX=`find -L ./ -name "*.amb" | sed 's/.amb//'`

This line was introduced by @drpatelh

Some more background

s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.amb
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.ann
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.bwt
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.pac
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/genome.fa.sa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.amb
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.ann
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.bwt
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.pac
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.rbwt
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.rpac
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.rsa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.5.x/genome.fa.sa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.amb
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.ann
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.bwt
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.pac
s3://ngi-igenomes/igenomes/Staphylococcus_aureus_NCTC_8325/NCBI/2006-02-13/Sequence/BWAIndex/version0.6.0/genome.fa.sa

Note the version0.6.0 and version0.5.x

Steps to reproduce

Steps to reproduce the behaviour:

I don't really have good steps to reproduce inside of modules right now.

@edmundmiller edmundmiller added the bug Something isn't working label Mar 8, 2021
@maxulysse
Copy link
Member

maxulysse commented Mar 9, 2021

The way they do it in the GATK4 pipelines is to pass the reference.fasta file name as a string.
We could pass the value as an additional entry, or have that in the meta map.

cf: https://github.com/gatk-workflows/gatk4-data-processing/blob/944403438198d9981adc4d244a3e8b70c4bf774a/processing-for-variant-discovery-gatk4.wdl#L319

@drpatelh
Copy link
Member

drpatelh commented Mar 9, 2021

Ok. Deleted the last comment because it was getting messy but I see the problem now...unless we default to find the first instance of the index in that structure? Is that possible? Realistically, nesting indices like that is the actual problem here and should be avoided in my opinion...is this only a problem for some genomes?

@maxulysse
Copy link
Member

I'm sure we can hack the find line., but wouldn't it be easier and less hacky to have an extra value?

@drpatelh
Copy link
Member

drpatelh commented Mar 9, 2021

We could but how would having an extra fasta file being staged here help? We would still have the same issues with paths no? I have to say the biggest advantage of passing around and using find is that you can provide the entire index in a directory (tarred or untarred) and the pipeline will deal with it appropriately without having to worry about genome fasta prefixes? e.g. here

@maxulysse
Copy link
Member

don't want to stage an extra fasta file, just the name

@drpatelh
Copy link
Member

drpatelh commented Mar 9, 2021

Yup, but that name would still be found 3 times in the example structure above?

@maxulysse
Copy link
Member

No, because I would only pass BWAIndex/genome.fa.

To be honest I would only pass genome.fa, and use {} to cover all extension from the indices... but that's another debate...

@drpatelh
Copy link
Member

drpatelh commented Mar 9, 2021

Ok. Then I think we will still need some sort of hack in order to use tarred directories containing indices? That is where the find is the most useful because it finds the appropriate files for you. I agree we need to do something about this though.

@maxulysse
Copy link
Member

I'm guessing the UNTAR_STAR_INDEX process (or any similar) could give out the indices, that we can collect to use as only one channel, and the value for the genome.fa or whatever string it should be.
So it could work

@edmundmiller
Copy link
Contributor Author

I think it could be handled in the prepare genome workflow. We could also just throw a cut in there to take the first genome it finds.

@maxulysse
Copy link
Member

What do you think of something like that:
maxulysse/nf-core_sarek@35956ab

@edmundmiller
Copy link
Contributor Author

I think that could work.

To be honest though I don't know why the INDEX was introduced, so if someone could link that discussion I might be more help.

@maxulysse
Copy link
Member

I think that could work.

To be honest though I don't know why the INDEX was introduced, so if someone could link that discussion I might be more help.

I think it was @drpatelh

@drpatelh
Copy link
Member

Yep, I did it without permission because it made sense at the time and it solved quite a few problems with passing the index around and standardising the use of compressed directories too. To be fair, the only edge case is the one you pointed out @emiller88 where you can have different indices nested in the same directory - I suspect this scenario is going to be extremely rare and if so then using the index in the top-level directory is a good shout or the user just re-organises their indices 🤷🏽

As we have discussed at length already, passing the fasta file in addition to the index is not a satisfactory solution because users can also call their custom indices whatever they want which means using the base name of the fasta isn't sufficient. This is why using a find made sense to deal with whatever people chose to call the index files.

I will have another play with this whilst try to re-prep the rnaseq pipeline for a release.

@maxulysse
Copy link
Member

I think we can have a bwaindex.first.basename() or something like that to get the actual fasta name, and add that to the map in case of user specified indices

@drpatelh
Copy link
Member

Ok. So if you are able to get a working prototype together where we don't have to pass the fasta file to the alignment process and where we can use .tar.gz directories for the index then that would be great @maxulysse

@maxulysse
Copy link
Member

I'll look into it.

@maxulysse
Copy link
Member

OK, so I have #1222
I think it should work with the .tar.gz extraction if we change the output from "dir" to "dir/*" as I did in the index processes

@maxulysse
Copy link
Member

cc #627

@SPPearce
Copy link
Contributor

Closing as outdated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants