-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version 3.0.3 of CARD breaks prepareref #278
Comments
You could try adding the --cdhit_max_memory argument to prepareref (see https://github.com/sanger-pathogens/ariba/wiki/Task:-prepareref) and see if that makes any difference - i.e. bump the memory up for cd-hit. |
I don't think this is a memory issue (see below). It looks like the main solution is to recompile cdhit with the specific flag (see weizhongli/cdhit#26), but this is awkward when there are exisiting docker images etc.
|
Ran into the same thing. There is apparently an element in the database that's longer than the maximum sequence length that CD-HIT allows. Recompiling CD-HIT yourself with MAX_SEQ=1000000 will fix your issue. |
I'll look at doing a change to the docker image next week, to fix this. Thanks for the info. |
Looks like there is a 1860132bp sequence in there (" I'd like to dig into this some more when I have time. I suspect a better fix may be to impose a max length on all reference sequences (at the moment there is only a max length for genes), which would result in that huge sequence getting removed. |
Hi Martin. Agreed - fixing in Ariba would be the best way, else we will be effectively forcing users to manually compile cdhit whenever they want to install Ariba and use CARD. To fix the problem at hand it looks like we just need a length check in the _get_from_card method (ref_genes_getter.py). The length to check could be overridden by cmd-line arg if necessary. Removed sequences can be shown in the getref log. I can do the the change unless you want to check it out a bit more? |
@fmaguire - thanks for referencing that issue. Good to know CARD are aware. @kpepper - sounds like a good plan, great if you can do it. I think the cutoff should be applied when prepareref is run, that way it covers everything. This already happens for genes (see https://github.com/sanger-pathogens/ariba/blob/master/ariba/reference_data.py#L28) I'm guessing same default as the genes would work: 10k? Maybe worth checking the lengths of the non coding seqs in the datasets like CARD etc? |
@martinghunt Okay. |
Fix for issue #278 Version 3.0.3 of CARD breaks prepareref
This has been fixed in CARD now. However, release v2.1.4.3 includes two new prepareref arguments that can be used to filter on the length of non-coding sequences so should we have a similar thing happen again, it can be bypassed by adjusting the filtering accordingly. The new arguments are called: --min_noncoding_length and --max_noncoding_length. We can already filter on gene lengths using --min_gene_length and --max_gene_length. Any filtered non-coding sequences will be shown in a new prepareref log file called 01.filter.check_noncoding.log. |
Details below. Fixing the version to V3.0.2 with the same setup works.
The text was updated successfully, but these errors were encountered: