Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 3.0.3 of CARD breaks prepareref #278

Closed
JezSw opened this issue Aug 8, 2019 · 9 comments
Closed

Version 3.0.3 of CARD breaks prepareref #278

JezSw opened this issue Aug 8, 2019 · 9 comments

Comments

@JezSw
Copy link

JezSw commented Aug 8, 2019

Details below. Fixing the version to V3.0.2 with the same setup works.

jez@ubuntu:~$ docker run -i -t sangerpathogens/ariba /bin/bash
root@038c21b11f21:/# ariba getref card out.card
Getting available CARD versions
Downloading "https://card.mcmaster.ca/download" and saving as "download.html" ... done
Found versions:
2.0.0	https://card.mcmaster.ca/download/0/broadstreet-v2.0.0.tar.gz
2.0.1	https://card.mcmaster.ca/download/0/broadstreet-v2.0.1.tar.gz
2.0.2	https://card.mcmaster.ca/download/0/broadstreet-v2.0.2.tar.gz
2.0.3	https://card.mcmaster.ca/download/0/broadstreet-v2.0.3.tar.gz
3.0.0	https://card.mcmaster.ca/download/0/broadstreet-v3.0.0.tar.gz
3.0.1	https://card.mcmaster.ca/download/0/broadstreet-v3.0.1.tar.gz
3.0.2	https://card.mcmaster.ca/download/0/broadstreet-v3.0.2.tar.gz
3.0.3	https://card.mcmaster.ca/download/0/broadstreet-v3.0.3.tar.gz
Getting version 3.0.3
Working in temporary directory /out.card.download
Downloading data from card: https://card.mcmaster.ca/download/0/broadstreet-v3.0.3.tar.gz
syscall: wget -O card.tar.gz https://card.mcmaster.ca/download/0/broadstreet-v3.0.3.tar.gz
...finished downloading
Extracted json data file ./card.json. Reading its contents...
Found 2700 records in the json file. Analysing...
Extracted data and written ARIBA input files

Finished. Final files are:
	/out.card.fa
	/out.card.tsv

You can use them with ARIBA like this:
ariba prepareref -f /out.card.fa -m /out.card.tsv output_directory

If you use this downloaded data, please cite:
"The Comprehensive Antibiotic Resistance Database", McArthur et al 2013, PMID: 23650175
and in your methods say that version 3.0.3 of the database was used
root@038c21b11f21:/# ariba prepareref -f out.card.fa -m out.card.tsv out.card.prepareref
The following command failed with exit code 139
/usr/bin/cdhit-est -i /out.card.prepareref/02.cdhit.noncoding.fa -o /tmp.run_cd-hit.31d170aj/cdhit -c 0.9 -T 1 -s 0.0 -d 0 -bak 1

The output was:

================================================================
Program: CD-HIT, V4.6 (+OpenMP), Aug 14 2016, 11:49:02
Command: /usr/bin/cdhit-est -i
         /out.card.prepareref/02.cdhit.noncoding.fa -o
         /tmp.run_cd-hit.31d170aj/cdhit -c 0.9 -T 1 -s 0.0 -d 0
         -bak 1

Started: Thu Aug  8 11:05:07 2019
================================================================
                            Output                              
----------------------------------------------------------------
total seq: 79

Warning:
Some seqs are too long, please rebuild the program with make parameter MAX_SEQ=new-maximum-length (e.g. make MAX_SEQ=10000000)
Not fatal, but may affect results !!

longest and shortest : 1860132 and 1441
Total letters: 2019859
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 2M
Buffer          : 1 X 408M = 408M
Table           : 1 X 16M = 16M
Miscellaneous   : 4M
Total           : 431M

Table limit with the given memory limit:
Max number of representatives: 40000
Max number of word counting entries: 46092845

comparing sequences from          0  to         79
Segmentation fault (core dumped)

@kpepper
Copy link
Member

kpepper commented Aug 8, 2019

You could try adding the --cdhit_max_memory argument to prepareref (see https://github.com/sanger-pathogens/ariba/wiki/Task:-prepareref) and see if that makes any difference - i.e. bump the memory up for cd-hit.

@JezSw
Copy link
Author

JezSw commented Aug 8, 2019

I don't think this is a memory issue (see below). It looks like the main solution is to recompile cdhit with the specific flag (see weizhongli/cdhit#26), but this is awkward when there are exisiting docker images etc.

root@6b1996c8c7d4:/# ariba prepareref --cdhit_max_memory 15000 -f out.card.fa -m out.card.tsv out.card.prepareref
The following command failed with exit code 139
/usr/bin/cdhit-est -i /out.card.prepareref/02.cdhit.noncoding.fa -o /tmp.run_cd-hit.0nerjr5y/cdhit -c 0.9 -T 1 -s 0.0 -d 0 -bak 1 -M 15000

The output was:

================================================================
Program: CD-HIT, V4.7 (+OpenMP), Jul 01 2017, 08:43:07
Command: /usr/bin/cdhit-est -i
         /out.card.prepareref/02.cdhit.noncoding.fa -o
         /tmp.run_cd-hit.0nerjr5y/cdhit -c 0.9 -T 1 -s 0.0 -d 0
         -bak 1 -M 15000

Started: Thu Aug  8 14:14:46 2019
================================================================
                            Output                              
----------------------------------------------------------------
total seq: 79

Warning:
Some seqs are too long, please rebuild the program with make parameter MAX_SEQ=new-maximum-length (e.g. make MAX_SEQ=10000000)
Not fatal, but may affect results !!

@lrvdijk
Copy link

lrvdijk commented Aug 8, 2019

Ran into the same thing.

There is apparently an element in the database that's longer than the maximum sequence length that CD-HIT allows.

Recompiling CD-HIT yourself with MAX_SEQ=1000000 will fix your issue.

@kpepper
Copy link
Member

kpepper commented Aug 9, 2019

I'll look at doing a change to the docker image next week, to fix this. Thanks for the info.

@martinghunt
Copy link
Collaborator

Looks like there is a 1860132bp sequence in there ("longest and shortest : 1860132 and 1441")?! ARIBA won't like assembling that - it will take far too long.

I'd like to dig into this some more when I have time. I suspect a better fix may be to impose a max length on all reference sequences (at the moment there is only a max length for genes), which would result in that huge sequence getting removed.

@kpepper
Copy link
Member

kpepper commented Aug 12, 2019

Hi Martin. Agreed - fixing in Ariba would be the best way, else we will be effectively forcing users to manually compile cdhit whenever they want to install Ariba and use CARD. To fix the problem at hand it looks like we just need a length check in the _get_from_card method (ref_genes_getter.py). The length to check could be overridden by cmd-line arg if necessary. Removed sequences can be shown in the getref log. I can do the the change unless you want to check it out a bit more?

@martinghunt
Copy link
Collaborator

@fmaguire - thanks for referencing that issue. Good to know CARD are aware.

@kpepper - sounds like a good plan, great if you can do it. I think the cutoff should be applied when prepareref is run, that way it covers everything. This already happens for genes (see https://github.com/sanger-pathogens/ariba/blob/master/ariba/reference_data.py#L28)

I'm guessing same default as the genes would work: 10k? Maybe worth checking the lengths of the non coding seqs in the datasets like CARD etc?

@kpepper
Copy link
Member

kpepper commented Aug 20, 2019

@martinghunt Okay.

kpepper added a commit that referenced this issue Aug 23, 2019
kpepper added a commit that referenced this issue Aug 30, 2019
Fix for issue #278 Version 3.0.3 of CARD breaks prepareref
@kpepper
Copy link
Member

kpepper commented Aug 30, 2019

This has been fixed in CARD now. However, release v2.1.4.3 includes two new prepareref arguments that can be used to filter on the length of non-coding sequences so should we have a similar thing happen again, it can be bypassed by adjusting the filtering accordingly. The new arguments are called: --min_noncoding_length and --max_noncoding_length. We can already filter on gene lengths using --min_gene_length and --max_gene_length. Any filtered non-coding sequences will be shown in a new prepareref log file called 01.filter.check_noncoding.log.

@kpepper kpepper closed this as completed Aug 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants