Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Createindex command takes a huge amount of time #962

Open
nicoceres opened this issue Feb 24, 2025 · 2 comments
Open

[Question] Createindex command takes a huge amount of time #962

nicoceres opened this issue Feb 24, 2025 · 2 comments

Comments

@nicoceres
Copy link

Hi,
It is my first time running mmseqs.

Actually, I'm at the stage where I want to index one of my target dbs, namely BFD (>700 Gb).

The log file looks like this:

createindex path_to_mmseqs_db/bfd/bfd path_to_my_local_tmp

MMseqs Version:          	14-7e284+ds-1+b2
Seed substitution matrix 	aa:VTML80.out,nucl:nucleotide.out
k-mer length             	0
Alphabet size            	aa:21,nucl:5
Compositional bias       	1
Compositional bias       	1
Max sequence length      	65535
Max results per query    	300
Mask residues            	1
Mask residues probability	0.9
Mask lower case residues 	0
Spaced k-mers            	1
Spaced k-mer pattern     	
Sensitivity              	7.5
k-score                  	seq:0,prof:0
Check compatible         	0
Search type              	0
Split database           	0
Split memory limit       	0
Verbosity                	3
Threads                  	32
Min codons in orf        	30
Max codons in length     	32734
Max orf gaps             	2147483647
Contig start mode        	2
Contig end mode          	2
Orf start mode           	1
Forward frames           	1,2,3
Reverse frames           	1,2,3
Translation table        	1
Translate orf            	0
Use all table starts     	false
Offset of numeric ids    	0
Create lookup            	0
Compressed               	0
Add orf stop             	false
Overlap between sequences	0
Sequence split mode      	1
Header split mode        	0
Strand selection         	1
Remove temporary files   	false

indexdb ../../data/mmseqs_alphafold_db/bfd/bfd ../../data/mmseqs_alphafold_db/bfd/bfd --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --alph-size aa:21,nucl:5 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-seq-len 65535 --max-seqs 300 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --spaced-kmer-mode 1 -s 7.5 --k-score seq:0,prof:0 --check-compatible 0 --search-type 0 --split 0 --split-memory-limit 0 -v 3 --threads 32 

Target split mode. Searching through 29 splits
Estimated memory consumption: 551G
Write VERSION (0)
Write META (1)
Write SCOREMATRIX3MER (4)
Write SCOREMATRIX2MER (3)
Write SCOREMATRIXNAME (2)
Write SPACEDPATTERN (23)
Write GENERATOR (22)
Write DBR1INDEX (5)
Write DBR1DATA (6)
Write HDR1INDEX (18)
Write HDR1DATA (19)
Index table: counting k-mers
[=================================================================] 88.79M 12m 13s 185ms
Index table: Masked residues: 273185904
Index table: fill
[=================================================================

and stays like this since a while (days!).

The output fold looks like this:

Image

When I look at the RAM of the machine I use for the calculation, I get this:
Image

Is there a problem, in your opinion?

Thanks in advance for your advice.

@milot-mirdita
Copy link
Member

There is no further output? This looks very broken. Can you check if this issue was resolved in the last release 17 please?

@nicoceres
Copy link
Author

No further output.
I'll check release 17, thanks for your answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants