Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmseqs prefilter: performance issues on Mac ARM #939

Open
jackroddy opened this issue Jan 21, 2025 · 0 comments
Open

mmseqs prefilter: performance issues on Mac ARM #939

jackroddy opened this issue Jan 21, 2025 · 0 comments

Comments

@jackroddy
Copy link

jackroddy commented Jan 21, 2025

When running mmseqs prefilter on Mac ARM, I've noticed some performance issues.

Here's a summary of what I've noticed:

  • this issue doesn't seem to happen on linux, but my linux machine is much more powerful than my laptop, so I can't say for sure
  • it seems to consistently happen when a kmer length of 7 is chosen (parameter -k 7), even with very small searches (e.g. 5,000 queries vs. 5,000 targets).
  • for our use case, we have been running with -k 0, which seems to usually choose a kmer length of 6, but when it decides to choose 7, we start to run into the performance issues
  • it seems like, for now, we can just default to an explicit -k 6 to avoid the slowdown, but it would be nice to have the option to use different kmer lengths
  • when the performance degrades, I've noticed that the memory & CPU usage ends up being roughly half of what is predicted & expected
  • my initial tests used MMseqs v15, but I see the same issues with using -k 7 on the latest release (v17)
  • when running with -k 0, it seems like v17 now chooses to use a kmer length of 6 on my laptop instead of 7

You should be able to download my benchmarks here

I've been running some tests on my laptop (18gb memory, mac ARM M3 pro) and on a linux workstation (128gb memory, intel i9 14900k).

Here's some of the test runs I've done:

MMseqs2 v15 | 5,000 queries vs. 5,000,000 targets | macos - 18g memory | arm M3 pro

These tests were run on my laptop.

-k 0 --k-score 80 --min-ungapped-score 15 --max-seqs 1000

When running with -k 0, MMseqs chose a kmer size of 7, and it estimated roughly 20 hours to completion for the first database split.

[jack@manami tmp (dev)]$ time mmseqs prefilter queryDB targetDB prefilterDB --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8
prefilter queryDB targetDB prefilterDB --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8

MMseqs Version:           	6f45232ac8daca14e354ae320a4359056ec524c2
Substitution matrix       	aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix  	aa:VTML80.out,nucl:nucleotide.out
Sensitivity               	4
k-mer length              	0
Target search mode        	0
k-score                   	seq:80,prof:80
Alphabet size             	aa:21,nucl:5
Max sequence length       	65535
Max results per query     	1000
Split database            	0
Split mode                	2
Split memory limit        	0
Coverage threshold        	0
Coverage mode             	0
Compositional bias        	1
Compositional bias        	1
Diagonal scoring          	true
Exact k-mer matching      	0
Mask residues             	1
Mask residues probability 	0.9
Mask lower case residues  	0
Minimum diagonal score    	15
Selected taxa             	
Include identical seq. id.	false
Spaced k-mers             	1
Preload mode              	0
Pseudo count a            	substitution:1.100,context:1.400
Pseudo count b            	substitution:4.100,context:5.800
Spaced k-mer pattern      	
Local temporary path      	
Threads                   	8
Compressed                	0
Verbosity                 	3

Query database size: 5000 type: Aminoacid
Target split mode. Searching through 4 splits
Estimated memory consumption: 13G
Target database size: 5050000 type: Aminoacid
Process prefiltering step 1 of 4

Index table k-mer threshold: 80 at k-mer size 7
Index table: counting k-mers
[=================================================================] 100.00% 1.27M 3m 21s 337ms
Index table: Masked residues: 6750387
Index table: fill
[=================================================================] 100.00% 1.27M 18m 47s 317ms
Index statistics
Entries:          477737682
DB size:          12499 MB
Avg k-mer size:   0.3
73233
Top 10 k-mers
    RAARQGG	3256
    LLNPKRH	2641
    VGPGTST	2338
    LTKSGGV	1370
    LTKAGGV	1269
    TTGGNLL	1106
    KGGEGLV	1086
    KGGPGLV	961
    LELVGYV	796
    EDAHGDN	686
Time for index table init: 0h 22m 21s 129ms
k-mer similarity threshold: 80
Starting prefiltering scores calculation (step 1 of 4)
Query db start 1 to 5000
Target db start 1 to 1270722
[>                                                                ] 1.00% 51 eta 19h 30m 45s

\\ *******************
\\ process killed here
\\ *******************

real	44m31.212s
user	11m15.878s
sys	144m50.782s

Interestingly, when I ran it again, it started to run a bit faster, estimating about 13 hours to completion for the first split.

[jack@manami tmp (dev)]$ time mmseqs prefilter queryDB targetDB prefilterDB -k 0 --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8
prefilter queryDB targetDB prefilterDB -k 0 --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8

MMseqs Version:           	6f45232ac8daca14e354ae320a4359056ec524c2
Substitution matrix       	aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix  	aa:VTML80.out,nucl:nucleotide.out
Sensitivity               	4
k-mer length              	0
Target search mode        	0
k-score                   	seq:80,prof:80
Alphabet size             	aa:21,nucl:5
Max sequence length       	65535
Max results per query     	1000
Split database            	0
Split mode                	2
Split memory limit        	0
Coverage threshold        	0
Coverage mode             	0
Compositional bias        	1
Compositional bias        	1
Diagonal scoring          	true
Exact k-mer matching      	0
Mask residues             	1
Mask residues probability 	0.9
Mask lower case residues  	0
Minimum diagonal score    	15
Selected taxa             	
Include identical seq. id.	false
Spaced k-mers             	1
Preload mode              	0
Pseudo count a            	substitution:1.100,context:1.400
Pseudo count b            	substitution:4.100,context:5.800
Spaced k-mer pattern      	
Local temporary path      	
Threads                   	8
Compressed                	0
Verbosity                 	3

Query database size: 5000 type: Aminoacid
Target split mode. Searching through 4 splits
Estimated memory consumption: 13G
Target database size: 5050000 type: Aminoacid
Process prefiltering step 1 of 4

Index table k-mer threshold: 80 at k-mer size 7
Index table: counting k-mers
[=================================================================] 100.00% 1.27M 11s 922ms
Index table: Masked residues: 6750387
Index table: fill
[=================================================================] 100.00% 1.27M 5m 51s 733ms
Index statistics
Entries:          477737682
DB size:          12499 MB
Avg k-mer size:   0.373233
Top 10 k-mers
    RAARQGG	3256
    LLNPKRH	2641
    VGPGTST	2338
    LTKSGGV	1370
    LTKAGGV	1269
    TTGGNLL	1106
    KGGEGLV	1086
    KGGPGLV	961
    LELVGYV	796
    EDAHGDN	686
Time for index table init: 0h 6m 12s 589ms
k-mer similarity threshold: 80
Starting prefiltering scores calculation (step 1 of 4)
Query db start 1 to 5000
Target db start 1 to 1270722
[>                                                                ] 1.00% 51 eta 13h 9m 44s

\\ *******************
\\ process killed here
\\ *******************

real	15m27.988s
user	6m38.978s
sys	50m15.786s

In both of these runs, I noticed that the actual memory usage was roughly half of the amount predicted by MMseqs, and the CPU usage stayed at ~400% instead of the ~800% I would expect when using 8 threads.

-k 6 --k-score 80 --min-ungapped-score 15 --max-seqs 1000

When running with an explicit -k 6, there's no performance issues at all:

[jack@manami tmp (dev)]$ time mmseqs prefilter queryDB targetDB prefilterDB -k 6 --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8
prefilter queryDB targetDB prefilterDB -k 6 --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8

MMseqs Version:           	6f45232ac8daca14e354ae320a4359056ec524c2
Substitution matrix       	aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix  	aa:VTML80.out,nucl:nucleotide.out
Sensitivity               	4
k-mer length              	6
Target search mode        	0
k-score                   	seq:80,prof:80
Alphabet size             	aa:21,nucl:5
Max sequence length       	65535
Max results per query     	1000
Split database            	0
Split mode                	2
Split memory limit        	0
Coverage threshold        	0
Coverage mode             	0
Compositional bias        	1
Compositional bias        	1
Diagonal scoring          	true
Exact k-mer matching      	0
Mask residues             	1
Mask residues probability 	0.9
Mask lower case residues  	0
Minimum diagonal score    	15
Selected taxa             	
Include identical seq. id.	false
Spaced k-mers             	1
Preload mode              	0
Pseudo count a            	substitution:1.100,context:1.400
Pseudo count b            	substitution:4.100,context:5.800
Spaced k-mer pattern      	
Local temporary path      	
Threads                   	8
Compressed                	0
Verbosity                 	3

Query database size: 5000 type: Aminoacid
Target split mode. Searching through 2 splits
Estimated memory consumption: 8G
Target database size: 5050000 type: Aminoacid
Process prefiltering step 1 of 2

Index table k-mer threshold: 80 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 2.53M 15s 731ms
Index table: Masked residues: 13593872
Index table: fill
[=================================================================] 100.00% 2.53M 18s 798ms
Index statistics
Entries:          958627140
DB size:          5973 MB
Avg k-mer size:   14.978549
Top 10 k-mers
    HGTNKF	6745
    TSGGGV	6671
    LLNPDR	6105
    LGGGKT	5880
    TTGGGV	4852
    DGAGDN	3685
    KPGTTY	3438
    ILNPDR	3202
    VLNPDR	3148
    TTGGGT	3109
Time for index table init: 0h 0m 38s 159ms
k-mer similarity threshold: 80
Starting prefiltering scores calculation (step 1 of 2)
Query db start 1 to 5000
Target db start 1 to 2531088
[=================================================================] 100.00% 5.00K 9m 48s 278ms

6730.260374 k-mers per position
45483853 DB matches per sequence
4895 overflows
589 sequences passed prefiltering per query sequence
589 median result list length
0 sequences with 0 size result lists
Time for merging to prefilterDB_tmp_0: 0h 0m 0s 12ms
Time for merging to prefilterDB_tmp_0_tmp: 0h 0m 0s 55ms
Process prefiltering step 2 of 2

Index table k-mer threshold: 80 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 2.52M 16s 573ms
Index table: Masked residues: 13708463
Index table: fill
[=================================================================] 100.00% 2.52M 16s 903ms
Index statistics
Entries:          958633088
DB size:          5973 MB
Avg k-mer size:   14.978642
Top 10 k-mers
    HGTNKF	6810
    TSGGGV	6794
    LLNPDR	6060
    LGGGKT	5924
    TTGGGV	4886
    DGAGDN	3695
    KPGTTY	3538
    VLNPDR	3146
    RLTKGS	3143
    TSGGGT	2559
Time for index table init: 0h 0m 36s 866ms
k-mer similarity threshold: 80
Starting prefiltering scores calculation (step 2 of 2)
Query db start 1 to 5000
Target db start 2531089 to 5050000
[=================================================================] 100.00% 5.00K 9m 2s 617ms

6730.260374 k-mers per position
45484151 DB matches per sequence
4895 overflows
589 sequences passed prefiltering per query sequence
589 median result list length
0 sequences with 0 size result lists
Time for merging to prefilterDB_tmp_1: 0h 0m 0s 17ms
Time for merging to prefilterDB_tmp_1_tmp: 0h 0m 0s 40ms
Merging 2 target splits to prefilterDB
Preparing offsets for merging: 0h 0m 0s 3ms
[=================================================================] 100.00% 5.00K 0s 194ms
Time for merging to prefilterDB: 0h 0m 0s 13ms
Time for merging target splits: 0h 0m 0s 237ms
Time for merging to prefilterDB_tmp: 0h 0m 0s 33ms
Time for processing: 0h 20m 8s 521ms

real	20m8.557s
user	148m36.677s
sys	3m37.955s

MMseqs2 v15 | 5,000 queries vs. 5,000,000 targets | linux - 128g memory | intel i9 14900kf

These tests were run on my linux workstation.

-k 0 --k-score 80 --min-ungapped-score 15 --max-seqs 1000

Interestingly, MMseqs automatically chooses a kmer size of 6 on my linux machine, and the prefilter runs with no problems.

[jack@kei tmp]$ time mmseqs prefilter queryDB targetDB prefilterDB --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8
prefilter queryDB targetDB prefilterDB --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8

MMseqs Version:                 6f45232ac8daca14e354ae320a4359056ec524c2
Substitution matrix             aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix        aa:VTML80.out,nucl:nucleotide.out
Sensitivity                     4
k-mer length                    0
Target search mode              0
k-score                         seq:80,prof:80
Alphabet size                   aa:21,nucl:5
Max sequence length             65535
Max results per query           1000
Split database                  0
Split mode                      2
Split memory limit              0
Coverage threshold              0
Coverage mode                   0
Compositional bias              1
Compositional bias              1
Diagonal scoring                true
Exact k-mer matching            0
Mask residues                   1
Mask residues probability       0.9
Mask lower case residues        0
Minimum diagonal score          15
Selected taxa
Include identical seq. id.      false
Spaced k-mers                   1
Preload mode                    0
Pseudo count a                  substitution:1.100,context:1.400
Pseudo count b                  substitution:4.100,context:5.800
Spaced k-mer pattern
Local temporary path
Threads                         8
Compressed                      0
Verbosity                       3

Query database size: 5000 type: Aminoacid
Estimated memory consumption: 16G
Target database size: 5050000 type: Aminoacid
Index table k-mer threshold: 80 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 5.05M 27s 84ms
Index table: Masked residues: 27302335
Index table: fill
[=================================================================] 100.00% 5.05M 29s 533ms
Index statistics
Entries:          1917260228
DB size:          11458 MB
Avg k-mer size:   29.957191
Top 10 k-mers
    HGTNKF      13555
    TSGGGV      13465
    LLNPDR      12165
    LGGGKT      11804
    TTGGGV      9738
    DGAGDN      7380
    KPGTTY      6976
    VLNPDR      6294
    RLTKGS      6234
    TSGGGT      5074
Time for index table init: 0h 1m 0s 487ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 80
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 5000
Target db start 1 to 5050000
[=================================================================] 100.00% 5.00K 13m 35s 287ms

6730.260374 k-mers per position
90968004 DB matches per sequence
4895 overflows
1000 sequences passed prefiltering per query sequence
1000 median result list length
0 sequences with 0 size result lists
Time for merging to prefilterDB: 0h 0m 0s 1ms
Time for processing: 0h 14m 39s 691ms

real    14m39.700s
user    116m44.519s
sys     0m4.440s

-k 7 --k-score 80 --min-ungapped-score 15 --max-seqs 1000

So, then I decided to use -k 7, and found that the prefilter also runs here with no problems.

[jack@kei tmp]$ time mmseqs prefilter queryDB targetDB prefilterDB -k 7 --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8
prefilter queryDB targetDB prefilterDB -k 7 --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8

MMseqs Version:                 6f45232ac8daca14e354ae320a4359056ec524c2
Substitution matrix             aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix        aa:VTML80.out,nucl:nucleotide.out
Sensitivity                     4
k-mer length                    7
Target search mode              0
k-score                         seq:80,prof:80
Alphabet size                   aa:21,nucl:5
Max sequence length             65535
Max results per query           1000
Split database                  0
Split mode                      2
Split memory limit              0
Coverage threshold              0
Coverage mode                   0
Compositional bias              1
Compositional bias              1
Diagonal scoring                true
Exact k-mer matching            0
Mask residues                   1
Mask residues probability       0.9
Mask lower case residues        0
Minimum diagonal score          15
Selected taxa
Include identical seq. id.      false
Spaced k-mers                   1
Preload mode                    0
Pseudo count a                  substitution:1.100,context:1.400
Pseudo count b                  substitution:4.100,context:5.800
Spaced k-mer pattern
Local temporary path
Threads                         8
Compressed                      0
Verbosity                       3

Query database size: 5000 type: Aminoacid
Estimated memory consumption: 25G
Target database size: 5050000 type: Aminoacid
Index table k-mer threshold: 80 at k-mer size 7
Index table: counting k-mers
[=================================================================] 100.00% 5.05M 27s 170ms
Index table: Masked residues: 27302335
Index table: fill
[=================================================================] 100.00% 5.05M 30s 779ms
Index statistics
Entries:          1911040173
DB size:          20700 MB
Avg k-mer size:   1.493000
Top 10 k-mers
    RAARQGG     13310
    LLNPKRH     10560
    VGPGTST     9543
    LTKSGGV     5571
    LTKAGGV     5106
    TTGGNLL     4631
    KGGEGLV     4449
    KGGPGAV     4006
    KGGPGLV     3995
    LELVGYV     3260
Time for index table init: 0h 1m 3s 914ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 80
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 5000
Target db start 1 to 5050000
[=================================================================] 100.00% 5.00K 34m 14s 264ms

130477.492231 k-mers per position
92674480 DB matches per sequence
4896 overflows
1000 sequences passed prefiltering per query sequence
1000 median result list length
0 sequences with 0 size result lists
Time for merging to prefilterDB: 0h 0m 0s 0ms
Time for processing: 0h 35m 27s 604ms

real    35m27.605s
user    282m17.467s
sys     0m5.550s

MMseqs2 v17 | 5,000 queries vs. 5,000,000 targets | macos - 18g memory | arm M3 pro

Last week, @milot-mirdita suggested that I try the v16 release to see if this was somehow related to some prefilter memory issues that were addressed in the new release. I noticed that v17 was released over the past few days, so I did the same test on my laptop with v17.

-k 7 --k-score 80 --min-ungapped-score 15 --max-seqs 1000

It seems that the issue still persists on the latest release:

[jack@manami tmp (dev)]$ prefilter queryDB targetDB prefilterDB -k 7 --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8

MMseqs Version:                    	b804fbe384e6f6c9fe96322ec0e92d48bccd0a42
Substitution matrix                	aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix           	aa:VTML80.out,nucl:nucleotide.out
Sensitivity                        	4
k-mer length                       	7
Target search mode                 	0
k-score                            	seq:80,prof:80
Alphabet size                      	aa:21,nucl:5
Max sequence length                	65535
Max results per query              	1000
Split database                     	0
Split mode                         	2
Split memory limit                 	0
Coverage threshold                 	0
Coverage mode                      	0
Compositional bias                 	1
Compositional bias                 	1
Diagonal scoring                   	true
Exact k-mer matching               	0
Mask residues                      	1
Mask residues probability          	0.9
Mask lower case residues           	0
Mask lower letter repeating N times	0
Minimum diagonal score             	15
Selected taxa                      	
Include identical seq. id.         	false
Spaced k-mers                      	1
Preload mode                       	0
Pseudo count a                     	substitution:1.100,context:1.400
Pseudo count b                     	substitution:4.100,context:5.800
Spaced k-mer pattern               	
Local temporary path               	
Threads                            	8
Compressed                         	0
Verbosity                          	3

Query database size: 5000 type: Aminoacid
Target split mode. Searching through 4 splits
Estimated memory consumption: 13G
Target database size: 5050000 type: Aminoacid
Process prefiltering step 1 of 4

Index table k-mer threshold: 80 at k-mer size 7
Index table: counting k-mers
[=================================================================] 100.00% 1.27M 1m 13s 806ms
Index table: Masked residues: 6750387
Index table: fill
[=================================================================] 100.00% 1.27M 10m 11s 424ms
Index statistics
Entries:          477737682
DB size:          12499 MB
Avg k-mer size:   0.373233
Top 10 k-mers
    RAARQGG	3256
    LLNPKRH	2641
    VGPGTST	2338
    LTKSGGV	1370
    LTKAGGV	1269
    TTGGNLL	1106
    KGGEGLV	1086
    KGGPGLV	961
    LELVGYV	796
    EDAHGDN	686
Time for index table init: 0h 11m 36s 635ms
k-mer similarity threshold: 80
Starting prefiltering scores calculation (step 1 of 4)
Query db start 1 to 5000
Target db start 1 to 1270722
^C                                                                ] 1.00% 51 eta 16h 44m 45s

\\ *******************
\\ process killed here
\\ *******************

real	25m29.049s
user	8m11.868s
sys	83m34.387s

MMseqs2 v17 | 5,000 queries vs. 100,000 targets | macos - 18g memory | arm M3 pro

Next, I decided I'd use a much smaller target database (100,000 targets).

-k 7 --k-score 80 --min-ungapped-score 15 --max-seqs 1000

It seems to still happen:

prefilter queryDB targetDB prefilterDB -k 7 --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8

MMseqs Version:                    	b804fbe384e6f6c9fe96322ec0e92d48bccd0a42
Substitution matrix                	aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix           	aa:VTML80.out,nucl:nucleotide.out
Sensitivity                        	4
k-mer length                       	7
Target search mode                 	0
k-score                            	seq:80,prof:80
Alphabet size                      	aa:21,nucl:5
Max sequence length                	65535
Max results per query              	1000
Split database                     	0
Split mode                         	2
Split memory limit                 	0
Coverage threshold                 	0
Coverage mode                      	0
Compositional bias                 	1
Compositional bias                 	1
Diagonal scoring                   	true
Exact k-mer matching               	0
Mask residues                      	1
Mask residues probability          	0.9
Mask lower case residues           	0
Mask lower letter repeating N times	0
Minimum diagonal score             	15
Selected taxa                      	
Include identical seq. id.         	false
Spaced k-mers                      	1
Preload mode                       	0
Pseudo count a                     	substitution:1.100,context:1.400
Pseudo count b                     	substitution:4.100,context:5.800
Spaced k-mer pattern               	
Local temporary path               	
Threads                            	8
Compressed                         	0
Verbosity                          	3

Query database size: 5000 type: Aminoacid
Estimated memory consumption: 10G
Target database size: 100000 type: Aminoacid
Index table k-mer threshold: 80 at k-mer size 7
Index table: counting k-mers
[=================================================================] 100.00% 100.00K 3s 954ms
Index table: Masked residues: 384579
Index table: fill
[=================================================================] 100.00% 100.00K 6s 233ms
Index statistics
Entries:          34004753
DB size:          9960 MB
Avg k-mer size:   0.026566
Top 10 k-mers
    RAARQGG	155
    LLNPKRH	99
    VGPGTST	98
    LTKSGGV	57
    KGGPGLV	54
    KGGEGLV	45
    LTKSGGL	39
    TTGGNLL	35
    LGTEDLL	34
    DLAPELL	33
Time for index table init: 0h 0m 16s 777ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 80
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 5000
Target db start 1 to 100000
^C                                                                ] 1.00% 51 eta 3h 8m 57s

\\ *******************
\\ process killed here
\\ *******************

real	2m25.866s
user	1m23.136s
sys	8m9.646s

MMseqs2 v17 | 5,000 queries vs. 5,000 targets | macos - 18g memory | arm M3 pro

Finally, I figured I'd try a tiny target database (5,000 targets).

-k 7 --k-score 80 --min-ungapped-score 15 --max-seqs 1000

Seems like the issue still persists even at this tiny database size:

[jack@manami tmp (dev)]$ time mmseqs prefilter queryDB targetDB prefilterDB -k 7 --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8
prefilter queryDB targetDB prefilterDB -k 7 --k-score 80 --min-ungapped-score 15 --max-seqs 1000 --threads 8

MMseqs Version:                    	b804fbe384e6f6c9fe96322ec0e92d48bccd0a42
Substitution matrix                	aa:blosum62.out,nucl:nucleotide.out
Seed substitution matrix           	aa:VTML80.out,nucl:nucleotide.out
Sensitivity                        	4
k-mer length                       	7
Target search mode                 	0
k-score                            	seq:80,prof:80
Alphabet size                      	aa:21,nucl:5
Max sequence length                	65535
Max results per query              	1000
Split database                     	0
Split mode                         	2
Split memory limit                 	0
Coverage threshold                 	0
Coverage mode                      	0
Compositional bias                 	1
Compositional bias                 	1
Diagonal scoring                   	true
Exact k-mer matching               	0
Mask residues                      	1
Mask residues probability          	0.9
Mask lower case residues           	0
Mask lower letter repeating N times	0
Minimum diagonal score             	15
Selected taxa                      	
Include identical seq. id.         	false
Spaced k-mers                      	1
Preload mode                       	0
Pseudo count a                     	substitution:1.100,context:1.400
Pseudo count b                     	substitution:4.100,context:5.800
Spaced k-mer pattern               	
Local temporary path               	
Threads                            	8
Compressed                         	0
Verbosity                          	3

Query database size: 5000 type: Aminoacid
Estimated memory consumption: 10G
Target database size: 5000 type: Aminoacid
Index table k-mer threshold: 80 at k-mer size 7
Index table: counting k-mers
[=================================================================] 100.00% 5.00K 0s 917ms
Index table: Masked residues: 25109
Index table: fill
[=================================================================] 100.00% 5.00K 0s 553ms
Index statistics
Entries:          1842327
DB size:          9776 MB
Avg k-mer size:   0.001439
Top 10 k-mers
    LLNPKRH	16
    ILNPKRH	10
    VGPGTST	8
    KGGPGLV	7
    ARIVRQG	5
    SDLGDFI	5
    LTKAGGI	5
    TKTPPFL	5
    DLAPELL	5
    TKTPPLL	5
Time for index table init: 0h 0m 8s 52ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 80
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 5000
Target db start 1 to 5000
^C                                                                ] 1.00% 51 eta 2h 46m 0s

real	2m9.257s
user	1m11.676s
sys	8m15.870s

Let me know if I can provide any more information to help figure out what's going on here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant