Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predicting memory usage #1577

Open
chubukov opened this issue Jun 9, 2022 · 5 comments
Open

predicting memory usage #1577

chubukov opened this issue Jun 9, 2022 · 5 comments

Comments

@chubukov
Copy link

chubukov commented Jun 9, 2022

Is there a way to reliably predict needed memory for STARSolo?

Even with --outSAMtype BAM Unsorted I find it hard to predict the memory usage that will be required. Since I'm running this in a scheduler environment where I need to specify and manage the maximum memory, this causes issues. Many jobs run all the way through mapping, only to get killed at the counting step because of excess memory demand.

Thanks!

@chubukov
Copy link
Author

chubukov commented Jun 12, 2022

Here's an empirical estimation

This is with human genome (Cell Ranger annotation) and the full set of metrics

--soloFeatures Gene GeneFull GeneFull_ExonOverIntron GeneF
ull_Ex50pAS 
--soloMultiMappers Unique PropUnique Uniform Rescue EM
--outSAMtype BAM Unsorted

image

@alexdobin
Copy link
Owner

Hi Victor,

thanks for the interesting empirical results.
Estimating exact amount of RAM may be hard as it depends not just on the number of cells, but also on the number of UMIs, cells, genes expressed, etc.

The main contributors:
For counting matrices:
4(read index)+4(UMI)+4(gene)=12 bytes per read
For BAM CB/UB output:
8(CB)+4(UMI)=12 bytes per read

This should result in 24 bytes per read, while you are observing ~40bytes per read for samples with more than 1Billion reads.
Could you send me Log.out files for 2 runs with the highest RAM consumption with ~1B and ~2B reads?

Thanks!
Alex

@chubukov
Copy link
Author

chubukov commented Jun 17, 2022

Absolutely, thanks

https://gist.github.com/chubukov/b5ad260173dbc1fc6f0a0ca2b9328f63
https://gist.github.com/chubukov/3666a7c03e5459b71e24f41559c3ccb2

Additional question -- I'm running some tests now, but how would you expect this memory consumption to change with sorted BAM output?

Update: adding sorting took a sample job from ~84GB to ~97GB

@alexdobin
Copy link
Owner

Hi Victor,

thanks for the files!
I am starting to think that some of the arrays are not deallocated and it uses much more memory than it should. I will check it and get back to you.

@chubukov
Copy link
Author

@alexdobin I was hoping you would say something like that! :) Looking forward to testing a patch when you're able to look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants