-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory usage for large files #432
Comments
So IIUC they are using a VM with 125GB of RAM and trying to load a 50GB CSV file. That should be enough to hold both the file contents and the tape in RAM. |
Is there a way to "retry" the benchmark to see if it's better on the latest 0.5.3 release? The most recent release should be a little better on memory pressure, but I'd like to know if we need to dig in further. |
Apparently they ran the benchmarks on July 29th (so with recent package versions), and they still run out of memory. |
I think we can basically make a PR to apply any changes we want at https://github.com/h2oai/db-benchmark/blob/a095028c4c9809304f0d2356e5fefce63194a03d/juliadf/groupby-juliadf.jl#L23. They shouldn't object since the benchmark is about grouping, not CSV reading. |
Oh, and do you think that without changing that line, #510 may improve memory use enough that it may now work? We don't really need reading the data to be fast (it's not included in the benchmark), it just has to work. |
Yes - in general in H2O benchmark, we should change to the option that is most memory efficient for the larges data set case (it does not have to be the fastest). And any PR we would submit there should be accepted. |
H2O benchmark still produces "out of memory" due to CSV.File. See https://github.com/h2oai/db-benchmark/blob/7e178c1d2fb9102c8b12ac201f883981254a9df6/benchplot-dict.R#L171. @quinnj - is anything still fixable here? Thank you! CC @nalimilan |
I've filed h2oai/db-benchmark#119 to pass types explicitly, let's see whether that fixes the problem. |
This should be much better on master; a good optimization we should look into is passing the exact # of rows via |
👍 Let us hope this makes 50GB H2O benchmarks go through! |
Following h2oai/db-benchmark#85 (comment) we have an issue in H2O benchmarks for large files.
@quinnj - do you think it is related to a conversion using
DataFrame
? The error seems to indicate that we have a problem earlier.The text was updated successfully, but these errors were encountered: