Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage for large files #432

Closed
bkamins opened this issue May 10, 2019 · 12 comments
Closed

Memory usage for large files #432

bkamins opened this issue May 10, 2019 · 12 comments

Comments

@bkamins
Copy link
Member

bkamins commented May 10, 2019

Following h2oai/db-benchmark#85 (comment) we have an issue in H2O benchmarks for large files.

@quinnj - do you think it is related to a conversion using DataFrame? The error seems to indicate that we have a problem earlier.

@nalimilan
Copy link
Member

So IIUC they are using a VM with 125GB of RAM and trying to load a 50GB CSV file. That should be enough to hold both the file contents and the tape in RAM.

@quinnj
Copy link
Member

quinnj commented May 22, 2019

Is there a way to "retry" the benchmark to see if it's better on the latest 0.5.3 release? The most recent release should be a little better on memory pressure, but I'd like to know if we need to dig in further.

@bkamins
Copy link
Member Author

bkamins commented May 22, 2019

If you want to reproduce the benchmark data exactly here are the instructions in Reproduce section (but probably producing large files with structure similar to described here will be good enough :)) - I never tried to reproduce the steps described in the official guide.

@nalimilan
Copy link
Member

Apparently they ran the benchmarks on July 29th (so with recent package versions), and they still run out of memory.

@quinnj
Copy link
Member

quinnj commented Oct 14, 2019

@bkamins, do you know if there's a way to provide the types explicitly for the h20 benchmark? If so, that would drastically improve the memory footprint with #510 merged.

@nalimilan
Copy link
Member

nalimilan commented Oct 15, 2019

I think we can basically make a PR to apply any changes we want at https://github.com/h2oai/db-benchmark/blob/a095028c4c9809304f0d2356e5fefce63194a03d/juliadf/groupby-juliadf.jl#L23. They shouldn't object since the benchmark is about grouping, not CSV reading.

@nalimilan
Copy link
Member

Oh, and do you think that without changing that line, #510 may improve memory use enough that it may now work? We don't really need reading the data to be fast (it's not included in the benchmark), it just has to work.

@bkamins
Copy link
Member Author

bkamins commented Oct 15, 2019

Yes - in general in H2O benchmark, we should change to the option that is most memory efficient for the larges data set case (it does not have to be the fastest). And any PR we would submit there should be accepted.

@bkamins
Copy link
Member Author

bkamins commented Nov 14, 2019

H2O benchmark still produces "out of memory" due to CSV.File. See https://github.com/h2oai/db-benchmark/blob/7e178c1d2fb9102c8b12ac201f883981254a9df6/benchplot-dict.R#L171.
for the detailed list of offending tests

@quinnj - is anything still fixable here? Thank you!

CC @nalimilan

@nalimilan
Copy link
Member

I've filed h2oai/db-benchmark#119 to pass types explicitly, let's see whether that fixes the problem.

@quinnj
Copy link
Member

quinnj commented Jun 25, 2020

This should be much better on master; a good optimization we should look into is passing the exact # of rows via limit=rows to ensure the row-estimation mechanism doesn't over-estimate too much (and hence over-allocate).

@quinnj quinnj closed this as completed Jun 25, 2020
@bkamins
Copy link
Member Author

bkamins commented Jun 25, 2020

👍

Let us hope this makes 50GB H2O benchmarks go through!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants