Memory usage for large files #432

bkamins · 2019-05-10T16:06:47Z

Following h2oai/db-benchmark#85 (comment) we have an issue in H2O benchmarks for large files.

@quinnj - do you think it is related to a conversion using DataFrame? The error seems to indicate that we have a problem earlier.

The text was updated successfully, but these errors were encountered:

nalimilan · 2019-05-10T18:01:44Z

So IIUC they are using a VM with 125GB of RAM and trying to load a 50GB CSV file. That should be enough to hold both the file contents and the tape in RAM.

quinnj · 2019-05-22T15:54:39Z

Is there a way to "retry" the benchmark to see if it's better on the latest 0.5.3 release? The most recent release should be a little better on memory pressure, but I'd like to know if we need to dig in further.

bkamins · 2019-05-22T17:52:09Z

If you want to reproduce the benchmark data exactly here are the instructions in Reproduce section (but probably producing large files with structure similar to described here will be good enough :)) - I never tried to reproduce the steps described in the official guide.

nalimilan · 2019-08-21T07:50:03Z

Apparently they ran the benchmarks on July 29th (so with recent package versions), and they still run out of memory.

quinnj · 2019-10-14T22:54:34Z

@bkamins, do you know if there's a way to provide the types explicitly for the h20 benchmark? If so, that would drastically improve the memory footprint with #510 merged.

nalimilan · 2019-10-15T07:47:53Z

I think we can basically make a PR to apply any changes we want at https://github.com/h2oai/db-benchmark/blob/a095028c4c9809304f0d2356e5fefce63194a03d/juliadf/groupby-juliadf.jl#L23. They shouldn't object since the benchmark is about grouping, not CSV reading.

nalimilan · 2019-10-15T07:52:14Z

Oh, and do you think that without changing that line, #510 may improve memory use enough that it may now work? We don't really need reading the data to be fast (it's not included in the benchmark), it just has to work.

bkamins · 2019-10-15T07:56:55Z

Yes - in general in H2O benchmark, we should change to the option that is most memory efficient for the larges data set case (it does not have to be the fastest). And any PR we would submit there should be accepted.

bkamins · 2019-11-14T13:24:08Z

H2O benchmark still produces "out of memory" due to CSV.File. See https://github.com/h2oai/db-benchmark/blob/7e178c1d2fb9102c8b12ac201f883981254a9df6/benchplot-dict.R#L171.
for the detailed list of offending tests

@quinnj - is anything still fixable here? Thank you!

CC @nalimilan

nalimilan · 2019-11-17T10:46:21Z

I've filed h2oai/db-benchmark#119 to pass types explicitly, let's see whether that fixes the problem.

quinnj · 2020-06-25T17:15:37Z

This should be much better on master; a good optimization we should look into is passing the exact # of rows via limit=rows to ensure the row-estimation mechanism doesn't over-estimate too much (and hence over-allocate).

bkamins · 2020-06-25T20:18:20Z

👍

Let us hope this makes 50GB H2O benchmarks go through!

quinnj mentioned this issue Aug 21, 2019

Utilize multithreading while parsing #481

Merged

nalimilan mentioned this issue Nov 17, 2019

Provide types explicitly when parsing CSV for juliadf h2oai/db-benchmark#119

Merged

quinnj closed this as completed Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage for large files #432

Memory usage for large files #432

bkamins commented May 10, 2019

nalimilan commented May 10, 2019

quinnj commented May 22, 2019

bkamins commented May 22, 2019

nalimilan commented Aug 21, 2019

quinnj commented Oct 14, 2019

nalimilan commented Oct 15, 2019 •

edited

Loading

nalimilan commented Oct 15, 2019

bkamins commented Oct 15, 2019

bkamins commented Nov 14, 2019

nalimilan commented Nov 17, 2019

quinnj commented Jun 25, 2020

bkamins commented Jun 25, 2020

Memory usage for large files #432

Memory usage for large files #432

Comments

bkamins commented May 10, 2019

nalimilan commented May 10, 2019

quinnj commented May 22, 2019

bkamins commented May 22, 2019

nalimilan commented Aug 21, 2019

quinnj commented Oct 14, 2019

nalimilan commented Oct 15, 2019 • edited Loading

nalimilan commented Oct 15, 2019

bkamins commented Oct 15, 2019

bkamins commented Nov 14, 2019

nalimilan commented Nov 17, 2019

quinnj commented Jun 25, 2020

bkamins commented Jun 25, 2020

nalimilan commented Oct 15, 2019 •

edited

Loading