a few tweaks to speed up data generation #4164

rongou · 2019-02-19T21:53:32Z

With the default parameters, the data generation time is reduced from 47.674 seconds to 2.027 seconds on my machine.

But I still run out of memory when converting to DMatrix if I try to generate too big of a dataset. Need to investigate further.

@RAMitchell

RAMitchell · 2019-02-19T22:40:12Z

What if we just use a completely randomly generated X/y matrix e.g. numpy.random.rand? It will be fast to generate and still generate trees. It will be more or less useless for checking accuracy but for speed it should be fine.

rongou · 2019-02-20T00:15:19Z

Is that more or less what make_classification() is doing? After these tweaks the time to generate the data seems reasonable, but I still run out of memory on my desktop (32 GB RAM) if I attempt too big of a dataset.

RAMitchell · 2019-02-20T01:19:59Z

make_classification() is doing something a little more complicated. From memory I think it tries to generate a set of Gaussian clusters and assign points to those clusters. Its not very memory efficient or fast.

rongou · 2019-02-20T17:16:17Z

I think there is some value in checking the accuracy.

Anyway, with the last commit, now I can reliably crash the training on my laptop (32GB RAM, Quadro P1000 with 4GB):

$ python benchmark_tree.py --rows 13000000 --columns 100 --test_size 0.01
Generating dataset: 13000000 rows * 100 columns
0.01/0.99 test/train split
Generate Time: 48.709909200668335 seconds
DMatrix Start
DMatrix Time: 5.693941593170166 seconds
Training with 'gpu_hist'
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: out of memory
Aborted (core dumped)

rongou · 2019-02-20T19:17:22Z

@RAMitchell it's a good idea to switch to random numpy arrays. :) Please take another look.

a few tweaks to speed up data generation

603f098

del variable to save memory

79fa114

switch to random numpy arrays

d45df56

RAMitchell merged commit 8e0a08f into dmlc:master Feb 21, 2019

rongou deleted the benchmark-tree-speedup branch February 21, 2019 18:05

hcho3 mentioned this pull request Mar 4, 2019

[RFC] Version 0.82 release candidate #4201

Merged

lock bot locked as resolved and limited conversation to collaborators May 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a few tweaks to speed up data generation #4164

a few tweaks to speed up data generation #4164

rongou commented Feb 19, 2019

RAMitchell commented Feb 19, 2019

rongou commented Feb 20, 2019

RAMitchell commented Feb 20, 2019

rongou commented Feb 20, 2019

rongou commented Feb 20, 2019

a few tweaks to speed up data generation #4164

a few tweaks to speed up data generation #4164

Conversation

rongou commented Feb 19, 2019

RAMitchell commented Feb 19, 2019

rongou commented Feb 20, 2019

RAMitchell commented Feb 20, 2019

rongou commented Feb 20, 2019

rongou commented Feb 20, 2019