Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a few tweaks to speed up data generation #4164

Merged
merged 3 commits into from
Feb 21, 2019
Merged

a few tweaks to speed up data generation #4164

merged 3 commits into from
Feb 21, 2019

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Feb 19, 2019

With the default parameters, the data generation time is reduced from 47.674 seconds to 2.027 seconds on my machine.

But I still run out of memory when converting to DMatrix if I try to generate too big of a dataset. Need to investigate further.

@RAMitchell

@RAMitchell
Copy link
Member

What if we just use a completely randomly generated X/y matrix e.g. numpy.random.rand? It will be fast to generate and still generate trees. It will be more or less useless for checking accuracy but for speed it should be fine.

@rongou
Copy link
Contributor Author

rongou commented Feb 20, 2019

Is that more or less what make_classification() is doing? After these tweaks the time to generate the data seems reasonable, but I still run out of memory on my desktop (32 GB RAM) if I attempt too big of a dataset.

@RAMitchell
Copy link
Member

make_classification() is doing something a little more complicated. From memory I think it tries to generate a set of Gaussian clusters and assign points to those clusters. Its not very memory efficient or fast.

@rongou
Copy link
Contributor Author

rongou commented Feb 20, 2019

I think there is some value in checking the accuracy.

Anyway, with the last commit, now I can reliably crash the training on my laptop (32GB RAM, Quadro P1000 with 4GB):

$ python benchmark_tree.py --rows 13000000 --columns 100 --test_size 0.01
Generating dataset: 13000000 rows * 100 columns
0.01/0.99 test/train split
Generate Time: 48.709909200668335 seconds
DMatrix Start
DMatrix Time: 5.693941593170166 seconds
Training with 'gpu_hist'
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: out of memory
Aborted (core dumped)

@rongou
Copy link
Contributor Author

rongou commented Feb 20, 2019

@RAMitchell it's a good idea to switch to random numpy arrays. :) Please take another look.

@RAMitchell RAMitchell merged commit 8e0a08f into dmlc:master Feb 21, 2019
@rongou rongou deleted the benchmark-tree-speedup branch February 21, 2019 18:05
@lock lock bot locked as resolved and limited conversation to collaborators May 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants