Incremental reading of a big dataset #64

pommedeterresautee · 2015-03-24T08:56:01Z

@wush978 would you be interested if I implement a function to do what I wrote in my last message of this issue: dmlc/xgboost#56

With data.table package it would be easy to do and may be useful to some.

Kind regards,
Michaël

The text was updated successfully, but these errors were encountered:

wush978 · 2015-03-24T12:07:42Z

Do you mean the partial loading of the file?

I think a better way is to new a package for incremental learning so that people could search it more conveniently. How do you think?

pommedeterresautee · 2015-03-24T12:20:11Z

Yep partial loading. But it would be so simple to implement that I am not sure it deserves it's own package.

Basically it s a for loop, and inside using something like fread to read a csv by parts, hashing it and merge the parts so never the entire dataset with all columns is in memory.

I don't think there is something to win to implement the csv part ourselves. Regarding other formats than CSV, I don't know the possibilities.

So one little function in feature hashing should be feasible. What do you think?

Kind regards,
Michaël

pommedeterresautee · 2015-03-24T12:27:22Z

Just got an idea when re reading your message. May be it is what you meant.

In Xgboost, you can continue a previous learning. The method I expose in my precedent message makes it mandatory to have all observations but less variables, right?

What about we learn on the first part, then unload the dataset from memory, and improve the model by reading the second part and so on. It would be very similar to Vowpal but with Gradient Boosting.

I don't know how the Gain of the branch of each tree will be computed on the trees from the second part of the dataset compared to the first part.

As Gradient Boosting is a negative gradient method, the gain is monotonically decreasing for each new tree compared to the precedent. It is probably because in the first tree, the most of the model is built and then it s all about details. But may be there is something else to take into account.

What do you think?

Kind regards,
Michaël

wush978 · 2015-03-24T14:01:10Z

If the package collects many existed incremental algorithm in R and provides a consistent interface, I think it deserves its own package. IMO, packages should focus on its purpose because it makes user easier to search what they want and makes maintainer easier to maintain. In fact, I implements some of these algorithms (logistic regression and neuron network with a kind of adaptive SGD) and collects them as a separated package.

IMO, the partial loading cross the line, so I think it should be put in a new package. If it is too easy to deserve a package, then we should leave it to the users.

formwork · 2015-03-25T11:51:28Z

There may be a compromise option here that would be useful for people who aren't especially familiar with feature hashing, i.e. we could emphasise these possibilities in the documentation but not actually add new functionality to the FeatureHashing package.

For example, I was planning to emphasise in the sentiment analysis tutorial that the feature hashing approach means you do not need to read the training and test datasets at the same time in order to build a document term matrix. That's not possible with the usual text processing packages. We could also explain that users could even read the training dataset in parts and gradually build a complete sparse binary/count representation of documents and terms, close to Michael's first suggestion here.

We could even describe feature hashing + xgboost for sentiment analysis as being a halfway step towards a pure online learner like Vowpal Wabbit -

feature hashing can easily handle seeing new features even if the data arrives in parts
the linear learner in xgboost is fast because it is uses gradient descent
at the moment, the combination requires having all of the hashed matrix in memory, so it's not a fully online learner, but in many situations this is unlikely to be a problem as the data is sparse by that stage

pommedeterresautee · 2015-03-25T12:27:46Z

@Lewis-C I like your approach!

wush978 mentioned this issue Mar 26, 2015

Vignette of sentiment analysis #66

Closed

wush978 closed this as completed May 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental reading of a big dataset #64

Incremental reading of a big dataset #64

pommedeterresautee commented Mar 24, 2015

wush978 commented Mar 24, 2015

pommedeterresautee commented Mar 24, 2015

pommedeterresautee commented Mar 24, 2015

wush978 commented Mar 24, 2015

formwork commented Mar 25, 2015

pommedeterresautee commented Mar 25, 2015

Incremental reading of a big dataset #64

Incremental reading of a big dataset #64

Comments

pommedeterresautee commented Mar 24, 2015

wush978 commented Mar 24, 2015

pommedeterresautee commented Mar 24, 2015

pommedeterresautee commented Mar 24, 2015

wush978 commented Mar 24, 2015

formwork commented Mar 25, 2015

pommedeterresautee commented Mar 25, 2015