Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental reading of a big dataset #64

Closed
pommedeterresautee opened this issue Mar 24, 2015 · 6 comments
Closed

Incremental reading of a big dataset #64

pommedeterresautee opened this issue Mar 24, 2015 · 6 comments

Comments

@pommedeterresautee
Copy link
Contributor

@wush978 would you be interested if I implement a function to do what I wrote in my last message of this issue: dmlc/xgboost#56

With data.table package it would be easy to do and may be useful to some.

Kind regards,
Michaël

@wush978
Copy link
Owner

wush978 commented Mar 24, 2015

Do you mean the partial loading of the file?

I think a better way is to new a package for incremental learning so that people could search it more conveniently. How do you think?

@pommedeterresautee
Copy link
Contributor Author

Yep partial loading. But it would be so simple to implement that I am not sure it deserves it's own package.

Basically it s a for loop, and inside using something like fread to read a csv by parts, hashing it and merge the parts so never the entire dataset with all columns is in memory.

I don't think there is something to win to implement the csv part ourselves. Regarding other formats than CSV, I don't know the possibilities.

So one little function in feature hashing should be feasible. What do you think?

Kind regards,
Michaël

@pommedeterresautee
Copy link
Contributor Author

Just got an idea when re reading your message. May be it is what you meant.

In Xgboost, you can continue a previous learning. The method I expose in my precedent message makes it mandatory to have all observations but less variables, right?

What about we learn on the first part, then unload the dataset from memory, and improve the model by reading the second part and so on. It would be very similar to Vowpal but with Gradient Boosting.

I don't know how the Gain of the branch of each tree will be computed on the trees from the second part of the dataset compared to the first part.

As Gradient Boosting is a negative gradient method, the gain is monotonically decreasing for each new tree compared to the precedent. It is probably because in the first tree, the most of the model is built and then it s all about details. But may be there is something else to take into account.

What do you think?

Kind regards,
Michaël

@wush978
Copy link
Owner

wush978 commented Mar 24, 2015

If the package collects many existed incremental algorithm in R and provides a consistent interface, I think it deserves its own package. IMO, packages should focus on its purpose because it makes user easier to search what they want and makes maintainer easier to maintain. In fact, I implements some of these algorithms (logistic regression and neuron network with a kind of adaptive SGD) and collects them as a separated package.

IMO, the partial loading cross the line, so I think it should be put in a new package. If it is too easy to deserve a package, then we should leave it to the users.

@formwork
Copy link
Contributor

There may be a compromise option here that would be useful for people who aren't especially familiar with feature hashing, i.e. we could emphasise these possibilities in the documentation but not actually add new functionality to the FeatureHashing package.

For example, I was planning to emphasise in the sentiment analysis tutorial that the feature hashing approach means you do not need to read the training and test datasets at the same time in order to build a document term matrix. That's not possible with the usual text processing packages. We could also explain that users could even read the training dataset in parts and gradually build a complete sparse binary/count representation of documents and terms, close to Michael's first suggestion here.

We could even describe feature hashing + xgboost for sentiment analysis as being a halfway step towards a pure online learner like Vowpal Wabbit -

  • feature hashing can easily handle seeing new features even if the data arrives in parts
  • the linear learner in xgboost is fast because it is uses gradient descent
  • at the moment, the combination requires having all of the hashed matrix in memory, so it's not a fully online learner, but in many situations this is unlikely to be a problem as the data is sparse by that stage

@pommedeterresautee
Copy link
Contributor Author

@Lewis-C I like your approach!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants