-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental reading of a big dataset #64
Comments
Do you mean the partial loading of the file? I think a better way is to new a package for incremental learning so that people could search it more conveniently. How do you think? |
Yep partial loading. But it would be so simple to implement that I am not sure it deserves it's own package. Basically it s a for loop, and inside using something like fread to read a csv by parts, hashing it and merge the parts so never the entire dataset with all columns is in memory. I don't think there is something to win to implement the csv part ourselves. Regarding other formats than CSV, I don't know the possibilities. So one little function in feature hashing should be feasible. What do you think? Kind regards, |
Just got an idea when re reading your message. May be it is what you meant. In Xgboost, you can continue a previous learning. The method I expose in my precedent message makes it mandatory to have all observations but less variables, right? What about we learn on the first part, then unload the dataset from memory, and improve the model by reading the second part and so on. It would be very similar to Vowpal but with Gradient Boosting. I don't know how the Gain of the branch of each tree will be computed on the trees from the second part of the dataset compared to the first part. As Gradient Boosting is a negative gradient method, the gain is monotonically decreasing for each new tree compared to the precedent. It is probably because in the first tree, the most of the model is built and then it s all about details. But may be there is something else to take into account. What do you think? Kind regards, |
If the package collects many existed incremental algorithm in R and provides a consistent interface, I think it deserves its own package. IMO, packages should focus on its purpose because it makes user easier to search what they want and makes maintainer easier to maintain. In fact, I implements some of these algorithms (logistic regression and neuron network with a kind of adaptive SGD) and collects them as a separated package. IMO, the partial loading cross the line, so I think it should be put in a new package. If it is too easy to deserve a package, then we should leave it to the users. |
There may be a compromise option here that would be useful for people who aren't especially familiar with feature hashing, i.e. we could emphasise these possibilities in the documentation but not actually add new functionality to the FeatureHashing package. For example, I was planning to emphasise in the sentiment analysis tutorial that the feature hashing approach means you do not need to read the training and test datasets at the same time in order to build a document term matrix. That's not possible with the usual text processing packages. We could also explain that users could even read the training dataset in parts and gradually build a complete sparse binary/count representation of documents and terms, close to Michael's first suggestion here. We could even describe feature hashing + xgboost for sentiment analysis as being a halfway step towards a pure online learner like Vowpal Wabbit -
|
@Lewis-C I like your approach! |
@wush978 would you be interested if I implement a function to do what I wrote in my last message of this issue: dmlc/xgboost#56
With
data.table
package it would be easy to do and may be useful to some.Kind regards,
Michaël
The text was updated successfully, but these errors were encountered: