-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to update a model with new data without retraining the model from scratch? #3055
Comments
In #2495, I said incremental training was "impossible". A little clarification is in order.
Hope it helps! |
@antontarasenko Actually, I'm curious about the whole quest behind "incremental learning": what it means and why it is sought after. Can we schedule a brief Google hangout session to discuss? |
A vain guess: using an online algorithm for tree construction may do what you want. See this paper for instance.
This paper is interesting too: it presents a way to find good splits without having all the data. |
Right. I think we should also check |
how you get checksum of dataset? content hash? |
Yes. We can simply use LRC checksum |
Isn’t it time consuming to calculate hash? Maybe we can simply adding reminders in the comments |
@CodingCat Indeed, at minimum we need to warn the user not to change the dataset for training continuation. That said, I just found a small warning in the CLI example, which says
../../xgboost mushroom.conf model_in=0002.model num_round=2 model_out=continue.model
Clearly we need to do a better job to make this warning more prominent. |
@hcho3 while it's true that in some specific application contexts it makes sense to restrict training continuation to some data, I wouldn't make it a blank statement and wouldn't implement any hard restrictions on that. Yes, when you have a large dataset and your goal is to achieve optimal performance on the whole dataset, you wouldn't get it (for the reasons you have described) when incrementally learning with either separate parts of the datasets or with cumulatively increasing data. However, there are applications when training continuation in new data makes good practical sense. E.g., in a situation when you get some new data that is related but has some sort of "concept drift", there are sometimes good chances that by taking an old model learned in old data as "prior knowledge", and adapting it to the new data by training continuation in that new data, you would get a better performing model for future use in data that would be like the new data than when training from scratch either with only this new data or with a combined sample of old + new data. Sometimes you don't even have access to old data anymore, or cannot combine it with your new data (e.g., for legal reasons). |
@khotilov I stand corrected. Calling training continuation "undefined behavior" was sweeping generalization, if what you have described is true. |
Firstly, there is a paper about using a random forest to initialise you Gbm model to get better final results than just rf or Gbm and in few rounds. I cannot find it however :( This seems like a similar consept except you are using a different Gbm to initialise. I guess the other main difference is that it is on another set of data... Secondly, sometimes it is more important to train a model quickly. I have been working on some time series problems where I have been doing transfer learning with LSTMs. I train the base model on generic historical data and then use transfer learning for the fine tuning on specific live data. It would take too long to train a full new model on live data when ideally I would. I think the same could be true of using xgboost. I. E. 95% of model optimal prediction is better than no prediction. |
@hcho3 While my mention of the "concept drift" was in a broad sense, the boosting continuation would likely do better with "concept shifts" (when new data has some differences but is expected to remain stable). Picking slow continuous data "drifts" would be harder. But for strong trending drifts, even that random forest method might not work well, and some forecasting elements would have to be utilized. A lot would depend on a situation. Also, a weak spot for boosted trees learners is that they are greedy feature-by-feature partitioners, so they might not pick well on such kinds of changes where the univariate effects were not so significant, e.g., when only interactions change. It might be rather useful if we could add some sort of limited look-ahead functionality to xgboost. E.g., in https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/kdd13.pdf they have a bivariate scan step that I think might work well as a spin of the histogram algorithm. As for why the predictive performance in future data that is similar to "new data" is sometimes worse for a model trained over a combined "old data" + "new data" dataset, when comparing to a training continuation in "new data", this is because the former model would be optimized over the whole combined dataset, and that might happen at the expense of "new data" when that "new data" is somewhat different and relatively small. |
I thought incremental training with minibatches of data (just like SGD) is kind of equivalent to subsampling the rows at each iteration. Is the subsampling in XGBoost only performed once for the whole training lifecycle or once every iteration? |
I also need to use incremental learning. I've read all links that have been mentioned above. However, I'm confused. #1686 |
This particular answer got close to proper incremental learning: Still, as I learned from hcho3, GBM has limited capacity for updates without seeing all the data from the start. For example, it can easily update leaves but has difficulties with altering splits. |
I need to update a model with new data without retraining the model from scratch. That is, incremental training for the cases when not all the data available right away.
This problem is similar to the "can't fit data in memory" problem, which was raised before in #56, #163, #244. But it's been 2-3 years ago and I see some changes in available parameters
process_type
andupdater
. The FAQ suggests using external memory viacacheprefix
. But this assumes I have all the data ready.The solution in #1686 uses several iterations over the entire data.
Another related issue is #2970, in particular #2970 (comment). I tried
'process_type': 'update'
but it throws the initial error mentioned in that issue. Without it, the model gives inconsistent results.I tried various combinations of parameters for
train
in Python. Andtrain
keeps making the model from scratch or something else. Here're the examples.In a nutshell, this is what works (sometimes) and needs a feedback from more experienced members of the community:
Here I'm looking to minimize the difference between the first and the fourth models. But it's keep jumping up and down. Even with equalling total boosting rounds in both methods.
Is there a canonical way to update models with newly arriving data alone?
Environment
xgboost
: 0.7.post3Similar issues
Contributors saying new-data training was impossible at the time of writing:
The text was updated successfully, but these errors were encountered: