Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to update a model with new data without retraining the model from scratch? #3055

Closed
antontarasenko opened this issue Jan 21, 2018 · 16 comments

Comments

@antontarasenko
Copy link

antontarasenko commented Jan 21, 2018

I need to update a model with new data without retraining the model from scratch. That is, incremental training for the cases when not all the data available right away.

This problem is similar to the "can't fit data in memory" problem, which was raised before in #56, #163, #244. But it's been 2-3 years ago and I see some changes in available parameters process_type and updater. The FAQ suggests using external memory via cacheprefix. But this assumes I have all the data ready.

The solution in #1686 uses several iterations over the entire data.

Another related issue is #2970, in particular #2970 (comment). I tried 'process_type': 'update' but it throws the initial error mentioned in that issue. Without it, the model gives inconsistent results.

I tried various combinations of parameters for train in Python. And train keeps making the model from scratch or something else. Here're the examples.

In a nutshell, this is what works (sometimes) and needs a feedback from more experienced members of the community:

print('Full')
bst_full = xgb.train(dtrain=train, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_full.predict(test)))

print('Subset 1')
bst_1 = xgb.train(dtrain=train_1, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_1.predict(test)))

print('Subset 2')
bst_2 = xgb.train(dtrain=train_2, params=params)
print(mean_squared_error(y_true=y_test, y_pred=bst_2.predict(test)))

print('Subset 1 updated with subset 2')
bst_1u2 = xgb.train(dtrain=train_1, params=params)
bst_1u2 = xgb.train(dtrain=train_2, params=params, xgb_model=bst_1u2)
print(mean_squared_error(y_true=y_test, y_pred=bst_1u2.predict(test)))

Here I'm looking to minimize the difference between the first and the fourth models. But it's keep jumping up and down. Even with equalling total boosting rounds in both methods.

Is there a canonical way to update models with newly arriving data alone?

Environment

  • Python: 3.6
  • xgboost: 0.7.post3

Similar issues

Contributors saying new-data training was impossible at the time of writing:

@hcho3
Copy link
Collaborator

hcho3 commented Jan 22, 2018

In #2495, I said incremental training was "impossible". A little clarification is in order.

  • As Tianqi pointed out in Incremental Loads #56, tree construction algorithms currently depend on the availability of the whole data to choose optimal splits.
  • In addition, the gradient boosting algorithm used in XGBoost was formulated with batch assumption, i.e. addition a new tree should each time reduce the training loss over whole training data.
  • "Training continuation" feature (with xgb_model) thus does not do what many would think it would do. One gets undefined behavior when xgb.train is asked to train further on a dataset different from one used to train the model given in xgb_model. The behavior is "undefined" in the sense that the underlying algorithm makes no guarantee that the loss over (old data) + (new data) would be in any way reduced. Observe that the trees in the existing ensemble had no knowledge of the new incoming data. [EDIT: see @khotilov 's comment below to learn about situation where training continuation with different data would make sense.]
  • One way out of this conundrum is to use the random forest approach: keep the old trees around, and fit new set of trees with new data only. Then combine the old and new trees in a random forest. This is rather unsatisfactory, since you're throwing away main benefits of boosted trees over random forests (e.g. more compact model, lower bias etc).
  • Another way is to allow the old trees to be modified. "Training continuation" feature does NOT do this. On the other hand, the incremental training example in incremental learning, partial_fit like sklearn? #1686 does modify the old trees in the light of new data. The example makes several passes over the data (old and new) to ensure that all trees receive updates that reflect all the data.
  • So for now, your hope appears to lie in the option 'process_type': 'update'. I think it is an experimental feature, so proceed with your own risk. To use the feature, make sure to install the latest XGBoost (0.7.post3). The feature is currently quite limited, in that you are not allowed to modify the tree structure; only leaf values will be updated.

Hope it helps!

@hcho3
Copy link
Collaborator

hcho3 commented Jan 22, 2018

@antontarasenko Actually, I'm curious about the whole quest behind "incremental learning": what it means and why it is sought after. Can we schedule a brief Google hangout session to discuss?

@hcho3
Copy link
Collaborator

hcho3 commented Jan 22, 2018

A vain guess: using an online algorithm for tree construction may do what you want. See this paper for instance.
Two limitations:

  • You'd need to assume that your data stream doesn't have any concept drift.
  • The gradient boosting algorithm needs to be reformulated using noisy samples rather than the whole training data. This paper by Friedman does this to some extent, although they do this simply to reduce overfitting.

This paper is interesting too: it presents a way to find good splits without having all the data.

@CodingCat
Copy link
Member

@Yunni The first item in @hcho3 's reply reminds me something about the newly added feature of checkpoint in Spark

We should have something blocking the user to use different training dataset for this feature to guarantee correctness

@Yunni
Copy link
Contributor

Yunni commented Jan 23, 2018

Right. I think we should also check boosterType as well. We can put a metadata file which contains boosterType and checksum of the dataset. Sounds good?

@CodingCat
Copy link
Member

how you get checksum of dataset? content hash?

@Yunni
Copy link
Contributor

Yunni commented Jan 23, 2018

Yes. We can simply use LRC checksum

@CodingCat
Copy link
Member

CodingCat commented Jan 23, 2018

Isn’t it time consuming to calculate hash? Maybe we can simply adding reminders in the comments

@hcho3
Copy link
Collaborator

hcho3 commented Jan 23, 2018

@CodingCat Indeed, at minimum we need to warn the user not to change the dataset for training continuation.

That said, I just found a small warning in the CLI example, which says

Continue from Existing Model
If you want to continue boosting from existing model, say 0002.model, use

../../xgboost mushroom.conf model_in=0002.model num_round=2 model_out=continue.model

xgboost will load from 0002.model continue boosting for 2 rounds, and save output to continue.model. However, beware that the training and evaluation data specified in mushroom.conf should not change when you use this function. [Emphasis mine]

Clearly we need to do a better job to make this warning more prominent.

@khotilov
Copy link
Member

@hcho3 while it's true that in some specific application contexts it makes sense to restrict training continuation to some data, I wouldn't make it a blank statement and wouldn't implement any hard restrictions on that. Yes, when you have a large dataset and your goal is to achieve optimal performance on the whole dataset, you wouldn't get it (for the reasons you have described) when incrementally learning with either separate parts of the datasets or with cumulatively increasing data.

However, there are applications when training continuation in new data makes good practical sense. E.g., in a situation when you get some new data that is related but has some sort of "concept drift", there are sometimes good chances that by taking an old model learned in old data as "prior knowledge", and adapting it to the new data by training continuation in that new data, you would get a better performing model for future use in data that would be like the new data than when training from scratch either with only this new data or with a combined sample of old + new data. Sometimes you don't even have access to old data anymore, or cannot combine it with your new data (e.g., for legal reasons).

@hcho3
Copy link
Collaborator

hcho3 commented Jan 26, 2018

@khotilov I stand corrected. Calling training continuation "undefined behavior" was sweeping generalization, if what you have described is true.
I have a question for you: how does training continuation with boosting fare when it comes to handling concept drift? I read papers where the authors use random forests to handle concept drift, with a sliding window to deprecate old trees. (For an example, see this paper.)

@JoshuaC3
Copy link

JoshuaC3 commented Feb 5, 2018

Firstly, there is a paper about using a random forest to initialise you Gbm model to get better final results than just rf or Gbm and in few rounds. I cannot find it however :( This seems like a similar consept except you are using a different Gbm to initialise. I guess the other main difference is that it is on another set of data...

Secondly, sometimes it is more important to train a model quickly. I have been working on some time series problems where I have been doing transfer learning with LSTMs. I train the base model on generic historical data and then use transfer learning for the fine tuning on specific live data. It would take too long to train a full new model on live data when ideally I would. I think the same could be true of using xgboost. I. E. 95% of model optimal prediction is better than no prediction.

@khotilov
Copy link
Member

khotilov commented Feb 9, 2018

@hcho3 While my mention of the "concept drift" was in a broad sense, the boosting continuation would likely do better with "concept shifts" (when new data has some differences but is expected to remain stable). Picking slow continuous data "drifts" would be harder. But for strong trending drifts, even that random forest method might not work well, and some forecasting elements would have to be utilized. A lot would depend on a situation.

Also, a weak spot for boosted trees learners is that they are greedy feature-by-feature partitioners, so they might not pick well on such kinds of changes where the univariate effects were not so significant, e.g., when only interactions change. It might be rather useful if we could add some sort of limited look-ahead functionality to xgboost. E.g., in https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/kdd13.pdf they have a bivariate scan step that I think might work well as a spin of the histogram algorithm.

As for why the predictive performance in future data that is similar to "new data" is sometimes worse for a model trained over a combined "old data" + "new data" dataset, when comparing to a training continuation in "new data", this is because the former model would be optimized over the whole combined dataset, and that might happen at the expense of "new data" when that "new data" is somewhat different and relatively small.

@liujxing
Copy link

I thought incremental training with minibatches of data (just like SGD) is kind of equivalent to subsampling the rows at each iteration. Is the subsampling in XGBoost only performed once for the whole training lifecycle or once every iteration?

@benyaminelc90
Copy link

I also need to use incremental learning. I've read all links that have been mentioned above. However, I'm confused.
Finally, is there any version of XGBoost to retrain a trained xgb model based on new received data point or batch of data?
I've found below links that addressed this issue before the date of this post. Don't they work? Can't we do incremental learning with them? What's the problem with them?

#1686
#484
https://stackoverflow.com/questions/38079853/how-can-i-implement-incremental-training-for-xgboost?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
#2495

@antontarasenko
Copy link
Author

@benyaminelc90

This particular answer got close to proper incremental learning:

Still, as I learned from hcho3, GBM has limited capacity for updates without seeing all the data from the start. For example, it can easily update leaves but has difficulties with altering splits.

@tqchen tqchen closed this as completed Jul 4, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Oct 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants