-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DaskXGBoostClassifier Tree Method Hist NaN Value Training Bug #9271
Comments
Thank you for the quick fix! |
Thank you for raising the issue! |
We're seeing the exact same error messages on 1.7.3 but upgrading to 1.7.6 does not eliminate the problem. In addition, it's intermittent. Just confirming that 1.7.6 is supposed to include the fix. |
@droctothorpe The error message is not quite useful actually, it just means one of the workers dropped out during training but doesn't show why the worker dropped out. Could you please share the worker log and open a new issue? Also, please try the latest XGBoost. |
Thanks, @trivialfis! Much appreciated. 🙏 Testing against 1.7.6 and 2.1.0. Will create a new issue if the problem resurfaces. |
Turns out it was cluster autoscaler trying to bin-pack workers and the specific xgboost computation (RFE) not being resilient to worker loss. Adding |
@zazulam suggested working on a contribution to improve XGBoost's resilience in the face of Dask worker outages. Do you have a sense of the scale of that lift, @trivialfis? I'm guessing if it wasn't a giant PITA, it probably would have been solved for by now 🙃 . |
Thank you for bringing this up @droctothorpe @zazulam . We have been thinking about it for a while now, also have been trying to engage with the dask developers dask/distributed#8624 (see the related issues there as well). I think we can handle some exceptions in XGBoost like OOM without dask intervening, but for more sophisticated errors like a killed worker, there has to be some cooperation between the distributed framework and the application. Another option is to just restart the training without the problematic worker (with data lost) after a timeout or some other signals, I have been thinking about making these options a user-provided policy (callback). In the latest XGBoost, we raised the timeout threshold significantly due to imbalanced data size in some clusters, but plan to introduce a user parameter in the next release. |
Having said that, we are open to suggestions. Feel free to raise a new issue for discussion. |
Issue Description
When using the
DaskXGBoostClassifier
in Python with Distributed Dask and Tree Method:"hist"
the following scenario causes an error:If there are no NaNs, or at least one NaN in any column in each partition then there is no issue.
If there is a NaN in at least one column in at least one partition, but not a NaN in at least one column in every partition then it fails.
When it fails, one of 2 errors occur, AllReduce Failed or Poll Timeout Error.
.../rabit/include/rabit/internal/utils.h:86: Allreduce failed coroutine
.../rabit/include/rabit/internal/socket.h:170: Poll timeout
How to Reproduce
This can be easily reproduced with n_partitions equal to n_workers and greater than 1 and any dataset where there is a total number of NaNs less than n_partitions but at least 1.
Once this dataset is made and shown to fail, it can be modified to succeed again, either by removing the NaNs, or by adding additional NaNs in any column and verifying that the NaNs are spread to each partition.
Additionally, I tested forcing the NaN into the first partition incase any schema was being inferred, however this didn't solve the issue.
The following is code to reproduce the issue with a dummy dataset:
The text was updated successfully, but these errors were encountered: