-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704 #2793
Comments
I think the latest master branch will not produce this error anymore, as But this still is a potential bug in GPU learner. ping @huanzhang12 |
On master
|
it is still a GPU bug. |
@guFalcon @huanzhang12 FYI, we are tracking a major accuracy issue with latest lightgbm compared to before. This is just a heads-up, perhaps it's related to this issue. But we'll post a separate issue once we have moment to generate MRE. |
Thanks @pseudotensor , could the accuracy issue reproduce in CPU? |
BTW, maybe this is related: #2811 |
I think this may be fixed by #2811 too. |
So in the latest master branch, the CPU version is okay, while the GPU version failed? |
@guolinke correct stack trace of the error
|
Just letting you know that I'm unable to reproduce the issue with dataset originally provided, but it's easily reproducible with data from #2813 |
@guolinke I'm trying to track down an issue where after upgrading to latest master branch in mmlspark I am seeing a similar error - any recommendations for code/commits I should look into to investigate what might be the root cause: [LightGBM] [Warning] Set TCP_NODELAY failed 20/02/29 00:35:01 WARN LightGBMClassifier: LightGBM reached early termination on one worker, stopping training on worker. This message should rarely occur |
Could it run by only one node? |
@guolinke amazing insight! I tried 1 node instead of 2 and almost all of my tests passed (except 1 test due to the number of nodes which is expected) Here is the output from the same test as above (except it was successful): [LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric= |
Note this is from this commit on 2/21 (both failing and successful runs): |
@imatiach-msft you can try the commit (509c2e5) and its parent (bc7bc4a) |
@imatiach-msft could you share the data (and config) to me for the debugging? |
@guolinke I'm running the mmlspark scala tests, maybe I can try to create an example that you can easily run? The first test that failed was below, but I tried several others and they failed as well: The compressed file with most datasets used in mmlspark can be found here: |
@shiyu1994 con you help to investigate this too? |
Still happens in version 3.0
|
Ok. |
@shiyu1994 @guolinke FYI my issue was resolved when I upgraded after my fix #3110 , but it sounds like others are still encountering issues similar to what I had |
I have this issue with CPU learner, not GPU. Got it after upgrading from 2.3.1 to 3.0.0, makes every test with a tiny testing dataset fail for exactly the same reason: lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/python-package/compile/src/treelearner/serial_tree_learner.cpp, line 630 . |
@diditforlulz273 could you try the latest master branch? |
@guolinke Have just built it from the latest master branch, still fails. I'll try to separate a minimum reproducible example and create an issue then. |
+1, this bug makes lightgbm GPU useless. still happens to me on latest master |
Hi,
I apologize if my questions are a bit out of bound. |
It's unfortunate that a known issue of this severity is left open for over 1.5 years. The error affects every other attempt to train on GPUs when using the latest 'stable' bits in Business Division (Dynamics). I can help with a business case from inside Microsoft, to push this if necessary. My alias: dakowalc. Thanks |
Thank you @nightflight-dk , actually we had re-written the LightGBM GPU version, and previous OpenCL and CUDA versions will be deprecated. refer to PR #4528 |
Great to hear the GPU acceleration is under further development @guolinke. I have just tested the code from PR #4528 unfortunately it's affected by the same bug, triggering the same assert error in the serial_tree_learner (even in data parallel exec. device=cuda / device=gpu) |
cc @shiyu1994 for above bug. |
I will double check that. But the new CUDA tree learners reuse no training logic of the old serial tree learner or old CUDA tree learner (only initialization code in serial_tree_learner.cpp is executed when a new CUDA tree learner is used, and it will not touch the check code which raises the error in this issue. Since the errors come from the source code of old CUDA tree learner and training part of serial tree learner), so I think it is not likely that the new CUDA version should result in the same bug. |
@nightflight-dk Thanks for the testing. It would be really appreciated if the error log of the new CUDA version could be provided. :) |
In addition, no distributed training is supported with the new CUDA versions in the PRs so far. So if distributed training is enabled, it will switch to old CUDA version. |
@shiyu1994 @guolinke after disabling distribution (tree_learner: serial) the latest bits from PR #4528 finish the training without issues. Moreover the GPU utilization appears dramatically improved (mean up to ca. 50% from 2%). Well done. |
@nightflight-dk Thanks for having a trial. Since #4528 is a very large PR, we plan to decompose it into several parts, and merge them one by one. We expect to finish the merge process by the end of this month. |
Since there hasn't been any activity for a year, I would like to bring this topic up again. Got the version 3.3.3, python. Training on GPU, on Windows. The issue is bugging me for the past 2 days.. The data set is 500k, with 1500 features. There seems to be some correlation with crashed when
the code is:
with a few changes here and there exception is:
I am using Tried using different split ratio (0.19/0.20/0.21) - does not seem to fix anything
as well as tried experimenting with the amount of data (600_000/600_001/200_001). Nothing seems to help fix the issue.. Can this fix be expected in the next major release? I see that the topic is still active.. |
I build the docker image with this dockerfile.gpu. And I encounter this issue, too.
|
version: 2.3.2
script and pickle file:
lgbm_histbug.zip
@sh1ng need help seeing if this is fixed in even later master.
The text was updated successfully, but these errors were encountered: