lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704 #2793

pseudotensor · 2020-02-21T16:58:05Z

version: 2.3.2

[LightGBM] [Fatal] Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704

Traceback (most recent call last):
  File "lgb_prefit_4ff5fa97-86b3-420c-aa87-5f01abcc18c3.py", line 10, in <module>
    model.fit(X, y, sample_weight=sample_weight, init_score=init_score, eval_set=eval_set, eval_names=valid_X_features, eval_sample_weight=eval_sample_weight, eval_init_score=init_score, eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds, feature_name=X_features, verbose=verbose_fit)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 818, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 610, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 250, in train
    booster.update(fobj=fobj)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2106, in update
    ctypes.byref(is_finished)))
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 46, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704

script and pickle file:

lgbm_histbug.zip

@sh1ng need help seeing if this is fixed in even later master.

The text was updated successfully, but these errors were encountered:

guolinke · 2020-02-22T03:42:45Z

I think the latest master branch will not produce this error anymore, as cnt is removed in histogram.

But this still is a potential bug in GPU learner. ping @huanzhang12

sh1ng · 2020-02-22T17:09:14Z

On master

[LightGBM] [Fatal] Check failed: best_split_info.right_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 706 .

Traceback (most recent call last):
  File "lgbm_histbug.py", line 8, in <module>
    model.fit(X, y, sample_weight=sample_weight, init_score=init_score, eval_set=eval_set, eval_names=valid_X_features, eval_sample_weight=eval_sample_weight, eval_init_score=init_score, eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds, feature_name=X_features, verbose=verbose_fit)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 829, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 614, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 250, in train
    booster.update(fobj=fobj)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2145, in update
    ctypes.byref(is_finished)))
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 46, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Check failed: best_split_info.right_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 706 .

guolinke · 2020-02-23T10:14:56Z

it is still a GPU bug.
ping @huanzhang12

pseudotensor · 2020-02-24T05:48:29Z

@guFalcon @huanzhang12 FYI, we are tracking a major accuracy issue with latest lightgbm compared to before. This is just a heads-up, perhaps it's related to this issue. But we'll post a separate issue once we have moment to generate MRE.

guolinke · 2020-02-24T06:22:36Z

Thanks @pseudotensor , could the accuracy issue reproduce in CPU?

guolinke · 2020-02-24T06:23:19Z

BTW, maybe this is related: #2811

pseudotensor · 2020-02-24T07:07:22Z

#2813 yes, it's CPU run. Same setup with GPU hits this GPU histogram bug issue so can't be run.

But I think the GPU histogram is more generally occurring than the accuracy Issue #2813

guolinke · 2020-02-24T11:05:18Z

I think this may be fixed by #2811 too.

guolinke · 2020-02-26T00:57:21Z

So in the latest master branch, the CPU version is okay, while the GPU version failed?

sh1ng · 2020-02-27T19:55:13Z

@guolinke correct

stack trace of the error

/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py:893: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
  .format(key))
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 22008
[LightGBM] [Info] Number of data points in the train set: 1348045, number of used features: 150
[LightGBM] [Info] Using GPU Device: GeForce MX150, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 138 dense feature groups (179.98 MB) transferred to GPU in 0.273129 secs. 1 sparse feature groups
[LightGBM] [Info] Start training from score -11.811581
[LightGBM] [Info] Start training from score -7.921803
[LightGBM] [Info] Start training from score -0.432866
[LightGBM] [Info] Start training from score -1.142893
[LightGBM] [Info] Start training from score -3.439298
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Fatal] Check failed: best_split_info.left_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 702 .

Traceback (most recent call last):
  File "lgb_accuracyissue.py", line 14, in <module>
    eval_init_score=init_score, eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds, feature_name=X_features, verbose=verbose_fit)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 829, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 614, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 250, in train
    booster.update(fobj=fobj)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2145, in update
    ctypes.byref(is_finished)))
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 46, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Check failed: best_split_info.left_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 702 .

sh1ng · 2020-02-27T20:02:49Z

Just letting you know that I'm unable to reproduce the issue with dataset originally provided, but it's easily reproducible with data from #2813

imatiach-msft · 2020-02-29T05:46:25Z

@guolinke I'm trying to track down an issue where after upgrading to latest master branch in mmlspark I am seeing a similar error - any recommendations for code/commits I should look into to investigate what might be the root cause:

[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 12422...
[LightGBM] [Info] Binding port 12422 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 12426...
[LightGBM] [Info] Binding port 12426 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 610, number of negative: 762
[LightGBM] [Info] Number of positive: 610, number of negative: 762
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000514 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 916
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000664 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 916
[LightGBM] [Info] Number of data points in the train set: 686, number of used features: 4
[LightGBM] [Info] Number of data points in the train set: 686, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.438776 -> initscore=-0.246133
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.450437 -> initscore=-0.198904
[LightGBM] [Info] Start training from score -0.222518
[LightGBM] [Info] Start training from score -0.222518
[LightGBM] [Info] Finished linking network in 0.003935 seconds
[LightGBM] [Fatal] Check failed: best_split_info.left_count > 0 at /home/ilya/LightGBM/src/treelearner/serial_tree_learner.cpp, line 709 .

20/02/29 00:35:01 WARN LightGBMClassifier: LightGBM reached early termination on one worker, stopping training on worker. This message should rarely occur

guolinke · 2020-02-29T08:40:25Z

Could it run by only one node?

imatiach-msft · 2020-02-29T20:55:33Z

@guolinke amazing insight! I tried 1 node instead of 2 and almost all of my tests passed (except 1 test due to the number of nodes which is expected)

Here is the output from the same test as above (except it was successful):

[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000942 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 327
[LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9
[LightGBM] [Info] Start training from score -1.572397
[LightGBM] [Info] Start training from score -1.618917
[LightGBM] [Info] Start training from score -2.024382
[LightGBM] [Info] Start training from score -1.955389
[LightGBM] [Info] Start training from score -1.890850
[LightGBM] [Info] Start training from score -1.773067
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002017 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 327
[LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9
[LightGBM] [Info] Start training from score -1.572397
[LightGBM] [Info] Start training from score -1.618917
[LightGBM] [Info] Start training from score -2.024382
[LightGBM] [Info] Start training from score -1.955389
[LightGBM] [Info] Start training from score -1.890850
[LightGBM] [Info] Start training from score -1.773067
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000835 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 327
[LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9
[LightGBM] [Info] Start training from score -1.572397
[LightGBM] [Info] Start training from score -1.618917
[LightGBM] [Info] Start training from score -2.024382
[LightGBM] [Info] Start training from score -1.955389
[LightGBM] [Info] Start training from score -1.890850
[LightGBM] [Info] Start training from score -1.773067
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001298 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 327
[LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9
[LightGBM] [Info] Using GOSS
[LightGBM] [Info] Start training from score -1.572397
[LightGBM] [Info] Start training from score -1.618917
[LightGBM] [Info] Start training from score -2.024382
[LightGBM] [Info] Start training from score -1.955389
[LightGBM] [Info] Start training from score -1.890850
[LightGBM] [Info] Start training from score -1.773067
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

imatiach-msft · 2020-02-29T21:06:17Z

Note this is from this commit on 2/21 (both failing and successful runs):
"Better documentation for Contributing (#2781)"
I'm currently trying to work back to older versions/commits of lightgbm to see which commit is causing the tests to fail, but it is a slow process to build and update the jar and rerun the tests (I'm currently skipping small batches of commits at a time but I might do a binary search to make this optimal since it looks like the issue goes back before 2/21).

guolinke · 2020-03-01T00:11:00Z

@imatiach-msft you can try the commit (509c2e5) and its parent (bc7bc4a)

imatiach-msft · 2020-03-01T20:05:06Z

@guolinke you're right, it looks like the issue is with commit (509c2e5).
I validated that including that commit causes the error, and removing it fixes the issue.

guolinke · 2020-03-02T01:23:09Z

@imatiach-msft could you share the data (and config) to me for the debugging?

imatiach-msft · 2020-03-02T03:04:49Z

@guolinke I'm running the mmlspark scala tests, maybe I can try to create an example that you can easily run?
You can find the lightgbm classifier tests here:
https://github.com/Azure/mmlspark/blob/master/src/test/scala/com/microsoft/ml/spark/lightgbm/split1/VerifyLightGBMClassifier.scala

The first test that failed was below, but I tried several others and they failed as well:
https://github.com/Azure/mmlspark/blob/master/src/test/scala/com/microsoft/ml/spark/lightgbm/split1/VerifyLightGBMClassifier.scala#L169

The compressed file with most datasets used in mmlspark can be found here:
https://mmlspark.blob.core.windows.net/installers/datasets-2020-01-20.tgz

guolinke · 2020-08-06T00:25:41Z

@shiyu1994 con you help to investigate this too?
you can start from @imatiach-msft 's test.

sh1ng · 2020-10-05T09:30:26Z

Still happens in version 3.0

lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 630

https://github.com/h2oai/h2o4gpu/blob/master/tests/python/open_data/gbm/test_lightgbm.py#L265-L284

shiyu1994 · 2020-10-07T04:10:27Z

@shiyu1994 con you help to investigate this too?
you can start from @imatiach-msft 's test.

Ok.

imatiach-msft · 2020-10-07T13:57:23Z

@shiyu1994 @guolinke FYI my issue was resolved when I upgraded after my fix #3110 , but it sounds like others are still encountering issues similar to what I had

diditforlulz273 · 2020-10-26T09:44:49Z

I have this issue with CPU learner, not GPU. Got it after upgrading from 2.3.1 to 3.0.0, makes every test with a tiny testing dataset fail for exactly the same reason:

lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/python-package/compile/src/treelearner/serial_tree_learner.cpp, line 630 .

guolinke · 2020-10-26T11:00:42Z

@diditforlulz273 could you try the latest master branch?
if the problem still exists, you can create a new issue and would be better if you can provide a reproducible example.

diditforlulz273 · 2020-10-26T11:20:08Z

@guolinke Have just built it from the latest master branch, still fails. I'll try to separate a minimum reproducible example and create an issue then.

grasevski · 2020-10-30T02:24:56Z

+1, this bug makes lightgbm GPU useless. still happens to me on latest master

asimraja77 · 2021-03-04T03:39:59Z

Hi,
I'm using the GPU setting and have the same issue. I tried "deterministic = True" but it did not solve the problem. I saw that the LightGBM v3.2.0 may fix this defect. I have a few question as follows:

In the v3.2.0 release thread, I noticed that this bug lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704 #2793 is not in bold. Does this mean that it may not be fixed until a later release?
Does a fix exists for it in a non-release (build from source) option? If so, can you please guide me to it?
Assuming that a fix may be part of v3.2.0 release, is this release about to happen? I noticed that v3.1.1 was released 3 months ago.

I apologize if my questions are a bit out of bound.
Best regards

nightflight-dk · 2021-09-29T10:34:07Z

It's unfortunate that a known issue of this severity is left open for over 1.5 years. The error affects every other attempt to train on GPUs when using the latest 'stable' bits in Business Division (Dynamics). I can help with a business case from inside Microsoft, to push this if necessary. My alias: dakowalc. Thanks

guolinke · 2021-10-01T03:17:00Z

Thank you @nightflight-dk , actually we had re-written the LightGBM GPU version, and previous OpenCL and CUDA versions will be deprecated. refer to PR #4528

nightflight-dk · 2021-10-06T09:08:25Z

Great to hear the GPU acceleration is under further development @guolinke. I have just tested the code from PR #4528 unfortunately it's affected by the same bug, triggering the same assert error in the serial_tree_learner (even in data parallel exec. device=cuda / device=gpu)
Please suggest a workaround or older version that is not affected (if any?) thanks

guolinke · 2021-10-06T11:39:19Z

cc @shiyu1994 for above bug.

shiyu1994 · 2021-10-06T16:59:23Z

I will double check that. But the new CUDA tree learners reuse no training logic of the old serial tree learner or old CUDA tree learner (only initialization code in serial_tree_learner.cpp is executed when a new CUDA tree learner is used, and it will not touch the check code which raises the error in this issue. Since the errors come from the source code of old CUDA tree learner and training part of serial tree learner), so I think it is not likely that the new CUDA version should result in the same bug.

shiyu1994 · 2021-10-06T17:03:24Z

@nightflight-dk Thanks for the testing. It would be really appreciated if the error log of the new CUDA version could be provided. :)

shiyu1994 · 2021-10-06T17:07:45Z

In addition, no distributed training is supported with the new CUDA versions in the PRs so far. So if distributed training is enabled, it will switch to old CUDA version.

nightflight-dk · 2021-10-07T13:05:10Z

@shiyu1994 @guolinke after disabling distribution (tree_learner: serial) the latest bits from PR #4528 finish the training without issues. Moreover the GPU utilization appears dramatically improved (mean up to ca. 50% from 2%). Well done.
Is there an ETA for PR #4528 part of master? it would help our planning. Also, if you plan data_parallel GPU or multi-GPU, please point out the items for us to track. Happy to help with testing. Please keep up the good work. Thanks a lot. - dakowalc, Business 360 AI team

shiyu1994 · 2021-10-08T02:36:32Z

@nightflight-dk Thanks for having a trial. Since #4528 is a very large PR, we plan to decompose it into several parts, and merge them one by one. We expect to finish the merge process by the end of this month.
Multi-GPU and distributed training will be added after #4528 is being merged. I will point that out once PRs are open for that.

pavlexander · 2022-11-07T08:36:11Z

Since there hasn't been any activity for a year, I would like to bring this topic up again.

Got the version 3.3.3, python. Training on GPU, on Windows.

The issue is bugging me for the past 2 days.. The data set is 500k, with 1500 features. There seems to be some correlation with min_gain_to_split parameter. When the value is 1 I have not yet seen any errors, however on value 0 (default) it seems to crash quite often. Take this comment with caution since I have not ran enough tests yet..

crashed when

{'learning_rate': 0.43467624523546383, 'max_depth': 8, 'num_leaves': 201, 'feature_fraction': 0.9, 'bagging_fraction': 0.7000000000000001, 'bagging_freq': 8}

{'learning_rate': 0.021403440298427053, 'max_depth': 2, 'num_leaves': 176, 'lambda_l1': 3.8066251775052895, 'lambda_l2': 1.08526150100961e-08, 'feature_fraction': 0.6, 'bagging_fraction': 0.9, 'bagging_freq': 6}

{'learning_rate': 0.3493368922746614, 'max_depth': 6, 'num_leaves': 109, 'lambda_l1': 4.506588272812341e-05, 'lambda_l2': 2.5452579091348995e-07, 'feature_fraction': 0.7000000000000001, 'bagging_fraction': 1.0, 'bagging_freq': 6, 'min_gain_to_split': 0}

{'learning_rate': 0.17840010040986135, 'max_depth': 12, 'num_leaves': 251, 'lambda_l1': 0.004509589012189404, 'lambda_l2': 3.882151732343819e-08, 'feature_fraction': 0.30000000000000004, 'bagging_fraction': 1.0, 'bagging_freq': 8, 'min_gain_to_split': 0}

the code is:

    params = {
        'device_type': "gpu",
        'objective': 'multiclass',  # 
        'metric': 'multi_logloss',  # 
        "boosting_type": "gbdt",
        "num_class": 3,
        'random_state': 123,
        'verbosity': -1,  # hides "No further splits with positive gain, best gain: -inf" warnings
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.9, log=True),  # 0.1
        'max_depth': trial.suggest_int('max_depth', 2, 12),
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),  # def 31
        'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),  # 0
        'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),  # 0
        'feature_fraction': trial.suggest_float('feature_fraction', 0.1, 1.0, step=0.1),  # 1
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.1, 1.0, step=0.1),  # 1
        'bagging_freq': trial.suggest_int('bagging_freq', 0, 10),  # 0
        'min_gain_to_split': trial.suggest_int('min_gain_to_split', 0, 5),
    }

with a few changes here and there

exception is:

[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .

[W 2022-11-07 09:49:32,774] Trial 49 failed because of the following error: LightGBMError('Check failed: (best_split_info.left_count) > (0) at D:\\a\\1\\s\\python-package\\compile\\src\\treelearner\\serial_tree_learner.cpp, line 653 .\n')
Traceback (most recent call last):
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 174, in objective
    model = lgb.train(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\engine.py", line 292, in train
    booster.update(fobj=fobj)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 3021, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .

Traceback (most recent call last):
  File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 237, in <module>
    study.optimize(objective, n_trials=_NUMBER_OF_TRIALS)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\study.py", line 419, in optimize
    _optimize(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 66, in _optimize
    _optimize_sequential(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 160, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 234, in _run_trial
    raise func_err
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 174, in objective
    model = lgb.train(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\engine.py", line 292, in train
    booster.update(fobj=fobj)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 3021, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .


Process finished with exit code 1

I am using optuna for optimization so the set of parameters is always different.

Tried using different split ratio (0.19/0.20/0.21) - does not seem to fix anything

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.19, random_state=42, shuffle=True)

as well as tried experimenting with the amount of data (600_000/600_001/200_001). Nothing seems to help fix the issue.. Can this fix be expected in the next major release? I see that the topic is still active..

JisongXie · 2022-12-13T09:24:13Z

I build the docker image with this dockerfile.gpu. And I encounter this issue, too.

LightGBMError: Check failed: (best_split_info.left_count) > (0) at /usr/local/src/lightgbm/LightGBM/src/treelearner/serial_tree_learner.cpp, line 653 .

StrikerRUS added the bug label Feb 22, 2020

sh1ng mentioned this issue Feb 24, 2020

severe drop in accuracy in latest lgbm #2813

Closed

StrikerRUS mentioned this issue May 11, 2020

v3.0.0rc1 #3071

Merged

guolinke mentioned this issue Aug 10, 2020

[WIP] next release (3.0.0) #3293

Closed

10 tasks

diditforlulz273 mentioned this issue Oct 26, 2020

Lightgbm CPU learner error - lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) #3489

Closed

StrikerRUS mentioned this issue Jan 28, 2021

v3.2.0 release #3872

Merged

shiyu1994 mentioned this issue Mar 1, 2021

Avoiding Exception "Check failed: (best_split_info.right_count) > (0) at ..." with a regression task #3679

Open

shiyu1994 mentioned this issue Mar 12, 2021

LightGBMError: Check failed: best_split_info.left_count > 0 for ranking task #2742

Closed

pavlexander mentioned this issue Nov 7, 2022

Check failed: (best_split_info.left_count) > (0) #4946

Open

lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704 #2793

lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704 #2793

Comments

pseudotensor commented Feb 21, 2020

guolinke commented Feb 22, 2020 • edited Loading

sh1ng commented Feb 22, 2020

guolinke commented Feb 23, 2020

pseudotensor commented Feb 24, 2020

guolinke commented Feb 24, 2020

guolinke commented Feb 24, 2020

pseudotensor commented Feb 24, 2020 • edited Loading

guolinke commented Feb 24, 2020

guolinke commented Feb 26, 2020

sh1ng commented Feb 27, 2020

sh1ng commented Feb 27, 2020

imatiach-msft commented Feb 29, 2020

guolinke commented Feb 29, 2020

imatiach-msft commented Feb 29, 2020

imatiach-msft commented Feb 29, 2020

guolinke commented Mar 1, 2020

imatiach-msft commented Mar 1, 2020

guolinke commented Mar 2, 2020

imatiach-msft commented Mar 2, 2020

guolinke commented Aug 6, 2020

sh1ng commented Oct 5, 2020

shiyu1994 commented Oct 7, 2020

imatiach-msft commented Oct 7, 2020

diditforlulz273 commented Oct 26, 2020

guolinke commented Oct 26, 2020

diditforlulz273 commented Oct 26, 2020

grasevski commented Oct 30, 2020

asimraja77 commented Mar 4, 2021

nightflight-dk commented Sep 29, 2021 • edited Loading

guolinke commented Oct 1, 2021 • edited Loading

nightflight-dk commented Oct 6, 2021 • edited Loading

guolinke commented Oct 6, 2021

shiyu1994 commented Oct 6, 2021

shiyu1994 commented Oct 6, 2021

shiyu1994 commented Oct 6, 2021

nightflight-dk commented Oct 7, 2021

shiyu1994 commented Oct 8, 2021

pavlexander commented Nov 7, 2022

JisongXie commented Dec 13, 2022 • edited Loading

guolinke commented Feb 22, 2020 •

edited

Loading

pseudotensor commented Feb 24, 2020 •

edited

Loading

nightflight-dk commented Sep 29, 2021 •

edited

Loading

guolinke commented Oct 1, 2021 •

edited

Loading

nightflight-dk commented Oct 6, 2021 •

edited

Loading

JisongXie commented Dec 13, 2022 •

edited

Loading