Avoiding Exception "Check failed: (best_split_info.right_count) > (0) at ..." with a regression task #3679

ch3rn0v · 2020-12-25T09:40:15Z

How you are using LightGBM?

Python package

Environment info

Operating System: Ubuntu 20.04.1 LTS
Python version: 3.8.5
GCC 7.3.0
LightGBM version or commit hash: 3.1.1

Steps to reproduce

In jupyter lab's notebook, prepare train and validation datasets. (They are huge and private, so can't share a reproducible example).
Train lgbm with the data with different sets of features.
Observe an exception looking like this:

Check failed: (best_split_info.right_count) > (0) at [...]
Sometimes it says left_count instead of right_count.
Other times it doesn't occur at all, depending on the features I use.

Other details

Apparently this is the start of the piece of code initiating the exception: https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L652.

I tried setting min_data_in_leaf to a value greater than zero. It helps sometimes, but not reliably. Same with feature_fraction. I also tried changing min_sum_hessian_in_leaf, to no avail. Also tried setting min_data_in_leaf and min_sum_hessian_in_leaf simultaneously, no difference.

This (or a similar) issue is mentioned a few times here:

None of them suggests an approach that allowed me to avoid these exceptions. Would you please share any ideas how to fix this, or at least why does this issue happen at all? If I understand correctly, one could simply trim the split leading to this error and stop branching further. Please correct me if I'm wrong. Thank you.

The text was updated successfully, but these errors were encountered:

guolinke · 2020-12-25T10:45:06Z

you can use larger min_data_per_leaf or min_hessian_per_leaf. non-zero may is not enough.
Regression objective should be safe in most cases, so I guess you may use the sample weight? If yes, it is better to avoid.
And which objective function you used?

ch3rn0v · 2020-12-25T10:58:37Z

you can use larger min_data_per_leaf or min_hessian_per_leaf
Sure, I tried a few values. If I increase these too much though it results in the model being under-fitted.

I use "regression" as the value for the "objective" param. Metric is "l1".

you may use the sample weight?

Do you mean specifying different weights for different samples? If so, I do not use this in the example we are discussing here.

Thank you very much for the rapid reply!

guolinke · 2020-12-25T11:03:21Z

If no sample weight and with regression, I think it may due to another problem, not related to min_data and min_hessian.
Did it only happen in large-scale data? if yes, I think you can try deterministic=true.

ch3rn0v · 2020-12-25T11:09:06Z

Did it only happen in large-scale data?

Although I didn't conduct the same experiment with a smaller fraction of my dataset, I tried bagging fraction before (perhaps with different set of features) and if I remember correctly it did not result in the above exception. Will try it again, thank you.

you can try deterministic=true.

I appreciate the suggestion! I'm already using this param since your helpful advice in #3654 :)

guolinke · 2020-12-25T11:15:21Z

interesting, I guess there may be a bug.
Did you use missing value handling? By default, it will enable if feature values contain NaN.
you can also try use_missing=false .

ch3rn0v · 2020-12-25T11:27:02Z

Did you use missing value handling? By default, it will enable if feature values contain NaN.
you can also try use_missing=false.

Yes, I do use missing value handling. Will try with use_missing=false now and report back.

ch3rn0v · 2020-12-25T13:58:00Z

I tried running the same script as before (so without setting min_data_in_leaf, min_sum_hessian_in_leaf) with the only change being the addition of "use_missing": False, to the model's params. And the same exception still occurs.

Any other suggestions in relation to how this can be fixed are very welcome.

guolinke · 2020-12-25T14:43:36Z

@ch3rn0v did you use categorical features?

ch3rn0v · 2020-12-25T14:59:40Z

@guolinke , nope, all features are numerical.

guolinke · 2020-12-26T03:09:06Z

is that possible to provide an exproduce example, by a sub-feature (or even with subrow), so that we can debug with.

ch3rn0v · 2020-12-26T09:52:54Z

I'll start an internal discussion about this, but I doubt any particular data or even a piece of it will be shared.

In the meantime I ran a few other tests.

Tried the same dataset, this time with bagging_fraction and bagging_freq. The exception still happens.
Suppose I have a dataset D1 that works ok. When I add a feature F2, I get an exception. If I keep F2, but remove any single feature from D1, the exception does not happen. So the reason is not adding F2, but rather it's in some interaction between the features.

ch3rn0v · 2020-12-26T10:02:02Z

Interestingly, the error still happens even with "max_depth" and "num_leaves" both being set to zero. Perhaps it occurs during some preliminary data verification?

shiyu1994 · 2020-12-26T10:05:35Z

A potential bug in histogram offset assignment may cause this error. I will create a PR for this.

shiyu1994 · 2020-12-28T06:55:39Z

@ch3rn0v Can you please try https://github.com/shiyu1994/LightGBM/tree/fix-3679 to see if the same error occurs?

ch3rn0v · 2020-12-28T07:42:13Z

Hello @shiyu1994 , appreciate your rapid response! Do I understand it correctly that the only way to try this version is to do this: https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#linux ? And if so, will I be able to remove this temporary version later? Will it result in any conflict if another version is already installed? Thanks in advance.

shiyu1994 · 2020-12-28T07:51:57Z

@ch3rn0v Yes you have to install the python package by building from the source files as in the link.

If you are using the Python API, you can create a virtualenv or conda to create a new python environment, and install the python package with the branch shiyu1994/fix-3679 in the new environment.

You may also install the python package from the branch directly. If you want to recover a standard released python package of LightGBM, just use pip to remove the branch package, and reinstall the latest released python package.

ch3rn0v · 2020-12-28T10:50:01Z

I tried a few different ways to install this version in a new conda env. Alas, none of them worked.

For instance, pip install git+git://github.com/shiyu1994/LightGBM@fix-3679 results in:

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-req-build-[...]/setup.py'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

And yes, I did conda install git pip before that. Searching for any similar errors didn't help much.

I also don't happen to have cmake and can't install it right now.

Would you please suggest any other steps I can take right now, or should I just obtain cmake?
Regardless, I'll post an update once I have any news.

shiyu1994 · 2020-12-28T11:33:20Z

Can you please install cmake? You have to build LightGBM before install the python package, when installing from source code.

shiyu1994 · 2020-12-28T11:39:31Z

The steps to install python package from source code is
git clone --recursive https://github.com/microsoft/LightGBM ; cd LightGBM
mkdir build ; cd build
cmake ..
make -j4
cd ../python-package
python setup.py install

ch3rn0v · 2020-12-28T16:39:23Z

While I'd be able to test this locally, it'll only make sense to run the experiment on a remote machine that has enough processing power. And I'm unable to install cmake there. While I could build it locally and scp the result to the server, it'd still require python setup.py install or similar. As far as I understand, the latter doesn't guarantee isolation within the current conda's env. And that's something we can't risk. I'm afraid we'll have to wait until this fix is released in order to be able to test it.

I can still run tests locally, but I don't have a dataset tiny enough that still allows to reproduce the bug with the current (3.1.1) release. Apologies for not returning to you with more meaningful feedback.

StrikerRUS · 2020-12-28T22:18:17Z

The steps to install python package from source code is

The last step should be

python setup.py install --precompile

if you'd like Python package installation pick up already compiled dynamic library.

@shiyu1994 Could you please transfer your changes from your fork to this repository? I believe you have enough rights to do this as a collaborator. Then we can trigger Azure Pipelines to build Python wheel file with your changes. And after that @ch3rn0v will be able to install patched version with simple pip install ... in isolated env without any other requirements.

StrikerRUS · 2020-12-29T22:15:26Z

Another option will be simply find current LightGBM installation folder, rename lib_lightgbm.so file to something like lib_lightgbm_backup.so and download only patched dynamic library file instead of the whole wheel in case you are not able to take risks of not fully isolated environments. It will work because as I can see fix includes changes only at cpp side but doesn't touch Python wrapper.

jungsooyun · 2020-12-30T05:57:00Z

Same issue occurred in GPU lightgbm.

In my case, if i do not use both max_depth, num_leaves params together, and use only num_leaves params (max_depth as default), error doesn't come out.

Hope this bug fixed soon,

shiyu1994 · 2020-12-30T06:11:54Z

#3694 is opened to potentially fix these errors, but it is only related to CPU version. We need further investigation if the errors are not fully eliminated after this PR is merged.

wonghang · 2021-01-17T11:33:41Z

@shiyu1994 Just for your information.

I also got a similar error and then I google on the web and found this issue.

I am using LightGBM 3.1.1 (the version that I can install from "pip3 install lightgbm")
I run it with missing_data=True, regression task, least-square error, no GPU, with categorical features

I got the following error at some point:

lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/python-package/compile/src/treelearner/serial_tree_learner.cpp, line 651 .

I saw #3694 had been merged. Therefore, I compile the latest version from github master and it currently works.

My data is also private and cannot be shared. Sorry about that.

pseudotensor · 2021-01-25T20:31:35Z

Hi @guolinke I hit the same problem:

 File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 794, in fit
    categorical_feature=categorical_feature, callbacks=callbacks, init_model=init_model)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 637, in fit
    callbacks=callbacks, init_model=init_model)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 251, in train
    booster.update(fobj=fobj)
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2505, in update
    ctypes.byref(is_finished)))
  File "/opt/h2oai/dai/cuda-10.0/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 52, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /workspace/LightGBM/src/treelearner/serial_tree_learner.cpp, line 653 .

Happens when trying to use mape on simple random data.

shiyu1994 · 2021-01-26T03:33:01Z

Hi @pseudotensor, are you using the released version of LightGBM or building from source?

pseudotensor · 2021-01-26T06:11:32Z

Building from source like:

    rm -rf build ; mkdir -p build ; cd build && \
    cmake $(GPU_FLAG) $(CUDA_FLAG) -DCMAKE_INSTALL_PREFIX=$$PYTHONPREFIX -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT:PATHNAME=$$BOOSTPREFIX -DBoost_LIBRARY_DIRS:FILEPATH=$BOOSTPREFIX/lib -DOpenCL_LIBRARY=$$CUDA_HOME/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=$$CUDA_HOME/include/ -DBoost_USE_STATIC_LIBS=ON .. && \
    make -j 8 && \
	cd ../python-package && rm -rf dist && \
    $(PYTHON) setup.py bdist_wheel --precompile --gpu --cuda --hdfs

(the --gpu etc. options aren't really needed since --precompile is shtere)

Note that I only started seeing this problem when upgrading from 2.2.4 to master.

I'm trying to repro the event seen in our jenkins testing, but so far no luck.

tandav · 2021-08-06T14:38:58Z

For me updating 3.1.1 -> 3.2.1 fixes the issue (CPU, macbook pro 16", macos catalina)

mshivers · 2021-10-05T18:45:03Z

I'm using 3.2.1, and still get this error, but it only seems to happen when bagging_fraction < 1.0.

shiyu1994 · 2021-10-08T02:56:03Z

@mshivers Thanks for using LightGBM. Are you using the CPU for training when encountering the error. It would be really appreciated if you could provide a reproducible example for the error.

mshivers · 2021-10-08T13:41:21Z

Hi @shiyu1994, I'm using CPUs. I've managed to reproduce the error just using randomly generated data. I'm on a corporate network that restricts data upload, however when I run the script below, it usually only takes a few minutes before it throws the error:

import lightgbm as lgb
import pandas as pd
import numpy as np

while 1:
    R,C = 100000, 10
    data = pd.DataFrame(np.random.randn(R,C))
    for i in range(1,C):
        data[i] += data[0] * np.random.randn()
    N = int(0.6*len(data))
    train_data = data.loc[:N]
    test_data = data.loc[N:]

    train = lgb.Dataset(train_data.iloc[:, 1:], train_data.iloc[:,0], free_raw_data=True)
    test = lgb.Dataset(test_data.iloc[:, 1:], test_data.iloc[:,0], free_raw_data=True, reference=train)

    params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'max_tree_output':0.03,
        'max_bin': 20,
        'max_depth': 10,
        'num_leaves': 127,
        'seed': 8,
        'learning_rate': 0.01,
        'bagging_fraction': 0.5,
        'bagging_freq': 1,
        'min_data_in_leaf': 5,
        'verbose': -1,
        'min_split_gain':0.1,
        'cegb_penalty_feature_coupled': 5 * np.ones(C-1),
        'cegb_penalty_split': 0.0000002,
    }
    gbm = lgb.train(params, train, num_boost_round=5000, valid_sets=test)

shiyu1994 · 2021-10-08T14:32:37Z

@mshivers Thanks! Given that reproducible example, we should reopen this issue. I'll investigate it further in next few days.

noobxinyu · 2021-10-11T02:51:35Z

same bug for 3.2.1

cpu
large-scale data
dataset weighted (no bug if removed)

possibly this is related to ill-conditioned problem

kabartay · 2021-11-02T10:33:07Z

What is the status of this bug?

kabartay · 2021-11-02T10:34:44Z

Probably with some tiny data size some parameters can cause this error, might not enough data in splits, etc.
If parameters are set carefully, this might prevent the check failed error.

kabartay · 2021-11-02T10:35:38Z

Thank you @ZFTurbo, does the error happen only under GPU settings?

I've checked. It only fails with GPU, while running ok on CPU. I only changed 'device': 'gpu' => 'device': 'cpu'

What can be behind of such difference?

shiyu1994 · 2021-11-03T06:58:11Z

@kabartay We are investing this bug. Progress will be posted once we have findings. Thanks for your patience!

chixujohnny · 2022-01-13T05:59:05Z

Please don't close this issue, I tried all solutions mentioned in this issue, but didn't work.

It is very instersting that only my GPU A100-40G load more than 17G memory, this error will be happend. For more details, please check #4946

ch3rn0v · 2022-01-23T15:47:18Z

Hello again. I happen to stumble upon this issue again. This time it's LightGBM v3.3.1 (OS is Ubuntu 20.04.3 LTS).
I don't specify the parameters because I tried different sets, and the error happened regardless of param's values.

I use the data from this competition:
http://www.topcoder.com/challenges/74c9ea5d-62f5-4168-8f2e-f05d2694988a

At the moment the data can be accessed here: https://drive.google.com/drive/folders/1pJgHq-xo0LNCmVxEWmnGX4zMBv8t48VX (see data_training.zip).

I can't share the specific preprocessing steps and features I make, but at least the source data is public. One of the classes is extremely rare, perhaps this can be part of the reason.

I can also say that another pipeline produces different preprocessing steps and computes different features, where the error doesn't occur.

When both sets of features are combined, the error does happen again.

UPD: This time the approach used in #3603 (increasing min_child_weight to a positive number) worked.

…ery iteration (fix partially #3679) (#5164) * clear split info buffer in cegb_ before every iteration * check nullable of cegb_ in serial_tree_learner.cpp * add a test case for checking the split buffer in CEGB * swith to Threading::For instead of raw OpenMP * apply review suggestions * apply review comments * remove device cpu

ahbon123 · 2022-07-04T13:44:00Z

I set min_data_in_leaf to 6, which is smaller than default value 20, it works for small dataset.

wellswei · 2022-07-05T12:17:17Z

Similar error happens when I run GPU build, while it works fine on CPU. I tried different envs and LGBM versions. So confused.

wellswei · 2022-07-06T05:00:05Z

UPD: Set min_child_weight to 1 solved the problems. For both left_count and right_count errors.

shiyu1994 self-assigned this Dec 26, 2020

bchen1116 mentioned this issue Jan 20, 2021

Pin LightGBM version to remove bug from docs alteryx/evalml#1711

Merged

shiyu1994 closed this as completed Mar 1, 2021

shiyu1994 reopened this Oct 8, 2021

arnocandel mentioned this issue Oct 28, 2021

simple repro for lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) #4739

Open

jameslamb added the bug label Nov 1, 2021

This was referenced Nov 2, 2021

Always respect forced splits, even when feature_fraction < 1.0 (fixes #4601) #4725

Merged

[Draft] Oct~Nov iteration Plan #4677

Closed

StrikerRUS mentioned this issue Nov 20, 2021

[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at /LightGBM/src/treelearner/serial_tree_learner.cpp, line 686 . #4817

Closed

jameslamb mentioned this issue Jan 13, 2022

Check failed: (best_split_info.left_count) > (0) #4946

Open

shiyu1994 mentioned this issue Mar 24, 2022

Fix program stop when split data count equals zero #5087

Closed

jameslamb mentioned this issue Apr 14, 2022

[RFC] 4.0.0 Release #5153

Closed

60 tasks

samFarrellDay mentioned this issue Apr 26, 2022

Getting error Check failed: (best_split_info.left_count) > (0) at /__w/1/s/python-package/compile/src/treelearner/serial_tree_learner.cpp, #5179

Closed

shiyu1994 mentioned this issue May 5, 2022

Clear split info buffer in cost efficient gradient boosting before every iteration (fix partially #3679) #5164

Merged

jameslamb mentioned this issue Jun 24, 2022

Check failed: (best_split_info.left_count) > (0) when using the ''cegb_penalty_feature_coupled'' parameter #5317

Closed

jameslamb mentioned this issue Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

Avoiding Exception "Check failed: (best_split_info.right_count) > (0) at ..." with a regression task #3679

Avoiding Exception "Check failed: (best_split_info.right_count) > (0) at ..." with a regression task #3679

Comments

ch3rn0v commented Dec 25, 2020

How you are using LightGBM?

Environment info

Steps to reproduce

Other details

guolinke commented Dec 25, 2020 • edited Loading

ch3rn0v commented Dec 25, 2020

guolinke commented Dec 25, 2020

ch3rn0v commented Dec 25, 2020

guolinke commented Dec 25, 2020

ch3rn0v commented Dec 25, 2020

ch3rn0v commented Dec 25, 2020

guolinke commented Dec 25, 2020

ch3rn0v commented Dec 25, 2020

guolinke commented Dec 26, 2020

ch3rn0v commented Dec 26, 2020

ch3rn0v commented Dec 26, 2020

shiyu1994 commented Dec 26, 2020

shiyu1994 commented Dec 28, 2020

ch3rn0v commented Dec 28, 2020

shiyu1994 commented Dec 28, 2020 • edited Loading

ch3rn0v commented Dec 28, 2020

shiyu1994 commented Dec 28, 2020 • edited Loading

shiyu1994 commented Dec 28, 2020

ch3rn0v commented Dec 28, 2020

StrikerRUS commented Dec 28, 2020 • edited Loading

StrikerRUS commented Dec 29, 2020

jungsooyun commented Dec 30, 2020

shiyu1994 commented Dec 30, 2020

wonghang commented Jan 17, 2021 • edited Loading

pseudotensor commented Jan 25, 2021

shiyu1994 commented Jan 26, 2021

pseudotensor commented Jan 26, 2021 • edited Loading

tandav commented Aug 6, 2021 • edited Loading

mshivers commented Oct 5, 2021

shiyu1994 commented Oct 8, 2021

mshivers commented Oct 8, 2021

shiyu1994 commented Oct 8, 2021

noobxinyu commented Oct 11, 2021 • edited Loading

kabartay commented Nov 2, 2021

kabartay commented Nov 2, 2021

kabartay commented Nov 2, 2021

shiyu1994 commented Nov 3, 2021

chixujohnny commented Jan 13, 2022

ch3rn0v commented Jan 23, 2022 • edited Loading

ahbon123 commented Jul 4, 2022

wellswei commented Jul 5, 2022

wellswei commented Jul 6, 2022

guolinke commented Dec 25, 2020 •

edited

Loading

shiyu1994 commented Dec 28, 2020 •

edited

Loading

shiyu1994 commented Dec 28, 2020 •

edited

Loading

StrikerRUS commented Dec 28, 2020 •

edited

Loading

wonghang commented Jan 17, 2021 •

edited

Loading

pseudotensor commented Jan 26, 2021 •

edited

Loading

tandav commented Aug 6, 2021 •

edited

Loading

noobxinyu commented Oct 11, 2021 •

edited

Loading

ch3rn0v commented Jan 23, 2022 •

edited

Loading