-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoiding Exception "Check failed: (best_split_info.right_count) > (0) at ..." with a regression task #3679
Comments
you can use larger |
I use
Do you mean specifying different weights for different samples? If so, I do not use this in the example we are discussing here. Thank you very much for the rapid reply! |
If no sample weight and with |
Although I didn't conduct the same experiment with a smaller fraction of my dataset, I tried bagging fraction before (perhaps with different set of features) and if I remember correctly it did not result in the above exception. Will try it again, thank you.
I appreciate the suggestion! I'm already using this param since your helpful advice in #3654 :) |
interesting, I guess there may be a bug. |
Yes, I do use missing value handling. Will try with |
I tried running the same script as before (so without setting Any other suggestions in relation to how this can be fixed are very welcome. |
@ch3rn0v did you use categorical features? |
@guolinke , nope, all features are numerical. |
is that possible to provide an exproduce example, by a sub-feature (or even with subrow), so that we can debug with. |
I'll start an internal discussion about this, but I doubt any particular data or even a piece of it will be shared. In the meantime I ran a few other tests.
|
Interestingly, the error still happens even with |
A potential bug in histogram offset assignment may cause this error. I will create a PR for this. |
@ch3rn0v Can you please try https://github.com/shiyu1994/LightGBM/tree/fix-3679 to see if the same error occurs? |
Hello @shiyu1994 , appreciate your rapid response! Do I understand it correctly that the only way to try this version is to do this: https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#linux ? And if so, will I be able to remove this temporary version later? Will it result in any conflict if another version is already installed? Thanks in advance. |
@ch3rn0v Yes you have to install the python package by building from the source files as in the link. If you are using the Python API, you can create a You may also install the python package from the branch directly. If you want to recover a standard released python package of LightGBM, just use |
I tried a few different ways to install this version in a new conda env. Alas, none of them worked. For instance,
And yes, I did I also don't happen to have Would you please suggest any other steps I can take right now, or should I just obtain |
Can you please install |
The steps to install python package from source code is |
While I'd be able to test this locally, it'll only make sense to run the experiment on a remote machine that has enough processing power. And I'm unable to install I can still run tests locally, but I don't have a dataset tiny enough that still allows to reproduce the bug with the current (3.1.1) release. Apologies for not returning to you with more meaningful feedback. |
The last step should be
if you'd like Python package installation pick up already compiled dynamic library. @shiyu1994 Could you please transfer your changes from your fork to this repository? I believe you have enough rights to do this as a collaborator. Then we can trigger Azure Pipelines to build Python wheel file with your changes. And after that @ch3rn0v will be able to install patched version with simple |
Another option will be simply find current LightGBM installation folder, rename |
Same issue occurred in GPU lightgbm. In my case, if i do not use both Hope this bug fixed soon, |
#3694 is opened to potentially fix these errors, but it is only related to CPU version. We need further investigation if the errors are not fully eliminated after this PR is merged. |
@shiyu1994 Just for your information. I also got a similar error and then I google on the web and found this issue. I am using LightGBM 3.1.1 (the version that I can install from "pip3 install lightgbm") I got the following error at some point:
I saw #3694 had been merged. Therefore, I compile the latest version from github master and it currently works. My data is also private and cannot be shared. Sorry about that. |
Hi @guolinke I hit the same problem:
Happens when trying to use |
Hi @pseudotensor, are you using the released version of LightGBM or building from source? |
Building from source like:
(the --gpu etc. options aren't really needed since --precompile is shtere) Note that I only started seeing this problem when upgrading from 2.2.4 to master. I'm trying to repro the event seen in our jenkins testing, but so far no luck. |
For me updating |
I'm using 3.2.1, and still get this error, but it only seems to happen when bagging_fraction < 1.0. |
@mshivers Thanks for using LightGBM. Are you using the CPU for training when encountering the error. It would be really appreciated if you could provide a reproducible example for the error. |
Hi @shiyu1994, I'm using CPUs. I've managed to reproduce the error just using randomly generated data. I'm on a corporate network that restricts data upload, however when I run the script below, it usually only takes a few minutes before it throws the error:
|
@mshivers Thanks! Given that reproducible example, we should reopen this issue. I'll investigate it further in next few days. |
same bug for 3.2.1
possibly this is related to ill-conditioned problem |
What is the status of this bug? |
Probably with some |
What can be behind of such difference? |
@kabartay We are investing this bug. Progress will be posted once we have findings. Thanks for your patience! |
Please don't close this issue, I tried all solutions mentioned in this issue, but didn't work. It is very instersting that only my GPU A100-40G load more than 17G memory, this error will be happend. For more details, please check #4946 |
Hello again. I happen to stumble upon this issue again. This time it's LightGBM v3.3.1 (OS is Ubuntu 20.04.3 LTS). I use the data from this competition: At the moment the data can be accessed here: https://drive.google.com/drive/folders/1pJgHq-xo0LNCmVxEWmnGX4zMBv8t48VX (see data_training.zip). I can't share the specific preprocessing steps and features I make, but at least the source data is public. One of the classes is extremely rare, perhaps this can be part of the reason. I can also say that another pipeline produces different preprocessing steps and computes different features, where the error doesn't occur. When both sets of features are combined, the error does happen again. UPD: This time the approach used in #3603 (increasing min_child_weight to a positive number) worked. |
…ery iteration (fix partially #3679) (#5164) * clear split info buffer in cegb_ before every iteration * check nullable of cegb_ in serial_tree_learner.cpp * add a test case for checking the split buffer in CEGB * swith to Threading::For instead of raw OpenMP * apply review suggestions * apply review comments * remove device cpu
I set |
Similar error happens when I run GPU build, while it works fine on CPU. I tried different envs and LGBM versions. So confused. |
UPD: Set min_child_weight to 1 solved the problems. For both left_count and right_count errors. |
How you are using LightGBM?
Environment info
Steps to reproduce
Check failed: (best_split_info.right_count) > (0) at [...]
Sometimes it says
left_count
instead ofright_count
.Other times it doesn't occur at all, depending on the features I use.
Other details
Apparently this is the start of the piece of code initiating the exception: https://github.com/microsoft/LightGBM/blob/master/src/treelearner/serial_tree_learner.cpp#L652.
I tried setting
min_data_in_leaf
to a value greater than zero. It helps sometimes, but not reliably. Same withfeature_fraction
. I also tried changingmin_sum_hessian_in_leaf
, to no avail. Also tried settingmin_data_in_leaf
andmin_sum_hessian_in_leaf
simultaneously, no difference.This (or a similar) issue is mentioned a few times here:
None of them suggests an approach that allowed me to avoid these exceptions. Would you please share any ideas how to fix this, or at least why does this issue happen at all? If I understand correctly, one could simply trim the split leading to this error and stop branching further. Please correct me if I'm wrong. Thank you.
The text was updated successfully, but these errors were encountered: