Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Creating multiple boosters from file results in broken scores. #3778

Closed
AlbertoEAF opened this issue Jan 17, 2021 · 19 comments · Fixed by #4056
Closed

[bug] Creating multiple boosters from file results in broken scores. #3778

AlbertoEAF opened this issue Jan 17, 2021 · 19 comments · Fixed by #4056
Labels

Comments

@AlbertoEAF
Copy link
Contributor

AlbertoEAF commented Jan 17, 2021

Using the master branch, if I create 2 boosters from file using the C API, I get broken scores from all but the first created booster:

BoosterHandle bh1, bh2;  

LGBM_BoosterCreateFromModelfile("./LightGBM_model.txt", &num_iterations, &bh1); // comment this & you get good results
LGBM_BoosterCreateFromModelfile("./LightGBM_model.txt", &num_iterations, &bh2);

// Score is only correct if I predict with bh1, or predict with bh2 but don't initialize bh1 before initializing bh2:
predict(bh2, values             , NUM_FEATURES, num_iterations, &out_len, &ref_scores[0]); // score changes depending on bh2 being the first Booster created or not.


Meaning, only the first booster created works properly.

I'm on Ubuntu 18.04, x86_64.

Thank you

@AlbertoEAF AlbertoEAF changed the title Creating multiple boosters from file results in broken scores. [bug] Creating multiple boosters from file results in broken scores. Jan 18, 2021
@StrikerRUS StrikerRUS added the bug label Jan 18, 2021
@shiyu1994
Copy link
Collaborator

shiyu1994 commented Feb 19, 2021

@AlbertoEAF It seems that the predict function is not from the C API ? I failed to reproduce the problem with LGBM_BoosterPredictForFile. Loading two models with LGBM_BoosterCreateFromModelfile and then predict with LGBM_BoosterPredictForFile, the two boosters produce consistent results. So could you provide more details about predict function ? Or provide a complete reproducible example ? Thanks.

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Feb 19, 2021

Yes @shiyu1994 it's from this test code: https://gist.github.com/AlbertoEAF/5972db15a27c294bab65b97e1bc4c315 :)

If you modify that gist like the snippet above this happens.

@shiyu1994
Copy link
Collaborator

Sorry @AlbertoEAF , I still cannot reproduce the error. At least, for the attached LightGBM_model.txt file and the following code snippet, using the first and the second booster handle produce exactly the same result.

#include <iostream>
#include <stdio.h>
#include <math.h>
#include <vector>
#include <thread>
#include <ctime>
#include <cstring>
#include "LightGBM/c_api.h"

using namespace std;

#define FMT_HEADER_ONLY
#include "LightGBM/../../external_libs/fmt/include/fmt/format.h"

inline void predict(BoosterHandle handle,
            const void* data,            
            int32_t ncol,                                    
            int num_iterations,            
            int64_t* out_len,
            double* out_result) {
  if (0 != LGBM_BoosterPredictForMatSingleRow(
    handle, 
    data, 
    C_API_DTYPE_FLOAT64,
    ncol, 
    1, // is_row_major
    C_API_PREDICT_NORMAL, 
    0, // start_iteration
    num_iterations, 
    "", 
    out_len, 
    out_result)) {
      throw std::exception();
  }
}

int main(int /*argc*/, char** /*argv*/) {
  BoosterHandle boosterHandle1, boosterHandle2;
  int num_iterations;
  LGBM_BoosterCreateFromModelfile("./LightGBM_model.txt", &num_iterations, &boosterHandle1);
  LGBM_BoosterCreateFromModelfile("./LightGBM_model.txt", &num_iterations, &boosterHandle2);
  cout << "Model iterations " << num_iterations<< "\n";

  double values[] = { 
      0.0, 45.44200,-30.74976,31.78587,4.63569,-15.14894,0.23370,-11.97968,-9.59708,6.48111,-8.89073,4.02405,-2.28873,17.90204,1377.12225,
      1762.04257,947.34430,562.28140,524.26811,361.25410,514.53507,247.70126,399.78880,205.36521,211.86870,44.88971,-299.63923,-227.64169,
      7.34210,-85.80057,16.81211,-41.18581,-12.00858,-41.92117,-26.52312,-12.19765,89.61104,-266.07834,-230.31223,-168.29609,38.40371,31.84016,
      28.98071,-98.73447,26.46565,-30.23272,-1.34637,253.36945,61.73424,16.15765,185.24868,75.57244,55.17367,-41.08869,15.86438,16.03692,-142.57968,
      59.74796,-151.46761,-12.70612,-104.99242,23.10514,47.00179,-13.59894,-79.19045,-27.63745,38.94605,-55.40585,54.73527,15.10521,-3.79596,390.47720,
      17.64018,-68.64827,-62.43611,-31.96095,33.90780,-181.37609,139.84078,-129.48841,76.23826,-8.84046,-0.15439,137.44210,77.54739,-4.22875,-61.92657,
      -33.52722,-3.86253,36.42400,7.17309,
      0.0, 45.44200,-30.74976,31.78587,4.63569,-15.14894,0.23370,-11.97968,-9.59708,6.48111,-8.89073,4.02405,-2.28873,17.90204,1377.12225,
      1762.04257,947.34430,562.28140,524.26811,361.25410,514.53507,247.70126,399.78880,205.36521,211.86870,44.88971,-299.63923,-227.64169,
      7.34210,-85.80057,16.81211,-41.18581,-12.00858,-41.92117,-26.52312,-12.19765,89.61104,-266.07834,-230.31223,-168.29609,38.40371,31.84016,
      28.98071,-98.73447,26.46565,-30.23272,-1.34637,253.36945,61.73424,16.15765,185.24868,75.57244,55.17367,-41.08869,15.86438,16.03692,-142.57968,
      59.74796,-151.46761,-12.70612,-104.99242,23.10514,47.00179,-13.59894,-79.19045,-27.63745,38.94605,-55.40585,54.73527,15.10521,-3.79596,390.47720,
      17.64018,-68.64827,-62.43611,-31.96095,33.90780,-181.37609,139.84078,-129.48841,76.23826,-8.84046,-0.15439,137.44210,77.54739,-4.22875,-61.92657,
      -33.52722,-3.86253,36.42400,7.17309
  };
  const size_t NROWS=2;
  const int NUM_FEATURES = 91;
  double ref_scores[NUM_FEATURES * NROWS];

  int64_t dummy_out_len;

  predict(boosterHandle1, values, NUM_FEATURES, num_iterations, &dummy_out_len, &ref_scores[0]);
  fmt::print("Ref scores: {:.6g}\n", ref_scores[0]);
  predict(boosterHandle2, values, NUM_FEATURES, num_iterations, &dummy_out_len, &ref_scores[0]);
  fmt::print("Ref scores: {:.6g}\n", ref_scores[0]);
}

Could you please provide the LightGBM_model.txt file which produces the error? Thanks!
LightGBM_model.txt

@shiyu1994
Copy link
Collaborator

@AlbertoEAF Could you please recheck this issue and provide a reproducible case? We need to fix this before next release. Thank you! :)

@guolinke
Copy link
Collaborator

gentle ping @AlbertoEAF , for this issue blocks the release of 3.2.0.

@AlbertoEAF
Copy link
Contributor Author

Hi there :),

Unfortunately I cannot provide that model file as it is sensitive, neither do I have much time on my hands to try to produce a simpler example this week. I can try it in the weekend though since it's urgent.

@AlbertoEAF
Copy link
Contributor Author

Hello @shiyu1994 and @guolinke I managed to reproduce it. I created a new branch https://github.com/feedzai/LightGBM/tree/test-double-model-load which you can use (has the model and the profile code).

To test:

# Compile
mkdir build && cd build && cmake -DBUILD_PROFILING_TESTS=ON

# Go to main repo folder & run
cd ..
./lightgbm_profile_single_row_predict

you should get the following outputs, when they should be the same for both boosterHandles 1 & 2:

image

and if you comment the creation of boosterHandle1 and its predictions, boosterHandle2 starts spitting the results of boosterHandle1 in the screenshot above.

@shiyu1994
Copy link
Collaborator

@AlbertoEAF Thanks for your effort. It seems that the file profiling/profile_single_row_predict.cpp is missing from the branch. Can you please upload it ?

@AlbertoEAF
Copy link
Contributor Author

Done! Sorry for that @shiyu1994 :)

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Mar 9, 2021

@AlbertoEAF I think I found the problem. Is your model file produced by an earlier version of LightGBM (earlier than the merge of linear_tree in #3299)? If so, the field is_linear= is missing from the model file. And when loaded by the latest version of LightGBM, the is_linear_ member of tree will be undefined, which causes undefined prediction behavior.

To fix this, let's set the is_linear_ filed to false when it is absent from the model file, for backward compatibility.

I'll open a PR for this.

@AlbertoEAF
Copy link
Contributor Author

Wow, great catch @shiyu1994

Yes the model is older than that.

Thank you :)

@cyfdecyf
Copy link
Contributor

I've also been biten by this bug when loading a model file without is_linear= set with the latest master code.

Since there's version= in the model file, I'd like to suggest adding a minior version number which increases whenever the model file format changes. When loading model file and found version mismatch, print a warning message to users to indicate it's possible to get wrong result.

@shiyu1994
Copy link
Collaborator

@cyfdecyf Thanks for the information. Does the error still happens after #4056 is merged ? I thought it should be already solved.

@cyfdecyf
Copy link
Contributor

@shiyu1994 The error is fixed now. I'm planning to send a PR this morning and found that you've already fixed the problem.

I noticed predict result would change if I specify max_bin or label_column during prediction which is not expected. It took me several days to realize this is a memory related bug. I've tried both AddressSanitize and Valgrind to catch memory related errors on Linux but with no luck.

  • For AddressSanitize, it's weird prediction result becomes all 0. But I remember it can't find uninitialized read so won't be helpful in this case.
  • For valgrind, it didn't catch the uninitialized read either. Maybe I'm using it in the wrong way.

@shiyu1994
Copy link
Collaborator

@cyfdecyf How did you specify max_bin or label_column when prediction? Is the inconsistency of prediction results related to this bug?

@cyfdecyf
Copy link
Contributor

@shiyu1994 The inconsistency is definitely related to this bug. I tried the following cases for my prediction:

  1. Specify no max_bin, max_bin=63, max_bin=255 in config, the latter two has same results. It only matters whether you specify max_bin or not.
  2. Specify label_column on cmdline, in config, on cmdline and also in config, the result differs.

@shiyu1994
Copy link
Collaborator

@cyfdecyf Could you provide a reproducible example? Thanks.

StrikerRUS added a commit that referenced this issue Mar 25, 2021
* [docs]Add alt text on images

* Update docs/GPU-Windows.rst

Co-authored-by: James Lamb <[email protected]>

* Update docs/GPU-Windows.rst

Co-authored-by: James Lamb <[email protected]>

* Apply suggestions from code review

Co-authored-by: James Lamb <[email protected]>

* Apply suggestions from code review

Co-authored-by: James Lamb <[email protected]>

* Merge main branch commit updates (#1)

* [docs] Add alt text to image in Parameters-Tuning.rst (#4035)

* [docs] Add alt text to image in Parameters-Tuning.rst

Add alt text to Leaf-wise growth image, as part of #4028

* Update docs/Parameters-Tuning.rst

Co-authored-by: James Lamb <[email protected]>

Co-authored-by: James Lamb <[email protected]>

* [ci] [R-package] upgrade to R 4.0.4 in CI (#4042)

* [docs] update description of deterministic parameter (#4027)

* update description of deterministic parameter to require using with force_row_wise or force_col_wise

* Update include/LightGBM/config.h

Co-authored-by: Nikita Titov <[email protected]>

* update docs

Co-authored-by: Nikita Titov <[email protected]>

* [dask] Include support for init_score (#3950)

* include support for init_score

* use dataframe from init_score and test difference with and without init_score in local model

* revert refactoring

* initial docs. test between distributed models with and without init_score

* remove ranker from tests

* test value for root node and change docs

* comma

* re-include parametrize

* fix incorrect merge

* use single init_score and the booster_ attribute

* use np.float64 instead of float

* [ci] ignore untitle Jupyter notebooks in .gitignore (#4047)

* [ci] prevent getting incompatible dask and distributed versions (#4054)

* [ci] prevent getting incompatible dask and distributed versions

* Update .ci/test.sh

Co-authored-by: Nikita Titov <[email protected]>

* empty commit

Co-authored-by: Nikita Titov <[email protected]>

* [ci] fix R CMD CHECK note about example timings (fixes #4049) (#4055)

* [ci] fix R CMD CHECK note about example timings (fixes #4049)

* Apply suggestions from code review

Co-authored-by: Nikita Titov <[email protected]>

* empty commit

Co-authored-by: Nikita Titov <[email protected]>

* [ci] add CMake + R 3.6 test back (fixes #3469) (#4053)

* [ci] add CMake + R 3.6 test back (fixes #3469)

* Apply suggestions from code review

Co-authored-by: Nikita Titov <[email protected]>

* Update .ci/test_r_package_windows.ps1

* -Wait and remove rtools40

* empty commit

Co-authored-by: Nikita Titov <[email protected]>

* [dask] include multiclass-classification task in tests (#4048)

* include multiclass-classification task and task_to_model_factory dicts

* define centers coordinates. flatten init_scores within each partition for multiclass-classification

* include issue comment and fix linting error

* Update index.rst (#4029)

Add alt text to logo image

Co-authored-by: James Lamb <[email protected]>

* [dask] raise more informative error for duplicates in 'machines' (fixes #4057) (#4059)

* [dask] raise more informative error for duplicates in 'machines'

* uncomment

* avoid test failure

* Revert "avoid test failure"

This reverts commit 9442bdf.

* [dask] add tutorial documentation (fixes #3814, fixes #3838) (#4030)

* [dask] add tutorial documentation (fixes #3814, fixes #3838)

* add notes on saving the model

* quick start examples

* add examples

* fix timeouts in examples

* remove notebook

* fill out prediction section

* table of contents

* add line back

* linting

* isort

* Apply suggestions from code review

Co-authored-by: Nikita Titov <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nikita Titov <[email protected]>

* move examples under python-guide

* remove unused pickle import

Co-authored-by: Nikita Titov <[email protected]>

* set 'pending' commit status for R Solaris optional workflow (#4061)

* [docs] add Yu Shi to repo maintainers (#4060)

* Update FAQ.rst

* Update CODEOWNERS

* set is_linear_ to false when it is absent from the model file (fix #3778) (#4056)

* Add CMake option to enable sanitizers and build gtest (#3555)

* Add CMake option to enable sanitizer

* Set up gtest

* Address reviewer's feedback

* Address reviewer's feedback

* Update CMakeLists.txt

Co-authored-by: Nikita Titov <[email protected]>

Co-authored-by: Nikita Titov <[email protected]>

* added type hint (#4070)

* [ci] run Dask examples on CI (#4064)

* Update Parallel-Learning-Guide.rst

* Update test.sh

* fix path

* address review comments

* [python-package] add type hints on Booster.set_network() (#4068)

* [python-package] add type hints on Booster.set_network()

* change behavior

* [python-package] Some mypy fixes (#3916)

* Some mypy fixes

* address James' comments

* Re-introduce pass in empty classes

* Update compat.py

Remove extra lines

* [dask] [ci] fix flaky network-setup test (#4071)

* [tests][dask] simplify code in Dask tests (#4075)

* simplify Dask tests code

* enable CI

* disable CI

* Revert "[ci] prevent getting incompatible dask and distributed versions (#4054)" (#4076)

This reverts commit 4e9c976.

* Fix parsing of non-finite values (#3942)

* Fix index out-of-range exception generated by BaggingHelper on small datasets.

Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero.

* Update goss.hpp

* Update goss.hpp

* Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array)

* Fix incorrect upstream merge

* Add link to LightGBM.NET

* Fix indenting to 2 spaces

* Dummy edit to trigger CI

* Dummy edit to trigger CI

* remove duplicate functions from merge

* Fix parsing of non-finite values.  Current implementation silently returns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match.  No attempt to optimise string allocations in this implementation since it is usually rarely invoked.

* Dummy commit to trigger CI

* Also handle -nan in double parsing method

* Update include/LightGBM/utils/common.h

Remove trailing whitespace to pass linting tests

Co-authored-by: Nikita Titov <[email protected]>

Co-authored-by: matthew-peacock <[email protected]>
Co-authored-by: Guolin Ke <[email protected]>
Co-authored-by: Nikita Titov <[email protected]>

* [dask] remove unused imports from typing (#4079)

* Range check for DCG position discount lookup (#4069)

* Add check to prevent out of index lookup in the position discount table. Add debug logging to report number of queries found in the data.

* Change debug logging location so that we can print the data file name as well.

* Revert "Change debug logging location so that we can print the data file name as well."

This reverts commit 3981b34.

* Add data file name to debug logging.

* Move log line to a place where it is output even when query IDs are read from a separate file.

* Also add the out-of-range check to rank metrics.

* Perform check after number of queries is initialized.

* Update

* [ci] upgrade R CI scripts to work on Ubuntu 20.04 (#4084)

* [ci] install additional LaTeX packages in R CI jobs

* update autoconf version

* bump upper limit on package size to 100

* [SWIG] Add streaming data support + cpp tests (#3997)

* [feature] Add ChunkedArray to SWIG

* Add ChunkedArray
* Add ChunkedArray_API_extensions.i
* Add SWIG class wrappers

* Address some review comments

* Fix linting issues

* Move test to tests/test_ChunkedArray_manually.cpp

* Add test note

* Move ChunkedArray to include/LightGBM/utils/

* Declare more explicit types of ChunkedArray in the SWIG API.

* Port ChunkedArray tests to googletest

* Please C++ linter

* Address StrikerRUS' review comments

* Update SWIG doc & disable ChunkedArray<int64_t>

* Use CHECK_EQ instead of assert

* Change include order (linting)

* Rename ChunkedArray -> chunked_array files

* Change header guards

* Address last comments from StrikerRUS

* store all CMake files in one place (#4087)

* v3.2.0 release (#3872)

* Update VERSION.txt

* update appveyor.yml and configure

* fix Appveyor builds

Co-authored-by: James Lamb <[email protected]>
Co-authored-by: Nikita Titov <[email protected]>
Co-authored-by: StrikerRUS <[email protected]>

* [ci] Bump version for development (#4094)

* Update .appveyor.yml

* Update cran-comments.md

* Update VERSION.txt

* update configure

Co-authored-by: James Lamb <[email protected]>

* [ci] fix flaky Azure Pipelines jobs (#4095)

* Update test.sh

* Update setup.sh

* Update .vsts-ci.yml

* Update test.sh

* Update setup.sh

* Update .vsts-ci.yml

* Update setup.sh

* Update setup.sh

Co-authored-by: Subham Agrawal <[email protected]>
Co-authored-by: James Lamb <[email protected]>
Co-authored-by: shiyu1994 <[email protected]>
Co-authored-by: Nikita Titov <[email protected]>
Co-authored-by: jmoralez <[email protected]>
Co-authored-by: marcelonieva7 <[email protected]>
Co-authored-by: Philip Hyunsu Cho <[email protected]>
Co-authored-by: Deddy Jobson <[email protected]>
Co-authored-by: Alberto Ferreira <[email protected]>
Co-authored-by: mjmckp <[email protected]>
Co-authored-by: matthew-peacock <[email protected]>
Co-authored-by: Guolin Ke <[email protected]>
Co-authored-by: ashok-ponnuswami-msft <[email protected]>
Co-authored-by: StrikerRUS <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nikita Titov <[email protected]>

Co-authored-by: James Lamb <[email protected]>
Co-authored-by: Subham Agrawal <[email protected]>
Co-authored-by: shiyu1994 <[email protected]>
Co-authored-by: Nikita Titov <[email protected]>
Co-authored-by: jmoralez <[email protected]>
Co-authored-by: marcelonieva7 <[email protected]>
Co-authored-by: Philip Hyunsu Cho <[email protected]>
Co-authored-by: Deddy Jobson <[email protected]>
Co-authored-by: Alberto Ferreira <[email protected]>
Co-authored-by: mjmckp <[email protected]>
Co-authored-by: matthew-peacock <[email protected]>
Co-authored-by: Guolin Ke <[email protected]>
Co-authored-by: ashok-ponnuswami-msft <[email protected]>
Co-authored-by: StrikerRUS <[email protected]>
@cyfdecyf
Copy link
Contributor

@shiyu1994 Spent sometime to construct a simple example. I'm not able to reproduce the exact symptoms I encounted with this simple example, only num_threads will impact the prediction result.

Here's the steps:

Build 2 lightgbm binaries with commit fcfd413 and its previous commit d90a16d

  • The bug is introduced in commit fcfd413
  • Name the 2 binaries as lightgbm.fcfd4132 and lightgbm.d90a16d5
  • I'm compiling the binary on Ubuntu 18.04 with g++ 7.5.0

Using the regression example:

cd examples/regression
ln -s ../../lightgbm.fcfd413 lightgbm.buggy
ln -s ../../lightgbm.d90a16d lightgbm.old

./lightgbm.old config=train.conf
./lightgbm.old config=predict.conf num_threads=1 deterministic=true output_result=old-1.txt
./lightgbm.old config=predict.conf num_threads=8 deterministic=true output_result=old-8.txt
./lightgbm.old config=predict.conf num_threads=16 deterministic=true output_result=old-16.txt
./lightgbm.buggy config=predict.conf num_threads=1 deterministic=true output_result=buggy-1.txt
./lightgbm.buggy config=predict.conf num_threads=8 deterministic=true output_result=buggy-8.txt
./lightgbm.buggy config=predict.conf num_threads=16 deterministic=true output_result=buggy-16.txt
md5sum old-*.txt buggy-*.txt

Here's the md5sum output on my server:

08888ca63a5500113856f2d6759aba50  old-1.txt
08888ca63a5500113856f2d6759aba50  old-8.txt
08888ca63a5500113856f2d6759aba50  old-16.txt
4e51234584f2a4c19a587cf298e8d792  buggy-1.txt
843753a4c9589969619c3c8e45f42010  buggy-8.txt
51e5da0cafaaf50ac763fcf71c28ee87  buggy-16.txt

If I change the train step to ./lightgbm.buggy config=train.conf (with the model file contianing is_linear=0), then all prediction results will be the same.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants