-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reloading dataset broken with init_model #6144
Comments
Thanks for the excellent report! Sorry for the long delay in responding, this project is struggling from a lack of maintainer availability. I was able to reproduce this on the most recent commit on Built the library like this (on an M2 mac, with Python 3.11.7) cmake -B build -S .
cmake --build build --target _lightgbm -j4
sh build-python.sh install --precompile And ran your example code. Saw exactly the same error you did. I see the problem. When you provide an LightGBM/python-package/lightgbm/basic.py Lines 2042 to 2046 in 5dfe716
That code in the Python package has logic like "if LightGBM/python-package/lightgbm/basic.py Lines 1150 to 1163 in 5dfe716
Lines 263 to 266 in 5dfe716
So this error comes from the fact that as of this writing, LightGBM's prediction routines (in Python, R, and C) do not support generating predictions on an already-constructed #4546 is the main feature request tracking that work. #5191 could also help in the Python package specifically, as an inefficient workaround. In all those prior discussions about adding predict() support on the |
Until #4546 is resolved, the best workaround I can think of is to do something like the following:
Like this import numpy as np
import lightgbm as lgb
np.random.seed(0)
X, y = np.random.normal(size=(10_000, 20)), np.random.normal(size=(10_000,))
params = {
"verbose": -1,
"seed": 1,
"num_iterations": 10,
"bagging_freq": 1,
"bagging_fraction": 0.5
}
dataset_bin = "dataset.bin"
model_txt = "model.txt"
# save the raw training data
np.save("data.npy", X)
np.save("label.npy", y)
# train a model and save it
ds = lgb.Dataset(X, label=y, params=params)
model = lgb.train(params, train_set=ds)
model.save_model(model_txt)
model.num_trees()
# 10
# save the Dataset in binary format
ds.save_binary(dataset_bin)
# clear everything out of memory, to simulate stopping this
# process and starting a new one
del ds
del model
del X
del y
# load the Dataset and raw training data
X = np.load("data.npy")
y = np.load("label.npy")
ds = lgb.Dataset(data=dataset_bin, params=params)
# create a new Dataset, using the bin mappings from the original one
ds2 = lgb.Dataset(
data=X,
label=y,
reference=ds
)
# continue training
model = lgb.train(params, train_set=ds2, init_model=model_txt)
model.num_trees()
# 20 That's inefficient relative to being able to just use a binary
BUT... this should at least be faster than reconstructing a new |
Realized today that there was an earlier issue documenting exactly the same thing (but in a different way, and with less details provided). I've closed that in favor of keeping the discussion here. see #4311 (comment) |
Description
Dataset
has asave_binary
function, and the docstring for thedata
argument inDataset
suggests that is where you should input the path to this dataset, however I cannot get this to work correctly in combination with aninit_model
.My goal here is to save both the dataset binary and the model so I can continue training later without reconstructing either the model or the dataset.
Reproducible example
Here is the setup, similar to my other bug report.
Loading and training without
init_model
goes fine:But then with
init_model
it fails -- the stack trace suggests that theinit_model
tries to read the dataset to create initial predictions but doesn't seem to be able to understand that it is a binary file:Environment info
LightGBM 4.0.0
The text was updated successfully, but these errors were encountered: