-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External Memory Version #244
Comments
Not yet tested but should be interesting to check behavior of out of core learning + feature hashing. For R: |
I just converted my h5 files to libsvm format and tried it out, but I keep getting an error. I'm unsure if it's related to the external memory or possibly my libsvm formatted files? My data has quite a few NaNs, it's for multiclassification, and the train file is ~2GB. I have the following code using the wrapper.
And I keep getting the error when I run the function:
I'm trying to trace the code, but it's proving to be difficult. When I loaded the matrix in-memory, this function worked fine. I think my format is correct:
I hope I can get this working soon so I can handle large files easily. |
Please remove the nan directly from the dataset, in libsvm format, simply not include them will indicate the file as missing. So the first line will be
|
Thank you. I thought missing in libsvm format equated to zero. So, is there then no difference between zero and nan in this format, or do I need to specify zero fields directly? There are some fields where zero and nan have different meanings. |
If you think zero and nan mean different things, you can specify zero explicitly. |
Attempting to fit a gbm to the following DMatrix results in identical error rates on the train and eval set at every iteration:
fitting the same dataset without the #**.cache tag does not. Maybe there are some wires being crossed in the caching? This was pulled using the most recent github master. The rest of the params are here:
|
@o1lo01ol1o Is it possible to given an complete code and dataset to reproduce the error? Thanks |
@tqchen Unfortunately I can't share the dataset. I can tell you that prior to the code I posted, missing values were filled with -999 and the numpy arrays were saved to libsvm using the |
Hello everyone,
What is the best approach to discover the source of problem and try to fix it? BTW, albeit apparently irrelevant, these are my parameters:
|
@erfannoury Thanks for trying the external memory version! I do need your help in finding the source of the problem. Please try to locate where the segfault happens. The code is likely to be around here https://github.com/dmlc/xgboost/blob/master/src/io/page_fmatrix-inl.hpp#L322 You can try to trace it, or add prints around the clause. The C++ CLI version might be easier for debugging in such case. It could also be the parsing error, the current parser do not parse nan, see the earlier posts in this thread, so please check if nan exists in the dumped svm file. Thanks for using xgboost and trying this new feature! |
@erfannoury also remember to delete the old col.blob file before your next run. It may also be helpful to try subset of your data and see if the problem persists |
@tqchen I'm now working on this and trying to find where the segfault happens. Though I'm a bit busy these days. As soon as I find a clue, I'll let you know. Also there are no Thank you for this great and powerful library. It would be great if I could be of any help. |
@erfannoury Thanks for doing this! I really appreciate it. If you can grab a minimum dataset(try less and less until the problem disappear) that reproduces the segfault, it might be easier to find where things went wrong. |
@tqchen Im also running into issues here. In R i got an error related to "unknown updater:grow_histmaker", in python just a segfault. My dataset is 50gb+ but i managed extract just a piece with 100k instances that have 13mb compressed. How can i send it to you? |
@lucaseustaquio you can send it to my UW email listed on my homepage. Thanks |
@tqchen Unfortunately, I have been busy lately and I haven't yet managed to find the problem. However, I'm working on it. |
@erfannoury Thanks for the catch, I pushed a fix to this |
@tqchen just as a reminder that I'm working on this issue. 😃 |
@erfannoury really sorry for being slow in response . I was occupied recently. Normally I also do not use a debugger, and use printf to narrow down things, very primal but sometimes effective when debuger is not available(e.g. distributed setting). Let me know if you have any findings Thanks! |
@tqchen it's ok. |
Hi tqchen, #feature.names are defined ..
tra<-train[,feature.names]
dval<-xgb.DMatrix(data=data.matrix(tra[h,]),label=log(train$Sales+1)[h])
dtrain<-xgb.DMatrix(data=data.matrix(tra[-h,]),label=log(train$Sales+1)[-h])
xgb.DMatrix.save(dval, '..\\data\\xgb.DMatrix.dval')
xgb.DMatrix.save(dtrain, '..\\data\\xgb.DMatrix.dtrain')
dval <-xgb.DMatrix(data='..\\data\\xgb.DMatrix.dval#cache')
dtrain <-xgb.DMatrix(data='..\\data\\xgb.DMatrix.dtrain#cache')
watchlist<-list(val=dval,train=dtrain)
param <- list( objective = "reg:linear",
booster = "gbtree",
eta = 0.02, # 0.06, #0.01,
max_depth = 500, #changed from default of 8
subsample = 1, # 0.7
colsample_bytree = 1 # 0.7
)
clf <- xgb.train( params = param,
data = dtrain,
nrounds = 5000,
verbose = 0,
early.stop.round = 5,
watchlist = watchlist,
maximize = FALSE,
feval=RMPSE
)
I get the message :
|
The external memory version was disabled in the standard R release, mainly to meet the restriction of CRAN standard of strict C++98. We are looking into enable this in R as well, for now, you can try it in the python version |
@tqchen thanks for your answer ! |
I was curious if it is expected that R will have this external memory version anytime soon? Building xgboost in python is not an option for me. Thanks! |
@tqchen On Windows, I get the error |
Created an external memory workflow by using the libsvm format with cache which works great for normal training. However when using CV I don't get the expected caching behavior. Looking at the code this might be because of the slice call in the mknfold method which creates new DMatrices from the original (cached) DMatrix. Thanks a lot, Joris |
* add lint with mshadow * fix * fix
Beta version of external memory xgboost is now ready, see https://github.com/dmlc/xgboost/blob/master/doc/external_memory.md
I am looking for people to try it out
The text was updated successfully, but these errors were encountered: