Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu_id error in R when it's non-zero(xgboost 0.81.0.1) #3850

Closed
joegaotao opened this issue Nov 1, 2018 · 9 comments
Closed

gpu_id error in R when it's non-zero(xgboost 0.81.0.1) #3850

joegaotao opened this issue Nov 1, 2018 · 9 comments

Comments

@joegaotao
Copy link

I have installed the newest multi-GPU xgboost from the source code(0.81.01) and the server has 8 GPUS. But when I change gpu_id to non-zero, there are some errors. Parameter n_gpus is ok. the test code:

library('xgboost')
# Simulate N x p random matrix with some binomial response dependent on pp columns
set.seed(111)
N <- 1000000
p <- 50
pp <- 25
X <- matrix(runif(N * p), ncol = p)
betas <- 2 * runif(pp) - 1
sel <- sort(sample(p, pp))
m <- X[, sel] %*% betas - 1 + rnorm(N)
y <- rbinom(N, 1, plogis(m))

tr <- sample.int(N, N * 0.75)
dtrain <- xgb.DMatrix(X[tr,], label = y[tr])
dtest <- xgb.DMatrix(X[-tr,], label = y[-tr])
wl <- list(train = dtrain, test = dtest)

# An example of running 'gpu_hist' algorithm
# which is
# - similar to the 'hist'
# - the fastest option for moderately large datasets
# - current limitations: max_depth < 16, does not implement guided loss
# You can use tree_method = 'gpu_exact' for another GPU accelerated algorithm,
# which is slower, more memory-hungry, but does not use binning.
param <- list(objective = 'reg:logistic', eval_metric = 'auc', subsample = 0.5, nthread = 4,
  max_bin = 64, tree_method = 'gpu_hist', gpu_id = 1)
pt <- proc.time()
bst_gpu <- xgb.train(param, dtrain, watchlist = wl, nrounds = 50)
proc.time() - pt

error:

terminate called after throwing an instance of 'dmlc::Error'
  what():  [12:58:22] /home/tgao/repos/xgboost/include/xgboost/./../../src/common/common.h:200: Check failed: Contains(device) 

Stack trace returned 10 entries:
[bt] (0) /home/tgao/rpkg/xgboost/libs/xgboost.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7f6d5395a0da]
[bt] (1) /home/tgao/rpkg/xgboost/libs/xgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f6d5395abd8]
[bt] (2) /home/tgao/rpkg/xgboost/libs/xgboost.so(+0x32cf45) [0x7f6d53bbaf45]
[bt] (3) /usr/lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x3f) [0x7f6e00a48cbf]
[bt] (4) /home/tgao/rpkg/xgboost/libs/xgboost.so(xgboost::obj::RegLossObj<xgboost::obj::LogisticRegression>::GetGradient(xgboost::HostDeviceVector<float> const&, xgboost::MetaInfo const&, int, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*)+0x765) [0x7f6d53bc4db5]
[bt] (5) /home/tgao/rpkg/xgboost/libs/xgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x1cb) [0x7f6d53a1adab]
[bt] (6) /home/tgao/rpkg/xgboost/libs/xgboost.so(XGBoosterUpdateOneIter+0x48) [0x7f6d5397c128]
[bt] (7) /home/tgao/rpkg/xgboost/libs/xgboost.so(XGBoosterUpdateOneIter_R+0x3c) [0x7f6d53ae6e8c]
[bt] (8) /usr/lib/R/lib/libR.so(+0x127ad1) [0x7f6e027e5ad1]
[bt] (9) /usr/lib/R/lib/libR.so(Rf_eval+0x380) [0x7f6e027ef430]

sessionInfo():

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xgboost_0.81.0.1

loaded via a namespace (and not attached):
[1] compiler_3.5.1    magrittr_1.5      Matrix_1.2-14     tools_3.5.1       stringi_1.2.4     grid_3.5.1        data.table_1.11.8 lattice_0.20-35
@hcho3
Copy link
Collaborator

hcho3 commented Nov 1, 2018

@joegaotao Did you compile with -DUSE_NCCL=1?

@joegaotao
Copy link
Author

@hcho3 I just complie with -DUSE_NCCL=ON according to the doc. Should I change it to -DUSE_NCCL=1?

@hcho3
Copy link
Collaborator

hcho3 commented Nov 1, 2018

Got it. -DUSE_NCCL=ON is equivalent to -DUSE_NCCL=1.

@trivialfis
Copy link
Member

@joegaotao Hi, what do you mean by

Parameter n_gpus is ok

Does it mean, after specifying n_gpus = -1 or n_gpus=8, XGBoost runs as expected. Or it means something else?

I will take a look tomorrow. By default XGBoost uses 1 GPU, which leads to my initial guess that gpu_id has to be 0 since you are only using 1 GPU.

@joegaotao
Copy link
Author

joegaotao commented Nov 1, 2018

@trivialfis Yes, I tested n_gpus = -1 or n_gpus=8 and default gpu_id (=0) , XGBoost runs as expected, but If I want to run multiple R sessions and each one use specified gpu, I found I can't change gpu_id to other devices, otherwise all R sessions will use the default first gpu, leading to out of memory

@hcho3
Copy link
Collaborator

hcho3 commented Nov 1, 2018

@trivialfis Is this issue a release blocker?

@trivialfis
Copy link
Member

I Will try to look into it tomorrow. Addressing it requires some reconsideration of how we specify GPU and adding proper tests. So a "quick fix" might not be possible.

But please note that even if this is fixed, user still can't change their GPU within the same process without deleting loaded data.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 1, 2018

Got it. We may want to add this as a known issue then.

@trivialfis
Copy link
Member

Addressed in #3851 .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants