Unable to run recent CUDA-enabled docker image against GPU #1584

xrd · 2022-12-13T19:36:00Z

I'm trying to use the latest docker image to run a neural network example on GPU/CUDA.

Environment

Elixir & Erlang/OTP versions (elixir --version): Erlang/OTP 24 [erts-12.3.2.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit] [x86_64-pc-linux-gnu]
Operating system: Linux (NixOS)
How have you started Livebook: docker
Livebook version (use git rev-parse HEAD if running with mix): v0.8.0
Browsers that reproduce this bug (the more the merrier): Firefox
Include what is logged in the browser console: see below
Include what is logged to the server console: see below

Current behavior

Start docker with GPUs permitted:

sudo docker run -p 8080:8080 -p 8081:8081 --gpus all --pull always -e LIVEBOOK_PASSWORD="securesecret" livebook/livebook
latest: Pulling from livebook/livebook
Digest: sha256:a61ce1bfa5fb17b43b77590af2c77a7207c337f2a267fe4c5c95379b24299d08
Status: Image is up to date for livebook/livebook:latest
[Livebook] Application running at http://0.0.0.0:8080/

Go to http://localhost:8080.

Go to settings, set XLA_TARGET to cuda118.

Create a new notebook. Add the Mix install code:

Mix.install(
  [
    {:kino_bumblebee, "~> 0.1.0"},
    {:exla, "~> 0.4.1"}
  ],
config: [nx: [default_backend: EXLA.Backend, client: :cuda]
]
)

It appears to download the CUDA enabled version of XLA for Linux inside the debug output

...
Generated tokenizers app
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 29 files (.ex)
Generated nx app

19:28:42.502 [info] Found a matching archive (xla_extension-x86_64-linux-cuda118.tar.gz), going to download it

19:28:56.426 [info] Successfully downloaded the XLA archive
==> exla
Unpacking /home/livebook/.cache/xla/0.4.1/cache/download/xla_extension-x86_64-linux-cuda118.tar.gz into /home/livebook/.cache/mix/installs/elixir-1.14.2-erts-12.3.2.2/ebbef9fe980d37896f70eb44794d54a7/deps/exla/cache
g++ -fPIC -I/usr/local/lib/erlang/erts-12.3.2.2/include -Icache/xla_extension/include -O3 -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -shared -std=c++17 -w -DLLVM_ON_UNIX=1 c_src/exla/exla.cc c_src/exla/exla_nif_util.cc c_src/exla/exla_client.cc -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -Wl,-rpath,'$ORIGIN/lib'
Compiling 21 files (.ex)
Generated exla app
==> kino
Compiling 37 files (.ex)
...

Once :ok is displayed, then add a smart cell with "Neural Network." Select txt2image. Click evaluate.

It sees the GPU, and then fails with:

18:01:22.741 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

18:01:22.741 [info] XLA service 0x7fa250017de0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

18:01:22.741 [info]   StreamExecutor device (0): NVIDIA GeForce GTX 1060 6GB, Compute Capability 6.1

18:01:22.741 [info] Using BFC allocator.

18:01:22.741 [info] XLA backend allocating 5659833139 bytes on device 0 for BFCAllocator.

18:01:23.571 [warn] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.8
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will wor

On the console we see this:


17:47:46.749 [debug] Downloading NIF from https://github.com/elixir-nx/tokenizers/releases/download/v0.2.0/libex_tokenizers-v0.2.0-nif-2.16-x86_64-unknown-linux-gnu.so.tar.gz

17:47:47.206 [debug] NIF cached at /home/livebook/.cache/rustler_precompiled/precompiled_nifs/libex_tokenizers-v0.2.0-nif-2.16-x86_64-unknown-linux-gnu.so.tar.gz and extracted to /home/livebook/.cache/mix/installs/elixir-1.14.2-erts-12.3.2.2/c25e041b25205711d20d9e43d9305779/_build/dev/lib/tokenizers/priv/native/libex_tokenizers-v0.2.0-nif-2.16-x86_64-unknown-linux-gnu.so

17:47:50.313 [info] Found a matching archive (xla_extension-x86_64-linux-cuda118.tar.gz), going to download it
--2022-12-13 17:47:50--  https://github.com/elixir-nx/xla/releases/download/v0.4.1/xla_extension-x86_64-linux-cuda118.tar.gz
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/404832884/dbb643d0-4c9d-4f21-beba-85d4d39c63b0?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221213%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221213T174750Z&X-Amz-Expires=300&X-Amz-Signature=9e1175537694e7126aec7c055bb527074e146feb35b611dbf363e27173a3ba69&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=404832884&response-content-disposition=attachment%3B%20filename%3Dxla_extension-x86_64-linux-cuda118.tar.gz&response-content-type=application%2Foctet-stream [following]
--2022-12-13 17:47:50--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/404832884/dbb643d0-4c9d-4f21-beba-85d4d39c63b0?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221213%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221213T174750Z&X-Amz-Expires=300&X-Amz-Signature=9e1175537694e7126aec7c055bb527074e146feb35b611dbf363e27173a3ba69&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=404832884&response-content-disposition=attachment%3B%20filename%3Dxla_extension-x86_64-linux-cuda118.tar.gz&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190627018 (182M) [application/octet-stream]
Saving to: ‘/home/livebook/.cache/xla/0.4.1/cache/download/xla_extension-x86_64-linux-cuda118.tar.gz’

0K .......... .......... .......... .......... ..........  0% 3.64M 50s
...
186150K .........                                             100%  646M=5.6s

2022-12-13 17:47:56 (32.4 MB/s) - ‘/home/livebook/.cache/xla/0.4.1/cache/download/xla_extension-x86_64-linux-cuda118.tar.gz’ saved [190627018/190627018]


17:47:56.424 [info] Successfully downloaded the XLA archive
2022-12-13 17:48:27.986283: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-13 17:48:27.986309: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

17:49:22.248 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

17:49:22.248 [info] XLA service 0x7f09380093c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

17:49:22.248 [info]   StreamExecutor device (0): NVIDIA GeForce GTX 1060 6GB, Compute Capability 6.1

17:49:22.248 [info] Using BFC allocator.

17:49:22.248 [info] XLA backend allocating 5659833139 bytes on device 0 for BFCAllocator.

17:49:34.263 [warning] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.8
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.

17:49:34.451 [info] Start cannot spawn child process: No such file or directory

17:49:34.451 [info] Start cannot spawn child process: No such file or directory

17:49:34.451 [warning] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
[FATAL] tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:454 ptxas returned an error during compilation of ptx to sass: 'INTERNAL: Failed to launch ptxas'  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.

18:00:34.297 [debug] Copying NIF from cache and extracting to /home/livebook/.cache/mix/installs/elixir-1.14.2-erts-12.3.2.2/ebbef9fe980d37896f70eb44794d54a7/_build/dev/lib/tokenizers/priv/native/libex_tokenizers-v0.2.0-nif-2.16-x86_64-unknown-linux-gnu.so

2022-12-13 18:01:08.594294: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-13 18:01:08.594336: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

18:01:22.741 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

18:01:22.741 [info] XLA service 0x7fa250017de0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

18:01:22.741 [info]   StreamExecutor device (0): NVIDIA GeForce GTX 1060 6GB, Compute Capability 6.1

18:01:22.741 [info] Using BFC allocator.

18:01:22.741 [info] XLA backend allocating 5659833139 bytes on device 0 for BFCAllocator.

18:01:23.571 [warning] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.8
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.

18:01:23.762 [info] Start cannot spawn child process: No such file or directory

18:01:23.762 [info] Start cannot spawn child process: No such file or directory

18:01:23.762 [warning] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
[FATAL] tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:454 ptxas returned an error during compilation of ptx to sass: 'INTERNAL: Failed to launch ptxas'  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.

18:01:23.762 [info] Start cannot spawn child process: No such file or directory

Expected behavior

It should let me use the GPU.

NB: I do have a GPU:

$ nvidia-smi 
Tue Dec 13 19:21:50 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   36C    P2    22W / 120W |     15MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14189      G   ...xorg-server-1.20.14/bin/X        9MiB |
|    0   N/A  N/A     14217      G   ...hell-43.1/bin/gnome-shell        2MiB |
+-----------------------------------------------------------------------------+

Also, I can run other docker images against the GPU:

$ git remote -v
origin  https://github.com/fboulnois/stable-diffusion-docker.git (fetch)
origin  https://github.com/fboulnois/stable-diffusion-docker.git (push)
 ./build.sh dev
$ python -c "import torch; print(torch.cuda.is_available())"
True

The text was updated successfully, but these errors were encountered:

josevalim · 2022-12-13T19:38:05Z

The docker image is not compiled with cuda but we just released a cuda tag. Can you please try it instead?

xrd · 2022-12-13T19:40:12Z

@josevalim Yes, thank you. I saw the recent PR went onto main, so I assumed it would be in the regular docker image. Thank you, I'll try that and report back.

xrd · 2022-12-13T20:00:05Z

@josevalim Thank you. It works but then runs out of memory.

I tried the example of changing precision here (elixir-nx/bumblebee#101 (comment)) but am not sure if I converted the code correctly. It is not working yet. I'll keep experimenting. Thanks for your assistance!

repository_id = "CompVis/stable-diffusion-v1-4"
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-large-patch14"})

{:ok, clip} =
  Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"},
    log_params_diff: false
  )

{:ok, unet} =
  Bumblebee.load_model({:hf, repository_id, subdir: "unet"},
    params_filename: "diffusion_pytorch_model.bin",
    log_params_diff: false
  )

##
## Added according to comment in bumblee issue 101)
##
policy = Axon.MixedPrecision.create_policy(compute: :f16)

{:ok, vae} =
  Bumblebee.load_model({:hf, repository_id, subdir: "vae"},
    architecture: :decoder,
    params_filename: "diffusion_pytorch_model.bin",
    log_params_diff: false
  )

##
## Added according to comment in bumblee issue 101)
##
vae  = %{vae | model: Axon.MixedPrecision.apply_policy(clip, policy)}

{:ok, scheduler} = Bumblebee.load_scheduler({:hf, repository_id, subdir: "scheduler"})

{:ok, featurizer} =
  Bumblebee.load_featurizer({:hf, repository_id, subdir: "feature_extractor"})

{:ok, safety_checker} =
  Bumblebee.load_model({:hf, repository_id, subdir: "safety_checker"},
    log_params_diff: false
  )

serving =
  Bumblebee.Diffusion.StableDiffusion.text_to_image(clip, unet, vae, tokenizer, scheduler,
    num_steps: 3,
    num_images_per_prompt: 1,
    safety_checker: safety_checker,
    safety_checker_featurizer: featurizer,
    compile: [batch_size: 1, sequence_length: 5],
    defn_options: [compiler: EXLA]
  )

josevalim · 2022-12-13T20:05:33Z

Yeah, we have done zero optimizations, so we hope there is a bunch to gain as we explore it!

josevalim closed this as completed Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to run recent CUDA-enabled docker image against GPU #1584

Unable to run recent CUDA-enabled docker image against GPU #1584

xrd commented Dec 13, 2022

josevalim commented Dec 13, 2022

xrd commented Dec 13, 2022

xrd commented Dec 13, 2022

josevalim commented Dec 13, 2022

Unable to run recent CUDA-enabled docker image against GPU #1584

Unable to run recent CUDA-enabled docker image against GPU #1584

Comments

xrd commented Dec 13, 2022

Environment

Current behavior

Expected behavior

josevalim commented Dec 13, 2022

xrd commented Dec 13, 2022

xrd commented Dec 13, 2022

josevalim commented Dec 13, 2022