Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run recent CUDA-enabled docker image against GPU #1584

Closed
xrd opened this issue Dec 13, 2022 · 4 comments
Closed

Unable to run recent CUDA-enabled docker image against GPU #1584

xrd opened this issue Dec 13, 2022 · 4 comments

Comments

@xrd
Copy link

xrd commented Dec 13, 2022

I'm trying to use the latest docker image to run a neural network example on GPU/CUDA.

Environment

  • Elixir & Erlang/OTP versions (elixir --version): Erlang/OTP 24 [erts-12.3.2.2] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit] [x86_64-pc-linux-gnu]
  • Operating system: Linux (NixOS)
  • How have you started Livebook: docker
  • Livebook version (use git rev-parse HEAD if running with mix): v0.8.0
  • Browsers that reproduce this bug (the more the merrier): Firefox
  • Include what is logged in the browser console: see below
  • Include what is logged to the server console: see below

Current behavior

Start docker with GPUs permitted:

sudo docker run -p 8080:8080 -p 8081:8081 --gpus all --pull always -e LIVEBOOK_PASSWORD="securesecret" livebook/livebook
latest: Pulling from livebook/livebook
Digest: sha256:a61ce1bfa5fb17b43b77590af2c77a7207c337f2a267fe4c5c95379b24299d08
Status: Image is up to date for livebook/livebook:latest
[Livebook] Application running at http://0.0.0.0:8080/

Go to http://localhost:8080.

Go to settings, set XLA_TARGET to cuda118.

Create a new notebook. Add the Mix install code:

Mix.install(
  [
    {:kino_bumblebee, "~> 0.1.0"},
    {:exla, "~> 0.4.1"}
  ],
config: [nx: [default_backend: EXLA.Backend, client: :cuda]
]
)

It appears to download the CUDA enabled version of XLA for Linux inside the debug output

...
Generated tokenizers app
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 29 files (.ex)
Generated nx app

19:28:42.502 [info] Found a matching archive (xla_extension-x86_64-linux-cuda118.tar.gz), going to download it

19:28:56.426 [info] Successfully downloaded the XLA archive
==> exla
Unpacking /home/livebook/.cache/xla/0.4.1/cache/download/xla_extension-x86_64-linux-cuda118.tar.gz into /home/livebook/.cache/mix/installs/elixir-1.14.2-erts-12.3.2.2/ebbef9fe980d37896f70eb44794d54a7/deps/exla/cache
g++ -fPIC -I/usr/local/lib/erlang/erts-12.3.2.2/include -Icache/xla_extension/include -O3 -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -shared -std=c++17 -w -DLLVM_ON_UNIX=1 c_src/exla/exla.cc c_src/exla/exla_nif_util.cc c_src/exla/exla_client.cc -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -Wl,-rpath,'$ORIGIN/lib'
Compiling 21 files (.ex)
Generated exla app
==> kino
Compiling 37 files (.ex)
...

Once :ok is displayed, then add a smart cell with "Neural Network." Select txt2image. Click evaluate.

It sees the GPU, and then fails with:

18:01:22.741 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

18:01:22.741 [info] XLA service 0x7fa250017de0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

18:01:22.741 [info]   StreamExecutor device (0): NVIDIA GeForce GTX 1060 6GB, Compute Capability 6.1

18:01:22.741 [info] Using BFC allocator.

18:01:22.741 [info] XLA backend allocating 5659833139 bytes on device 0 for BFCAllocator.

18:01:23.571 [warn] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.8
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will wor

On the console we see this:


17:47:46.749 [debug] Downloading NIF from https://github.com/elixir-nx/tokenizers/releases/download/v0.2.0/libex_tokenizers-v0.2.0-nif-2.16-x86_64-unknown-linux-gnu.so.tar.gz

17:47:47.206 [debug] NIF cached at /home/livebook/.cache/rustler_precompiled/precompiled_nifs/libex_tokenizers-v0.2.0-nif-2.16-x86_64-unknown-linux-gnu.so.tar.gz and extracted to /home/livebook/.cache/mix/installs/elixir-1.14.2-erts-12.3.2.2/c25e041b25205711d20d9e43d9305779/_build/dev/lib/tokenizers/priv/native/libex_tokenizers-v0.2.0-nif-2.16-x86_64-unknown-linux-gnu.so

17:47:50.313 [info] Found a matching archive (xla_extension-x86_64-linux-cuda118.tar.gz), going to download it
--2022-12-13 17:47:50--  https://github.com/elixir-nx/xla/releases/download/v0.4.1/xla_extension-x86_64-linux-cuda118.tar.gz
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/404832884/dbb643d0-4c9d-4f21-beba-85d4d39c63b0?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221213%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221213T174750Z&X-Amz-Expires=300&X-Amz-Signature=9e1175537694e7126aec7c055bb527074e146feb35b611dbf363e27173a3ba69&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=404832884&response-content-disposition=attachment%3B%20filename%3Dxla_extension-x86_64-linux-cuda118.tar.gz&response-content-type=application%2Foctet-stream [following]
--2022-12-13 17:47:50--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/404832884/dbb643d0-4c9d-4f21-beba-85d4d39c63b0?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221213%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221213T174750Z&X-Amz-Expires=300&X-Amz-Signature=9e1175537694e7126aec7c055bb527074e146feb35b611dbf363e27173a3ba69&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=404832884&response-content-disposition=attachment%3B%20filename%3Dxla_extension-x86_64-linux-cuda118.tar.gz&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190627018 (182M) [application/octet-stream]
Saving to: ‘/home/livebook/.cache/xla/0.4.1/cache/download/xla_extension-x86_64-linux-cuda118.tar.gz’

0K .......... .......... .......... .......... ..........  0% 3.64M 50s
...
186150K .........                                             100%  646M=5.6s

2022-12-13 17:47:56 (32.4 MB/s) - ‘/home/livebook/.cache/xla/0.4.1/cache/download/xla_extension-x86_64-linux-cuda118.tar.gz’ saved [190627018/190627018]


17:47:56.424 [info] Successfully downloaded the XLA archive
2022-12-13 17:48:27.986283: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-13 17:48:27.986309: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

17:49:22.248 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

17:49:22.248 [info] XLA service 0x7f09380093c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

17:49:22.248 [info]   StreamExecutor device (0): NVIDIA GeForce GTX 1060 6GB, Compute Capability 6.1

17:49:22.248 [info] Using BFC allocator.

17:49:22.248 [info] XLA backend allocating 5659833139 bytes on device 0 for BFCAllocator.

17:49:34.263 [warning] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.8
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.

17:49:34.451 [info] Start cannot spawn child process: No such file or directory

17:49:34.451 [info] Start cannot spawn child process: No such file or directory

17:49:34.451 [warning] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
[FATAL] tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:454 ptxas returned an error during compilation of ptx to sass: 'INTERNAL: Failed to launch ptxas'  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.

18:00:34.297 [debug] Copying NIF from cache and extracting to /home/livebook/.cache/mix/installs/elixir-1.14.2-erts-12.3.2.2/ebbef9fe980d37896f70eb44794d54a7/_build/dev/lib/tokenizers/priv/native/libex_tokenizers-v0.2.0-nif-2.16-x86_64-unknown-linux-gnu.so

2022-12-13 18:01:08.594294: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-13 18:01:08.594336: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

18:01:22.741 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

18:01:22.741 [info] XLA service 0x7fa250017de0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

18:01:22.741 [info]   StreamExecutor device (0): NVIDIA GeForce GTX 1060 6GB, Compute Capability 6.1

18:01:22.741 [info] Using BFC allocator.

18:01:22.741 [info] XLA backend allocating 5659833139 bytes on device 0 for BFCAllocator.

18:01:23.571 [warning] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.8
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.

18:01:23.762 [info] Start cannot spawn child process: No such file or directory

18:01:23.762 [info] Start cannot spawn child process: No such file or directory

18:01:23.762 [warning] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
[FATAL] tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:454 ptxas returned an error during compilation of ptx to sass: 'INTERNAL: Failed to launch ptxas'  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.

18:01:23.762 [info] Start cannot spawn child process: No such file or directory

Expected behavior

It should let me use the GPU.

NB: I do have a GPU:

$ nvidia-smi 
Tue Dec 13 19:21:50 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   36C    P2    22W / 120W |     15MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14189      G   ...xorg-server-1.20.14/bin/X        9MiB |
|    0   N/A  N/A     14217      G   ...hell-43.1/bin/gnome-shell        2MiB |
+-----------------------------------------------------------------------------+


Also, I can run other docker images against the GPU:

$ git remote -v
origin  https://github.com/fboulnois/stable-diffusion-docker.git (fetch)
origin  https://github.com/fboulnois/stable-diffusion-docker.git (push)
 ./build.sh dev
$ python -c "import torch; print(torch.cuda.is_available())"
True
@josevalim
Copy link
Contributor

The docker image is not compiled with cuda but we just released a cuda tag. Can you please try it instead?

@xrd
Copy link
Author

xrd commented Dec 13, 2022

@josevalim Yes, thank you. I saw the recent PR went onto main, so I assumed it would be in the regular docker image. Thank you, I'll try that and report back.

@xrd
Copy link
Author

xrd commented Dec 13, 2022

@josevalim Thank you. It works but then runs out of memory.

I tried the example of changing precision here (elixir-nx/bumblebee#101 (comment)) but am not sure if I converted the code correctly. It is not working yet. I'll keep experimenting. Thanks for your assistance!

repository_id = "CompVis/stable-diffusion-v1-4"
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-large-patch14"})

{:ok, clip} =
  Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"},
    log_params_diff: false
  )

{:ok, unet} =
  Bumblebee.load_model({:hf, repository_id, subdir: "unet"},
    params_filename: "diffusion_pytorch_model.bin",
    log_params_diff: false
  )

##
## Added according to comment in bumblee issue 101)
##
policy = Axon.MixedPrecision.create_policy(compute: :f16)

{:ok, vae} =
  Bumblebee.load_model({:hf, repository_id, subdir: "vae"},
    architecture: :decoder,
    params_filename: "diffusion_pytorch_model.bin",
    log_params_diff: false
  )

##
## Added according to comment in bumblee issue 101)
##
vae  = %{vae | model: Axon.MixedPrecision.apply_policy(clip, policy)}

{:ok, scheduler} = Bumblebee.load_scheduler({:hf, repository_id, subdir: "scheduler"})

{:ok, featurizer} =
  Bumblebee.load_featurizer({:hf, repository_id, subdir: "feature_extractor"})

{:ok, safety_checker} =
  Bumblebee.load_model({:hf, repository_id, subdir: "safety_checker"},
    log_params_diff: false
  )

serving =
  Bumblebee.Diffusion.StableDiffusion.text_to_image(clip, unet, vae, tokenizer, scheduler,
    num_steps: 3,
    num_images_per_prompt: 1,
    safety_checker: safety_checker,
    safety_checker_featurizer: featurizer,
    compile: [batch_size: 1, sequence_length: 5],
    defn_options: [compiler: EXLA]
  )

@josevalim
Copy link
Contributor

Yeah, we have done zero optimizations, so we hope there is a bunch to gain as we explore it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants