Single configurable Dockerfile with multi-stage build + external bazel cache #41

vidlb · 2020-12-29T19:44:35Z

In response to issue #36 , this PR should considerably ease the docker build process, mostly thanks to build arguments and bazel remote cache. You can now build different docker images with custom base image, versions / branches using build arguments, the Dockerfile should be compatible with any branch. You can control env / flags and apt dependencies in external dedicated files.

Thanks to bazel remote cache you can build many docker images without having to rebuild TF (I believe you're going to LOVE it, bazel is such a nightmare !). Now you could just stop and restart a build without loosing progress !
You can choose either CPU or GPU with the only mandatory argument (BASE_IMG), cuda support is enabled if it is found in /usr/local/cuda-*.

I also added some little extras, like passwordless sudo, GUI-related packages option, CPU_RATIO arg to control number of jobs, args to keep TF and/or OTB sources, RTX 3000 support (CUDA compute capability 8.6) and more useful options - see the multibuild.sh script and README.md for available arguments and examples.
All files are now installed in /opt/otbtf (pip wheel as well, in /opt/otbtf/lib/python3.8).

Several changes were made in order to build with latest TF 2.4 and CUDA 11 drivers (ubuntu20.04 with python3.8, no more --noincompatible_do_not_split_linking_cmdline option, bazel and protobuf versions bumps), and since TF 2.3 (I believe) model path is tensorflow::tstring instead of std::string , so I made the fix in order to build, and ran some tests with my old CNN workflow, but I'm not sure this is enough, you may need to apply that fix elsewhere in the repo ? check this commit.
You can pull from our gitlab repo if you want to check it out !

Mission accomplished, it significantly reduced images size (in this example without GUI and OTB sources only for -dev) :
r1.7 is with the other, more complicated 'bruteforce' method.
Div by 2 for GPU, by 3 for CPU !

P.S.
Sorry for the mess with all those small commits ... I just push too easily and there were so many typos ^^

remicres · 2021-01-11T23:04:21Z

Hi @vidlb I just saw you MR. It is just amazing ! Can't wait to test that tomorrow !

vidlb · 2021-01-12T10:20:39Z

@remicres haaa finally =D
Hope it builds ! Please tell me how it goes... I didn't try with v2.2 or 2.3 but it should also work, with the right bazel version !

What I found really confusing is the protobuf version ! With v2.4, pip installed python protoc v3.14, but it seems I needed to build v3.9.2 in order to run TF... WTF ?

Also I hope the numpy version problem is not really one. This is the only thing that could require to modify the dockerfile, in order to set a maximum version when installing it with pip, if you want to build older TF branches.

But it won't change the fact that GDAL is installed with ubuntu's numpy (which is outdated), I guess it is a quite a mess if you try to import gdal, tf and otbApplication at the same time.... But it seems ok right now, or at least it worked for the last docker RUN statement.

remicres · 2021-01-12T17:40:01Z

GPU image built,uncompressed 15Gb instead of 22Gb, well done!
I am giving a try with the CPU image.
I guess when you push the image in the registry, it is compressed again?

vidlb · 2021-01-12T18:08:21Z

Yes it is compressed just before upload.
You still need a lot of space in /var/lib/docker if you want to keep all layers, docker will store layers for each different build tag...
See docker system df to check how big it is. When finished there are several ways to clean it, just try docker builder prune -a I went with the more brutal docker system prune ...

Now the magic with remote cache is that you could run system prune without the need to rebuild tf !
BUT I noticed something: the slightest change in bazel env and config / tensorflow git tag ===> bazel may skip all of the cache (used for some downloads but none of the built files)...

remicres · 2021-01-12T21:38:38Z

Well this is a great contribution. Least I can say!
I would be dumb not to merge that in otbtf asap. You are still working on the MR, I am waiting your go. Also, I think we can remove the old docker files, completely obsolete now... Many thanks Vincent

vidlb · 2021-01-13T08:22:31Z

Okay, I think I will push some last comments, and remove the dockerfiles dir !

I wasn't sure about that but I can confirm now, they (the TF dev team) did not build and test cuda11.1 yet... So 11.0 should be more stable for now.

I know it because I tried to build with TensorRT and faced this issue. I built with compute capabilities 8.6 but we can't be sure that TF is working with RTX 3000... Also it might explain why XLA isn't working. The CNN test I tried was OK though.

May be I should downgrade default version in the Dockerfile and multibuild script ?
But there was no ubuntu 20 docker image with both cuda11.0 and cudnn8.
So the other options are 11.0-cudnn8-devel-ubuntu18.04 and 10.2-cudnn7-devel-ubuntu18.04 for older branches.
There is a 11.0-devel-ubuntu20.04 but it would require to manually install cudnn with apt.

vidlb · 2021-01-13T19:59:28Z

Now you can merge it ! I downgraded default versions, and was able to remove some unnecessary steps :

duplicated headers copy (now just one file)
the ./tensorflow/lite/tools/make/download_dependencies.sh and ./tensorflow/lite/tools/make/build_lib.sh lines which are probably related to tensorflow-lite ?
the protobuf build : I guess it is built by bazel, we don't need to take care of it.

With those changes, images are now smaller (~12gb uncompressed for gpu-dev, 5gb compressed) and also because cuda 11.0 img is lighter.

I tried with XLA but it causes the same error than with CUDA11.1.
Check failed: stream->parent()->GetBlasGemmAlgorithms(&algorithms).

So it may be useless to build with XLA enabled. Or may be we're missing something in conf.

Also I made a little commit to your executable python files (added a shebang) in order to link it in PATH in the docker images, so we can call it directly from any directory (without the explicit python call and file path).

Always happy to read this :
[32,728 / 33,004] Linking tensorflow/python/_pywrap_tensorflow_internal.so [for host]
INFO: Elapsed time: 107.384s, Critical Path: 15.97s
INFO: 22299 processes: 22299 remote cache hit.

And
r2.1-cpu 869.88 MiB

remicres · 2021-01-13T22:58:55Z

I believe that there is an issue with WSL (windows) when XLA is not enabled (see #24).
Have checked how images work on Windows?

vidlb · 2021-01-13T23:12:58Z

Nope, you want to check before merge ?

vidlb · 2021-01-13T23:21:34Z

Ok you can wait before merge, there may be some other things to investigate, also for the cpu build. I'll let you know if it works !

remicres · 2021-01-15T11:49:55Z

tools/docker/build-env-tf.sh

+export TF_NEED_CUDA=0
+export CUDA_TOOLKIT_PATH=$(find /usr/local -maxdepth 1 -type d -name 'cuda-*')
+if  [ ! -z $CUDA_TOOLKIT_PATH ] ; then
+    export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$CUDA_TOOLKIT_PATH/lib64:$CUDA_TOOLKIT_PATH/lib64/stubs"


This makes me think that you have kept the fix provided in #24 (see @Pratyush1991 comment here) so it's great new!

remicres · 2021-01-15T11:51:09Z

tools/docker/build-env-tf.sh

+#export CC_OPT_FLAGS="-march=native -Wno-sign-compare"
+export PYTHON_BIN_PATH=$(which python)
+export PYTHON_LIB_PATH="$($PYTHON_BIN_PATH -c 'import site; print(site.getsitepackages()[0])')"
+export TF_ENABLE_XLA=1


Great. Indeed XLA is required (see here)

daspk04 · 2021-01-18T12:49:00Z

Hello @vidlb ,

Thank you for this awesome work. I tried it on my WSL2 Windows. I built the docker with cuda 11 and XLA enabled (by default). The command that I used was as per the readme section.
docker build --network='host' -t otbtf:gpu-dev --build-arg BASE_IMG=nvidia/cuda:11.0-cudnn8-devel-ubuntu18.04 --build-arg KEEP_SRC_OTB=true .
I would like to confirm that it works perfectly fine on WSL2. I'm not sure why you got the error with Cuda 11 and XLA enabled during model serve and I could generate output as well.

vidlb · 2021-01-18T12:59:48Z

Hi @Pratyush1991
The error I had was only on Linux. I'm not even if sure XLA is supposed to work if running on UNIX.
It's possible the TF team hasn't finished to test and patch tf2.4 with XLA... may be we should try with 2.3.2 just to be sure.
But everything else works as usual, except if I try to enable XLA auto JIT.

If it runs on WSL2 this is great !

@remicres I will soon push my last commits with a few more improvements:

regarding the KEEP_SRC_... arguments + now always keep OTBTF sources in /src/otbtf .
removed bazel (now use bazelisk which is really handful : will get the right bazel version for us, depending on the tf release).
switched OTB and TF args to git tags by default in order to avoid bazel rebuild if someone committed (on the branch r2.4 in my case), and added git clone --single-branch to reduce repo size.

vidlb · 2021-01-18T15:28:31Z

@remicres I just realized bazel produced 3 differents libs : libtensorflow_cc.so + libtensorflow_cc.so.2 + libtensorflow_cc.so.2.4.0 . Same thing for libtensorflow_framework.
Those are not symbolic links, I checked with sha256sum and they are exactly the same files, this is awkward.
For the GPU build one lib is +400 MB so again it seems we could save about 800MB here !

EDIT: it was just a bad cp command (copy files instead of links), we need cp -P

daspk04 · 2021-01-18T16:05:32Z

@vidlb The issue that I have mentioned in #28 still open as I checked on the new build as well. I see someone has posted a fix but I'm not sure I can understand how to implement it. I know it's not related to this may be but do you think anything that can be done for now to fix it.
Related issues 1, 2 , 3, 4

remicres · 2021-01-19T15:02:53Z

@Pratyush1991 Thanks for your feedback! It is great news that this works fine on WSL2.

Indeed, #28 it is still open. I concede that me too, I didn't understand the fix they used in deepmind/reverb...

I did reproduce the issue using @vidlb docker images, but I think we should address this issue in a different MR. This issue isn't related to docker: we can reproduce it in any debian-like environment for instance.

vidlb · 2021-01-22T17:43:26Z

Finally !
@remicres you can merge these 42 commits =)

To resume last changes :

TF tag 2.4.1 + CUDA 11.0 with ubuntu20.04 (found it !)
Added arg NUMPY_SPEC="~=1.19"
BZL_CONFIG ===> BZL_CONFIGS
Removed BAZEL, use bazelisk
Use libtensorflow_framework.so.2 installed with the wheel + added cp -P for links copy
Added cuda compute capability 7.0 for Tesla V100 / TITAN V (cc 8.6 / RTX 3000 should drop with TF2.5)
Re-added -march=native, images may not run with old CPUs (need AVX2) - default build without MKL
Link python executables in /opt/otbtf/bin
Copy repo in /src/otbtf
More python import tests

Also I tried to keep bazel cached when KEEP_SRC_TF but impossible to to recycle it from inside the container. It builds again from scratch. So right now this argument will only save tensorflow repo (in my list called dev-all).
I tested on linux and everything should be good.

You could just pull, tag and push those images to dockerhub if you don't want to rebuild !

Single configurable Dockerfile with multi-stage build + external bazel cache

vidlb added 5 commits December 28, 2020 21:20

FIX: std::string --> tensorflow::tstring

daa8057

ENH: configurable multi-stage Dockerfile

3abcc18

ENH: Comments and typos

47c6dcd

FIX: Added missing newline

75370b2

FIX: typos in build command example

11ba09f

vidlb changed the title ~~Single configurable Dockerfile with multi-stage build and external bazel cache~~ Single configurable Dockerfile with multi-stage build + external bazel cache Dec 29, 2020

vidlb added 16 commits December 29, 2020 21:46

FIX: use BZL_CONFIG to enable MKL support

ce97289

ENH: typo in comment

9f3eb23

DOC: removed more typos

3cfbc80

DOC: cleaning comments

fc5e681

DOC: cleaning comments

5a748d0

FIX: replace gitlab by mdl4eo docker repository

7b0ce95

Merge branch 'develop' of github.com:LaTeleScop/otbtf into develop

7766148

ENH: removed verbose cp + added verbose rm src

d24a113

ENH: Removed verbose rm src

b1429e2

DOC: more comments about docker containers

9729ca8

DOC: docker bazel --config flags

a9db659

FIX: docker bazel --config flags

e8c3990

FIX: docker set env globally

cb4d85b

DOC: typo docker bazel empty config example

4ee790a

DOC: about monteverdi in docker

f66f9c2

DOC: comments - XLA seems broken (cublas 11)

59eca5d

vidlb added 2 commits January 12, 2021 16:27

ENH: use another dir for the bazel remote cache

eaa2c0b

FIX: Docker - added missing newline char

a5bb0ef

FIX: quotes problem in multibuild.sh

871567a

vidlb added 3 commits January 13, 2021 21:51

FIX: Docker - allow CUDA<10

5e1ba99

DOC: Docker example 'tf.test.is_gpu_available()'

ef7145c

DOC: typos in docker readme

51821e6

ENH: docker prebuild OTB

3c2e023

vidlb added 4 commits January 14, 2021 12:04

FIX: Docker - symlink external libraries

a7d120f

FIX: '-march=native' was removed from default conf

40713c0

FIX: Docker - let bazel deal with optimizations

ba45a5a

FIX: default build CPU without MKL (it is so slow)

15e3096

remicres approved these changes Jan 15, 2021

View reviewed changes

ENH: Docker - cuda cc_7.0 for Tesla V100 / TITAN V

715e9b4

ENH: Docker - git tags + KEEP_SRC + bazel outputs

b3dd139

remicres added docker Questions related to the use of otbtf docker images enhancement New feature or request labels Jan 19, 2021

vidlb added 3 commits January 21, 2021 18:24

ENH: Docker - add argument NUMPY_SPEC (version)

2092df1

FIX: Docker - duplicated libtensorflow_framework

5767b06

ENH: Docker - last comments + version bumps

c2731bc

remicres merged commit 702dfe2 into remicres:develop Jan 25, 2021

remicres mentioned this pull request Jan 25, 2021

Lighter docker images #36

Closed

remicres added a commit that referenced this pull request Nov 19, 2021

Merge pull request #41 from LaTeleScop/develop

aba017f

Single configurable Dockerfile with multi-stage build + external bazel cache

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single configurable Dockerfile with multi-stage build + external bazel cache #41

Single configurable Dockerfile with multi-stage build + external bazel cache #41

vidlb commented Dec 29, 2020 •

edited

Loading

remicres commented Jan 11, 2021

vidlb commented Jan 12, 2021 •

edited

Loading

remicres commented Jan 12, 2021

vidlb commented Jan 12, 2021 •

edited

Loading

remicres commented Jan 12, 2021

vidlb commented Jan 13, 2021 •

edited

Loading

vidlb commented Jan 13, 2021 •

edited

Loading

remicres commented Jan 13, 2021

vidlb commented Jan 13, 2021 •

edited

Loading

vidlb commented Jan 13, 2021 •

edited

Loading

remicres Jan 15, 2021

remicres Jan 15, 2021

daspk04 commented Jan 18, 2021

vidlb commented Jan 18, 2021 •

edited

Loading

vidlb commented Jan 18, 2021 •

edited

Loading

daspk04 commented Jan 18, 2021

remicres commented Jan 19, 2021

vidlb commented Jan 22, 2021 •

edited

Loading

Single configurable Dockerfile with multi-stage build + external bazel cache #41

Single configurable Dockerfile with multi-stage build + external bazel cache #41

Conversation

vidlb commented Dec 29, 2020 • edited Loading

remicres commented Jan 11, 2021

vidlb commented Jan 12, 2021 • edited Loading

remicres commented Jan 12, 2021

vidlb commented Jan 12, 2021 • edited Loading

remicres commented Jan 12, 2021

vidlb commented Jan 13, 2021 • edited Loading

vidlb commented Jan 13, 2021 • edited Loading

remicres commented Jan 13, 2021

vidlb commented Jan 13, 2021 • edited Loading

vidlb commented Jan 13, 2021 • edited Loading

remicres Jan 15, 2021

Choose a reason for hiding this comment

remicres Jan 15, 2021

Choose a reason for hiding this comment

daspk04 commented Jan 18, 2021

vidlb commented Jan 18, 2021 • edited Loading

vidlb commented Jan 18, 2021 • edited Loading

daspk04 commented Jan 18, 2021

remicres commented Jan 19, 2021

vidlb commented Jan 22, 2021 • edited Loading

vidlb commented Dec 29, 2020 •

edited

Loading

vidlb commented Jan 12, 2021 •

edited

Loading

vidlb commented Jan 12, 2021 •

edited

Loading

vidlb commented Jan 13, 2021 •

edited

Loading

vidlb commented Jan 13, 2021 •

edited

Loading

vidlb commented Jan 13, 2021 •

edited

Loading

vidlb commented Jan 13, 2021 •

edited

Loading

vidlb commented Jan 18, 2021 •

edited

Loading

vidlb commented Jan 18, 2021 •

edited

Loading

vidlb commented Jan 22, 2021 •

edited

Loading