Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single configurable Dockerfile with multi-stage build + external bazel cache #41

Merged
merged 42 commits into from
Jan 25, 2021
Merged

Conversation

vidlb
Copy link
Contributor

@vidlb vidlb commented Dec 29, 2020

In response to issue #36 , this PR should considerably ease the docker build process, mostly thanks to build arguments and bazel remote cache. You can now build different docker images with custom base image, versions / branches using build arguments, the Dockerfile should be compatible with any branch. You can control env / flags and apt dependencies in external dedicated files.

Thanks to bazel remote cache you can build many docker images without having to rebuild TF (I believe you're going to LOVE it, bazel is such a nightmare !). Now you could just stop and restart a build without loosing progress !
You can choose either CPU or GPU with the only mandatory argument (BASE_IMG), cuda support is enabled if it is found in /usr/local/cuda-*.

I also added some little extras, like passwordless sudo, GUI-related packages option, CPU_RATIO arg to control number of jobs, args to keep TF and/or OTB sources, RTX 3000 support (CUDA compute capability 8.6) and more useful options - see the multibuild.sh script and README.md for available arguments and examples.
All files are now installed in /opt/otbtf (pip wheel as well, in /opt/otbtf/lib/python3.8).

Several changes were made in order to build with latest TF 2.4 and CUDA 11 drivers (ubuntu20.04 with python3.8, no more --noincompatible_do_not_split_linking_cmdline option, bazel and protobuf versions bumps), and since TF 2.3 (I believe) model path is tensorflow::tstring instead of std::string , so I made the fix in order to build, and ran some tests with my old CNN workflow, but I'm not sure this is enough, you may need to apply that fix elsewhere in the repo ? check this commit.
You can pull from our gitlab repo if you want to check it out !

Mission accomplished, it significantly reduced images size (in this example without GUI and OTB sources only for -dev) :
r1.7 is with the other, more complicated 'bruteforce' method.
Div by 2 for GPU, by 3 for CPU !
Screenshot_2020-12-30 Container Registry · La TeleScop Docker OTBTF

P.S.
Sorry for the mess with all those small commits ... I just push too easily and there were so many typos ^^

@vidlb vidlb changed the title Single configurable Dockerfile with multi-stage build and external bazel cache Single configurable Dockerfile with multi-stage build + external bazel cache Dec 29, 2020
@remicres
Copy link
Owner

Hi @vidlb I just saw you MR. It is just amazing ! Can't wait to test that tomorrow !

@vidlb
Copy link
Contributor Author

vidlb commented Jan 12, 2021

@remicres haaa finally =D
Hope it builds ! Please tell me how it goes... I didn't try with v2.2 or 2.3 but it should also work, with the right bazel version !

What I found really confusing is the protobuf version ! With v2.4, pip installed python protoc v3.14, but it seems I needed to build v3.9.2 in order to run TF... WTF ?

Also I hope the numpy version problem is not really one. This is the only thing that could require to modify the dockerfile, in order to set a maximum version when installing it with pip, if you want to build older TF branches.

But it won't change the fact that GDAL is installed with ubuntu's numpy (which is outdated), I guess it is a quite a mess if you try to import gdal, tf and otbApplication at the same time.... But it seems ok right now, or at least it worked for the last docker RUN statement.

@remicres
Copy link
Owner

GPU image built,uncompressed 15Gb instead of 22Gb, well done!
I am giving a try with the CPU image.
I guess when you push the image in the registry, it is compressed again?

@vidlb
Copy link
Contributor Author

vidlb commented Jan 12, 2021

Yes it is compressed just before upload.
You still need a lot of space in /var/lib/docker if you want to keep all layers, docker will store layers for each different build tag...
See docker system df to check how big it is. When finished there are several ways to clean it, just try docker builder prune -a I went with the more brutal docker system prune ...

Now the magic with remote cache is that you could run system prune without the need to rebuild tf !
BUT I noticed something: the slightest change in bazel env and config / tensorflow git tag ===> bazel may skip all of the cache (used for some downloads but none of the built files)...

@remicres
Copy link
Owner

Well this is a great contribution. Least I can say!
I would be dumb not to merge that in otbtf asap. You are still working on the MR, I am waiting your go. Also, I think we can remove the old docker files, completely obsolete now... Many thanks Vincent

@vidlb
Copy link
Contributor Author

vidlb commented Jan 13, 2021

Okay, I think I will push some last comments, and remove the dockerfiles dir !

I wasn't sure about that but I can confirm now, they (the TF dev team) did not build and test cuda11.1 yet... So 11.0 should be more stable for now.

I know it because I tried to build with TensorRT and faced this issue. I built with compute capabilities 8.6 but we can't be sure that TF is working with RTX 3000... Also it might explain why XLA isn't working. The CNN test I tried was OK though.

May be I should downgrade default version in the Dockerfile and multibuild script ?
But there was no ubuntu 20 docker image with both cuda11.0 and cudnn8.
So the other options are 11.0-cudnn8-devel-ubuntu18.04 and 10.2-cudnn7-devel-ubuntu18.04 for older branches.
There is a 11.0-devel-ubuntu20.04 but it would require to manually install cudnn with apt.

@vidlb
Copy link
Contributor Author

vidlb commented Jan 13, 2021

Now you can merge it ! I downgraded default versions, and was able to remove some unnecessary steps :

  • duplicated headers copy (now just one file)
  • the ./tensorflow/lite/tools/make/download_dependencies.sh and ./tensorflow/lite/tools/make/build_lib.sh lines which are probably related to tensorflow-lite ?
  • the protobuf build : I guess it is built by bazel, we don't need to take care of it.

With those changes, images are now smaller (~12gb uncompressed for gpu-dev, 5gb compressed) and also because cuda 11.0 img is lighter.

I tried with XLA but it causes the same error than with CUDA11.1.
Check failed: stream->parent()->GetBlasGemmAlgorithms(&algorithms).

So it may be useless to build with XLA enabled. Or may be we're missing something in conf.

Also I made a little commit to your executable python files (added a shebang) in order to link it in PATH in the docker images, so we can call it directly from any directory (without the explicit python call and file path).

Always happy to read this :
[32,728 / 33,004] Linking tensorflow/python/_pywrap_tensorflow_internal.so [for host]
INFO: Elapsed time: 107.384s, Critical Path: 15.97s
INFO: 22299 processes: 22299 remote cache hit.

And
r2.1-cpu 869.88 MiB

@remicres
Copy link
Owner

I believe that there is an issue with WSL (windows) when XLA is not enabled (see #24).
Have checked how images work on Windows?

@vidlb
Copy link
Contributor Author

vidlb commented Jan 13, 2021

Nope, you want to check before merge ?

@vidlb
Copy link
Contributor Author

vidlb commented Jan 13, 2021

Ok you can wait before merge, there may be some other things to investigate, also for the cpu build. I'll let you know if it works !

export TF_NEED_CUDA=0
export CUDA_TOOLKIT_PATH=$(find /usr/local -maxdepth 1 -type d -name 'cuda-*')
if [ ! -z $CUDA_TOOLKIT_PATH ] ; then
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$CUDA_TOOLKIT_PATH/lib64:$CUDA_TOOLKIT_PATH/lib64/stubs"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think that you have kept the fix provided in #24 (see @Pratyush1991 comment here) so it's great new!

#export CC_OPT_FLAGS="-march=native -Wno-sign-compare"
export PYTHON_BIN_PATH=$(which python)
export PYTHON_LIB_PATH="$($PYTHON_BIN_PATH -c 'import site; print(site.getsitepackages()[0])')"
export TF_ENABLE_XLA=1
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Indeed XLA is required (see here)

@daspk04
Copy link
Contributor

daspk04 commented Jan 18, 2021

Hello @vidlb ,

Thank you for this awesome work. I tried it on my WSL2 Windows. I built the docker with cuda 11 and XLA enabled (by default). The command that I used was as per the readme section.
docker build --network='host' -t otbtf:gpu-dev --build-arg BASE_IMG=nvidia/cuda:11.0-cudnn8-devel-ubuntu18.04 --build-arg KEEP_SRC_OTB=true .
I would like to confirm that it works perfectly fine on WSL2. I'm not sure why you got the error with Cuda 11 and XLA enabled during model serve and I could generate output as well.

@vidlb
Copy link
Contributor Author

vidlb commented Jan 18, 2021

Hi @Pratyush1991
The error I had was only on Linux. I'm not even if sure XLA is supposed to work if running on UNIX.
It's possible the TF team hasn't finished to test and patch tf2.4 with XLA... may be we should try with 2.3.2 just to be sure.
But everything else works as usual, except if I try to enable XLA auto JIT.

If it runs on WSL2 this is great !

@remicres I will soon push my last commits with a few more improvements:

  • regarding the KEEP_SRC_... arguments + now always keep OTBTF sources in /src/otbtf .
  • removed bazel (now use bazelisk which is really handful : will get the right bazel version for us, depending on the tf release).
  • switched OTB and TF args to git tags by default in order to avoid bazel rebuild if someone committed (on the branch r2.4 in my case), and added git clone --single-branch to reduce repo size.

@vidlb
Copy link
Contributor Author

vidlb commented Jan 18, 2021

@remicres I just realized bazel produced 3 differents libs : libtensorflow_cc.so + libtensorflow_cc.so.2 + libtensorflow_cc.so.2.4.0 . Same thing for libtensorflow_framework.
Those are not symbolic links, I checked with sha256sum and they are exactly the same files, this is awkward.
For the GPU build one lib is +400 MB so again it seems we could save about 800MB here !

EDIT: it was just a bad cp command (copy files instead of links), we need cp -P

@daspk04
Copy link
Contributor

daspk04 commented Jan 18, 2021

@vidlb The issue that I have mentioned in #28 still open as I checked on the new build as well. I see someone has posted a fix but I'm not sure I can understand how to implement it. I know it's not related to this may be but do you think anything that can be done for now to fix it.
Related issues 1, 2 , 3, 4

@remicres
Copy link
Owner

@Pratyush1991 Thanks for your feedback! It is great news that this works fine on WSL2.

Indeed, #28 it is still open. I concede that me too, I didn't understand the fix they used in deepmind/reverb...

I did reproduce the issue using @vidlb docker images, but I think we should address this issue in a different MR. This issue isn't related to docker: we can reproduce it in any debian-like environment for instance.

@remicres remicres added docker Questions related to the use of otbtf docker images enhancement New feature or request labels Jan 19, 2021
@vidlb
Copy link
Contributor Author

vidlb commented Jan 22, 2021

Finally !
@remicres you can merge these 42 commits =)

To resume last changes :

  • TF tag 2.4.1 + CUDA 11.0 with ubuntu20.04 (found it !)
  • Added arg NUMPY_SPEC="~=1.19"
  • BZL_CONFIG ===> BZL_CONFIGS
  • Removed BAZEL, use bazelisk
  • Use libtensorflow_framework.so.2 installed with the wheel + added cp -P for links copy
  • Added cuda compute capability 7.0 for Tesla V100 / TITAN V (cc 8.6 / RTX 3000 should drop with TF2.5)
  • Re-added -march=native, images may not run with old CPUs (need AVX2) - default build without MKL
  • Link python executables in /opt/otbtf/bin
  • Copy repo in /src/otbtf
  • More python import tests

Also I tried to keep bazel cached when KEEP_SRC_TF but impossible to to recycle it from inside the container. It builds again from scratch. So right now this argument will only save tensorflow repo (in my list called dev-all).
I tested on linux and everything should be good.

You could just pull, tag and push those images to dockerhub if you don't want to rebuild !

Screenshot_20210122_181412

@remicres remicres merged commit 702dfe2 into remicres:develop Jan 25, 2021
@remicres remicres mentioned this pull request Jan 25, 2021
remicres added a commit that referenced this pull request Nov 19, 2021
Single configurable Dockerfile with multi-stage build + external bazel cache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docker Questions related to the use of otbtf docker images enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants