-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single configurable Dockerfile with multi-stage build + external bazel cache #41
Conversation
Hi @vidlb I just saw you MR. It is just amazing ! Can't wait to test that tomorrow ! |
@remicres haaa finally =D What I found really confusing is the protobuf version ! With v2.4, pip installed python protoc v3.14, but it seems I needed to build v3.9.2 in order to run TF... WTF ? Also I hope the numpy version problem is not really one. This is the only thing that could require to modify the dockerfile, in order to set a maximum version when installing it with pip, if you want to build older TF branches. But it won't change the fact that GDAL is installed with ubuntu's numpy (which is outdated), I guess it is a quite a mess if you try to import gdal, tf and otbApplication at the same time.... But it seems ok right now, or at least it worked for the last docker RUN statement. |
GPU image built,uncompressed 15Gb instead of 22Gb, well done! |
Yes it is compressed just before upload. Now the magic with remote cache is that you could run system prune without the need to rebuild tf ! |
Well this is a great contribution. Least I can say! |
Okay, I think I will push some last comments, and remove the dockerfiles dir ! I wasn't sure about that but I can confirm now, they (the TF dev team) did not build and test cuda11.1 yet... So 11.0 should be more stable for now. I know it because I tried to build with TensorRT and faced this issue. I built with compute capabilities 8.6 but we can't be sure that TF is working with RTX 3000... Also it might explain why XLA isn't working. The CNN test I tried was OK though. May be I should downgrade default version in the Dockerfile and multibuild script ? |
Now you can merge it ! I downgraded default versions, and was able to remove some unnecessary steps :
With those changes, images are now smaller (~12gb uncompressed for gpu-dev, 5gb compressed) and also because cuda 11.0 img is lighter. I tried with XLA but it causes the same error than with CUDA11.1. So it may be useless to build with XLA enabled. Or may be we're missing something in conf. Also I made a little commit to your executable python files (added a shebang) in order to link it in PATH in the docker images, so we can call it directly from any directory (without the explicit python call and file path). Always happy to read this : And |
I believe that there is an issue with WSL (windows) when XLA is not enabled (see #24). |
Nope, you want to check before merge ? |
Ok you can wait before merge, there may be some other things to investigate, also for the cpu build. I'll let you know if it works ! |
export TF_NEED_CUDA=0 | ||
export CUDA_TOOLKIT_PATH=$(find /usr/local -maxdepth 1 -type d -name 'cuda-*') | ||
if [ ! -z $CUDA_TOOLKIT_PATH ] ; then | ||
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$CUDA_TOOLKIT_PATH/lib64:$CUDA_TOOLKIT_PATH/lib64/stubs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#export CC_OPT_FLAGS="-march=native -Wno-sign-compare" | ||
export PYTHON_BIN_PATH=$(which python) | ||
export PYTHON_LIB_PATH="$($PYTHON_BIN_PATH -c 'import site; print(site.getsitepackages()[0])')" | ||
export TF_ENABLE_XLA=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Indeed XLA is required (see here)
Hello @vidlb , Thank you for this awesome work. I tried it on my WSL2 Windows. I built the docker with cuda 11 and XLA enabled (by default). The command that I used was as per the readme section. |
Hi @Pratyush1991 If it runs on WSL2 this is great ! @remicres I will soon push my last commits with a few more improvements:
|
@remicres I just realized bazel produced 3 differents libs : libtensorflow_cc.so + libtensorflow_cc.so.2 + libtensorflow_cc.so.2.4.0 . Same thing for libtensorflow_framework. EDIT: it was just a bad |
@vidlb The issue that I have mentioned in #28 still open as I checked on the new build as well. I see someone has posted a fix but I'm not sure I can understand how to implement it. I know it's not related to this may be but do you think anything that can be done for now to fix it. |
@Pratyush1991 Thanks for your feedback! It is great news that this works fine on WSL2. Indeed, #28 it is still open. I concede that me too, I didn't understand the fix they used in deepmind/reverb... I did reproduce the issue using @vidlb docker images, but I think we should address this issue in a different MR. This issue isn't related to docker: we can reproduce it in any debian-like environment for instance. |
Finally ! To resume last changes :
Also I tried to keep bazel cached when KEEP_SRC_TF but impossible to to recycle it from inside the container. It builds again from scratch. So right now this argument will only save tensorflow repo (in my list called dev-all). You could just pull, tag and push those images to dockerhub if you don't want to rebuild ! |
Single configurable Dockerfile with multi-stage build + external bazel cache
In response to issue #36 , this PR should considerably ease the docker build process, mostly thanks to build arguments and bazel remote cache. You can now build different docker images with custom base image, versions / branches using build arguments, the Dockerfile should be compatible with any branch. You can control env / flags and apt dependencies in external dedicated files.
Thanks to bazel remote cache you can build many docker images without having to rebuild TF (I believe you're going to LOVE it, bazel is such a nightmare !). Now you could just stop and restart a build without loosing progress !
You can choose either CPU or GPU with the only mandatory argument (BASE_IMG), cuda support is enabled if it is found in /usr/local/cuda-*.
I also added some little extras, like passwordless sudo, GUI-related packages option, CPU_RATIO arg to control number of jobs, args to keep TF and/or OTB sources, RTX 3000 support (CUDA compute capability 8.6) and more useful options - see the multibuild.sh script and README.md for available arguments and examples.
All files are now installed in /opt/otbtf (pip wheel as well, in /opt/otbtf/lib/python3.8).
Several changes were made in order to build with latest TF 2.4 and CUDA 11 drivers (ubuntu20.04 with python3.8, no more --noincompatible_do_not_split_linking_cmdline option, bazel and protobuf versions bumps), and since TF 2.3 (I believe) model path is tensorflow::tstring instead of std::string , so I made the fix in order to build, and ran some tests with my old CNN workflow, but I'm not sure this is enough, you may need to apply that fix elsewhere in the repo ? check this commit.
You can pull from our gitlab repo if you want to check it out !
Mission accomplished, it significantly reduced images size (in this example without GUI and OTB sources only for -dev) :
![Screenshot_2020-12-30 Container Registry · La TeleScop Docker OTBTF](https://user-images.githubusercontent.com/34066246/103374917-078d6e00-4ad9-11eb-9ddc-6a2bed0f3d33.png)
r1.7 is with the other, more complicated 'bruteforce' method.
Div by 2 for GPU, by 3 for CPU !
P.S.
Sorry for the mess with all those small commits ... I just push too easily and there were so many typos ^^