TensorFlow with CUDA or Python might rebuilds more than necessary instead of re-using bazel cache #16585

lissyx · 2018-01-30T14:16:49Z

Context: for DeepSpeech, we perform tensorflow builds and then keep the cache in a tar (capturing the whole of the home directory of the build user). We then untar it and the deepspeech build through bazel build picks the proper cached items so it does not rebuild anything.

Recently, we started to have increased (2.5x) build time on CUDA-enabled builds. Debugging with Bazel showed that it was rebuilding because the actionKey computed for stream_executor_impl was different. Instrumenting Bazel to get more informations, I could get down to the reason of the different actionKey: the ordering of the CUDA includes was different. The list itself contained the exact same content, just a different ordering.

Those includes are symlinks, and they are generated from a genrule. This is all taken care of by

tensorflow/third_party/gpus/cuda_configure.bzl

Lines 915 to 1035 in ba64f53

    
           def _create_local_cuda_repository(repository_ctx): 
        
             """Creates the repository containing files set up to build with CUDA.""" 
        
             cuda_config = _get_cuda_config(repository_ctx) 
        
             cudnn_header_dir = _find_cudnn_header_dir(repository_ctx, 
        
                                                       cuda_config.cudnn_install_basedir) 
        
             # Set up symbolic links for the cuda toolkit by creating genrules to do 
        
             # symlinking. We create one genrule for each directory we want to track under 
        
             # cuda_toolkit_path 
        
             cuda_toolkit_path = cuda_config.cuda_toolkit_path 
        
             cuda_include_path = cuda_toolkit_path + "/include" 
        
             genrules = [symlink_genrule_for_dir(repository_ctx, 
        
                 cuda_include_path, "cuda/include", "cuda-include")] 
        
             genrules.append(symlink_genrule_for_dir(repository_ctx, 
        
                 cuda_toolkit_path + "/nvvm", "cuda/nvvm", "cuda-nvvm")) 
        
             genrules.append(symlink_genrule_for_dir(repository_ctx, 
        
                 cuda_toolkit_path + "/extras/CUPTI/include", 
        
                 "cuda/extras/CUPTI/include", "cuda-extras")) 
        
             cuda_libs = _find_libs(repository_ctx, cuda_config) 
        
             cuda_lib_src = [] 
        
             cuda_lib_dest = [] 
        
             for lib in cuda_libs.values(): 
        
               cuda_lib_src.append(lib.path) 
        
               cuda_lib_dest.append("cuda/lib/" + lib.file_name) 
        
             genrules.append(symlink_genrule_for_dir(repository_ctx, None, "", "cuda-lib", 
        
                                                     cuda_lib_src, cuda_lib_dest)) 
        
             # Set up the symbolic links for cudnn if cndnn was not installed to 
        
             # CUDA_TOOLKIT_PATH. 
        
             included_files = _read_dir(repository_ctx, cuda_include_path).replace( 
        
                 cuda_include_path, '').splitlines() 
        
             if '/cudnn.h' not in included_files: 
        
               genrules.append(symlink_genrule_for_dir(repository_ctx, None, 
        
                   "cuda/include/", "cudnn-include", [cudnn_header_dir + "/cudnn.h"], 
        
                   ["cudnn.h"])) 
        
             else: 
        
               genrules.append( 
        
                       'filegroup(\n' + 
        
                       '    name = "cudnn-include",\n' + 
        
                       '    srcs = [],\n' + 
        
                       ')\n' 
        
                   ) 
        
             # Set up BUILD file for cuda/ 
        
             _tpl(repository_ctx, "cuda:build_defs.bzl", 
        
                  { 
        
                      "%{cuda_is_configured}": "True", 
        
                      "%{cuda_extra_copts}": _compute_cuda_extra_copts( 
        
                          repository_ctx, cuda_config.compute_capabilities), 
        
                  }) 
        
             _tpl(repository_ctx, "cuda:BUILD", 
        
                  { 
        
                      "%{cuda_driver_lib}": cuda_libs["cuda"].file_name, 
        
                      "%{cudart_static_lib}": cuda_libs["cudart_static"].file_name, 
        
                      "%{cudart_static_linkopt}": _cudart_static_linkopt( 
        
                          cuda_config.cpu_value), 
        
                      "%{cudart_lib}": cuda_libs["cudart"].file_name, 
        
                      "%{cublas_lib}": cuda_libs["cublas"].file_name, 
        
                      "%{cusolver_lib}": cuda_libs["cusolver"].file_name, 
        
                      "%{cudnn_lib}": cuda_libs["cudnn"].file_name, 
        
                      "%{cufft_lib}": cuda_libs["cufft"].file_name, 
        
                      "%{curand_lib}": cuda_libs["curand"].file_name, 
        
                      "%{cupti_lib}": cuda_libs["cupti"].file_name, 
        
                      "%{cuda_include_genrules}": "\n".join(genrules), 
        
                      "%{cuda_headers}": ('":cuda-include",\n' + 
        
                                          '        ":cudnn-include",') 
        
                  }) 
        
             is_cuda_clang = _use_cuda_clang(repository_ctx) 
        
             should_download_clang = is_cuda_clang and _flag_enabled( 
        
                 repository_ctx, _TF_DOWNLOAD_CLANG) 
        
             if should_download_clang: 
        
               download_clang(repository_ctx, "crosstool/extra_tools") 
        
             # Set up crosstool/ 
        
             cc = find_cc(repository_ctx) 
        
             cc_fullpath = cc if not should_download_clang else "crosstool/" + cc 
        
             host_compiler_includes = _host_compiler_includes(repository_ctx, cc_fullpath) 
        
             cuda_defines = { 
        
                      "%{cuda_include_path}": _cuda_include_path(repository_ctx, 
        
                                                                 cuda_config), 
        
                      "%{host_compiler_includes}": host_compiler_includes, 
        
                  } 
        
             if is_cuda_clang: 
        
               cuda_defines["%{clang_path}"] = cc 
        
               _tpl(repository_ctx, "crosstool:BUILD", {"%{linker_files}": ":empty"}) 
        
               _tpl(repository_ctx, "crosstool:CROSSTOOL_clang", cuda_defines, out="crosstool/CROSSTOOL") 
        
               repository_ctx.file("crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc", "") 
        
             else: 
        
               nvcc_path = str(repository_ctx.path("%s/bin/nvcc%s" % 
        
                   (cuda_config.cuda_toolkit_path, 
        
                   ".exe" if cuda_config.cpu_value == "Windows" else ""))) 
        
               _tpl(repository_ctx, "crosstool:BUILD", 
        
                    {"%{linker_files}": ":crosstool_wrapper_driver_is_not_gcc"}) 
        
               _tpl(repository_ctx, "crosstool:CROSSTOOL_nvcc", cuda_defines, out="crosstool/CROSSTOOL") 
        
               _tpl(repository_ctx, 
        
                    "crosstool:clang/bin/crosstool_wrapper_driver_is_not_gcc", 
        
                    { 
        
                        "%{cpu_compiler}": str(cc), 
        
                        "%{cuda_version}": cuda_config.cuda_version, 
        
                        "%{nvcc_path}": nvcc_path, 
        
                        "%{gcc_host_compiler_path}": str(cc), 
        
                        "%{cuda_compute_capabilities}": ", ".join( 
        
                            ["\"%s\"" % c for c in cuda_config.compute_capabilities]), 
        
                    }) 
        
             # Set up cuda_config.h, which is used by 
        
             # tensorflow/stream_executor/dso_loader.cc. 
        
             _tpl(repository_ctx, "cuda:cuda_config.h", 
        
                  { 
        
                      "%{cuda_version}": cuda_config.cuda_version, 
        
                      "%{cudnn_version}": cuda_config.cudnn_version, 
        
                      "%{cuda_compute_capabilities}": ",".join( 
        
                          ["CudaVersion(\"%s\")" % c 
        
                           for c in cuda_config.compute_capabilities]), 
        
                          "%{cuda_toolkit_path}": cuda_config.cuda_toolkit_path, 
        
                  }, "cuda/cuda/cuda_config.h")

which generated shell script for the genrules, that actually do perform the symlinks. Checking those shell scripts revealed the exact same and different ordering.

Checking more carefully, one will see that the headers are discovered by _read_dir function:

tensorflow/third_party/gpus/cuda_configure.bzl

Lines 891 to 894 in ba64f53

    
           find_result = _execute( 
        
               repository_ctx, ["find", src_dir, "-follow", "-type", "f"], 
        
               empty_stdout_fine=True) 
        
           result = find_result.stdout

, it does directly get the output of find. This is dependant on the ordering provided by readdir syscall.

In our case, the ordering on the filesystem before making the tar archive, and after untarring it would be different.

One simple fix for that is to force ordering the list of headers, this way we are sure the order is always the same and we are not dependant on what readdir is going to get us.

In the past, Bazel would force the ordering of the elements considered to compute the actionKey. This was removed with 0.3.0 but it might have make the issue hidden bazelbuild/bazel@9dc3211

The text was updated successfully, but these errors were encountered:

If one does try to re-use Bazel cache of a TensorFlow CUDA-enabled build, then it might happen that readdir() syscall behind the use of find in _read_dir() will generate a different ordering of the very same list of headers. This will make new genrules for symlinking the CUDA headers and in the end it will result in different actionKey computed by Bazel, hence invalidating the action cache. Fixes tensorflow#16585

reedwm · 2018-01-30T18:31:46Z

Marking as contributions welcome since there is a PR.

If one does try to re-use Bazel cache of a TensorFlow CUDA-enabled build, then it might happen that readdir() syscall behind the use of find in _read_dir() will generate a different ordering of the very same list of headers. This will make new genrules for symlinking the CUDA headers and in the end it will result in different actionKey computed by Bazel, hence invalidating the action cache. Fixes tensorflow#16585

If one does try to re-use Bazel cache of a TensorFlow CUDA-enabled or Python-enabled build, then it might happen that readdir() syscall behind the use of find in _read_dir() will generate a different ordering of the very same list of headers. This will make new genrules for symlinking the CUDA headers and in the end it will result in different actionKey computed by Bazel, hence invalidating the action cache. Fixes tensorflow#16585

lissyx · 2018-02-01T10:24:22Z

Updating since I spotted similar code-path in third_party/py/python_configure.bzl.

…16586) If one does try to re-use Bazel cache of a TensorFlow CUDA-enabled or Python-enabled build, then it might happen that readdir() syscall behind the use of find in _read_dir() will generate a different ordering of the very same list of headers. This will make new genrules for symlinking the CUDA headers and in the end it will result in different actionKey computed by Bazel, hence invalidating the action cache. Fixes #16585

…ensorflow#16586) If one does try to re-use Bazel cache of a TensorFlow CUDA-enabled or Python-enabled build, then it might happen that readdir() syscall behind the use of find in _read_dir() will generate a different ordering of the very same list of headers. This will make new genrules for symlinking the CUDA headers and in the end it will result in different actionKey computed by Bazel, hence invalidating the action cache. Fixes tensorflow#16585

ml_dtypes Updates: Add float8_e4m3 and float8_e3m4 types support Fix float divmod with zero denominator Add int2 and uint2 types ml_dtypes/commits Related PRs ml_dtypes PR Add float8_e4m3 jax-ml/ml_dtypes#161 Add float8_e4m3 (Merged) XLA PR Add support for float8_e4m3 #16585 (In Review) This closes openxla/xla#17075 PiperOrigin-RevId: 674396944

ml_dtypes Updates: Add float8_e4m3 and float8_e3m4 types support Fix float divmod with zero denominator Add int2 and uint2 types ml_dtypes/commits Related PRs ml_dtypes PR Add float8_e4m3 jax-ml/ml_dtypes#161 Add float8_e4m3 (Merged) XLA PR Add support for float8_e4m3 #16585 (In Review) This closes openxla/xla#17075 PiperOrigin-RevId: 675687080

Imported from GitHub PR openxla/xla#16585 This PR adds f8E4M3 and f8E3M4 types support to XLA (mainly to cpu_compiler). ### `f8E4M3` type follows IEEE 754 convention. ```c f8E4M3 (IEEE 754) - Exponent bias: 7 - Maximum stored exponent value: 14 (binary 1110) - Maximum unbiased exponent value: 14 - 7 = 7 - Minimum stored exponent value: 1 (binary 0001) - Minimum unbiased exponent value: 1 − 7 = −6 - Precision specifies the total number of bits used for the significand (mantisa), including implicit leading integer bit = 3 + 1 = 4 - Follows IEEE 754 conventions for representation of special values - Has Positive and Negative zero - Has Positive and Negative infinity - Has NaNs Additional details: - Max exp (unbiased): 7 - Min exp (unbiased): -6 - Infinities (+/-): S.1111.000 - Zeros (+/-): S.0000.000 - NaNs: S.1111.{001, 010, 011, 100, 101, 110, 111} - Max normal number: S.1110.111 = +/-2^(7) x (1 + 0.875) = +/-240 - Min normal number: S.0001.000 = +/-2^(-6) - Max subnormal number: S.0000.111 = +/-2^(-6) x 0.875 = +/-2^(-9) x 7 - Min subnormal number: S.0000.001 = +/-2^(-6) x 0.125 = +/-2^(-9) ``` ### `f8E3M4` type follows IEEE 754 convention ```c f8E3M4 (IEEE 754) - Exponent bias: 3 - Maximum stored exponent value: 6 (binary 110) - Maximum unbiased exponent value: 6 - 3 = 3 - Minimum stored exponent value: 1 (binary 001) - Minimum unbiased exponent value: 1 − 3 = −2 - Precision specifies the total number of bits used for the significand (mantissa), including implicit leading integer bit = 4 + 1 = 5 - Follows IEEE 754 conventions for representation of special values - Has Positive and Negative zero - Has Positive and Negative infinity - Has NaNs Additional details: - Max exp (unbiased): 3 - Min exp (unbiased): -2 - Infinities (+/-): S.111.0000 - Zeros (+/-): S.000.0000 - NaNs: S.111.{0,1}⁴ except S.111.0000 - Max normal number: S.110.1111 = +/-2^(6-3) x (1 + 15/16) = +/-2^3 x 31 x 2^(-4) = +/-15.5 - Min normal number: S.001.0000 = +/-2^(1-3) x (1 + 0) = +/-2^(-2) - Max subnormal number: S.000.1111 = +/-2^(-2) x 15/16 = +/-2^(-2) x 15 x 2^(-4) = +/-15 x 2^(-6) - Min subnormal number: S.000.0001 = +/-2^(-2) x 1/16 = +/-2^(-2) x 2^(-4) = +/-2^(-6) ``` ### Testing: ``` bazel test \ //xla:array2d_test \ //xla:fp_util_test \ //xla:literal_comparison_test \ //xla:literal_test \ //xla/mlir/utils:type_util_test \ //xla:primitive_util_test \ //xla/python/ifrt:dtype_test \ //xla/python:xla_client_test \ //xla/service:elemental_ir_emitter_test \ //xla/service:float_normalization_test \ //xla/service/gpu/tests:float_conversions_test \ //xla/tests:array_elementwise_ops_test \ //xla/tests:constants_test \ //xla/tests:convert_test \ //xla/tests:float8_test \ //xla:util_test bazel test \ //xla/hlo/translate/hlo_to_mhlo/tests:import.hlo.test \ //xla/hlo/translate/mhlo_to_hlo/tests:export.mlir.test \ //xla/mlir_hlo/tests:Dialect/mhlo/hlo-legalize-to-stablehlo.mlir.test \ //xla/mlir_hlo/tests:Dialect/mhlo/ops.mlir.test \ //xla/mlir_hlo/tests:Dialect/mhlo/stablehlo-legalize-to-hlo.mlir.test ``` ### Related PRs: - LLVM [PR-97179](llvm/llvm-project#97179) [APFloat] Add support for f8E4M3 IEEE 754 type (Merged) - LLVM [PR-97118](llvm/llvm-project#97118) [MLIR] Add f8E4M3 IEEE 754 type (Merged) - LLVM [PR-99698](llvm/llvm-project#99698) [APFloat] Add support for f8E3M4 IEEE 754 type (Merged) - LLVM [PR-101230](llvm/llvm-project#101230) [MLIR] Add f8E3M4 IEEE 754 type (Merged) - StableHLO [PR-2486](openxla/stablehlo#2486) [RFC] Add f8E4M3 and f8E3M4 types support (Merged) - StableHLO [PR-2482](openxla/stablehlo#2482) Add f8E4M3 and f8E3M4 types support (Merged) - ml_dtypes [PR-161](jax-ml/ml_dtypes#161) Add float8_e4m3 (Merged) - ml_dtypes [PR-171](jax-ml/ml_dtypes#171) Add float8_e3m4 (Merged) - XLA [PR-17075](openxla/xla#17075) [TSL] Bump ml_dtypes. Add float8_e4m3, float8_e3m4 (Approved) - XLA [PR-3200](openxla/xla#3200) Add support for float8_e4m3fnuz and float8_e5m2fnuz (Template) - JAX [PR-23585](jax-ml/jax#23585) Add float8_e4m3 type support (in Review) Copybara import of the project: -- ec1c723027012a816d7e17f268c5f034863696e6 by Alexander Pivovarov <[email protected]>: Add support for float8_e4m3 and float8_e3m4 types Merging this change closes #16585 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#16585 from apivovarov:float8_e4m3 ec1c723027012a816d7e17f268c5f034863696e6 PiperOrigin-RevId: 680651037

Imported from GitHub PR openxla/xla#16585 This PR adds f8E4M3 and f8E3M4 types support to XLA (mainly to cpu_compiler). ### `f8E4M3` type follows IEEE 754 convention. ```c f8E4M3 (IEEE 754) - Exponent bias: 7 - Maximum stored exponent value: 14 (binary 1110) - Maximum unbiased exponent value: 14 - 7 = 7 - Minimum stored exponent value: 1 (binary 0001) - Minimum unbiased exponent value: 1 − 7 = −6 - Precision specifies the total number of bits used for the significand (mantisa), including implicit leading integer bit = 3 + 1 = 4 - Follows IEEE 754 conventions for representation of special values - Has Positive and Negative zero - Has Positive and Negative infinity - Has NaNs Additional details: - Max exp (unbiased): 7 - Min exp (unbiased): -6 - Infinities (+/-): S.1111.000 - Zeros (+/-): S.0000.000 - NaNs: S.1111.{001, 010, 011, 100, 101, 110, 111} - Max normal number: S.1110.111 = +/-2^(7) x (1 + 0.875) = +/-240 - Min normal number: S.0001.000 = +/-2^(-6) - Max subnormal number: S.0000.111 = +/-2^(-6) x 0.875 = +/-2^(-9) x 7 - Min subnormal number: S.0000.001 = +/-2^(-6) x 0.125 = +/-2^(-9) ``` ### `f8E3M4` type follows IEEE 754 convention ```c f8E3M4 (IEEE 754) - Exponent bias: 3 - Maximum stored exponent value: 6 (binary 110) - Maximum unbiased exponent value: 6 - 3 = 3 - Minimum stored exponent value: 1 (binary 001) - Minimum unbiased exponent value: 1 − 3 = −2 - Precision specifies the total number of bits used for the significand (mantissa), including implicit leading integer bit = 4 + 1 = 5 - Follows IEEE 754 conventions for representation of special values - Has Positive and Negative zero - Has Positive and Negative infinity - Has NaNs Additional details: - Max exp (unbiased): 3 - Min exp (unbiased): -2 - Infinities (+/-): S.111.0000 - Zeros (+/-): S.000.0000 - NaNs: S.111.{0,1}⁴ except S.111.0000 - Max normal number: S.110.1111 = +/-2^(6-3) x (1 + 15/16) = +/-2^3 x 31 x 2^(-4) = +/-15.5 - Min normal number: S.001.0000 = +/-2^(1-3) x (1 + 0) = +/-2^(-2) - Max subnormal number: S.000.1111 = +/-2^(-2) x 15/16 = +/-2^(-2) x 15 x 2^(-4) = +/-15 x 2^(-6) - Min subnormal number: S.000.0001 = +/-2^(-2) x 1/16 = +/-2^(-2) x 2^(-4) = +/-2^(-6) ``` ### Testing: ``` bazel test \ //xla:array2d_test \ //xla:fp_util_test \ //xla:literal_comparison_test \ //xla:literal_test \ //xla/mlir/utils:type_util_test \ //xla:primitive_util_test \ //xla/python/ifrt:dtype_test \ //xla/python:xla_client_test \ //xla/service:elemental_ir_emitter_test \ //xla/service:float_normalization_test \ //xla/service/gpu/tests:float_conversions_test \ //xla/tests:array_elementwise_ops_test \ //xla/tests:constants_test \ //xla/tests:convert_test \ //xla/tests:float8_test \ //xla:util_test bazel test \ //xla/hlo/translate/hlo_to_mhlo/tests:import.hlo.test \ //xla/hlo/translate/mhlo_to_hlo/tests:export.mlir.test \ //xla/mlir_hlo/tests:Dialect/mhlo/hlo-legalize-to-stablehlo.mlir.test \ //xla/mlir_hlo/tests:Dialect/mhlo/ops.mlir.test \ //xla/mlir_hlo/tests:Dialect/mhlo/stablehlo-legalize-to-hlo.mlir.test ``` ### Related PRs: - LLVM [PR-97179](llvm/llvm-project#97179) [APFloat] Add support for f8E4M3 IEEE 754 type (Merged) - LLVM [PR-97118](llvm/llvm-project#97118) [MLIR] Add f8E4M3 IEEE 754 type (Merged) - LLVM [PR-99698](llvm/llvm-project#99698) [APFloat] Add support for f8E3M4 IEEE 754 type (Merged) - LLVM [PR-101230](llvm/llvm-project#101230) [MLIR] Add f8E3M4 IEEE 754 type (Merged) - StableHLO [PR-2486](openxla/stablehlo#2486) [RFC] Add f8E4M3 and f8E3M4 types support (Merged) - StableHLO [PR-2482](openxla/stablehlo#2482) Add f8E4M3 and f8E3M4 types support (Merged) - ml_dtypes [PR-161](jax-ml/ml_dtypes#161) Add float8_e4m3 (Merged) - ml_dtypes [PR-171](jax-ml/ml_dtypes#171) Add float8_e3m4 (Merged) - XLA [PR-17075](openxla/xla#17075) [TSL] Bump ml_dtypes. Add float8_e4m3, float8_e3m4 (Approved) - XLA [PR-3200](openxla/xla#3200) Add support for float8_e4m3fnuz and float8_e5m2fnuz (Template) - JAX [PR-23585](jax-ml/jax#23585) Add float8_e4m3 type support (in Review) Copybara import of the project: -- ec1c723027012a816d7e17f268c5f034863696e6 by Alexander Pivovarov <[email protected]>: Add support for float8_e4m3 and float8_e3m4 types Merging this change closes #16585 PiperOrigin-RevId: 681551979

lissyx mentioned this issue Jan 30, 2018

Force sorting of CUDA/Python headers to avoid spurious rebuilds #16586

Merged

reedwm added the stat:contribution welcome Status - Contributions welcome label Jan 30, 2018

lissyx changed the title ~~TensorFlow CUDA-enabled might rebuilds more than necessary instead of re-using bazel cache~~ TensorFlow with CUDA or Python might rebuilds more than necessary instead of re-using bazel cache Feb 1, 2018

andrewharp closed this as completed in #16586 Feb 2, 2018

copybara-service bot mentioned this issue Sep 13, 2024

[TSL] Bump ml_dtypes. Add float8_e4m3, float8_e3m4 #75735

Merged

copybara-service bot mentioned this issue Sep 30, 2024

PR #16585: Add support for float8_e4m3 and float8_e3m4 types #76821

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorFlow with CUDA or Python might rebuilds more than necessary instead of re-using bazel cache #16585

TensorFlow with CUDA or Python might rebuilds more than necessary instead of re-using bazel cache #16585

lissyx commented Jan 30, 2018

reedwm commented Jan 30, 2018

lissyx commented Feb 1, 2018

TensorFlow with CUDA or Python might rebuilds more than necessary instead of re-using bazel cache #16585

TensorFlow with CUDA or Python might rebuilds more than necessary instead of re-using bazel cache #16585

Comments

lissyx commented Jan 30, 2018

reedwm commented Jan 30, 2018

lissyx commented Feb 1, 2018