Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

change windows build system #16980

Merged
merged 6 commits into from
Jan 2, 2020
Merged

change windows build system #16980

merged 6 commits into from
Jan 2, 2020

Conversation

yajiedesign
Copy link
Contributor

Description

now, the huge image size causes many troubles.etc large image can't laod. some compile error.
now, the PIP build use compressed fatbin by hack nvcc,but it will cause a lot of memory,and general users have difficulty compiling on their own.
this pr change build system,it spilt different cuda arch to different dll.etc mxnet_30.mxnet_50.Then use a warpdll to automatically identify the gpu sm version. Load the corresponding dll. And forward all calls.
this pr also contains tools to automatically generate warp dll and will be executed automatically during compilation.

@yajiedesign yajiedesign requested a review from szha as a code owner December 5, 2019 08:03
@yajiedesign yajiedesign added Build CMake CMake related bugs/issues/improvements labels Dec 5, 2019
tools/gen_warp.cpp Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
@yajiedesign
Copy link
Contributor Author

every body,ci test is ok .

CMakeLists.txt Outdated Show resolved Hide resolved
@@ -115,6 +117,7 @@ class BuildFlavour(Enum):
'-DCUDA_ARCH_NAME=Manual '
'-DCUDA_ARCH_BIN=52 '
'-DCUDA_ARCH_PTX=52 '
'-DCUDA_ARCH_LIST=5.2 '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add CUDA_ARCH_LIST?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first class cuda use CUDA_ARCH_LIST choose or set arch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it works currently. I got this warning when testing it:

Manually-specified variables were not used by the project:

    CUDA_ARCH_LIST

It is fixed as part of #17031

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently works fine on ci.The warning may be caused by other modifications.
image

Copy link
Contributor

@leezu leezu Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried it again. It only works when using the Makefile generator (or apparently the windows generator as well). But doesn't work when using the ninja generator.
It's quite strange.

Where do you expect the variable to be used? Only at https://github.com/apache/incubator-mxnet/blob/8645b9a4a220940bdccba3aeee577eb645a86b34/cmake/FirstClassLangCuda.cmake#L109 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yajiedesign yajiedesign added the pr-awaiting-review PR is waiting for code review label Dec 18, 2019
Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate on the motivation on this change?

now, the huge image size causes many troubles.etc large image can't laod. some compile error.

What do you mean?

now, the PIP build use compressed fatbin by hack nvcc

How would this change address the compressed fatbin? The compression was used to make the library size smaller. However, this PR will split the dll into multiple dlls. The total size that needs to be distributed on Pypi doesn't change or does it?

,but it will cause a lot of memory,and general users have difficulty compiling on their own.

What do you mean?

Thank you!

@yajiedesign
Copy link
Contributor Author

@leezu now imageSize is too large. Windows cannot load too large DLL.So we must try to reduce the imagesize size.
previously I reduced the size by modifying nvcc.But this method takes up too much memory when starting multiple processes.So I designed a new scheme

@yajiedesign
Copy link
Contributor Author

@leezu The pip size will be solved by other methods.

@leezu
Copy link
Contributor

leezu commented Dec 18, 2019

@yajiedesign do you mean on Windows 32 bit? Do we have any need to support 32 bit build?
I found the following reference https://software.intel.com/en-us/articles/memory-limits-applications-windows
If you have any other reference in mind, please mention it.

Unfortunately Microsoft doesn't seem to publish statistics about the number of users running Windows 32bit. But Steam does https://store.steampowered.com/hwsurvey/ and as of November 2019 less than 1% of users run Windows 7 32bit, Windows 10 32bit.

So I think building for Windows 64 bit may be an easier solution?

Regarding the pip size: Yes, I agree your PR doesn't address it. That's why I wonder why you cite pip size as motivation for this PR. You can edit the motivation and remove it, given that this PR is not related.

@leezu
Copy link
Contributor

leezu commented Dec 18, 2019

Regarding the memory size, in my understanding if you use multiple processes, that all load the mxnet dll, the memory among them will be shared.
Or is there some problem that prevents the memory reuse?

@yajiedesign
Copy link
Contributor Author

@leezu the previously reduced size method broken dll shared.

@yajiedesign
Copy link
Contributor Author

@leezu not 32 bit. 64bit dll are also image size limit.This is determined by PE file structure.Reference https://stackoverflow.com/questions/6976693/what-is-the-maximum-size-of-a-pe-file-on-64-bit-windows

@yajiedesign
Copy link
Contributor Author

yajiedesign commented Dec 18, 2019

@leezu Pip is mentioned here because the old method is use to PIP build,and solved pip pkg size problem.

@leezu
Copy link
Contributor

leezu commented Dec 19, 2019

@leezu not 32 bit. 64bit dll are also image size limit.This is determined by PE file structure.Reference https://stackoverflow.com/questions/6976693/what-is-the-maximum-size-of-a-pe-file-on-64-bit-windows

Thanks @yajiedesign. It seems one of the root causes is that Windows does not support position independent code and PE format must always use 32bit relative addressing.

The reason I ask you so many questions is that a) I'm concerned that not many people are familiar with Windows, so we need to document the reasoning that went into this patch b) fundamentally the problem you attempt to solve here also affects unix systems and it would be good to have a cross-platform solution. (Thanks to ELF superiority over Windows PE, I understand that the problem is more pressing on Windows and we may need to have a Windows specific intermediate solution at first)

What do you think?

Maybe @samskalicky's dynamic loading of operators effort can solve this in a cross-platform way?

@yajiedesign
Copy link
Contributor Author

Yes, it is urgent under windows. The current scheme is private and has many problems
This solution may be portable to Linux. But I don't know it is need under Linux.I'm not familiar with linux.
I think merge it first to make windows available, and then consider cross platform issues.

@leezu
Copy link
Contributor

leezu commented Dec 19, 2019

@yajiedesign can you provide a link for the current scheme? Or is the code not available at all? Where does it run?

Which scheme do you mean? Do you mean the code used to build the windows binary at https://pypi.org/project/mxnet-cu101/1.6.0b20190926/#files

How large is the binary without the current private scheme? What happens without the private scheme?

@yajiedesign
Copy link
Contributor Author

@leezu not current scheme link.it only on my build machine. It's complicated and dirty.
yes current scheme used to build pip pkg.
without current scheme .image size about 1.8gb.it can't load.

@leezu
Copy link
Contributor

leezu commented Dec 19, 2019

Thanks for providing more details.

I think it would be good to move the files introduced here from tools to tools/windows or tools/windowsbuild and add a README.md there which documents the motivation (ie. the information you provided during the discussion). Then it should be fine to merge.

You can also consider making the warping feature optional in the Windows build, as it adds overhead and is not needed apart from the pip build that needs to include all cuda arches.

@yajiedesign yajiedesign force-pushed the build_system branch 2 times, most recently from d7611b3 to ab1e279 Compare December 20, 2019 11:25
CMakeLists.txt Outdated Show resolved Hide resolved
@yajiedesign
Copy link
Contributor Author

ok.

@yajiedesign yajiedesign force-pushed the build_system branch 2 times, most recently from 93b4c16 to 5a31b19 Compare December 25, 2019 11:49
@yajiedesign
Copy link
Contributor Author

ci test ok


int find_version()
{
int known_sm[] = { 30,35,37,50,52,60,61,70,75 };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make that dynamic so we don't have to update it in the future?

@yajiedesign
Copy link
Contributor Author

@marcoabreu done

@yajiedesign
Copy link
Contributor Author

can we merge?

add gen_warp cpp version
add add_custom_command to run warp_gen
add download cmake
add option
change option
add dynamic read mxnet dll
Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, thanks!

@yajiedesign
Copy link
Contributor Author

ci test ok.

CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
@yajiedesign
Copy link
Contributor Author

ping

@leezu leezu merged commit ffeacfa into apache:master Jan 2, 2020
@ptrendx
Copy link
Member

ptrendx commented Jan 3, 2020

@yajiedesign Could you make a PR with backported changes needed for Windows build to 1.6 branch as you mentioned in the mail to dev@?

@yajiedesign
Copy link
Contributor Author

@ptrendx ok

yajiedesign pushed a commit to yajiedesign/mxnet that referenced this pull request Jan 4, 2020
Introduce unified MXNET_CUDA_ARCH option to specify cuda architectures.
Previously cuda architecture setting was partially broken and different options
were applied to different parts of the build (CUDA_ARCH_NAME CUDA_ARCH_BIN
CUDA_ARCH_PTX and CUDA_ARCH_LIST).
Include FindCUDAToolkit from CMake 3.17, which replaces the deprecated FindCUDA
functionality for finding the cuda toolkit include directories and libraries.

Workaround for DLL size limitation on Windows (apache#16980)
* change windows build system.
add gen_warp cpp version
add add_custom_command to run warp_gen
add download cmake
add option
change option
add dynamic read mxnet dll
yajiedesign pushed a commit to yajiedesign/mxnet that referenced this pull request Jan 5, 2020
Switch to modern CMake CUDA handling (apache#17031)

Introduce unified MXNET_CUDA_ARCH option to specify cuda architectures.
Previously cuda architecture setting was partially broken and different options
were applied to different parts of the build (CUDA_ARCH_NAME CUDA_ARCH_BIN
CUDA_ARCH_PTX and CUDA_ARCH_LIST).

Include FindCUDAToolkit from CMake 3.17, which replaces the deprecated FindCUDA
functionality for finding the cuda toolkit include directories and libraries.

Workaround for DLL size limitation on Windows (apache#16980)

* change windows build system.
add gen_warp cpp version
add add_custom_command to run warp_gen
add download cmake
add option
change option
add dynamic read mxnet dll
yajiedesign pushed a commit to yajiedesign/mxnet that referenced this pull request Jan 5, 2020
…ranch

Fix CUDNN detection for CMake build (apache#17019)

Replace mxnet_option macro with standard CMAKE_DEPENDENT_OPTION (apache#17018)

Switch to modern CMake CUDA handling (apache#17031)

Introduce unified MXNET_CUDA_ARCH option to specify cuda architectures.
Previously cuda architecture setting was partially broken and different options
were applied to different parts of the build (CUDA_ARCH_NAME CUDA_ARCH_BIN
CUDA_ARCH_PTX and CUDA_ARCH_LIST).

Include FindCUDAToolkit from CMake 3.17, which replaces the deprecated FindCUDA
functionality for finding the cuda toolkit include directories and libraries.

Workaround for DLL size limitation on Windows (apache#16980)

* change windows build system.
add gen_warp cpp version
add add_custom_command to run warp_gen
add download cmake
add option
change option
add dynamic read mxnet dll
ptrendx pushed a commit that referenced this pull request Jan 5, 2020
Fix CUDNN detection for CMake build (#17019)

Replace mxnet_option macro with standard CMAKE_DEPENDENT_OPTION (#17018)

Switch to modern CMake CUDA handling (#17031)

Introduce unified MXNET_CUDA_ARCH option to specify cuda architectures.
Previously cuda architecture setting was partially broken and different options
were applied to different parts of the build (CUDA_ARCH_NAME CUDA_ARCH_BIN
CUDA_ARCH_PTX and CUDA_ARCH_LIST).

Include FindCUDAToolkit from CMake 3.17, which replaces the deprecated FindCUDA
functionality for finding the cuda toolkit include directories and libraries.

Workaround for DLL size limitation on Windows (#16980)

* change windows build system.
add gen_warp cpp version
add add_custom_command to run warp_gen
add download cmake
add option
change option
add dynamic read mxnet dll

Co-authored-by: Leonard Lausen <[email protected]>
@lordofgod
Copy link
Contributor

@yajiedesign

这个issue的修改可能会导致cpp_package的OpWrapperGenerator无法正常生成op.h文件, 错误码127。原因是warp_dll.cpp文件的119行HMODULE hm = LoadLibraryW(dll_name);使用了相对路径。调用OpWrapperGenerator时的dll搜索路径不是libmxnet.dll的所在目录,导致这段代码找不到依赖的dll库。这里是不是使用绝对路径更好一些。

This pr will cause the "OpwrapperGenerator.py" failed to generate "op.h" with error code 127, while building with cpp_package. The reason is use of relative dll path at "HMODULE hm = LoadLibraryW(dll_name);" in the new added file warp_dll.cpp, Line 119. "OpWrapperGenerator.py" 's path is different from libmxnet.dll, thus may cause the 127 error while calling the generator script, which infers that sub-dll's load failure in libmxnet.dll. If it's better to use absolute path here, or other strategy for solving this.

dll search path reference: https://docs.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-search-order

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Build CMake CMake related bugs/issues/improvements pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants