Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build OpenBLAS with MAX_STACK_ALLOC=2048 #10780

Closed
hiccup7 opened this issue Apr 9, 2015 · 40 comments
Closed

Build OpenBLAS with MAX_STACK_ALLOC=2048 #10780

hiccup7 opened this issue Apr 9, 2015 · 40 comments
Labels
building Build system, or building Julia or its dependencies upstream The issue is with an upstream dependency, e.g. LLVM

Comments

@hiccup7
Copy link

hiccup7 commented Apr 9, 2015

Building OpenBLAS with MAX_STACK_ALLOC=2048 seems necessary to get proper multi-threading performance from gemv() and ger() functions. See OpenMathLib/OpenBLAS#532

Options I see:
a) Negotiate with OpenBLAS project to default to MAX_STACK_ALLOC=2048 if MAX_STACK_ALLOC is unspecified in the make options.
b) Update the Julia build environment to include MAX_STACK_ALLOC=2048 for OpenBLAS.

@ViralBShah
Copy link
Member

We can easily do that. @xianyi Is it possible to make this the default in OpenBLAS? In the meanwhile, is there any downside to use this in the Julia build.

If we end up doing this, we may also want to backport it.

@ViralBShah
Copy link
Member

@hiccup7 A possible workaround for you may be to just roll out your own matrix-vector multiplication implementation in pure julia, writing the doubly nested for loop. In some cases, that could get you the performance you need on small problems - until we sort this out.

@xianyi
Copy link

xianyi commented Apr 9, 2015

@ViralBShah , I can enable this flag as default.

@hiccup7
Copy link
Author

hiccup7 commented Apr 9, 2015

Are these next steps?

  1. OpenBLAS changes the default in the develop branch.
  2. OpenBLAS does a release
  3. Julia commits the new OpenBLAS to the master branch.
  4. The change is in the next Julia nightly build for v0.4-pre

@ViralBShah
Copy link
Member

Until then, we don't mind having this flag in the julia build.

ViralBShah added a commit that referenced this issue Apr 9, 2015
@ViralBShah ViralBShah added the building Build system, or building Julia or its dependencies label Apr 10, 2015
ViralBShah added a commit that referenced this issue Apr 10, 2015
@xianyi
Copy link

xianyi commented Apr 10, 2015

I found that this flag would cause a segfult with dgemv_t. I am working on
it.

Zhang Xianyi
2015年4月9日 下午2:38于 "Viral B. Shah" [email protected]写道:

Until then, we don't mind having this flag in the julia build.


Reply to this email directly or view it on GitHub
#10780 (comment).

@tkelman
Copy link
Contributor

tkelman commented Apr 10, 2015

Good to know, thanks for the warning. We should be able to upgrade to the latest tagged v0.2.14, though it looks like DESTDIR support may not work correctly with mingw prior to OpenMathLib/OpenBLAS@0ac787e? We recently switched to using DESTDIR for more dependencies with the "more atomic dependency install" change, but we might be handling this specially in deps/Makefile. Let's see.

@ViralBShah
Copy link
Member

@xianyi Can we have a 0.2.15 with this bugfix? Thanks.

@hiccup7
Copy link
Author

hiccup7 commented Apr 11, 2015

I see here that Julia v0.4 is estimated to be 2.3 months away: https://github.com/JuliaLang/julia/milestones
Will there be a Julia v0.3.8 release with this bugfix on the monthly bugfix schedule?

@tkelman
Copy link
Contributor

tkelman commented Apr 11, 2015

Potentially. We don't always backport dependency version bumps, but the blas/lapack bindings haven't changed too dramatically on the Julia side between 0.3 and 0.4. So if things get tested on master for at least a few days, preferably a week or so without anyone seeing any problems, then it should be okay to backport an openblas upgrade.

@ViralBShah
Copy link
Member

Note that max stack alloc cannot be used until the next release of OpenBLAS as discussed. After that, a few days of testing on master, and then 0.3.8.

@carlkl
Copy link

carlkl commented Apr 13, 2015

I recently hit bugs (segfaults) on haswell windows 64bit with MAX_STACK_ALLOC=2048 Openblas0-2-14 with inofficial numpy/scipy openblas builds. Increasing the stacksize didn't help.

@xianyi
Copy link

xianyi commented Apr 13, 2015

@carlkl , please try the develop branch. I just push the codes to fix this segfault bug.

@carlkl
Copy link

carlkl commented Apr 15, 2015

@xianyi , the development branch works now with MAX_STACK_ALLOC=2048 and the latest patches on haswell, thanks.

@ViralBShah
Copy link
Member

@carlkl Do all julia tests pass with this version? That would be useful input for the next openblas release.

@carlkl
Copy link

carlkl commented Apr 15, 2015

I accidently commented on this repo. Usually I test numpy, scipy, but I can
put the DLL's online if needed.

2015-04-15 12:38 GMT+02:00 Viral B. Shah [email protected]:

@carlkl https://github.com/carlkl Do all julia tests pass with this
version? That would be useful input for the next openblas release.


Reply to this email directly or view it on GitHub
#10780 (comment).

@ViralBShah ViralBShah added the upstream The issue is with an upstream dependency, e.g. LLVM label Apr 15, 2015
@hiccup7
Copy link
Author

hiccup7 commented Apr 15, 2015

@carlkl , I volunteer to do some Julia testing if someone kindly puts libopenblas.dll online. It would need to be a drop-in replacement for my current libopenblas.dll in Julia 0.3.7, where versioninfo() shows libopenblas as "USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell".

@carlkl
Copy link

carlkl commented Apr 16, 2015

latest openblas develop compiled with USE64BITINT:
https://bitbucket.org/carlkl/mingw-w64-for-python/downloads/openblas-0.2.14_USE64BITINT.zip

2015-04-15 22:17 GMT+02:00 Eric [email protected]:

@carlkl https://github.com/carlkl , I volunteer to do some Julia
testing if someone kindly puts libopenblas.dll online. It would need to be
a drop-in replacement for my current libopenblas.dll in Julia 0.3.7, where
versioninfo() shows libopenblas as "USE64BITINT DYNAMIC_ARCH NO_AFFINITY
Haswell".


Reply to this email directly or view it on GitHub
#10780 (comment).

@tkelman
Copy link
Contributor

tkelman commented Apr 16, 2015

Thanks Carl, though I don't think that'll be a drop-in replacement for Julia since we also rename the symbols when using 64-bit ints via SYMBOLSUFFIX. I can make a temporary test branch that uses openblas develop and build test binaries on our buildbot, that's not much trouble to do at all.

@tkelman
Copy link
Contributor

tkelman commented Apr 16, 2015

@hiccup7 (and anyone else who's got some spare disk space and cpu cycles) give http://julianightlies.s3.amazonaws.com/bin/winnt/x64/0.4/julia-0.4.0-9e5eda975c-win64.exe a try, checking Base.runtests() would be useful

@ViralBShah
Copy link
Member

Off-topic If there is an easy way to get a Windows VM and do stuff through vagrant, I don't mind becoming a slightly more regular windows tester.

@tkelman
Copy link
Contributor

tkelman commented Apr 16, 2015

You can build one through packer without too much hassle (been a while since I last did it), see staticfloat/julia-vagrant#2 (comment) for instructions and a Vagrantfile with provisioning script

@ihnorton
Copy link
Member

There are also images available at http://modern.ie. Only 32-bit, but maybe
that is a good thing since very few people are using 32-bit regularly.

On Thu, Apr 16, 2015 at 12:29 PM, Tony Kelman [email protected]
wrote:

You can build one through packer without too much trouble (been a while
since I last did it), see staticfloat/julia-vagrant#2 (comment)
staticfloat/julia-vagrant#2 (comment)
for instructions and a Vagrantfile with provisioning script


Reply to this email directly or view it on GitHub
#10780 (comment).

@pao
Copy link
Member

pao commented Apr 16, 2015

checking Base.runtests() would be useful

I eventually ran into #7942 but assuming all the linalg tests run as a block, WFM.

@hiccup7
Copy link
Author

hiccup7 commented Apr 16, 2015

@tkelman , are you sure you used the develop branch of OpenBLAS for your build? I extracted the libopenblas.dll file from exe archive file you provided and used it with Julia 0.3.7 (to ensure no other changes are tested). Using the small-vector test code from OpenMathLib/OpenBLAS#532, Julia runs about 4 times slower than before and allocates 2GB of memory. The test from OpenMathLib/OpenBLAS#530 runs about 15% faster and allocates 0 bytes. I expected a bigger improvement. versioninfo() shows the correct libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell).

@tkelman
Copy link
Contributor

tkelman commented Apr 17, 2015

9e5eda975c

@hiccup7
Copy link
Author

hiccup7 commented Apr 17, 2015

To clarify my last post, my test today with Julia v0.3.7 and the new libopenblas.dll file is about 4 times slower than with the released Julia v0.3.7. Good thing we tested before OpenBLAS did a release!
cc: @xianyi

@tkelman
Copy link
Contributor

tkelman commented Apr 17, 2015

Would also be good to compare to openblas from a standard 0.4-dev nightly which as of a few days ago will be openblas 0.2.14, to see whether the difference is between 0.2.13 vs 0.2.14, or 0.2.14 vs develop.

@hiccup7
Copy link
Author

hiccup7 commented Apr 18, 2015

Today, I downloaded a standard Julia v0.4-dev nightly, extracted libopenblas.dll v0.2.14, and used it with Julia v0.3.7 on Windows. It performs the same as the develop branch of OpenBLAS on the test from OpenMathLib/OpenBLAS#530, which is 15% faster than OpenBLAS v0.2.13 contained in original Julia v0.3.7. OpenBLAS v0.2.14 also performed the same as OpenBLAS v0.2.13 on the tests from OpenMathLib/OpenBLAS#532, which is 4 times faster than the develop branch of OpenBLAS.

The first conclusion I get from these tests is that OpenBLAS v0.2.14 on the Julia master branch is the best-working so far. I recommend keeping MAX_STACK_ALLOC=2048 in the OpenBLAS build options for now.

Using the small-vector test code from OpenMathLib/OpenBLAS#532, Julia allocates 2GB of memory with both v0.2.14 and the develop branch of OpenBLAS. Therefore, there is still a serious problem with GEMV. Also, the test in OpenMathLib/OpenBLAS#530 shows that develop branch of OpenBLAS is still 3 times slower than MKL BLAS, so it didn't fix the issue.

@xianyi , we need some further diagnosis of the problems with the develop branch of OpenBLAS.

@tkelman
Copy link
Contributor

tkelman commented Apr 19, 2015

We are not currently setting MAX_STACK_ALLOC=2048 anywhere. See #10783, which is not merged yet.

@hiccup7
Copy link
Author

hiccup7 commented Apr 19, 2015

See OpenMathLib/OpenBLAS@6c3a0b5
MAX_STACK_ALLOC=2048 should not be necessary with OpenBLAS v0.2.15 or the develop branch. It is necessary for OpenBLAS v0.2.14, which is currently used in the Julia master branch.

To unravel the multiple OpenBLAS problems, I suggest merging #10783 now. This will resolve some of the GEMV problems. Once in the standard Julia v0.4-dev nightly, I can extract the libopenblas.dll and retest with Julia v0.3.7. My results will help the OpenBLAS team understand exactly which problems are in the develop branch (after the v0.2.14 release).

@hiccup7
Copy link
Author

hiccup7 commented Apr 20, 2015

Dot products and GEMV() dominate my CPU usage. Being 3 and 6 times slower than MKL BLAS, respectively, these OpenBLAS bugs dominate my speed experience of Julia compared to Python using MKL BLAS. Note that MKL BLAS is available in at least 3 Python binary distributions -- for free. I predict that solving these OpenBLAS bugs will cause wider adoption of Julia.

@tkelman
Copy link
Contributor

tkelman commented Apr 20, 2015

Solving openblas bugs and improving it is a good thing in and of itself. "wider adoption of Julia" and comparisons to MKL are another matter entirely, and beyond the scope of the issue tracker.

Do you have evidence that merging #10783 and building openblas 0.2.14 with MAX_STACK_ALLOC=2048 will actually improve the performance of openblas? If not, we can build additional test binaries in various configurations to collect more data.

@hiccup7
Copy link
Author

hiccup7 commented Apr 20, 2015

The evidence:
OpenBLAS v0.2.14 release notes: https://github.com/xianyi/OpenBLAS/blob/develop/Changelog.txt
OpenBLAS v0.2.14 includes this commit: OpenMathLib/OpenBLAS#482
which creates the requirement for MAX_STACK_ALLOC=2048 to improve the performance.

@tkelman
Copy link
Contributor

tkelman commented Apr 20, 2015

By evidence, I meant measured performance improvements. Data, numbers.

Looking at the commit log, I see multiple changes that indicate MAX_STACK_ALLOC is likely to have bugs in the openblas code base as of 0.2.14. There's one way to find out, and we can always revert if this proves to be the case, but this really calls for more thorough evaluation IMO.

@hiccup7
Copy link
Author

hiccup7 commented Apr 20, 2015

If the Julia team provides a build, I volunteer to measure the performance (with the tests I did before).

ViralBShah added a commit that referenced this issue Apr 20, 2015
@tkelman
Copy link
Contributor

tkelman commented Apr 20, 2015

Okay, the commit a48cada is a rebased version of #10783 (so it builds openblas 0.2.14). Wait an hour or two for http://buildbot.e.ip.saba.us:8010/builders/package_win8.1-x64/builds/589 to finish, then check http://s3.amazonaws.com/julianightlies for a julia-0.4.0-a48cada4b0-win64.exe when it's done.

@carlkl
Copy link

carlkl commented Apr 21, 2015

see the latest comments on OpenMathLib/OpenBLAS#478
and OpenMathLib/OpenBLAS#482 and the latest openblas
develop: Enable stack alloc for s/dgemv_t.(revert 9798491)
OpenMathLib/OpenBLAS@847e19c

2015-04-20 19:44 GMT+02:00 Tony Kelman [email protected]:

Okay, the commit a48cada
a48cada
is a rebased version of #10783
#10783 (so it builds openblas
0.2.14). Wait an hour or two for
http://buildbot.e.ip.saba.us:8010/builders/package_win8.1-x64/builds/589
to finish, then check http://s3.amazonaws.com/julianightlies for a
julia-0.4.0-a48cada4b0-win64.exe when it's done.


Reply to this email directly or view it on GitHub
#10780 (comment).

@hiccup7
Copy link
Author

hiccup7 commented Apr 21, 2015

Today, I downloaded julia-0.4.0-a48cada4b0-win64.exe, extracted libopenblas.dll v0.2.14, and used it with Julia v0.3.7 on Windows. The tests at OpenMathLib/OpenBLAS#532 performed the same as v0.2.14 built without MAX_STACK_ALLOC=2048. The test at OpenMathLib/OpenBLAS#530 performed 15% slower than v0.2.14 built without MAX_STACK_ALLOC=2048. Therefore, it seems best to avoid building OpenBLAS v0.2.14 with MAX_STACK_ALLOC=2048.

Since one change has been committed and there is another change waiting to be committed to the OpenBLAS develop branch in the last day, how about if we wait for @xianyi to let us know when we should test the OpenBLAS develop branch next?

@tkelman
Copy link
Contributor

tkelman commented Apr 23, 2015

Yes I think that would be the most appropriate thing to do here. I'd like to tag Julia 0.3.8 within the next few days, after just a few more small backports. This can wait, but we appreciate the benchmarking of openblas while they're working on their next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
building Build system, or building Julia or its dependencies upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

No branches or pull requests

7 participants