Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suprisingly slow dsyrk on Mac #730

Closed
jakirkham opened this issue Jan 2, 2016 · 60 comments
Closed

Suprisingly slow dsyrk on Mac #730

jakirkham opened this issue Jan 2, 2016 · 60 comments

Comments

@jakirkham
Copy link
Contributor

Here is an example on Mac OS 10.9 on a Late 2013 MacBook Pro 15" with a 2.3 GHz Intel Core i7-4850HQ using OpenBLAS 0.2.15, Python 2.7.11, NumPy 1.10.2, SciPy 0.16.1. I have verified that NumPy and SciPy are properly linked to the version of OpenBLAS specified. OpenBLAS has built with the following flags make DYNAMIC_ARCH=1 BINARY=64 NO_LAPACK=0 NO_AFFINITY=1 NUM_THREADS=1 no other options or modifications were made.

>>> import numpy
>>> from scipy.linalg import blas
>>> a = numpy.ones((100,101), dtype=numpy.float32)
>>> %timeit blas.ssyrk(1, a)
10000 loops, best of 3: 37.8 µs per loop
>>> a = numpy.ones((100,101), dtype=numpy.float64)
>>> %timeit blas.dsyrk(1, a)
1000 loops, best of 3: 520 µs per loop

I wouldn't be surprised to see it takes twice as a long due to the fact that it is double versus single precision; however, taking over an order of magnitude longer seems to be a bit much.

Following the same build procedure on a Linux VM on the same machine (uses VirtualBox 5.0.12), I get a much more reasonable time for dsyrk (around double ssyrk). I have no idea whether this carries over to Windows or not. If someone is able to reproduce a similar example using C or Fortran, please share your steps.

After further discussion, we found this was array size dependent. Below is a graph showing this dependence. More details about how the graph was made can be found in this comment. ( #730 (comment) )

performance_of_openblas_syrk_subroutines

For comparison, one can look at the time taken by sgemm and dgemm, but one will not see this behavior.

performance_of_openblas_gemm_subroutines

@brada4
Copy link
Contributor

brada4 commented Jan 7, 2016

Are you sure result is from OpenBLAS and not Accelerate Framework running on (float64-impaired, HD-Iris) GPU?
Single line versions of test cases:

python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((100,101), dtype=numpy.float32)' 'blas.ssyrk(1, a)'
python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((100,101), dtype=numpy.float64)' 'blas.dsyrk(1, a)'

I get dsyrk 1.5x slower on normal CPU (explainable by super small input matrix) and 3x slower on atom (explainable by using integer ALU for double precision)
What is your CPU?

@jakirkham
Copy link
Contributor Author

Yes, as previously stated, I am positive NumPy and SciPy are using OpenBLAS. Here are specs from SciPy if you don't believe me.

>>> import scipy
>>> scipy.show_config()
    lapack_opt_info:
    libraries = ['openblas']
    library_dirs = ['/zopt/conda/envs/nanshenv/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/zopt/conda/envs/nanshenv/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/zopt/conda/envs/nanshenv/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_lapack_info:
    libraries = ['openblas']
    library_dirs = ['/zopt/conda/envs/nanshenv/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_mkl_info:
  NOT AVAILABLE
>>> scipy.show_numpy_config()
lapack_opt_info:
    libraries = ['openblas']
    library_dirs = ['/zopt/conda/envs/nanshenv/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/zopt/conda/envs/nanshenv/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/zopt/conda/envs/nanshenv/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_lapack_info:
    libraries = ['openblas']
    library_dirs = ['/zopt/conda/envs/nanshenv/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_mkl_info:
  NOT AVAILABLE

Though some people aren't satisfied with this information alone (I am sometimes one of those people).
If you look at the source code in scipy/linalg/blas.py, you will see that it imports and exposes Python bindings of all Fortran symbols found in scipy/linalg/_fblas.so [1], which is where these functions are coming from. If I check the linkage on scipy/linalg/_fblas.so, you can see the output below. In short, one generic system library that gets linked everywhere, OpenBLAS, and other compiler libraries. However, there is no Accelerate. Here @rpath is set to <my_prefix>/lib, which contains all of these.

$ otool -L <my_prefix>/lib/python2.7/site-packages/scipy/linalg/_fblas.so
<my_prefix>/lib/python2.7/site-packages/scipy/linalg/_fblas.so:
    @rpath/./libopenblas-r0.2.15.dylib (compatibility version 0.0.0, current version 0.0.0)
    @rpath/./libgfortran.3.dylib (compatibility version 4.0.0, current version 4.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1197.1.1)
    @rpath/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0)
    @rpath/./libquadmath.0.dylib (compatibility version 1.0.0, current version 1.0.0)

@jakirkham
Copy link
Contributor Author

What is your CPU?

I realize that I didn't include a model number, which I have done in the description. So, it is 2.3 GHz Intel Core i7-4850HQ. In other words, a Haswell processor. When I built OpenBLAS, I built it for dynamic architecture though if that is any help.

@brada4
Copy link
Contributor

brada4 commented Jan 8, 2016

Must be something about macintosh. In Linux with i7-4790 I get same results as on sandy bridge server mentioned before.
Can you LD_PRELOAD openblas library and run tests again?

@jakirkham
Copy link
Contributor Author

I agree that there is something weird on Mac that doesn't carry over to Linux. Sure, here they are.

$ LD_PRELOAD="<my_prefix>/lib/libopenblas-r0.2.15.dylib" python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((100,101), dtype=numpy.float32)' 'blas.ssyrk(1, a)'
10000 loops, best of 3: 37.7 usec per loop
$ LD_PRELOAD="<my_prefix>/lib/libopenblas-r0.2.15.dylib" python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((100,101), dtype=numpy.float64)' 'blas.dsyrk(1, a)'
1000 loops, best of 3: 516 usec per loop

In short, small differences (though they fluctuate back up to the values I listed before on reruns). Still over an order of magnitude difference.

@njsmith
Copy link

njsmith commented Jan 8, 2016

To further narrow down the issue you might want to check how consistent it is over a variety of array sizes, and try with and without threading (to disable threads set envvar OMP_NUM_THREADS=1). One thing I know is that OpenBLAS can show really weird behavior on "small" matrices (e.g. #731), and I'm not sure what counts as small.

@jakirkham
Copy link
Contributor Author

Threading is currently disabled, but I can play with enabling it. What are you thinking will happen when threads enter the picture?

As for size, the following all use square-ish configured matrices. I list the orders of magnitude of elements each matrix and how they do below.

Starting with smaller arrays (~10s to ~100s) appear to do about the same in terms of time for ssyrk and dsyrk. Arrays on the order of a ~1,000s of elements do even worse than previously seen (between 1 and 2 orders of magnitude slower). See an example below.

$ LD_PRELOAD="<my_prefix>/lib/libopenblas-r0.2.15.dylib" python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((35,36), dtype=numpy.float32)' 'blas.ssyrk(1, a)'
100000 loops, best of 3: 5.8 usec per loop
$ LD_PRELOAD="<my_prefix>/lib/libopenblas-r0.2.15.dylib" python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((35,36), dtype=numpy.float64)' 'blas.dsyrk(1, a)'
10000 loops, best of 3: 113 usec per loop

Arrays of the range previously tried (~10,000s of elements) are over 1 order of magnitude slower as was shown before. By ~100,000s of elements, it is under an order of magnitude slower. Finally, when reaching ~1,000,000s of elements it is around the expected to 2x slower.


**TL;DR:** Behavior is really bad for matrices with between ~100s - ~1,000,000s of elements. It is most notable around ~1,000s of elements.

@njsmith
Copy link

njsmith commented Jan 8, 2016

If threads are disabled then it's not the same pathological corner case as in #731. I suspect that pattern of slowing down at a particular range of sizes may mean something to someone who knows openblas internals, but that's not me :-).

@jakirkham
Copy link
Contributor Author

Ah, ok, yeah I don't think I have that case, but that does look nasty.

I am almost inclined to think that this current case has something to do with how the cache is working. That being said, I used the exact same machine (with VirtualBox) for the Linux profiling and don't see the same problem.

@brada4
Copy link
Contributor

brada4 commented Jan 9, 2016

There is some small but measurable effect in your case from sub-optimal threading configuration on very small data, but it does not affect dsyrk 20x more than ssyrk.
May I ask you to re-try with OpenBLAS that you build yourself - 0.2.15 or trunk?
You can use OSX-s subversion if you dont have git at hand.

Measurements (Linux, AMD A10-5750M)
100

$ python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((100,101), dtype=numpy.float32)' 'blas.ssyrk(1, a)'
10000 loops, best of 3: 84.5 usec per loop
$ OPENBLAS_NUM_THREADS=1 python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((100,101), dtype=numpy.float32)' 'blas.ssyrk(1, a)'
10000 loops, best of 3: 70.8 usec per loop
$ python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((100,101), dtype=numpy.float64)' 'blas.dsyrk(1, a)'
10000 loops, best of 3: 115 usec per loop
$ OPENBLAS_NUM_THREADS=1 python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((100,101), dtype=numpy.float64)' 'blas.dsyrk(1, a)'
10000 loops, best of 3: 103 usec per loop

10

$ python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((10,11), dtype=numpy.float32)' 'blas.ssyrk(1, a)'
100000 loops, best of 3: 2.63 usec per loop
$ OPENBLAS_NUM_THREADS=1 python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((10,11), dtype=numpy.float32)' 'blas.ssyrk(1, a)'
100000 loops, best of 3: 2.34 usec per loop
$ python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((10,11), dtype=numpy.float64)' 'blas.dsyrk(1, a)'
100000 loops, best of 3: 2.99 usec per loop
$ OPENBLAS_NUM_THREADS=1 python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((10,11), dtype=numpy.float64)' 'blas.dsyrk(1, a)'
100000 loops, best of 3: 2.7 usec per loop

1000

$ python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((1000,1001), dtype=numpy.float32)' 'blas.ssyrk(1, a)'
10 loops, best of 3: 27 msec per loop
$ OPENBLAS_NUM_THREADS=1 python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((1000,1001), dtype=numpy.float32)' 'blas.ssyrk(1, a)'
10 loops, best of 3: 38.1 msec per loop
$ python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((1000,1001), dtype=numpy.float64)' 'blas.dsyrk(1, a)'
10 loops, best of 3: 43.1 msec per loop
$ OPENBLAS_NUM_THREADS=1 python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((1000,1001), dtype=numpy.float64)' 'blas.dsyrk(1, a)'
10 loops, best of 3: 64.7 msec per loop

@jakirkham
Copy link
Contributor Author

The copy I used is one I built myself using 0.2.15 with the following flags DYNAMIC_ARCH=1 BINARY=64 NO_LAPACK=0 NO_AFFINITY=1 NUM_THREADS=1.

I am a little confused as to the statement that a suboptimal threading configuration is causing me problems as I am keeping threading off if at all possible. In fact, I ran the case for 1000, which seemed to have the biggest difference for you (in terms of threading) and saw little to no effect. See below.

$ python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((1000,1001), dtype=numpy.float32)' 'blas.ssyrk(1, a)'
100 loops, best of 3: 17.7 msec per loop
$ OPENBLAS_NUM_THREADS=1 python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((1000,1001), dtype=numpy.float32)' 'blas.ssyrk(1, a)'
100 loops, best of 3: 17.8 msec per loop
$ python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((1000,1001), dtype=numpy.float64)' 'blas.dsyrk(1, a)'
10 loops, best of 3: 41.2 msec per loop
$ OPENBLAS_NUM_THREADS=1 python -m timeit -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((1000,1001), dtype=numpy.float64)' 'blas.dsyrk(1, a)'
10 loops, best of 3: 40.7 msec per loop

If I'm doing terribly wrong with this build, please let me know.

@brada4
Copy link
Contributor

brada4 commented Jan 12, 2016

Obviously single threaded version runs single threaded no matter what variables you set.
For single machine with always same CPU there is no need for DYNAMIC_ARCH or NUM_THREADS, they are automatically set for your CPU. You can switch to 1 CPU core with _NUM_THREADS variables if/when needed (at least you have option to compare).
The whole thing with threading variable is just to confirm that this issue is not related to #731

Your build is perfectly fine. Please ask ones who equipped you with initial binary openblas to build it well just like you did moments ago.

@jakirkham
Copy link
Contributor Author

For single machine with always same CPU there is no need for DYNAMIC_ARCH or NUM_THREADS, they are automatically set for your CPU.

The reason I use dynamic architecture is I do support a variety of laptop and desktop machines with different architecture. Also, I use a cluster that has some older and some newer architectures depending on the node. This ends up being the easiest solution as opposed to pinning the build to the lowest common denominator as I get the right balance between support and performance. Maybe I am misunderstanding you. Is this a default build option?

As for setting the number of threads, often parallelism is dealt with at a higher level in my code. So, I normally want the BLAS to be single threaded. However, I may re-explore building it to run multithreaded for different applications. Is there some way to make it default to 1 thread if some _NUM_THREADS parameter is not set?

Your build is perfectly fine. Please ask ones who equipped you with initial binary openblas to build it well just like you did moments ago.

I always hand build OpenBLAS. Good to hear I am not way off base :)

@jakirkham
Copy link
Contributor Author

The only other thing I can think of is I am using clang on Mac. Specs below.

Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin13.4.0
Thread model: posix

@brada4
Copy link
Contributor

brada4 commented Jan 12, 2016

In your initial posting dsyrk was 50x slower than ssyrk. now it is 2x slower, i.e perfectly fine.
Problem is gone, you fixed it and nobody knows how.

4 results in 100-group with openblas built with linux CC=clang 3.5 ie approx xcode6
77.6 usec 59.2 usec 112 usec 97.2 usec per loop

@jakirkham
Copy link
Contributor Author

In your initial posting dsyrk was 50x slower than ssyrk. now it is 2x slower, i.e perfectly fine.
Problem is gone, you fixed it and nobody knows how.

@brada4, I think you may be confused. This is ok as this problem isn't a clear "everything is slow" problem, but there seems to be a regime where it gets very slow and transitions at the edges where normal behavior is resumed.

The last example is at the upper end of the regime where it becomes normal again. In other words, I used arrays that had these dimensions ~103 * ~103 so this many elements ~106. This is what I said before.

Finally, when reaching ~1,000,000s of elements it is around the expected to 2x slower.

In short, this does not deviate from what I said before. However, it is surprising as the problem seems to be size based. This is why I was wondering if the problem was cache size dependent. However, the fact that I don't experience this on a Linux VM that is not using hardware visualization suggests this is probably not the cause (though I could be wrong).

The worst parts of the range is arrays with ~1,000 elements. See an example below.

$ python -m timeit -n 10000 -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((33, 34), dtype=numpy.float32)' 'blas.ssyrk(1, a)'
10000 loops, best of 3: 5.07 usec per loop
$ python -m timeit -n 10000 -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((33, 34), dtype=numpy.float64)' 'blas.dsyrk(1, a)'
10000 loops, best of 3: 103 usec per loop

We have now demonstrated the dramatic effect that I saw before.

Arrays on the order of a ~1,000s of elements do even worse than previously seen (between 1 and 2 orders of magnitude slower).

Another bad point in this range is ~10,000 element arrays. See below.

$ python -m timeit -n 10000 -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((100, 101), dtype=numpy.float32)' 'blas.ssyrk(1, a)'
10000 loops, best of 3: 37.7 usec per loop
$ python -m timeit -n 10000 -s 'import numpy;from scipy.linalg import blas;a = numpy.ones((100, 101), dtype=numpy.float64)' 'blas.dsyrk(1, a)'
10000 loops, best of 3: 524 usec per loop

This matches again with what we have seen previously.

Arrays of the range previously tried (~10,000s of elements) are over 1 order of magnitude slower

So, no, this is not magically fixed. We merely did a second run at the transition back to normal behavior. Unsurprisingly, the same behavior that was seen before repeated.

I think I will try to come up with a figure so that this behavior is a little clearer. It should also elucidate any other weird peaks and valleys to help us identify what the possible causes might be.

@jakirkham
Copy link
Contributor Author

Below is a graph comparing dsyrk vs ssyrk. The x-axis is the number of elements in matrices of the form N x (N+1) that where N ranged from 10 to 1000 by units of 1. The y-axis is the time spent computing dsyrk divided by the time computing ssyrk so optimal performance is 2. Time was measured using time.time from the Python standard library. Every point is the average of 100 runs for both dsyrk and ssyrk.

A log-log graph was used here to make the differences more clear. Note that while performance has improved for matrices with 106 elements it is still not fully recovered. Small arrays seem to do about the same with either function, but it is hard to say as the signal to noise is lower there. This hopefully does a better job of elucidating the problematic region.

performance_of_openblas_syrk_subroutines

Here is a gist with the Jupyter notebook used. ( https://gist.github.com/jakirkham/17cc1481672a49cadb33 )

@brada4
Copy link
Contributor

brada4 commented Jan 14, 2016

It just says you have very old compiler that does not play well with 20x patched threading library.

@jakirkham
Copy link
Contributor Author

Though, this is built to be single threaded only. Please help me to understand why the threading library matters at all here.

@jakirkham
Copy link
Contributor Author

Also, here is the compiler information.

$ clang --version
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin13.4.0
Thread model: posix

@brada4
Copy link
Contributor

brada4 commented Jan 14, 2016

Xcode release history at https://trac.macports.org/wiki/XcodeVersionInfo indicates that Xcode 5.0.0 is not really for OSX 10.9
I would speculate that it takes some legacy compatibility code path in system libraries detecting a program built with old compiler.
Also if you dont have time to fix your compiler I suggest reading "man 7 accelerate" for quick way around compiler issues that prevent openblas from working normally.

@jakirkham
Copy link
Contributor Author

Sorry, I should have given you the XCode version information.

Version 5.0.2 (5A3005)

This is the second version of XCode that was for 10.9.

@brada4
Copy link
Contributor

brada4 commented Jan 14, 2016

Can you try AVX(1) openblas with prepending OPENBLAS_CORETYPE=Sandybridge to command lines? And Prescott? Another thing to play around is malloc debug options. And what about virtual machine with newer compilers?
Do you have a backtrack as to when problem appeared? Was it there with 0.2.14/13? With OSX 10.8 etc?

@njsmith
Copy link

njsmith commented Jan 14, 2016

if you have the capability to get a quick profile then that sometimes will
immediately make things obvious ("it's spending 90% of the time doing what?
it should be spending 0% of the time doing that"). Not always, but
sometimes :-)

On Thu, Jan 14, 2016 at 2:51 PM, brada4 [email protected] wrote:

Can you try AVX(1) openblas with prepending OPENBLAS_CORETYPE=Sandybridge
to command lines? And Prescott? Another thing to play around is malloc
debug options. And what about virtual machine with newer compilers?
Do you have a backtrack as to when problem appeared? Was it there with
0.2.14/13? With OSX 10.8 etc?


Reply to this email directly or view it on GitHub
#730 (comment).

Nathaniel J. Smith -- http://vorpus.org

@brada4
Copy link
Contributor

brada4 commented Jan 17, 2016

Whats in the makefile.rule ans whats in the end report of OpenBLAS build?
I canot hit anything related to #731 with genuine linux clang build.
Graph looks very similar to other issue.
Alternatively you may try blindly applying patch which adds n<4096 in one place.

@jakirkham
Copy link
Contributor Author

This is simply a straight download of the v0.2.15 tarball on GitHub. Nothing is patched. Only the aforementioned flags are provided.

@jakirkham
Copy link
Contributor Author

Also, I don't think I provided the fortran compiler information, but that is gfortran 4.8.2.

@brada4
Copy link
Contributor

brada4 commented Jan 17, 2016

Can you provide compile flags and compile report for regression case? I dont find them anywhere in the thread.

@jakirkham
Copy link
Contributor Author

Thanks @njsmith.

I went and ran the same code used to generate the graph from before ( #730 (comment) ) with OPENBLAS_CORETYPE=SANDYBRIDGE and got the same curve as before. It showed the same dscal_kernel_8_zero function when running through instrumentation. Before, I could see references to dscal_k_HASWELL that was calling dscal_kernel_8_zero. Now, I see references to dscal_k_SANDYBRIDGE calling dscal_kernel_8_zero. So, it seems like it is recognizing this environment variable and following the correct path.

@brada4
Copy link
Contributor

brada4 commented Jan 17, 2016

Virtualbox is AVX, not AVX2, thats why it exhibits little problem
3 steps above you see disabling multithread dsyrk for small samples - does it improve your graphs already.

Here to change haswell kernel to sandybridge (if single thread is of no help)

kernel/x86_64/dscal.c
#elif defined(SANDYBRIDGE)
#include "dscal_microk_sandy-2.c"
#elif defined(HASWELL)
-- #include "dscal_microk_haswell-2.c"
++ // #include "dscal_microk_haswell-2.c"
++ #include "dscal_microk_sandy-2.c"
#endif

@jakirkham
Copy link
Contributor Author

I tried setting OPENBLAS_CORETYPE=NEHALEM with the exact same build as before and ran the same comparison ( #730 (comment) ). I also tried running the simple program for profiling above ( #730 (comment) ) and verified that this is using the Nehalem variant. As one can see below, the problem is basically gone.

openblas_nehalem

It appears to peak at 2x slower as opposed to overshooting for some range and slowly returning. Given that Nehalem did not support AVX instructions, it seems like an interesting test case to rebuild with different forms of AVX instructions disabled and see if the problem persists or not.

@jakirkham
Copy link
Contributor Author

Related information on AVX and AVX2 instructions on VirtualBox. ( https://www.virtualbox.org/ticket/14262 ) At present, it appears AVX2 is still disabled on all hosts, but AVX is allowed on host/guest combinations that both have 64-bit. The latter is my case. As I have default settings for the VirtualBox VM, it appears AVX instructions should be used in my Linux test case, but not AVX2 instructions. As the Sandy Bridge case on Mac saw the same behavior and Sandy Bridge only supported AVX, this suggests the AVX instruction on Mac, in particular, run into some problems, but we still can't confirm that AVX2 instructions don't cause the problem.

@martin-frbg
Copy link
Collaborator

It occurs to me that you could try leaving CORETYPE as Haswell, but changing the
#elif defined(HASWELL)
at the top of kernel/x86_64/dscal.c to
#elif defined(HASWELL) && !defined(OS_DARWIN)
so that it uses the C fallback code for dscal_kernel_8 and dscal_kernel_8_zero on Mac, which would seem to isolate any remaining performance problem to dscal_kernel_inc_8 in that file.
(But then I am not an OpenBLAS developer either).
Lastly, the easy workaround to get Haswell performance overall, but use the Nehalem codepath (likely plain netlib) for dscal, would appear to be to comment or remove the line reading DSCALKERNEL = dscal.c at the top of kernel/x86_64/KERNEL.HASWELL

@jakirkham
Copy link
Contributor Author

So, I went ahead and tried rebuilding OpenBLAS with all the same flags as before and one additional flag NO_AVX=1. However, the AVX2 instruction set may still be in use (unless NO_AVX=1 implies NO_AVX2=1). When running the profiler, I can see that NEHALEM shows up even though it is not being specified by the environment. So, this may have the same effect as specifying the architecture. In any event, this appears to also remove the problem (with the exception of some weird asymptotic behavior at one point).

performance_of_openblas_syrk_subroutines_no_avx

I am not sure if this is related, but I saw this posting on SO ( http://stackoverflow.com/q/7839925 ). If SSE and AVX instructions are being mixed here, it might explain why not using AVX instructions has a positive effect (or it may be due to simply using the Nehalem variant underneath).

@jakirkham
Copy link
Contributor Author

Also, based on a few people's suggestions, I tried patching the dscal_kernel_8_zero function. It was patched using a C version of dscal_kernel_8_zero. I reran the dsyrk/syrk comparison and got the following result. In short, this appears to be mostly fixed, but there still is some more time spent in dsyrk in a few places than probably should be the case (albeit not much).

performance_of_openblas_syrk_subroutines_dscal_kernel_8_zero_patch

Here is the patch I applied against v0.2.15.

diff --git kernel/x86_64/dscal_microk_haswell-2.c kernel/x86_64/dscal_microk_haswell-2.c
index 07a9c80..e7df0fc 100644
--- kernel/x86_64/dscal_microk_haswell-2.c
+++ kernel/x86_64/dscal_microk_haswell-2.c
@@ -140,67 +140,23 @@ static void dscal_kernel_8( BLASLONG n, FLOAT *alpha, FLOAT *x)

 static void dscal_kernel_8_zero( BLASLONG n, FLOAT *alpha, FLOAT *x) __attribute__ ((noinline));

-static void dscal_kernel_8_zero( BLASLONG n, FLOAT *alpha, FLOAT *x)
+static void dscal_kernel_8_zero( BLASLONG n, FLOAT *alpha , FLOAT *x )
 {

-
-   BLASLONG n1 = n >> 4 ;
-   BLASLONG n2 = n & 8 ;
-
-   __asm__  __volatile__
-   (
-   "vxorpd     %%xmm0, %%xmm0 , %%xmm0         \n\t"  
-
-   "addq   $128, %1                    \n\t"
-
-   "cmpq   $0, %0                      \n\t"
-   "je 2f                      \n\t" 
-
-   ".align 16                          \n\t"
-   "1:                                 \n\t"
-
-   "vmovups    %%xmm0  ,-128(%1)           \n\t"
-   "vmovups    %%xmm0  ,-112(%1)           \n\t"
-   "vmovups    %%xmm0  , -96(%1)           \n\t"
-   "vmovups    %%xmm0  , -80(%1)           \n\t"
-
-   "vmovups    %%xmm0  , -64(%1)           \n\t"
-   "vmovups    %%xmm0  , -48(%1)           \n\t"
-   "vmovups    %%xmm0  , -32(%1)           \n\t"
-   "vmovups    %%xmm0  , -16(%1)           \n\t"
-
-   "addq       $128, %1                \n\t"
-   "subq           $1 , %0                     \n\t"       
-   "jnz        1b                          \n\t"
-
-   "2:                                 \n\t"
-
-   "cmpq   $8  ,%3                     \n\t"
-   "jne    4f                      \n\t"
-
-   "vmovups    %%xmm0  ,-128(%1)           \n\t"
-   "vmovups    %%xmm0  ,-112(%1)           \n\t"
-   "vmovups    %%xmm0  , -96(%1)           \n\t"
-   "vmovups    %%xmm0  , -80(%1)           \n\t"
-
-   "4:                         \n\t"
-
-   "vzeroupper                     \n\t"
-
-   :
-        : 
-     "r" (n1),     // 0
-          "r" (x),      // 1
-          "r" (alpha),  // 2
-     "r" (n2)      // 3
-   : "cc", 
-     "%xmm0", "%xmm1", "%xmm2", "%xmm3", 
-     "%xmm4", "%xmm5", "%xmm6", "%xmm7", 
-     "%xmm8", "%xmm9", "%xmm10", "%xmm11", 
-     "%xmm12", "%xmm13", "%xmm14", "%xmm15",
-     "memory"
-   );
-
-} 
+   BLASLONG i;
+   for( i=0; i<n; i+=8 )
+   {
+       x[0] = 0.0; 
+       x[1] = 0.0; 
+       x[2] = 0.0; 
+       x[3] = 0.0; 
+       x[4] = 0.0; 
+       x[5] = 0.0; 
+       x[6] = 0.0; 
+       x[7] = 0.0; 
+       x+=8;
+   }
+
+}

@martin-frbg
Copy link
Collaborator

Might be interesting to see the assembly that clang creates from that C loop to find out why it beats
wernsaar's hand-optimized code (which I am sure he will have benchmarked before committing) on your specific platform. At the very least that would probably tell if AVX as such is the problem.

@jakirkham
Copy link
Contributor Author

I looked briefly at the final OpenBLAS dynamic library (built with the C code patch above) using a trial copy of a disassembler and there were no AVX instructions. Though I don't know if you would need to coerce clang to vectorize this. That being said, I don't think AVX alone is the problem. After all, I am able to get reasonable performance out of ssyrk, which is going through these same instructions, if I am not mistaken. Also, I believe sgemm and dgemm use AVX instructions as well, but both have reasonable performance.

@jakirkham
Copy link
Contributor Author

After some brief testing on a Linux cluster (as opposed to a VM due to AVX pass through issues), it doesn't appear to suffer too badly. I tried a few brief tests on Haswell and Sandybridge architecture nodes that were available. However, I am not clear on the exact make of the chips they are using, but this could also be responsible for some differences. Below is a graph generated from a Haswell core ( 2.3GHz Intel Haswell E5-2698 ) running CentOS 6.5. The build has the same parameters listed in the description with no other modifications.

performance_of_openblas_syrk_subroutines_centos_6_5

@jakirkham
Copy link
Contributor Author

Also added a dgemm vs. sgemm graph to the description for comparison. This was profiled in essentially the same way as dsyrk vs. ssyrk except it used *gemm instructions. Code for this profiling is shown in the gist ( https://gist.github.com/jakirkham/17cc1481672a49cadb33 ).

@jakirkham
Copy link
Contributor Author

Finally, given current master ( 53e849f ) and develop ( 692d9c8 ), I can see no difference in kernel/x86_64/dscal_microk_haswell-2.c.

@jakirkham
Copy link
Contributor Author

@wernsaar, it looks like this is your code (at least git blame thinks so). Do you have any thoughts on what is going on here? I have tried to give you as much useful information as I can. Any help you can provide would be much appreciated.

@martin-frbg
Copy link
Collaborator

While actual OpenBLAS developers are busy, it occured to me that it might help if you could clarify if this effect was seen on just that one MacBook, or also confirmed on other OSX haswell systems ? (I.e., could it be a hardware fluke like cpu throttling in response to local overheating, or just the scaledown of cpu frequency during avx unit activity that seems to be common to Haswell processors
but might be more pronounced on a "mobile" cpu ?)

wernsaar added a commit that referenced this issue Jan 21, 2016
Ref #730: added performance updates for syrk and syr2k
@brada4
Copy link
Contributor

brada4 commented Jan 21, 2016

Is appnap disabled for good?
Some apple laptops draw power from battery and charger to reach top processing power, so there is time limit on heavy computation.

@jakirkham
Copy link
Contributor Author

Any news on this?

@martin-frbg
Copy link
Collaborator

One possibility that surfaced in the meantime is that there may be alignment problems in the haswell assembler code - see #901 and links therein.

@martin-frbg
Copy link
Collaborator

@jakirkham if you are still interested in this issue (and if I understand the thread correctly at all on re-reading it now), I wonder if you could try again with the stock dscal_microk_haswell-2.c and just drop the ".align 16" from the inline assembly in the dscal_kernel_8_zero() function that I understand you had to replace to fix the OSX perfomance issue. (This is what I meant to allude to in my earlier comment, my apologies if that was too unclear). In any case I suspect we lack a developer with an OSX Haswell box.

@martin-frbg
Copy link
Collaborator

Should be fixed by #1471

@jakirkham
Copy link
Contributor Author

FWIW here's the behavior with OpenBLAS 0.3.3. This seems to be doing roughly the right thing.


screen shot 2019-01-21 at 22 51 46

@martin-frbg
Copy link
Collaborator

Thanks. There may still be a problem with the transition point for switching to multithreading on at least some processors, e.g. #1115, but at least the OSX-specific bug should be gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants