-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suprisingly slow dsyrk
on Mac
#730
Comments
Are you sure result is from OpenBLAS and not Accelerate Framework running on (float64-impaired, HD-Iris) GPU?
I get dsyrk 1.5x slower on normal CPU (explainable by super small input matrix) and 3x slower on atom (explainable by using integer ALU for double precision) |
Yes, as previously stated, I am positive NumPy and SciPy are using OpenBLAS. Here are specs from SciPy if you don't believe me.
Though some people aren't satisfied with this information alone (I am sometimes one of those people).
|
I realize that I didn't include a model number, which I have done in the description. So, it is 2.3 GHz Intel Core i7-4850HQ. In other words, a Haswell processor. When I built OpenBLAS, I built it for dynamic architecture though if that is any help. |
Must be something about macintosh. In Linux with i7-4790 I get same results as on sandy bridge server mentioned before. |
I agree that there is something weird on Mac that doesn't carry over to Linux. Sure, here they are.
In short, small differences (though they fluctuate back up to the values I listed before on reruns). Still over an order of magnitude difference. |
To further narrow down the issue you might want to check how consistent it is over a variety of array sizes, and try with and without threading (to disable threads set envvar OMP_NUM_THREADS=1). One thing I know is that OpenBLAS can show really weird behavior on "small" matrices (e.g. #731), and I'm not sure what counts as small. |
Threading is currently disabled, but I can play with enabling it. What are you thinking will happen when threads enter the picture? As for size, the following all use square-ish configured matrices. I list the orders of magnitude of elements each matrix and how they do below. Starting with smaller arrays (~10s to ~100s) appear to do about the same in terms of time for
Arrays of the range previously tried (~10,000s of elements) are over 1 order of magnitude slower as was shown before. By ~100,000s of elements, it is under an order of magnitude slower. Finally, when reaching ~1,000,000s of elements it is around the expected to 2x slower. **TL;DR:** Behavior is really bad for matrices with between ~100s - ~1,000,000s of elements. It is most notable around ~1,000s of elements. |
If threads are disabled then it's not the same pathological corner case as in #731. I suspect that pattern of slowing down at a particular range of sizes may mean something to someone who knows openblas internals, but that's not me :-). |
Ah, ok, yeah I don't think I have that case, but that does look nasty. I am almost inclined to think that this current case has something to do with how the cache is working. That being said, I used the exact same machine (with VirtualBox) for the Linux profiling and don't see the same problem. |
There is some small but measurable effect in your case from sub-optimal threading configuration on very small data, but it does not affect dsyrk 20x more than ssyrk. Measurements (Linux, AMD A10-5750M)
10
1000
|
The copy I used is one I built myself using 0.2.15 with the following flags I am a little confused as to the statement that a suboptimal threading configuration is causing me problems as I am keeping threading off if at all possible. In fact, I ran the case for
If I'm doing terribly wrong with this build, please let me know. |
Obviously single threaded version runs single threaded no matter what variables you set. Your build is perfectly fine. Please ask ones who equipped you with initial binary openblas to build it well just like you did moments ago. |
The reason I use dynamic architecture is I do support a variety of laptop and desktop machines with different architecture. Also, I use a cluster that has some older and some newer architectures depending on the node. This ends up being the easiest solution as opposed to pinning the build to the lowest common denominator as I get the right balance between support and performance. Maybe I am misunderstanding you. Is this a default build option? As for setting the number of threads, often parallelism is dealt with at a higher level in my code. So, I normally want the BLAS to be single threaded. However, I may re-explore building it to run multithreaded for different applications. Is there some way to make it default to 1 thread if some
I always hand build OpenBLAS. Good to hear I am not way off base :) |
The only other thing I can think of is I am using
|
In your initial posting dsyrk was 50x slower than ssyrk. now it is 2x slower, i.e perfectly fine. 4 results in 100-group with openblas built with linux CC=clang 3.5 ie approx xcode6 |
@brada4, I think you may be confused. This is ok as this problem isn't a clear "everything is slow" problem, but there seems to be a regime where it gets very slow and transitions at the edges where normal behavior is resumed. The last example is at the upper end of the regime where it becomes normal again. In other words, I used arrays that had these dimensions ~103 * ~103 so this many elements ~106. This is what I said before.
In short, this does not deviate from what I said before. However, it is surprising as the problem seems to be size based. This is why I was wondering if the problem was cache size dependent. However, the fact that I don't experience this on a Linux VM that is not using hardware visualization suggests this is probably not the cause (though I could be wrong). The worst parts of the range is arrays with ~1,000 elements. See an example below.
We have now demonstrated the dramatic effect that I saw before.
Another bad point in this range is ~10,000 element arrays. See below.
This matches again with what we have seen previously.
So, no, this is not magically fixed. We merely did a second run at the transition back to normal behavior. Unsurprisingly, the same behavior that was seen before repeated. I think I will try to come up with a figure so that this behavior is a little clearer. It should also elucidate any other weird peaks and valleys to help us identify what the possible causes might be. |
Below is a graph comparing A log-log graph was used here to make the differences more clear. Note that while performance has improved for matrices with 106 elements it is still not fully recovered. Small arrays seem to do about the same with either function, but it is hard to say as the signal to noise is lower there. This hopefully does a better job of elucidating the problematic region. Here is a gist with the Jupyter notebook used. ( https://gist.github.com/jakirkham/17cc1481672a49cadb33 ) |
It just says you have very old compiler that does not play well with 20x patched threading library. |
Though, this is built to be single threaded only. Please help me to understand why the threading library matters at all here. |
Also, here is the compiler information.
|
Xcode release history at https://trac.macports.org/wiki/XcodeVersionInfo indicates that Xcode 5.0.0 is not really for OSX 10.9 |
Sorry, I should have given you the XCode version information.
This is the second version of XCode that was for 10.9. |
Can you try AVX(1) openblas with prepending OPENBLAS_CORETYPE=Sandybridge to command lines? And Prescott? Another thing to play around is malloc debug options. And what about virtual machine with newer compilers? |
if you have the capability to get a quick profile then that sometimes will On Thu, Jan 14, 2016 at 2:51 PM, brada4 [email protected] wrote:
Nathaniel J. Smith -- http://vorpus.org |
Whats in the makefile.rule ans whats in the end report of OpenBLAS build? |
This is simply a straight download of the v0.2.15 tarball on GitHub. Nothing is patched. Only the aforementioned flags are provided. |
Also, I don't think I provided the fortran compiler information, but that is |
Can you provide compile flags and compile report for regression case? I dont find them anywhere in the thread. |
Thanks @njsmith. I went and ran the same code used to generate the graph from before ( #730 (comment) ) with |
Virtualbox is AVX, not AVX2, thats why it exhibits little problem Here to change haswell kernel to sandybridge (if single thread is of no help) kernel/x86_64/dscal.c |
I tried setting It appears to peak at 2x slower as opposed to overshooting for some range and slowly returning. Given that Nehalem did not support AVX instructions, it seems like an interesting test case to rebuild with different forms of AVX instructions disabled and see if the problem persists or not. |
Related information on AVX and AVX2 instructions on VirtualBox. ( https://www.virtualbox.org/ticket/14262 ) At present, it appears AVX2 is still disabled on all hosts, but AVX is allowed on host/guest combinations that both have 64-bit. The latter is my case. As I have default settings for the VirtualBox VM, it appears AVX instructions should be used in my Linux test case, but not AVX2 instructions. As the Sandy Bridge case on Mac saw the same behavior and Sandy Bridge only supported AVX, this suggests the AVX instruction on Mac, in particular, run into some problems, but we still can't confirm that AVX2 instructions don't cause the problem. |
It occurs to me that you could try leaving CORETYPE as Haswell, but changing the |
So, I went ahead and tried rebuilding OpenBLAS with all the same flags as before and one additional flag I am not sure if this is related, but I saw this posting on SO ( http://stackoverflow.com/q/7839925 ). If SSE and AVX instructions are being mixed here, it might explain why not using AVX instructions has a positive effect (or it may be due to simply using the Nehalem variant underneath). |
Also, based on a few people's suggestions, I tried patching the Here is the patch I applied against
|
Might be interesting to see the assembly that clang creates from that C loop to find out why it beats |
I looked briefly at the final OpenBLAS dynamic library (built with the C code patch above) using a trial copy of a disassembler and there were no AVX instructions. Though I don't know if you would need to coerce |
After some brief testing on a Linux cluster (as opposed to a VM due to AVX pass through issues), it doesn't appear to suffer too badly. I tried a few brief tests on Haswell and Sandybridge architecture nodes that were available. However, I am not clear on the exact make of the chips they are using, but this could also be responsible for some differences. Below is a graph generated from a Haswell core ( 2.3GHz Intel Haswell E5-2698 ) running CentOS 6.5. The build has the same parameters listed in the description with no other modifications. |
Also added a |
@wernsaar, it looks like this is your code (at least |
While actual OpenBLAS developers are busy, it occured to me that it might help if you could clarify if this effect was seen on just that one MacBook, or also confirmed on other OSX haswell systems ? (I.e., could it be a hardware fluke like cpu throttling in response to local overheating, or just the scaledown of cpu frequency during avx unit activity that seems to be common to Haswell processors |
Ref #730: added performance updates for syrk and syr2k
Is appnap disabled for good? |
Any news on this? |
One possibility that surfaced in the meantime is that there may be alignment problems in the haswell assembler code - see #901 and links therein. |
@jakirkham if you are still interested in this issue (and if I understand the thread correctly at all on re-reading it now), I wonder if you could try again with the stock dscal_microk_haswell-2.c and just drop the ".align 16" from the inline assembly in the dscal_kernel_8_zero() function that I understand you had to replace to fix the OSX perfomance issue. (This is what I meant to allude to in my earlier comment, my apologies if that was too unclear). In any case I suspect we lack a developer with an OSX Haswell box. |
Should be fixed by #1471 |
Thanks. There may still be a problem with the transition point for switching to multithreading on at least some processors, e.g. #1115, but at least the OSX-specific bug should be gone. |
Here is an example on Mac OS 10.9 on a Late 2013 MacBook Pro 15" with a 2.3 GHz Intel Core i7-4850HQ using OpenBLAS 0.2.15, Python 2.7.11, NumPy 1.10.2, SciPy 0.16.1. I have verified that NumPy and SciPy are properly linked to the version of OpenBLAS specified. OpenBLAS has built with the following flags
make DYNAMIC_ARCH=1 BINARY=64 NO_LAPACK=0 NO_AFFINITY=1 NUM_THREADS=1
no other options or modifications were made.I wouldn't be surprised to see it takes twice as a long due to the fact that it is double versus single precision; however, taking over an order of magnitude longer seems to be a bit much.
Following the same build procedure on a Linux VM on the same machine (uses VirtualBox 5.0.12), I get a much more reasonable time for
dsyrk
(around doublessyrk
). I have no idea whether this carries over to Windows or not. If someone is able to reproduce a similar example using C or Fortran, please share your steps.After further discussion, we found this was array size dependent. Below is a graph showing this dependence. More details about how the graph was made can be found in this comment. ( #730 (comment) )
For comparison, one can look at the time taken by
sgemm
anddgemm
, but one will not see this behavior.The text was updated successfully, but these errors were encountered: