-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX2 optimization #20
Comments
Off the top of my head, I know that a lot of time is spent in the SAD/SATD functions. I think the SAD functions are already as good as they can get, and AVX2 won't help them. Maybe you can find something to improve in the SATD functions. Otherwise, profile it and see what other functions are most used. It is possible to use intrinsics. They are already used in Degrain. However, it wouldn't hurt to learn how to write assembly with x86inc.asm. I don't know if it's worth the effort, because I don't know what AVX2 really is. Have fun. |
I've started converting some function to AVX2, you can find them in my repo https://github.com/MonoS/vapoursynth-mvtools . While you're right saying that don't hurt to learn writing assembly, i prefer to use intrinsics for now, i've already know how to write them and i have not so much time to give to this project. Also i'll optimize mostly degrain related functions. |
I forgot about this, sorry. Are your changes done? |
Not yet, i've only had time to write some POC code without testing it [last time i've tried i had quite some problem compiling mvtools with msys2], i've just compiled it to ensure it was without syntactic errors. |
I've implemented ToPixels both AVX2 and SSE2, i've tested it for all values of pSrc [0 to 2^32 - 1] and had the same result as the C version [performance wise you can find information in the commits]. Now i'll go to implement 4 different version of Overlaps [2x2, 2x4, 4xX and 8->32xX], implement LimitChanges, rewrite Degrains for 4xX [and probably the littler brother 2x2 in SSE2]. If you think there are other places i can give attention [i'd prefer degrain related function] let me know. Also i'd like you to take a look at MonoS@46f3481 and tell me if you like this syntax for selecting function with more than one version. |
Funny thing: an optimised function that is clearly several times faster than the equivalent C version may still have no effect on the speed of MVTools, if the function doesn't take up much CPU time to begin with. You should test your optimised functions in the context where they will be used, i.e. with a VapourSynth script. Also, if the AVX2 version is only 1.13 times faster than the SSE2 version, I'll take just the SSE2. Are these the functions that profiling revealed to take a lot of CPU time? The syntax is fine, I suppose. |
I'm doing this not only because i use mvtools quite extensively, but also for learning purposes, so i'm not doing any profiling yet. In the case of toPixels it take really no cpu time [unroled 32x it take only 56 cycles for one 32x interation, the same in avx 10], but to me was interesting to optimize it [as well as the other functions]. In my opinion a 13% speedup is quite nice to have, maybe not so much for so little functions, but it's better to not waste cycles when possible. I'm choosing the function to optimize based on the code i see, if i think that a function it's called a lot of times [like all the block related functions... i suppose] even a 10% speedup start accumulating. Regarding integrating this changes it's up to you, avx2 is available since 2 years ago [haswell], but if i find some function easy to translate i'll make also an sse2 version for less recent processor. |
All right. Just keep in mind that I won't integrate any functions that have no effect on the speed reported by vspipe. |
I've converted also overlaps [excluding 2xX] and tested it for all possible values of pWin and pSrc. I think i'll start to compile it and let you know if i see any speedup [i hope so] |
As expected lost the last three hours trying to compile it without success, Jonathan Blow is right. |
HI dubhater, sorry for asking, but i may need an hand in compiling it on windows, i also have a linux vm if it can help. |
I don't know anything about compiling it in Windows. In Linux you add You'll probably have a GCC with shared libstdc++ and libwinpthread, so the resulting libmvtools.dll will need those DLLs to function. Use depends.exe to find out for sure. Alternatively, maybe @HolyWu can tell you how to compile it with Visual Studio. |
I had strange problem during linking and during compilation of asm files, in fact at some point i started rewriting the script from scratch using GCC but without avail, now a lot of months have passed and i don't remember the exact problems i encountered. I compile all statically with GCC, shared executable are just a pain (as is compilation per se). I've also asked a friend to help me compile it (he successfully compiled ffmpeg), but he complained about lack of vapoursynth installation, but compiling vapoursynth was not so simple for him. If @HolyWu really can help me, even with VS i'll be very grateful to both of you. PS: if everything goes well i'll try to make some SSE2 version of the avx2 stuffs |
No problem. What error did you get while compiling in Visual Studio? Did you have vsyasm installed to compile the .asm files? |
i've never compiled anything with VS, the attempts i tried were on GCC on a msys2 evirorment. I'll instal the VS C/C++ compiler and let you know asap. EDIT: Ok, @HolyWu , i've installed vc++ and also vsyasm, now i should import the project for compiling, how can i do that? |
Using msys i successfully compiled all the file, and now i'm at the linking stage, when it tries to use the .asm files it raise "undefined reference to" a lot of file. |
I'll need to see at least a few of those error messages. |
test2.txt |
The multiple definitions are probably due to a missing |
I'm not sure what's happening with the others. Maybe you should post all the output from |
stdout |
It does look like you're using a 64 bit compiler and linker, but for some reason the build system thinks you're on a 32 bit system. What does |
i686-pc-mingw32 |
Does it help to add |
perfect i guess, now i only get some fftw3 related undefined reference. |
In test3.txt I see that |
i set the env variable FFTW_LIBS/CFLAGS with the location of fftw3f in which i've extracted all the file of the windows binaries available on the fftw3 site (http://www.fftw.org/install/windows.html) i've also set the make flag as you said in some different ways: absolute and relative directory, with and without the ddl file; the same for the -l option with and without lib; i continue to have this error
something make me think that i need the .a version of the library |
Oh, I see. The file is called libfftw3f-3.dll, so maybe you need |
EDIT: if @HolyWu is still there, could you please take a look at my batch file and why it give me undefined reference error for the .asm files? |
It's |
oh my, my apologies, now i have a dll, i'll let you know asap. EDIT: dll compiled, but there is a bug in my degrain function, i'll need to fix it before anything else |
After a lot of trouble and fixing i seems to have a shippable version, i'll make some commit on my repo in the near future and will try to integrate some of your fixes in my repo. Thank you to both of us for the help. |
How is the speed? |
I'm still not completely confident about my reading, but my Degrain2 function [you can find it here https://github.com/MonoS/MonoS-VS-Func/blob/master/MFunc.py#L32 ] with blksize=16 and overlap=8 seems to be 2.4 faster with all the analysis stuff. I've tested it only on one source without access to the machine encoding it, so i need to make some more tests, also the analysis part not being optimized i'd like to exclude it from my testing, i'll make one ASAP. PS: i've compile the whole solution with
so maybe the autovectorization of GCC kicked in in the C versions. |
Suggestion: compile just one DLL and temporarily add a parameter to the filter to select the degrain function. For example, "avx2:int:opt;". If True, you use your new function, otherwise the existing SSE2 function. That way you know you're not comparing the speed of -O2 versus -O3. |
I already have the extra parameters. I'll let you know, anyway the code is on my repo, if you have a machine with avx2 (Hashwell or later) you can try it, IIRC @HolyWu should have one if they is willing to do. |
now new fftw 3.3.5 have AVX2 optimization. |
I've got some Xeon with AVX2 and i'd like to optimize MVTools to support this instruction set, may i have some pointers on how and where to put my smelly hands??
I've already took a look at the code and saw quite a lot of assembly, i'd prefer to use intrinsics instead, is it possible??
Do you think is it worth the effort??
Thanks for the attention
The text was updated successfully, but these errors were encountered: