Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX2 optimization #20

Open
MonoS opened this issue Nov 4, 2015 · 37 comments
Open

AVX2 optimization #20

MonoS opened this issue Nov 4, 2015 · 37 comments

Comments

@MonoS
Copy link

MonoS commented Nov 4, 2015

I've got some Xeon with AVX2 and i'd like to optimize MVTools to support this instruction set, may i have some pointers on how and where to put my smelly hands??

I've already took a look at the code and saw quite a lot of assembly, i'd prefer to use intrinsics instead, is it possible??

Do you think is it worth the effort??

Thanks for the attention

@dubhater
Copy link
Owner

dubhater commented Nov 4, 2015

Off the top of my head, I know that a lot of time is spent in the SAD/SATD functions. I think the SAD functions are already as good as they can get, and AVX2 won't help them. Maybe you can find something to improve in the SATD functions. Otherwise, profile it and see what other functions are most used.

It is possible to use intrinsics. They are already used in Degrain. However, it wouldn't hurt to learn how to write assembly with x86inc.asm.

I don't know if it's worth the effort, because I don't know what AVX2 really is.

Have fun.

@MonoS
Copy link
Author

MonoS commented Nov 14, 2015

I've started converting some function to AVX2, you can find them in my repo https://github.com/MonoS/vapoursynth-mvtools .

While you're right saying that don't hurt to learn writing assembly, i prefer to use intrinsics for now, i've already know how to write them and i have not so much time to give to this project.

Also i'll optimize mostly degrain related functions.

@dubhater
Copy link
Owner

dubhater commented Dec 9, 2015

I forgot about this, sorry.

Are your changes done?

@MonoS
Copy link
Author

MonoS commented Dec 9, 2015

Not yet, i've only had time to write some POC code without testing it [last time i've tried i had quite some problem compiling mvtools with msys2], i've just compiled it to ensure it was without syntactic errors.

@MonoS
Copy link
Author

MonoS commented Dec 13, 2015

I've implemented ToPixels both AVX2 and SSE2, i've tested it for all values of pSrc [0 to 2^32 - 1] and had the same result as the C version [performance wise you can find information in the commits].

Now i'll go to implement 4 different version of Overlaps [2x2, 2x4, 4xX and 8->32xX], implement LimitChanges, rewrite Degrains for 4xX [and probably the littler brother 2x2 in SSE2].

If you think there are other places i can give attention [i'd prefer degrain related function] let me know.

Also i'd like you to take a look at MonoS@46f3481 and tell me if you like this syntax for selecting function with more than one version.

@dubhater
Copy link
Owner

Funny thing: an optimised function that is clearly several times faster than the equivalent C version may still have no effect on the speed of MVTools, if the function doesn't take up much CPU time to begin with. You should test your optimised functions in the context where they will be used, i.e. with a VapourSynth script. Also, if the AVX2 version is only 1.13 times faster than the SSE2 version, I'll take just the SSE2.

Are these the functions that profiling revealed to take a lot of CPU time?

The syntax is fine, I suppose.

@MonoS
Copy link
Author

MonoS commented Dec 13, 2015

I'm doing this not only because i use mvtools quite extensively, but also for learning purposes, so i'm not doing any profiling yet.

In the case of toPixels it take really no cpu time [unroled 32x it take only 56 cycles for one 32x interation, the same in avx 10], but to me was interesting to optimize it [as well as the other functions].
Now it is possible that in the O3 version the compiler autovectorized the function ["Vectorization is enabled by the flag -ftree-vectorize and by default at -O3." from https://gcc.gnu.org/projects/tree-ssa/vectorization.html], remember that i compiled it with march avx2.

In my opinion a 13% speedup is quite nice to have, maybe not so much for so little functions, but it's better to not waste cycles when possible.

I'm choosing the function to optimize based on the code i see, if i think that a function it's called a lot of times [like all the block related functions... i suppose] even a 10% speedup start accumulating.

Regarding integrating this changes it's up to you, avx2 is available since 2 years ago [haswell], but if i find some function easy to translate i'll make also an sse2 version for less recent processor.

@dubhater
Copy link
Owner

All right. Just keep in mind that I won't integrate any functions that have no effect on the speed reported by vspipe.

@MonoS
Copy link
Author

MonoS commented Dec 13, 2015

I've converted also overlaps [excluding 2xX] and tested it for all possible values of pWin and pSrc.

I think i'll start to compile it and let you know if i see any speedup [i hope so]

@MonoS
Copy link
Author

MonoS commented Dec 13, 2015

As expected lost the last three hours trying to compile it without success, Jonathan Blow is right.

@MonoS
Copy link
Author

MonoS commented Jun 3, 2016

HI dubhater, sorry for asking, but i may need an hand in compiling it on windows, i also have a linux vm if it can help.

@dubhater
Copy link
Owner

dubhater commented Jun 3, 2016

I don't know anything about compiling it in Windows. In Linux you add --host=x86_64-w64-mingw32 (or something like that) to the configure command. Since you're using intrinsics (right?) you'll also need to add CFLAGS='-mavx2' CXXFLAGS='-mavx2'. If you added any new .c or .cpp files they need to be listed in Makefile.am with the others. If you're using yasm, you just need to add any new .asm files to Makefile.am.

You'll probably have a GCC with shared libstdc++ and libwinpthread, so the resulting libmvtools.dll will need those DLLs to function. Use depends.exe to find out for sure.

Alternatively, maybe @HolyWu can tell you how to compile it with Visual Studio.

@MonoS
Copy link
Author

MonoS commented Jun 3, 2016

I had strange problem during linking and during compilation of asm files, in fact at some point i started rewriting the script from scratch using GCC but without avail, now a lot of months have passed and i don't remember the exact problems i encountered.

I compile all statically with GCC, shared executable are just a pain (as is compilation per se).

I've also asked a friend to help me compile it (he successfully compiled ffmpeg), but he complained about lack of vapoursynth installation, but compiling vapoursynth was not so simple for him.

If @HolyWu really can help me, even with VS i'll be very grateful to both of you.

PS: if everything goes well i'll try to make some SSE2 version of the avx2 stuffs

@HolyWu
Copy link
Contributor

HolyWu commented Jun 4, 2016

No problem. What error did you get while compiling in Visual Studio? Did you have vsyasm installed to compile the .asm files?

@MonoS
Copy link
Author

MonoS commented Jun 5, 2016

i've never compiled anything with VS, the attempts i tried were on GCC on a msys2 evirorment. I'll instal the VS C/C++ compiler and let you know asap.

EDIT: Ok, @HolyWu , i've installed vc++ and also vsyasm, now i should import the project for compiling, how can i do that?

@MonoS
Copy link
Author

MonoS commented Jun 18, 2016

Using msys i successfully compiled all the file, and now i'm at the linking stage, when it tries to use the .asm files it raise "undefined reference to" a lot of file.
i'm compiling running
./autogen.sh
CXXFLAGS="-march=core-avx2" ./configure
make

@dubhater
Copy link
Owner

I'll need to see at least a few of those error messages.

@MonoS
Copy link
Author

MonoS commented Jun 18, 2016

test2.txt
Here's a log of all the messages, the first two are due to my incompetence for sure

@dubhater
Copy link
Owner

The multiple definitions are probably due to a missing static keyword. void ToPixels_AVX2_16bit(...) -> static void ToPixels_AVX2_16bit(...).

@dubhater
Copy link
Owner

I'm not sure what's happening with the others. Maybe you should post all the output from make V=1 (run make clean first).

@MonoS
Copy link
Author

MonoS commented Jun 18, 2016

stdout
stderr
This error
c:/mingw-x64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/5.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: i386 architecture of input filesrc/asm/.libs/const-a.o' is incompatible with i386:x86-64 output`
i know how to solve it using a custom bat i was making
compila.bat
Also i get the same error when executing my bat
errors from my bat

@dubhater
Copy link
Owner

It does look like you're using a 64 bit compiler and linker, but for some reason the build system thinks you're on a 32 bit system. What does ./config.guess print?

@MonoS
Copy link
Author

MonoS commented Jun 18, 2016

i686-pc-mingw32

@dubhater
Copy link
Owner

Does it help to add --host=x86_64-w64-mingw32 to the configure command?

@MonoS
Copy link
Author

MonoS commented Jun 18, 2016

perfect i guess, now i only get some fftw3 related undefined reference.
Do i need to compile fftw3 on my environment?

@dubhater
Copy link
Owner

In test3.txt I see that -lfftw3f is missing from the link command. make LIBS='-L<location of libfftw3f.dll> -lfftw3f' may work. The build system should have done this already, or configure should have failed if it couldn't find fftw. I don't know what happened.

@MonoS
Copy link
Author

MonoS commented Jun 18, 2016

i set the env variable FFTW_LIBS/CFLAGS with the location of fftw3f in which i've extracted all the file of the windows binaries available on the fftw3 site (http://www.fftw.org/install/windows.html)

i've also set the make flag as you said in some different ways: absolute and relative directory, with and without the ddl file; the same for the -l option with and without lib; i continue to have this error

*** Warning: linker path does not have real file for library -lfftw3f.
*** I have the capability to make that library automatically link in when
*** you link to this library.  But I can only do this if you have a
*** shared version of the library, which you do not appear to have
*** because I did check the linker path looking for a file starting
*** with libfftw3f but no candidates were found. (...for file magic test)
*** The inter-library dependencies that have been dropped here will be
*** automatically added whenever a program is linked with this library
*** or is declared to -dlopen it.

*** Since this library must not contain undefined symbols,
*** because either the platform does not support them or
*** it was explicitly requested with -no-undefined,
*** libtool will only create a static version of it.
copying selected object files to avoid basename conflicts...

something make me think that i need the .a version of the library

@dubhater
Copy link
Owner

Oh, I see. The file is called libfftw3f-3.dll, so maybe you need -lfftw3f-3.

@MonoS
Copy link
Author

MonoS commented Jun 18, 2016

error
Still nothing :(

EDIT: if @HolyWu is still there, could you please take a look at my batch file and why it give me undefined reference error for the .asm files?

@dubhater
Copy link
Owner

It's -L not L.

@MonoS
Copy link
Author

MonoS commented Jun 18, 2016

oh my, my apologies, now i have a dll, i'll let you know asap.

EDIT: dll compiled, but there is a bug in my degrain function, i'll need to fix it before anything else

@MonoS
Copy link
Author

MonoS commented Jun 22, 2016

After a lot of trouble and fixing i seems to have a shippable version, i'll make some commit on my repo in the near future and will try to integrate some of your fixes in my repo.

Thank you to both of us for the help.

@dubhater
Copy link
Owner

How is the speed?

@MonoS
Copy link
Author

MonoS commented Jun 23, 2016

I'm still not completely confident about my reading, but my Degrain2 function [you can find it here https://github.com/MonoS/MonoS-VS-Func/blob/master/MFunc.py#L32 ] with blksize=16 and overlap=8 seems to be 2.4 faster with all the analysis stuff.

I've tested it only on one source without access to the machine encoding it, so i need to make some more tests, also the analysis part not being optimized i'd like to exclude it from my testing, i'll make one ASAP.

PS: i've compile the whole solution with

CXXFLAGS="-march=core-avx2 -O3" ./configure --host=x86_64-w64-mingw32
make LIBS='-L./src/ -lfftw3f-3'

so maybe the autovectorization of GCC kicked in in the C versions.

@dubhater
Copy link
Owner

Suggestion: compile just one DLL and temporarily add a parameter to the filter to select the degrain function. For example, "avx2:int:opt;". If True, you use your new function, otherwise the existing SSE2 function. That way you know you're not comparing the speed of -O2 versus -O3.

@MonoS
Copy link
Author

MonoS commented Jun 23, 2016

I already have the extra parameters.
The O3 version use autovectorization by the compiler, i need to recompile it with just O2 to compare against the c version completely or i'll compare against an avx2 version generated by the compiler.
Also i need to remove any other SSE2 version i created during the process.

I'll let you know, anyway the code is on my repo, if you have a machine with avx2 (Hashwell or later) you can try it, IIRC @HolyWu should have one if they is willing to do.

@fuchanghao
Copy link

now new fftw 3.3.5 have AVX2 optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants