AVX2 optimization #20

MonoS · 2015-11-04T16:42:55Z

I've got some Xeon with AVX2 and i'd like to optimize MVTools to support this instruction set, may i have some pointers on how and where to put my smelly hands??

I've already took a look at the code and saw quite a lot of assembly, i'd prefer to use intrinsics instead, is it possible??

Do you think is it worth the effort??

Thanks for the attention

dubhater · 2015-11-04T19:31:43Z

Off the top of my head, I know that a lot of time is spent in the SAD/SATD functions. I think the SAD functions are already as good as they can get, and AVX2 won't help them. Maybe you can find something to improve in the SATD functions. Otherwise, profile it and see what other functions are most used.

It is possible to use intrinsics. They are already used in Degrain. However, it wouldn't hurt to learn how to write assembly with x86inc.asm.

I don't know if it's worth the effort, because I don't know what AVX2 really is.

Have fun.

MonoS · 2015-11-14T18:05:21Z

I've started converting some function to AVX2, you can find them in my repo https://github.com/MonoS/vapoursynth-mvtools .

While you're right saying that don't hurt to learn writing assembly, i prefer to use intrinsics for now, i've already know how to write them and i have not so much time to give to this project.

Also i'll optimize mostly degrain related functions.

dubhater · 2015-12-09T12:56:40Z

I forgot about this, sorry.

Are your changes done?

MonoS · 2015-12-09T18:05:23Z

Not yet, i've only had time to write some POC code without testing it [last time i've tried i had quite some problem compiling mvtools with msys2], i've just compiled it to ensure it was without syntactic errors.

MonoS · 2015-12-13T00:14:32Z

I've implemented ToPixels both AVX2 and SSE2, i've tested it for all values of pSrc [0 to 2^32 - 1] and had the same result as the C version [performance wise you can find information in the commits].

Now i'll go to implement 4 different version of Overlaps [2x2, 2x4, 4xX and 8->32xX], implement LimitChanges, rewrite Degrains for 4xX [and probably the littler brother 2x2 in SSE2].

If you think there are other places i can give attention [i'd prefer degrain related function] let me know.

Also i'd like you to take a look at MonoS@46f3481 and tell me if you like this syntax for selecting function with more than one version.

dubhater · 2015-12-13T11:58:13Z

Funny thing: an optimised function that is clearly several times faster than the equivalent C version may still have no effect on the speed of MVTools, if the function doesn't take up much CPU time to begin with. You should test your optimised functions in the context where they will be used, i.e. with a VapourSynth script. Also, if the AVX2 version is only 1.13 times faster than the SSE2 version, I'll take just the SSE2.

Are these the functions that profiling revealed to take a lot of CPU time?

The syntax is fine, I suppose.

MonoS · 2015-12-13T12:41:47Z

I'm doing this not only because i use mvtools quite extensively, but also for learning purposes, so i'm not doing any profiling yet.

In the case of toPixels it take really no cpu time [unroled 32x it take only 56 cycles for one 32x interation, the same in avx 10], but to me was interesting to optimize it [as well as the other functions].
Now it is possible that in the O3 version the compiler autovectorized the function ["Vectorization is enabled by the flag -ftree-vectorize and by default at -O3." from https://gcc.gnu.org/projects/tree-ssa/vectorization.html], remember that i compiled it with march avx2.

In my opinion a 13% speedup is quite nice to have, maybe not so much for so little functions, but it's better to not waste cycles when possible.

I'm choosing the function to optimize based on the code i see, if i think that a function it's called a lot of times [like all the block related functions... i suppose] even a 10% speedup start accumulating.

Regarding integrating this changes it's up to you, avx2 is available since 2 years ago [haswell], but if i find some function easy to translate i'll make also an sse2 version for less recent processor.

dubhater · 2015-12-13T16:30:03Z

All right. Just keep in mind that I won't integrate any functions that have no effect on the speed reported by vspipe.

MonoS · 2015-12-13T19:59:42Z

I've converted also overlaps [excluding 2xX] and tested it for all possible values of pWin and pSrc.

I think i'll start to compile it and let you know if i see any speedup [i hope so]

MonoS · 2015-12-13T22:59:42Z

As expected lost the last three hours trying to compile it without success, Jonathan Blow is right.

MonoS · 2016-06-03T19:40:25Z

HI dubhater, sorry for asking, but i may need an hand in compiling it on windows, i also have a linux vm if it can help.

dubhater · 2016-06-03T20:17:12Z

I don't know anything about compiling it in Windows. In Linux you add --host=x86_64-w64-mingw32 (or something like that) to the configure command. Since you're using intrinsics (right?) you'll also need to add CFLAGS='-mavx2' CXXFLAGS='-mavx2'. If you added any new .c or .cpp files they need to be listed in Makefile.am with the others. If you're using yasm, you just need to add any new .asm files to Makefile.am.

You'll probably have a GCC with shared libstdc++ and libwinpthread, so the resulting libmvtools.dll will need those DLLs to function. Use depends.exe to find out for sure.

Alternatively, maybe @HolyWu can tell you how to compile it with Visual Studio.

MonoS · 2016-06-03T20:25:12Z

I had strange problem during linking and during compilation of asm files, in fact at some point i started rewriting the script from scratch using GCC but without avail, now a lot of months have passed and i don't remember the exact problems i encountered.

I compile all statically with GCC, shared executable are just a pain (as is compilation per se).

I've also asked a friend to help me compile it (he successfully compiled ffmpeg), but he complained about lack of vapoursynth installation, but compiling vapoursynth was not so simple for him.

If @HolyWu really can help me, even with VS i'll be very grateful to both of you.

PS: if everything goes well i'll try to make some SSE2 version of the avx2 stuffs

HolyWu · 2016-06-04T01:46:15Z

No problem. What error did you get while compiling in Visual Studio? Did you have vsyasm installed to compile the .asm files?

MonoS · 2016-06-05T07:58:57Z

i've never compiled anything with VS, the attempts i tried were on GCC on a msys2 evirorment. I'll instal the VS C/C++ compiler and let you know asap.

EDIT: Ok, @HolyWu , i've installed vc++ and also vsyasm, now i should import the project for compiling, how can i do that?

MonoS · 2016-06-18T10:19:28Z

Using msys i successfully compiled all the file, and now i'm at the linking stage, when it tries to use the .asm files it raise "undefined reference to" a lot of file.
i'm compiling running
./autogen.sh
CXXFLAGS="-march=core-avx2" ./configure
make

dubhater · 2016-06-18T14:11:15Z

I'll need to see at least a few of those error messages.

MonoS · 2016-06-18T14:31:16Z

test2.txt
Here's a log of all the messages, the first two are due to my incompetence for sure

dubhater · 2016-06-18T15:01:32Z

The multiple definitions are probably due to a missing static keyword. void ToPixels_AVX2_16bit(...) -> static void ToPixels_AVX2_16bit(...).

dubhater · 2016-06-18T15:03:51Z

I'm not sure what's happening with the others. Maybe you should post all the output from make V=1 (run make clean first).

MonoS · 2016-06-18T15:09:10Z

stdout
stderr
This error
c:/mingw-x64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/5.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: i386 architecture of input filesrc/asm/.libs/const-a.o' is incompatible with i386:x86-64 output`
i know how to solve it using a custom bat i was making
compila.bat
Also i get the same error when executing my bat
errors from my bat

dubhater · 2016-06-18T15:16:57Z

It does look like you're using a 64 bit compiler and linker, but for some reason the build system thinks you're on a 32 bit system. What does ./config.guess print?

MonoS · 2016-06-18T15:19:39Z

i686-pc-mingw32

dubhater · 2016-06-18T15:27:33Z

Does it help to add --host=x86_64-w64-mingw32 to the configure command?

MonoS · 2016-06-18T15:34:35Z

perfect i guess, now i only get some fftw3 related undefined reference.
Do i need to compile fftw3 on my environment?

dubhater · 2016-06-18T16:13:57Z

In test3.txt I see that -lfftw3f is missing from the link command. make LIBS='-L<location of libfftw3f.dll> -lfftw3f' may work. The build system should have done this already, or configure should have failed if it couldn't find fftw. I don't know what happened.

MonoS · 2016-06-18T17:09:20Z

i set the env variable FFTW_LIBS/CFLAGS with the location of fftw3f in which i've extracted all the file of the windows binaries available on the fftw3 site (http://www.fftw.org/install/windows.html)

i've also set the make flag as you said in some different ways: absolute and relative directory, with and without the ddl file; the same for the -l option with and without lib; i continue to have this error

*** Warning: linker path does not have real file for library -lfftw3f.
*** I have the capability to make that library automatically link in when
*** you link to this library.  But I can only do this if you have a
*** shared version of the library, which you do not appear to have
*** because I did check the linker path looking for a file starting
*** with libfftw3f but no candidates were found. (...for file magic test)
*** The inter-library dependencies that have been dropped here will be
*** automatically added whenever a program is linked with this library
*** or is declared to -dlopen it.

*** Since this library must not contain undefined symbols,
*** because either the platform does not support them or
*** it was explicitly requested with -no-undefined,
*** libtool will only create a static version of it.
copying selected object files to avoid basename conflicts...

something make me think that i need the .a version of the library

dubhater · 2016-06-18T17:26:32Z

Oh, I see. The file is called libfftw3f-3.dll, so maybe you need -lfftw3f-3.

MonoS · 2016-06-18T17:36:01Z

Still nothing :(

EDIT: if @HolyWu is still there, could you please take a look at my batch file and why it give me undefined reference error for the .asm files?

dubhater · 2016-06-18T17:41:03Z

It's -L not L.

MonoS · 2016-06-18T17:51:47Z

oh my, my apologies, now i have a dll, i'll let you know asap.

EDIT: dll compiled, but there is a bug in my degrain function, i'll need to fix it before anything else

MonoS · 2016-06-22T21:55:12Z

After a lot of trouble and fixing i seems to have a shippable version, i'll make some commit on my repo in the near future and will try to integrate some of your fixes in my repo.

Thank you to both of us for the help.

dubhater · 2016-06-23T07:35:45Z

How is the speed?

MonoS · 2016-06-23T18:34:05Z

I'm still not completely confident about my reading, but my Degrain2 function [you can find it here https://github.com/MonoS/MonoS-VS-Func/blob/master/MFunc.py#L32 ] with blksize=16 and overlap=8 seems to be 2.4 faster with all the analysis stuff.

I've tested it only on one source without access to the machine encoding it, so i need to make some more tests, also the analysis part not being optimized i'd like to exclude it from my testing, i'll make one ASAP.

PS: i've compile the whole solution with

CXXFLAGS="-march=core-avx2 -O3" ./configure --host=x86_64-w64-mingw32
make LIBS='-L./src/ -lfftw3f-3'

so maybe the autovectorization of GCC kicked in in the C versions.

dubhater · 2016-06-23T21:35:45Z

Suggestion: compile just one DLL and temporarily add a parameter to the filter to select the degrain function. For example, "avx2:int:opt;". If True, you use your new function, otherwise the existing SSE2 function. That way you know you're not comparing the speed of -O2 versus -O3.

MonoS · 2016-06-23T21:46:25Z

I already have the extra parameters.
The O3 version use autovectorization by the compiler, i need to recompile it with just O2 to compare against the c version completely or i'll compare against an avx2 version generated by the compiler.
Also i need to remove any other SSE2 version i created during the process.

I'll let you know, anyway the code is on my repo, if you have a machine with avx2 (Hashwell or later) you can try it, IIRC @HolyWu should have one if they is willing to do.

fuchanghao · 2016-08-11T07:53:25Z

now new fftw 3.3.5 have AVX2 optimization.

hydra3333 mentioned this issue Aug 11, 2016

query rdp/ffmpeg-windows-build-helpers#154

Closed

AVX2 optimization #20

AVX2 optimization #20

Comments

MonoS commented Nov 4, 2015

dubhater commented Nov 4, 2015

MonoS commented Nov 14, 2015

dubhater commented Dec 9, 2015

MonoS commented Dec 9, 2015

MonoS commented Dec 13, 2015

dubhater commented Dec 13, 2015

MonoS commented Dec 13, 2015

dubhater commented Dec 13, 2015

MonoS commented Dec 13, 2015

MonoS commented Dec 13, 2015

MonoS commented Jun 3, 2016

dubhater commented Jun 3, 2016

MonoS commented Jun 3, 2016

HolyWu commented Jun 4, 2016

MonoS commented Jun 5, 2016 • edited Loading

MonoS commented Jun 18, 2016 • edited Loading

dubhater commented Jun 18, 2016

MonoS commented Jun 18, 2016

dubhater commented Jun 18, 2016

dubhater commented Jun 18, 2016

MonoS commented Jun 18, 2016

dubhater commented Jun 18, 2016

MonoS commented Jun 18, 2016

dubhater commented Jun 18, 2016

MonoS commented Jun 18, 2016 • edited Loading

dubhater commented Jun 18, 2016

MonoS commented Jun 18, 2016

dubhater commented Jun 18, 2016

MonoS commented Jun 18, 2016 • edited Loading

dubhater commented Jun 18, 2016

MonoS commented Jun 18, 2016 • edited Loading

MonoS commented Jun 22, 2016

dubhater commented Jun 23, 2016

MonoS commented Jun 23, 2016

dubhater commented Jun 23, 2016

MonoS commented Jun 23, 2016

fuchanghao commented Aug 11, 2016

MonoS commented Jun 5, 2016 •

edited

Loading

MonoS commented Jun 18, 2016 •

edited

Loading

MonoS commented Jun 18, 2016 •

edited

Loading

MonoS commented Jun 18, 2016 •

edited

Loading

MonoS commented Jun 18, 2016 •

edited

Loading