optimization on matrix-matrix mulitplication #172

Fritzkefit · 2015-12-02T22:08:03Z

Introduces a BLIS-like approach: a macro-kernel iterates over blocks (of sizes dependent on cache-sizes and available threads) of the matrices being multiplied and a micro-kernel executes the actual multiplication. Two micro-kernels -one using the AVX instruction set, the other using the SSE instruction set- are included.

…at-mat-mul seems to work now

…case in packing.hpp, mat-mat-mul seems to work now

… added a framework for reading cache sizes, used to calculate block sizes (not yet implemented).

…fails

…loads/stores, further bug fixes Renamed "get_cache_sizes.hpp" to "get_block_sizes.hpp", where get_block_sizes() is called evertime prod() is invoked, since we have to dynamically assign mr/nr as they depend on wether float or double entries are processed. The AVX-microkernels work for doubles and float, where the approach taken for float entries differs from that for doubles due to limitations of the AVX-instructions.

…pp), made sure standard-microkernel is working

… accordingly, unfortunately fails due to segfaults on amd-systems, intel-systems untested

The inline assembler in get_cache_sizes() gets a pointer to an array which should be stored in %rdi. This was the only way I could get it to work propperly, as specifying input/output operands would yield segfaults. Therefore, the inline assembler is in a seperate function and relies on the standard register or first function argument (i.e. %rdi). I do not know if this could cause problems on other systems => needs to be tested.

…cpus fails

…thouroughly tested CPUID info can be obtained through cpuid-leaf2 or cpuid-leaf4 on intel CPUs. It depends on the CPU, which leaf to use. Both have been implemented and leaf2 works correctly on a core 2 quad q9400. Further,thorough testing and double checking of the huge switch-case for leaf2 has NOT been done.

into gemmopt-avx

quick tests did not show any performance impacts

Please enter the commit message for your changes. Lines starting

… of available threads Please enter the commit message for your changes. Lines starting

…riptions

…emory_create() etc., fixed underflows when calculatiting num_of_blocks.. and num_residue_slivers..

…ther memory_create()s (buffer_A/B)

…lude in matrix_operations.hpp

…ned L1/2/3_AVX/SSE_DENOMs to quickly change what fraction of cache should be filled with the blocks

karlrupp · 2015-12-07T17:07:56Z

Thanks, @Fritzkefit ! For documentation purposes: In a face-to-face discussion we agreed that I'll take care of resolving the merge conflicts.

Fritzkefit and others added 30 commits September 4, 2015 22:50

added product test in ./tests viennacl/, fixed some packaging bugs, m…

ab756df

…at-mat-mul seems to work now

got rid of the wrappers for A and B, switched offsets for transposed …

a8789e6

…case in packing.hpp, mat-mat-mul seems to work now

Altered packing and blocking to be ready for BLIS-micro-kernels. Also…

2a85254

… added a framework for reading cache sizes, used to calculate block sizes (not yet implemented).

Fixed a lot of bugs, still some remaining, "avx_prod_test 1 513 513" …

e304016

…fails

transfered aligned-buffer-functions to its own file (aligned_buffer.h…

7ca4779

…pp), made sure standard-microkernel is working

get_block_sizes() now tries to read the cpuid and set the cache sizes…

61e080d

… accordingly, unfortunately fails due to segfaults on amd-systems, intel-systems untested

added #ifdef to include correct micro-kernel, reading cache on intel …

b306a14

…cpus fails

fixed switch case in set_cache_intel()

fcbfa30

updated benchmarks and file structure in tests\ viennacl

8ab70e5

updated microkernel

0d71ac1

added sse kernel

ad7e436

extended MR_D block size for avx_micro_kernel<double>()

f401acd

Merge branch 'gemmopt-avx' of https://github.com/Fritzkefit/viennacl-dev

6886625

into gemmopt-avx

fixed get_cache_intel_leaf4()

ed01ae4

adjusted how block sizes are calculated

7c989c2

quick tests did not show any performance impacts

nothing functional changed, switching systems that's why commit/push

17c7469

extended the benchmarks

b2dd9fb

Please enter the commit message for your changes. Lines starting

parallel for around first loop in macro-kernel, nc is at least number…

e1d658d

… of available threads Please enter the commit message for your changes. Lines starting

deleted comments conatining debug-code and added minimal doxygen desc…

5f97487

…riptions

fixed inline assembler nonsense, swapped 'get_aligned_buffer()' for m…

7915cc3

…emory_create() etc., fixed underflows when calculatiting num_of_blocks.. and num_residue_slivers..

removed DEBUG comments and moved memory_create() of buffer_C to the o…

60009d8

…ther memory_create()s (buffer_A/B)

forgot commented free()s after macro-kernel

aa6fd76

deleted test files, align_buffer.hpp (not needed anymore) and its inc…

656e842

…lude in matrix_operations.hpp

added option to use posix_memalign() instead of aligned_alloc(), defi…

4d8ee48

…ned L1/2/3_AVX/SSE_DENOMs to quickly change what fraction of cache should be filled with the blocks

few cleanups on comments

36e5d70

CMake tests now use AVX/SSE

8ac081c

add min matrix-size for omp loops

949106a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimization on matrix-matrix mulitplication #172

optimization on matrix-matrix mulitplication #172

Fritzkefit commented Dec 2, 2015

karlrupp commented Dec 7, 2015

optimization on matrix-matrix mulitplication #172

Are you sure you want to change the base?

optimization on matrix-matrix mulitplication #172

Conversation

Fritzkefit commented Dec 2, 2015

karlrupp commented Dec 7, 2015