Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the proper usage in combination with multi-threaded BLAS libraries. #8

Open
grisuthedragon opened this issue Mar 2, 2023 · 8 comments
Labels
question Further information is requested

Comments

@grisuthedragon
Copy link

I am currently performing some experiments how to integrate StarPU into my algorithms and to accelerate my code using it. Thereby "old" code gets mixed with the StarPU enabled algorithms. The old code uses a multi-threaded BLAS (OpenBLAS with OpenMP support, Threaded-MKL or Threaded-ESSL). But when the StarPU enabled algorithms starts, the threaded BLAS results in huge performance issues for this part of the code. Is there a proper way to handle the case, where the surrounding code as well as the StarPU algorithms rely on BLAS?

@sthibaul
Copy link
Collaborator

sthibaul commented Mar 2, 2023

You would indeed need to dynamically tell the blas library to switch between parallel and single-threaded implementations. The details depend on the blas library unfortunately (perhaps openblas_set_num_threads for openblas, I don't know if that can be used in the middle of the execution).

@sthibaul
Copy link
Collaborator

sthibaul commented Mar 2, 2023

Also you'll probably want to use starpu_pause() and starpu_resume() to separate starpu and non-starpu parts

@nfurmento nfurmento added the question Further information is requested label Mar 2, 2023
@grisuthedragon
Copy link
Author

grisuthedragon commented Mar 3, 2023

I did some experiments with the example/mult and OpenBLAS-OpenMP and I did not get to a proper working solution.
I added:

   int oldth = omp_get_max_threads();
   omp_set_num_threads(1); 

before the starpu code starts, and changed the check to

if (check) {
        starpu_pause();
        omp_set_num_threads(oldth); 
        start = starpu_timing_now()
        ret = check_output();
        end = starpu_timing_now();                                                                                                                                                                           
        timing = end - start;
        PRINTF("%u\t%u\t%u\t%.0f\t%.1f\n", xdim, ydim, zdim, timing/1000.0, flops/1000.0);
        starpu_resume();
}

(I did the same with openblas_set/get_num_threads as well.

Here are the results of my experiments (32 physical cores, IBM Power 9, gcc 8.x, OpenBLAS-current)

$ STARPU_NCUDA=0 OMP_NUM_THREADS=1 ./dgemm -size 4096 -check -iter 1
# x	y	z	ms	GFlops
4096	4096	4096	456	301.6
Results are OK
4096	4096	4096	5984	23.0

which looks reasonable for using one thread in OpenMP. Using two threads it already get strange:

$ STARPU_NCUDA=0 OMP_NUM_THREADS=2 ./dgemm -size 4096 -check -iter 1
# x	y	z	ms	GFlops
4096	4096	4096	7590	18.1
Results are OK
4096	4096	4096	3032	45.3

but the performance of the check, which calls BLAS directly, seem to scale. Using 4 threads, we get:

$ STARPU_NCUDA=0 OMP_NUM_THREADS=4 ./dgemm -size 4096 -check -iter 1
# x	y	z	ms	GFlops
4096	4096	4096	21280	6.5
Results are OK
4096	4096	4096	1523	90.3

Going to 32 threads, we have:

$ STARPU_NCUDA=0 OMP_NUM_THREADS=32 ./dgemm -size 4096 -check -iter 1
# x	y	z	ms	GFlops
4096	4096	4096	222405	0.6
Results are OK
4096	4096	4096	609	225.7

It seems that the approach with setting the number of openblas/openmp threads does not work.

Edit:
Some more experiments. On a 6-Core x86_64 with OpenBLAS-OpenMP we get:

$ OMP_NUM_THREADS=1 ./dgemm -check -size 4096 -iter 1
# x	y	z	ms	GFlops
4096	4096	4096	931	147.6
Results are OK
4096	4096	4096	2511	54.7

$ OMP_NUM_THREADS=2 ./dgemm -check -size 4096 -iter 1
# x	y	z	ms	GFlops
4096	4096	4096	2963	46.4
Results are OK
4096	4096	4096	1292	106.3

$ OMP_NUM_THREADS=6 ./dgemm -check -size 4096 -iter 1
# x	y	z	ms	GFlops
4096	4096	4096	3796	36.2
Results are OK
4096	4096	4096	729	188.6

On the same machine with Intel oneAPI-mkl (OpenMP Threading) and mkl_set/get_num_threads :

$ OMP_NUM_THREADS=1 ./dgemm -check -size 4096 -iter 1
# x	y	z	ms	GFlops
4096	4096	4096	905	151.9
Results are OK
4096	4096	4096	2382	57.7

$ OMP_NUM_THREADS=2 ./dgemm -check -size 4096 -iter 1
# x	y	z	ms	GFlops
4096	4096	4096	874	157.3
Results are OK
4096	4096	4096	1302	105.6

$ OMP_NUM_THREADS=6 ./dgemm -check -size 4096 -iter 1
# x	y	z	ms	GFlops
4096	4096	4096	942	145.8
Results are OK
4096	4096	4096	694	198.1

Using OpenBLAS with PThreads and openblas_set/get_num_threads it works.

@sthibaul
Copy link
Collaborator

sthibaul commented Mar 3, 2023

With the openblas library, when using openblas_set_num_threads, I do get the expected behavior (this is on my 4-core laptop):

$ OPENBLAS_NUM_THREADS=1 ./sgemm -size 8192 -check -iter 1
# x	y	z	ms	GFlop/s
8192	8192	8192	4839	227.2
Results are OK
8192	8192	8192	10392	105.8
$ OPENBLAS_NUM_THREADS=4 ./sgemm -size 8192 -check -iter 1   
# x	y	z	ms	GFlop/s
8192	8192	8192	4688	234.6
Results are OK
8192	8192	8192	4703	233.8

(with openblas, omp_set_num_thread doesn't seem effective)

I indeed see in top that the CPU% usage during the check is according to the openblas_set_num_threads call just before it

@grisuthedragon
Copy link
Author

grisuthedragon commented Mar 3, 2023

With the openblas library, when using openblas_set_num_threads, I do get the expected behavior (this is on my 4-core laptop):

Which threading does your OpenBLAS library use? It seems that it uses PThreads. And using the PThreads version is not possible since some of the old algorithms uses OpenMP and thus it gets in trouble with the PThread threading.

@sthibaul
Copy link
Collaborator

sthibaul commented Mar 3, 2023

ah, yes it was the pthreads variant

@sthibaul
Copy link
Collaborator

sthibaul commented Mar 3, 2023

With the openmp variant I indeed get the same kind of erroneous behavior. Apparently it's due to the omp implementation using separate num_threads icvs in the different starpu threads. I.e. the openblas_set_num_threads(1) call should be done in each starpu worker thread. That can be done in the tasks themselves before the gemm call, or just once for all with starpu_execute_on_each_worker

@grisuthedragon
Copy link
Author

grisuthedragon commented Mar 3, 2023

That is a workaround.... but it helps. I figured out that OpenBLAS, when using OpenMP, resets the number of threads to the maximum number of threads given by omp_get_max_threads. The internal variables are set correctly by calling openblas_set_num_threads but once a call to a threaded BLAS routine is done, the value resets. I will fill out a bug report there. But for StarPU it would be nice to have a section in the documentation about StarPU algorithms and multithreaded
BLAS libraries.

See: OpenMathLib/OpenBLAS#3933

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants