-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about the proper usage in combination with multi-threaded BLAS libraries. #8
Comments
You would indeed need to dynamically tell the blas library to switch between parallel and single-threaded implementations. The details depend on the blas library unfortunately (perhaps openblas_set_num_threads for openblas, I don't know if that can be used in the middle of the execution). |
Also you'll probably want to use starpu_pause() and starpu_resume() to separate starpu and non-starpu parts |
I did some experiments with the
before the starpu code starts, and changed the
(I did the same with Here are the results of my experiments (32 physical cores, IBM Power 9, gcc 8.x, OpenBLAS-current)
which looks reasonable for using one thread in OpenMP. Using two threads it already get strange:
but the performance of the check, which calls BLAS directly, seem to scale. Using 4 threads, we get:
Going to 32 threads, we have:
It seems that the approach with setting the number of openblas/openmp threads does not work. Edit:
On the same machine with Intel oneAPI-mkl (OpenMP Threading) and
Using OpenBLAS with PThreads and openblas_set/get_num_threads it works. |
With the openblas library, when using openblas_set_num_threads, I do get the expected behavior (this is on my 4-core laptop):
(with openblas, omp_set_num_thread doesn't seem effective) I indeed see in top that the CPU% usage during the check is according to the openblas_set_num_threads call just before it |
Which threading does your OpenBLAS library use? It seems that it uses PThreads. And using the PThreads version is not possible since some of the old algorithms uses OpenMP and thus it gets in trouble with the PThread threading. |
ah, yes it was the pthreads variant |
With the openmp variant I indeed get the same kind of erroneous behavior. Apparently it's due to the omp implementation using separate num_threads icvs in the different starpu threads. I.e. the openblas_set_num_threads(1) call should be done in each starpu worker thread. That can be done in the tasks themselves before the gemm call, or just once for all with |
That is a workaround.... but it helps. I figured out that OpenBLAS, when using OpenMP, resets the number of threads to the maximum number of threads given by |
I am currently performing some experiments how to integrate StarPU into my algorithms and to accelerate my code using it. Thereby "old" code gets mixed with the StarPU enabled algorithms. The old code uses a multi-threaded BLAS (OpenBLAS with OpenMP support, Threaded-MKL or Threaded-ESSL). But when the StarPU enabled algorithms starts, the threaded BLAS results in huge performance issues for this part of the code. Is there a proper way to handle the case, where the surrounding code as well as the StarPU algorithms rely on BLAS?
The text was updated successfully, but these errors were encountered: