-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
threads vs elements when using the OpenMP 4.0 backend #156
Comments
Note: the printout when using cupla is confusing. |
Thanks for reporting @fwyzard . I will provide a PR just now, hopefully it resolves the problem. |
This is now consistent with the Omp2Blocks backend, fixes alpaka-group#156
…_KERNEL_OPTI This is now consistent with the Omp2Blocks backend, fixes alpaka-group#156
@fwyzard What is the test case you used for the performance measurement? |
Hello @fwyzard . I think actually my fix in #157 is wrong and should not be merged, and it just happened to provide a workaround for your case. Since your kernels are explicitly utilizing alpaka element level, it probably makes sense to call them from cupla using |
will be fixed with #159 |
OK, so what you are saying is that the launch parameters need to be optimised specifically for the OpenMP 4 backend, in terms of blocks, threads and elements. |
@fwyzard no, sorry for a confusing message. While your statement may well be true, this is not what I wanted to express. I meant merely the technical choice between The underperformance of OpenMP 4 seems to be another issue, which we need to look into. |
Mhm, now I'm slightly confused. I understood from the porting guide that using the element level is already a requirement in order to use
I can use I do understand that |
You are right, this is the idea behind The switching vs. non-switching is currently based on whether a backend works only with 1 thread per block, or not. This is why my yesterday's PR #157 enabled the switching for the OpenMP4 backend, but this is probably not a desired change since the backend actually supports multiple threads per block as well. Thus, in my opinion, |
Ah, I understand the reasoning now - thanks.
|
It looks like cupla does not swap the number of threads and elements when using the OpenMP 4.0 backend.
Using alpaka directly, with the swap explicitly in place:
Using cupla:
The much larger time observed with the OpenMP 4.0 backend is consistent with what I was seeing with alpaka before introducing the swap between threads and elements.
The text was updated successfully, but these errors were encountered: