Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

threads vs elements when using the OpenMP 4.0 backend #156

Open
fwyzard opened this issue Feb 18, 2020 · 12 comments
Open

threads vs elements when using the OpenMP 4.0 backend #156

fwyzard opened this issue Feb 18, 2020 · 12 comments

Comments

@fwyzard
Copy link
Contributor

fwyzard commented Feb 18, 2020

It looks like cupla does not swap the number of threads and elements when using the OpenMP 4.0 backend.

Using alpaka directly, with the swap explicitly in place:

Running with the blocking serial CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 532.66 us

Running with the non-blocking TBB CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 283.06 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 211.79 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 632.7 us

Using cupla:

Running with the blocking serial CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 471.79 us

Running with the non-blocking TBB CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 240.64 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 186.92 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 128157 us

The much larger time observed with the OpenMP 4.0 backend is consistent with what I was seeing with alpaka before introducing the swap between threads and elements.

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 18, 2020

Note: the printout when using cupla is confusing.
As the swap between threads and elements is hidden from the caller, what is printed is the original value - not what is actually used by the backend.

@sbastrakov
Copy link
Member

Thanks for reporting @fwyzard .
I believe this is indeed a cupla, not alpaka, bug. Probably it is caused by not having using AccThreadSeq = ::alpaka::acc::AccCpuOmp4; defined for this configuration here. For example, the OpenMP 2 blocks accelerator has a different logic.

I will provide a PR just now, hopefully it resolves the problem.

sbastrakov added a commit to sbastrakov/cupla that referenced this issue Feb 19, 2020
This is now consistent with the Omp2Blocks backend, fixes alpaka-group#156
@sbastrakov
Copy link
Member

sbastrakov commented Feb 19, 2020

PR #157 submitted, @fwyzard hopefully this helps.

sbastrakov added a commit to sbastrakov/cupla that referenced this issue Feb 19, 2020
…_KERNEL_OPTI

This is now consistent with the Omp2Blocks backend, fixes alpaka-group#156
@psychocoderHPC
Copy link
Member

@fwyzard What is the test case you used for the performance measurement?

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 19, 2020 via email

@sbastrakov
Copy link
Member

sbastrakov commented Feb 20, 2020

Hello @fwyzard . I think actually my fix in #157 is wrong and should not be merged, and it just happened to provide a workaround for your case.

Since your kernels are explicitly utilizing alpaka element level, it probably makes sense to call them from cupla using CUPLA_KERNEL_ELEM and not CUPLA_KERNEL_OPTI. Then you should have explicit control over the whole blocks-threads-elements configuration and cupla will not change it behind the hood. It is briefly described here. Edit: in the code, the comment of the class called from inside this macro looks outdated, we will fix. The macro should work as described in the docs.

@psychocoderHPC
Copy link
Member

Edit: in the code, the comment of the class called from inside this macro looks outdated, we will fix. The macro should work as described in the docs.

will be fixed with #159

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 20, 2020

Since your kernels are explicitly utilizing alpaka element level, it probably makes sense to call them from cupla using CUPLA_KERNEL_ELEM and not CUPLA_KERNEL_OPTI. Then you should have explicit control over the whole blocks-threads-elements configuration and cupla will not change it behind the hood.

OK, so what you are saying is that the launch parameters need to be optimised specifically for the OpenMP 4 backend, in terms of blocks, threads and elements.

@sbastrakov
Copy link
Member

sbastrakov commented Feb 20, 2020

@fwyzard no, sorry for a confusing message. While your statement may well be true, this is not what I wanted to express.

I meant merely the technical choice between CUPLA_KERNEL_OPTI and CUPLA_KERNEL_ELEM. I think for your case, with the kernel explicitly utilizing the element level (which seems perfectly reasonable for me), the CUPLA_KERNEL_ELEM is more fitting. Then you can explicitly set a run configuration with 1 thread per block and 512 (or how many you want) elements per thread, and do not rely on cupla switching threads-per-block and elements-per-threads (and know which accelerators does it perform it for, and which not). CUPLA_KERNEL_OPTI is somewhat of a workaround, and I am not sure why your application has to use it.

The underperformance of OpenMP 4 seems to be another issue, which we need to look into.

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 20, 2020

Mhm, now I'm slightly confused.

I understood from the porting guide that using the element level is already a requirement in order to use CUPLA_KERNEL_OPTI.
Then, CUPLA_KERNEL_OPTI(blocks, threads) would simplify the launch configuration and hide the extra level, using as (blocks, actual_threads, actual_elements):

  • (blocks, threads, 1) for GPU-like backends
  • (blocks, 1, threads) for CPU-like backends

I can use CUPLA_KERNEL_ELEM directly and take care of the threads-vs-elements in the application (just as Ido when using Alpaka directly), but doesn't that defeat the purpose of having CUPLA_KERNEL_OPTI in the first place ?

I do understand that CUPLA_KERNEL_OPTI is just a simplification - but it's a nice and useful one :-)

@sbastrakov
Copy link
Member

sbastrakov commented Feb 20, 2020

You are right, this is the idea behind CUPLA_KERNEL_OPTI. However, the actual criterion for whether it switches or not is somewhat arbitrary, and so not intuitive from a user side (well, even for me it was not clear). You also initially encountered it with Omp4, and since the switching was done behind the hood, and not explicitly on your application side, it was not even immediately clear if the switching did not work, or it did but there was some issue making Omp4 super slow (and it is slow anyhow as your runs show, but on another scale).

The switching vs. non-switching is currently based on whether a backend works only with 1 thread per block, or not. This is why my yesterday's PR #157 enabled the switching for the OpenMP4 backend, but this is probably not a desired change since the backend actually supports multiple threads per block as well. Thus, in my opinion, CUPLA_KERNEL_OPTI is more of workaround than simplification, and is not very intuitive but this is of course subjective and I understood your opinion. I think anyhow we need to extend the documentation on this to explain it clearly.

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 20, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants