threads vs elements when using the OpenMP 4.0 backend #156

fwyzard · 2020-02-18T22:24:58Z

It looks like cupla does not swap the number of threads and elements when using the OpenMP 4.0 backend.

Using alpaka directly, with the swap explicitly in place:

Running with the blocking serial CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 532.66 us

Running with the non-blocking TBB CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 283.06 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 211.79 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 632.7 us

Using cupla:

Running with the blocking serial CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 471.79 us

Running with the non-blocking TBB CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 240.64 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 186.92 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 128157 us

The much larger time observed with the OpenMP 4.0 backend is consistent with what I was seeing with alpaka before introducing the swap between threads and elements.

The text was updated successfully, but these errors were encountered:

fwyzard · 2020-02-18T22:26:40Z

Note: the printout when using cupla is confusing.
As the swap between threads and elements is hidden from the caller, what is printed is the original value - not what is actually used by the backend.

sbastrakov · 2020-02-19T09:29:20Z

Thanks for reporting @fwyzard .
I believe this is indeed a cupla, not alpaka, bug. Probably it is caused by not having using AccThreadSeq = ::alpaka::acc::AccCpuOmp4; defined for this configuration here. For example, the OpenMP 2 blocks accelerator has a different logic.

I will provide a PR just now, hopefully it resolves the problem.

This is now consistent with the Omp2Blocks backend, fixes alpaka-group#156

sbastrakov · 2020-02-19T09:35:13Z

PR #157 submitted, @fwyzard hopefully this helps.

…_KERNEL_OPTI This is now consistent with the Omp2Blocks backend, fixes alpaka-group#156

psychocoderHPC · 2020-02-19T14:42:37Z

@fwyzard What is the test case you used for the performance measurement?

fwyzard · 2020-02-19T15:19:03Z

this: https://github.com/fwyzard/pixel-standalone/ .

sbastrakov · 2020-02-20T08:26:49Z

Hello @fwyzard . I think actually my fix in #157 is wrong and should not be merged, and it just happened to provide a workaround for your case.

Since your kernels are explicitly utilizing alpaka element level, it probably makes sense to call them from cupla using CUPLA_KERNEL_ELEM and not CUPLA_KERNEL_OPTI. Then you should have explicit control over the whole blocks-threads-elements configuration and cupla will not change it behind the hood. It is briefly described here. Edit: in the code, the comment of the class called from inside this macro looks outdated, we will fix. The macro should work as described in the docs.

psychocoderHPC · 2020-02-20T09:10:38Z

Edit: in the code, the comment of the class called from inside this macro looks outdated, we will fix. The macro should work as described in the docs.

will be fixed with #159

fwyzard · 2020-02-20T10:36:42Z

Since your kernels are explicitly utilizing alpaka element level, it probably makes sense to call them from cupla using CUPLA_KERNEL_ELEM and not CUPLA_KERNEL_OPTI. Then you should have explicit control over the whole blocks-threads-elements configuration and cupla will not change it behind the hood.

OK, so what you are saying is that the launch parameters need to be optimised specifically for the OpenMP 4 backend, in terms of blocks, threads and elements.

sbastrakov · 2020-02-20T10:51:11Z

@fwyzard no, sorry for a confusing message. While your statement may well be true, this is not what I wanted to express.

I meant merely the technical choice between CUPLA_KERNEL_OPTI and CUPLA_KERNEL_ELEM. I think for your case, with the kernel explicitly utilizing the element level (which seems perfectly reasonable for me), the CUPLA_KERNEL_ELEM is more fitting. Then you can explicitly set a run configuration with 1 thread per block and 512 (or how many you want) elements per thread, and do not rely on cupla switching threads-per-block and elements-per-threads (and know which accelerators does it perform it for, and which not). CUPLA_KERNEL_OPTI is somewhat of a workaround, and I am not sure why your application has to use it.

The underperformance of OpenMP 4 seems to be another issue, which we need to look into.

fwyzard · 2020-02-20T11:29:24Z

Mhm, now I'm slightly confused.

I understood from the porting guide that using the element level is already a requirement in order to use CUPLA_KERNEL_OPTI.
Then, CUPLA_KERNEL_OPTI(blocks, threads) would simplify the launch configuration and hide the extra level, using as (blocks, actual_threads, actual_elements):

(blocks, threads, 1) for GPU-like backends
(blocks, 1, threads) for CPU-like backends

I can use CUPLA_KERNEL_ELEM directly and take care of the threads-vs-elements in the application (just as Ido when using Alpaka directly), but doesn't that defeat the purpose of having CUPLA_KERNEL_OPTI in the first place ?

I do understand that CUPLA_KERNEL_OPTI is just a simplification - but it's a nice and useful one :-)

sbastrakov · 2020-02-20T11:45:46Z

You are right, this is the idea behind CUPLA_KERNEL_OPTI. However, the actual criterion for whether it switches or not is somewhat arbitrary, and so not intuitive from a user side (well, even for me it was not clear). You also initially encountered it with Omp4, and since the switching was done behind the hood, and not explicitly on your application side, it was not even immediately clear if the switching did not work, or it did but there was some issue making Omp4 super slow (and it is slow anyhow as your runs show, but on another scale).

The switching vs. non-switching is currently based on whether a backend works only with 1 thread per block, or not. This is why my yesterday's PR #157 enabled the switching for the OpenMP4 backend, but this is probably not a desired change since the backend actually supports multiple threads per block as well. Thus, in my opinion, CUPLA_KERNEL_OPTI is more of workaround than simplification, and is not very intuitive but this is of course subjective and I understood your opinion. I think anyhow we need to extend the documentation on this to explain it clearly.

fwyzard · 2020-02-20T12:05:06Z

Ah, I understand the reasoning now - thanks.

sbastrakov added a commit to sbastrakov/cupla that referenced this issue Feb 19, 2020

Add a conditional definition of AccThreadSeq for the Omp4 backend

333e4a1

This is now consistent with the Omp2Blocks backend, fixes alpaka-group#156

sbastrakov mentioned this issue Feb 19, 2020

[WIP] Add a conditional definition of AccThreadSeq for the Omp4 backend #157

Closed

sbastrakov linked a pull request Feb 19, 2020 that will close this issue

[WIP] Add a conditional definition of AccThreadSeq for the Omp4 backend #157

Closed

sbastrakov added a commit to sbastrakov/cupla that referenced this issue Feb 19, 2020

Activate the thread-element switching for the Omp4 backend with CUPLA…

bcbc071

…_KERNEL_OPTI This is now consistent with the Omp2Blocks backend, fixes alpaka-group#156

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

threads vs elements when using the OpenMP 4.0 backend #156

threads vs elements when using the OpenMP 4.0 backend #156

fwyzard commented Feb 18, 2020

fwyzard commented Feb 18, 2020

sbastrakov commented Feb 19, 2020

sbastrakov commented Feb 19, 2020 •

edited

Loading

psychocoderHPC commented Feb 19, 2020

fwyzard commented Feb 19, 2020 via email

sbastrakov commented Feb 20, 2020 •

edited

Loading

psychocoderHPC commented Feb 20, 2020

fwyzard commented Feb 20, 2020

sbastrakov commented Feb 20, 2020 •

edited

Loading

fwyzard commented Feb 20, 2020

sbastrakov commented Feb 20, 2020 •

edited

Loading

fwyzard commented Feb 20, 2020 via email

threads vs elements when using the OpenMP 4.0 backend #156

threads vs elements when using the OpenMP 4.0 backend #156

Comments

fwyzard commented Feb 18, 2020

fwyzard commented Feb 18, 2020

sbastrakov commented Feb 19, 2020

sbastrakov commented Feb 19, 2020 • edited Loading

psychocoderHPC commented Feb 19, 2020

fwyzard commented Feb 19, 2020 via email

sbastrakov commented Feb 20, 2020 • edited Loading

psychocoderHPC commented Feb 20, 2020

fwyzard commented Feb 20, 2020

sbastrakov commented Feb 20, 2020 • edited Loading

fwyzard commented Feb 20, 2020

sbastrakov commented Feb 20, 2020 • edited Loading

fwyzard commented Feb 20, 2020 via email

sbastrakov commented Feb 19, 2020 •

edited

Loading

sbastrakov commented Feb 20, 2020 •

edited

Loading

sbastrakov commented Feb 20, 2020 •

edited

Loading

sbastrakov commented Feb 20, 2020 •

edited

Loading