We should check and exclude data transfers from timing otherwise tests report unreliable values #635

DiamonDinoia · 2025-02-18T20:50:03Z

          Just a tiny text fix. I checked a cmake build on my local RTX-A6000.

But it has a weird timing inconsistency.
Eg, running it by hand:

(base) ahb@ccmlin019 /mnt/home/ahb/numerics/finufft/build> test/cuda/cufinufft1dspreadinterponly_test 1e7 1e8 1e-3 1e-2 f 0.0
spread-only test 1d:
	100000000 NU pts spread to 10000000 grid in 0.046 s 	2.17e+09 NU pts/s
	rel mass err 0.0002
interp-only test 1d:
	100000000 NU pts interp from 10000000 grid in 0.341 s 	2.93e+08 NU pts/s
	rel sup err 0.000497

It is nice to see the 2e9 NU pt/sec for spread, and this matches the expected from 1d1 actual NUFFT. But it is 10x slower for interp? Could you look into why that is?

I notice that 1d1 vs 1d2 are not different (but the setNUpts time is 10x longer than the exec): here I use precision d in order to test accuracy, but the speed is the same for f:

(base) ahb@ccmlin019 /mnt/home/ahb/numerics/finufft/build> test/cuda/cufinufft1d_test 1 1 1e7 1e8 1e-3 1e-2 d 0.0
[time  ] dummy warmup call to CUFFT	 0.00474 s
[time  ] cufinufft plan:		 0.0229 s
[time  ] cufinufft setNUpts:		 0.48 s
[time  ] cufinufft exec:		 0.0525 s
[time  ] cufinufft destroy:		 0.000309 s
[Method 1] 10000000 U pts to 100000000 NU pts in 0.556 s:      1.8e+08 NU pts/s
					(exec-only thoughput: 1.9e+09 NU pts/s)
[gpu   ] one mode: rel err in F[3700000] is 2.69e-05
(base) ahb@ccmlin019 /mnt/home/ahb/numerics/finufft/build> test/cuda/cufinufft1d_test 1 2 1e7 1e8 1e-3 1e-2 d 0.0
[time  ] dummy warmup call to CUFFT	 0.00385 s
[time  ] cufinufft plan:		 0.0225 s
[time  ] cufinufft setNUpts:		 0.48 s
[time  ] cufinufft exec:		 0.048 s
[time  ] cufinufft destroy:		 0.000335 s
[Method 1] 10000000 U pts to 100000000 NU pts in 0.551 s:      1.81e+08 NU pts/s
					(exec-only thoughput: 2.08e+09 NU pts/s)
[gpu   ] one targ: rel err in c[50000000] is 0.000339

These are all using method=1 as in your new tester.

It's very peculiar that setNUpts doesn't show up in the spread-only timing.
Could it be more cuda-event-synchronizations are needed?
Investigations welcome ! Thanks, Alex

Originally posted by @ahbarnett in #631 (review)

The text was updated successfully, but these errors were encountered:

ahbarnett · 2025-02-25T20:56:40Z

The issue here is whether H<->D times are included in test/cufinufft{1,2,3}d_test
They should not be.

It could be because we switched from sync transfers to async, to allow users to increase throughput. This may mess up timings...

DiamonDinoia mentioned this issue Feb 18, 2025

Added GPU spread interp only test #631

Merged

ahbarnett assigned ahbarnett and DiamonDinoia Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We should check and exclude data transfers from timing otherwise tests report unreliable values #635

We should check and exclude data transfers from timing otherwise tests report unreliable values #635

DiamonDinoia commented Feb 18, 2025

ahbarnett commented Feb 25, 2025 •

edited

Loading

We should check and exclude data transfers from timing otherwise tests report unreliable values #635

We should check and exclude data transfers from timing otherwise tests report unreliable values #635

Comments

DiamonDinoia commented Feb 18, 2025

ahbarnett commented Feb 25, 2025 • edited Loading

ahbarnett commented Feb 25, 2025 •

edited

Loading