Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should check and exclude data transfers from timing otherwise tests report unreliable values #635

Open
DiamonDinoia opened this issue Feb 18, 2025 · 1 comment
Assignees

Comments

@DiamonDinoia
Copy link
Collaborator

          Just a tiny text fix. I checked a cmake build on my local RTX-A6000.

But it has a weird timing inconsistency.
Eg, running it by hand:

(base) ahb@ccmlin019 /mnt/home/ahb/numerics/finufft/build> test/cuda/cufinufft1dspreadinterponly_test 1e7 1e8 1e-3 1e-2 f 0.0
spread-only test 1d:
	100000000 NU pts spread to 10000000 grid in 0.046 s 	2.17e+09 NU pts/s
	rel mass err 0.0002
interp-only test 1d:
	100000000 NU pts interp from 10000000 grid in 0.341 s 	2.93e+08 NU pts/s
	rel sup err 0.000497

It is nice to see the 2e9 NU pt/sec for spread, and this matches the expected from 1d1 actual NUFFT. But it is 10x slower for interp? Could you look into why that is?

I notice that 1d1 vs 1d2 are not different (but the setNUpts time is 10x longer than the exec): here I use precision d in order to test accuracy, but the speed is the same for f:

(base) ahb@ccmlin019 /mnt/home/ahb/numerics/finufft/build> test/cuda/cufinufft1d_test 1 1 1e7 1e8 1e-3 1e-2 d 0.0
[time  ] dummy warmup call to CUFFT	 0.00474 s
[time  ] cufinufft plan:		 0.0229 s
[time  ] cufinufft setNUpts:		 0.48 s
[time  ] cufinufft exec:		 0.0525 s
[time  ] cufinufft destroy:		 0.000309 s
[Method 1] 10000000 U pts to 100000000 NU pts in 0.556 s:      1.8e+08 NU pts/s
					(exec-only thoughput: 1.9e+09 NU pts/s)
[gpu   ] one mode: rel err in F[3700000] is 2.69e-05
(base) ahb@ccmlin019 /mnt/home/ahb/numerics/finufft/build> test/cuda/cufinufft1d_test 1 2 1e7 1e8 1e-3 1e-2 d 0.0
[time  ] dummy warmup call to CUFFT	 0.00385 s
[time  ] cufinufft plan:		 0.0225 s
[time  ] cufinufft setNUpts:		 0.48 s
[time  ] cufinufft exec:		 0.048 s
[time  ] cufinufft destroy:		 0.000335 s
[Method 1] 10000000 U pts to 100000000 NU pts in 0.551 s:      1.81e+08 NU pts/s
					(exec-only thoughput: 2.08e+09 NU pts/s)
[gpu   ] one targ: rel err in c[50000000] is 0.000339

These are all using method=1 as in your new tester.

It's very peculiar that setNUpts doesn't show up in the spread-only timing.
Could it be more cuda-event-synchronizations are needed?
Investigations welcome ! Thanks, Alex

Originally posted by @ahbarnett in #631 (review)

@ahbarnett
Copy link
Collaborator

ahbarnett commented Feb 25, 2025

The issue here is whether H<->D times are included in test/cufinufft{1,2,3}d_test
They should not be.

It could be because we switched from sync transfers to async, to allow users to increase throughput. This may mess up timings...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants