Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown of GPU Data Transfers in Python Threads #75

Open
insertinterestingnamehere opened this issue May 11, 2021 · 2 comments
Open

Slowdown of GPU Data Transfers in Python Threads #75

insertinterestingnamehere opened this issue May 11, 2021 · 2 comments
Labels
bug Something isn't working performance Runtime performance of Parla or Parla programs

Comments

@insertinterestingnamehere
Copy link
Member

Creating this as a placeholder to track progress while we figure out where to even submit this upstream.

Currently GPU transfers in Python threads exhibit unexplained erratic slowdowns. We originally thought these overheads were caused by VECs, however @dialecticDolt did some additional investigation and found that they were caused entirely by use of cudaMemcpy from within threads created by Python. He's verified that this issue does not affect OpenMP's thread pool. We haven't yet verified if this affects threads created using the pthreads interfaces or c++'s std::thread interface, so it is possible that OpenMP is just doing something special instead of Python conflicting with CUDA.

@insertinterestingnamehere
Copy link
Member Author

@dialecticDolt please feel free to add more info here. Where do we have example code to reproduce this?

@insertinterestingnamehere insertinterestingnamehere added performance Runtime performance of Parla or Parla programs bug Something isn't working labels May 11, 2021
@wlruys
Copy link
Contributor

wlruys commented May 14, 2021

I've added the examples to reproduce this with/without VECs in https://github.com/ut-parla/Parla.py/tree/master/benchmarks/gpu_threading, as well as the MPI and CPP OpenMP comparisons.

As a log I'm also copying the performance numbers here (from the slack discussion):

`The reported times for the Memcpy (timed with nvprof) are:
Start Time, Duration, Size of Transfer, Transfer Speed, Device Details

On Zemaitis (Openmp in CPP, Allocations and Deallocations done ahead of time, only timing the memcpy)Before warmup:
1.20125s 2.87918s - 7.4506GB 2.5877GB/s Pageable Device Tesla P100-SXM2 4 17 [CUDA memcpy HtoD]
1.47087s 2.79998s - 7.4506GB 2.6609GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
1.47157s 2.86431s - 7.4506GB 2.6012GB/s Pageable Device Tesla P100-SXM2 2 35 [CUDA memcpy HtoD]
1.47182s 2.75349s - 7.4506GB 2.7059GB/s Pageable Device Tesla P100-SXM2 3 28 [CUDA memcpy HtoD]

Warmed Up:

19.0741s 1.56790s - 7.4506GB 4.7520GB/s Pageable Device Tesla P100-SXM2 3 27 [CUDA memcpy HtoD]
19.0742s 1.50439s - 7.4506GB 4.9525GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
19.0747s 1.46896s - 7.4506GB 5.0720GB/s Pageable Device Tesla P100-SXM2 4 17 [CUDA memcpy HtoD]
19.0767s 1.52812s - 7.4506GB 4.8756GB/s Pageable Device Tesla P100-SXM2 2 37 [CUDA memcpy HtoD]

On Zemaitis (Multithreading in Python, Allocations and Deallocations done with cupy, only timing the memcpy)

14.5530s 2.51998s - 7.4506GB 2.9566GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
14.5531s 1.89952s - 7.4506GB 3.9223GB/s Pageable Device Tesla P100-SXM2 2 17 [CUDA memcpy HtoD]
14.5535s 2.10205s - 7.4506GB 3.5444GB/s Pageable Device Tesla P100-SXM2 3 27 [CUDA memcpy HtoD]
14.5537s 1.79802s - 7.4506GB 4.1438GB/s Pageable Device Tesla P100-SXM2 4 37 [CUDA memcpy HtoD]

Just another sample/trial of the same (the first one ^ is on the low end of the variance):

33.5803s 2.30019s - 7.4506GB 3.2391GB/s Pageable Device Tesla P100-SXM2 2 17 [CUDA memcpy HtoD]
33.5807s 2.17479s - 7.4506GB 3.4259GB/s Pageable Device Tesla P100-SXM2 3 27 [CUDA memcpy HtoD]
33.5807s 2.39922s - 7.4506GB 3.1054GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
33.5808s 2.40678s - 7.4506GB 3.0957GB/s Pageable Device Tesla P100-SXM2 4 37 [CUDA memcpy HtoD]

On Zemaitis (MPI in Python, Allocations and Deallocations done with cupy, only timing the memcpy)

3.69050s 1.42256s - 7.4506GB 5.2374GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
3.78333s 1.43109s - 7.4506GB 5.2062GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
3.83678s 1.44673s - 7.4506GB 5.1499GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
3.88229s 1.44153s - 7.4506GB 5.1685GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]

On Frontera (Openmp in CPP, Allocations and Deallocations done ahead of time, only timing the memcpy)

829.65ms 889.74ms - 7.4506GB 8.3739GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD]
882.30ms 1.13772s - 7.4506GB 6.5487GB/s Pageable Device Quadro RTX 5000 4 17 [CUDA memcpy HtoD]
997.03ms 1.20776s - 7.4506GB 6.1689GB/s Pageable Device Quadro RTX 5000 2 37 [CUDA memcpy HtoD]
997.41ms 1.21396s - 7.4506GB 6.1374GB/s Pageable Device Quadro RTX 5000 3 27 [CUDA memcpy HtoD]

On Frontera (Multiprocess with MPI in Python, Allocations and Deallocations done with cupy, only timing the Memcpy.)

3.38638s 1.13888s - 7.4506GB 6.5420GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD]
3.58987s 1.13825s - 7.4506GB 6.5457GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD]
3.69940s 1.21641s - 7.4506GB 6.1251GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD]
3.73599s 1.21368s - 7.4506GB 6.1388GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD]

On Frontera ( Multithreading in Python, Allocations and Deallocations done with cupy, only timing the Memcpy. )

11.5841s 1.40261s - 7.4506GB 5.3119GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD]
11.5843s 1.57895s - 7.4506GB 4.7187GB/s Pageable Device Quadro RTX 5000 2 18 [CUDA memcpy HtoD]
11.5845s 1.64461s - 7.4506GB 4.5303GB/s Pageable Device Quadro RTX 5000 3 29 [CUDA memcpy HtoD]
11.5846s 1.28430s - 7.4506GB 5.8013GB/s Pageable Device Quadro RTX 5000 4 40 [CUDA memcpy HtoD]`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Runtime performance of Parla or Parla programs
Projects
None yet
Development

No branches or pull requests

2 participants