Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drastical performance degradation when switching from StarPU-1.3.11 to StarPU-1.4.4 on a GPU node #33

Open
Muxas opened this issue Feb 7, 2024 · 69 comments

Comments

@Muxas
Copy link

Muxas commented Feb 7, 2024

The Issue

On a GPU node when switching from StarPU version 1.3.11 to 1.4 versions we experience strange performance drop. For our new software NNTile it results in a 10x performance drop. Yes, it goes from 100% to only 10% percent.

Attempt to switch to a master branch (commit 50cf74508 at Inria gitlab repository) leads to different errors, related to data transfers between CPU and GPU. We tried some other commits from master branch and realized, that they only work with CPU and something strange with memory manager happens when it goes to GPU nodes. DARTS scheduler always fails, while DM and DMDA schedulers fail for some commits (e.g., 50cf74508) and work correctly for other commits (e.g., 2b8a91fe). I cannot present the output of master branch experiments right now, as this current issue is about performance degradation of 1.4 series of StarPU releases.

Although 10x performance drop happens on our new software, I prepared a simple example that shows performance for versions 1.2.10, 1.3.11 and 1.4.4. Most performance drop for the simple example happened when switching from 1.2.10 version to 1.3.11.

Steps to reproduce

I have implemented a simple test https://github.com/Muxas/starpu_gemm_redux to reproduce the issue. The repo simply implements several chains of matrix multiplications:

C[i] = A[i][0]*B[i][0] + A[i][1]*B[i][1] + ... +A[i][NB-1]*B[i][NB-1]

for i in range from 0 to D-1.

which can be simply described with the following C code (the first order of task submissions):

for(int r = 0; r < R; ++r) // Number of repeats
{
    for(int i = 0; i < NB; ++i) // Number of A and B matrices in each chain of matrix multiplications
    {
        for(int j = 0; j < D; ++j) // Number of output C matrices
        {
            starpu_task_insert(&gemm_cl, STARPU_R, A[i*D+j], STARPU_R, B[i*D+j],
                    C_mode, C[j], 0);
        }
    }
}

or with the following C code (the other order of task submissions):

for(int r = 0; r < R; ++r) // Number of repeats
{
    for(int j = 0; j < D; ++j) // Number of output C matrices
    {
        for(int i = 0; i < NB; ++i) // Number of A and B matrices in each chain of matrix multiplications
        {
            starpu_task_insert(&gemm_cl, STARPU_R, A[i*D+j], STARPU_R, B[i*D+j],
                    C_mode, C[j], 0);
        }
    }
}

Matrices A are of size M-by-K, matrices B are of size K-by-N and matrices C are of size M-by-N. No transpositions in matrix multiplications.

Our results are produced on a HGX node with 8 (eight) Nvidia A100 80GB SXM GPUs. We compiled the code and run two experimental setups:

  1. M=N=K=1024, D=32, NB=100, R=50. with and without STARPU_REDUX access mode for the matrices C.
  2. M=256, N=K=1532, D=32, NB=100, R=50. with and without STARPU_REDUX access mode for the matrices C.

StarPU-1.4.4 behavior

This section presents plots for the StarPU-1.4.4 version. The first plot shows warmup time (done by the first order of task submission), time for the first order of task submission and time for the other way of task submission with STARPU_RW|STARPU_COMMUTE access mode for the matrices C and M=N=K=1024:

1024_1024_1024_mode0

The second plot shows the same timings but for the STARPU_REDUX access mode for the matrices C:

1024_1024_1024_mode1

The third plot shows timings for M=256 and N=K=1532 with STARPU_RW|STARPU_COMMUTE access mode:

256_1536_1536_mode0

And the last plot in this section (for the STARPU_REDUX access mode):

256_1536_1536_mode1

We see, that most dumb scheduling algorithm, namely eager, outperforms smarter ones.

StarPU-1.3.11 behavior

This section presents plots for StarPU of version 1.3.11 in the same order as above.

1 3 11-1024_1024_1024_mode0

1 3 11-1024_1024_1024_mode1

1 3 11-256_1536_1536_mode0

1 3 11-256_1536_1536_mode1

We see, that most dumb scheduling algorithm, namely eager, outperforms smarter ones.

StarPU-1.2.10 behavior

This section presents plots for StarPU of version 1.2.10 in the same order as above.

1 2 10-1024_1024_1024_mode0

1 2 10-1024_1024_1024_mode1

1 2 10-256_1536_1536_mode0

1 2 10-256_1536_1536_mode1

Here we see, that in case of STARPU_RW|STARPU_COMMUTE access mode smart schedulers DMDA and DMDAR perform nearly perfectly, just as EAGER. The problem with DMDA and DMDAR appears when switching to 1.3.11 or 1.4.4 StarPU version.

Configuration

The configure line we used is within config.log files in the section below.

Configuration result

This is a config file for StarPU-1.2.10:
config-1.2.10.log

This is a config file for StarPU-1.3.11:
config-1.3.11.log

This is a config file for StarPU-1.4.4:
config-1.4.4.log

Distribution

Inria Gitlab repository

Version of StarPU

We used starpu-1.3.11 and starpu-1.4.4 tags of Inria GitLab repository

Version of GPU drivers

We use CUDA 12.3, hwloc 2.9.3

@Muxas
Copy link
Author

Muxas commented Feb 8, 2024

Dear StarPU team,

I think I figured out what is the reason for 10x performance drop of my application. I disabled kernels and printed bus stats for 1.3.11 and 1.4.4 versions of StarPU.

StarPU-1.3.11:

Total transfers: 29525.0176 GB
Real time including initialization: 4:58.20
Training performance: 1039.465649305438 Tflops/s

StarPU-1.4.4:

Total transfers: 44654.4844 GB
Real time: 17:05.05
Training performance: 58.44558141509535 Tflops/s

For some reason DMDAR and other DM** schedulers in StarPU-1.4.4 send nearly twice more data. And if I specifically take a look at the slowest part, namely PCI-express connection between CPU and GPUs, the new 1.4.4 version sends 65 times more data compared to old version 1.3.11.

Could you please advice if there is a way in StarPU-1.4.4 to put this data transmission overload of 1.4.4 StarPU back to where it was with 1.3.11 version?

I believe there is something wrong with the Memory Manager.

P.S. Enabling CUDA memory map leads to the following error:
../../src/datawizard/copy_driver.c:312: _starpu_driver_copy_data_1_to_1: Assertion `0 && "src_replicate->memory_node == dst_replicate->mapped"' failed.

P.P.S Using STARPU_REDUX leads to another but similar error. Seems like memory manager is bugged in 1.4.4 StarPU.

@sthibaul
Copy link
Collaborator

sthibaul commented Feb 13, 2024

This is unexpected of course :)

Particularly since the 1.3 series introduces heuristics which are precisely meant to improve the overall flow of data.

AIUI, the involved matrices can completely fit in even just one GPU?

Could you also post results with starpu 1.3.0? To make sure whether it's the 1.2->1.3 development that introduced the first regression, or possibly some backports from the 1.4.x series to the 1.3.x series.

Could you also post the output of starpu_machine_display obtained with the different versions, to make sure dmda gets the correct understanding of available PCI bandwidths, gpu placement etc.

Ideally, if you could provide your testcase with an LGPL-2.1+ licence, we could integrate it in our testsuite, and with simulation support we could add non-regression check-up.

Using STARPU_REDUX leads to another but similar error

That's not precise enough for us to be able to act :)

@Muxas
Copy link
Author

Muxas commented Feb 14, 2024

Output for the StarPU-1.3.0.

M=N=K=1024, mode=STARPU_RW|STARPU_COMMUTE
1 3 0-1024_1024_1024_mode0

M=N=K=1024, mode=STARPU_REDUX
1 3 0-1024_1024_1024_mode1

M=256, N=K=1532, mode=STARPU_RW|STARPU_COMMUTE
1 3 0-256_1536_1536_mode0

M=256, N=K=1532, mode=STARPU_REDUX
1 3 0-256_1536_1536_mode1

It seems, that 1.3.0 performs similar to 1.2.10.

@Muxas
Copy link
Author

Muxas commented Feb 14, 2024

Using STARPU_REDUX leads to another but similar error

That's not precise enough for us to be able to act :)
As soon as we find out the source of increased timing with 1.3.11 version, I will prepare some minimal example where CUDA map and STARPU_REDUX lead to internal StarPU assertion failures.

@Muxas
Copy link
Author

Muxas commented Feb 16, 2024

Comparison of 1.2.10 vs 1.3.0

I believe I misjudged the plots of StarPU-1.3.0. Take a look at the most right graphs.

StarPU-1.2.10:
1 2 10-256_1536_1536_mode0
StarPU-1.3.0:
1 3 0-256_1536_1536_mode0

There is a gap between Eager and other smart schedulers. And then, for the StarPU-1.3.5 (below) the gap becomes larger.

Another portion of plots. This time for the StarPU-1.3.5.

M=N=K=1024, mode=STARPU_RW|STARPU_COMMUTE
1 3 5-1024_1024_1024_mode0

M=N=K=1024, mode=STARPU_REDUX
1 3 5-1024_1024_1024_mode1

M=256, N=K=1532, mode=STARPU_RW|STARPU_COMMUTE
1 3 5-256_1536_1536_mode0

M=256, N=K=1532, mode=STARPU_REDUX
1 3 5-256_1536_1536_mode1

@Muxas
Copy link
Author

Muxas commented Feb 18, 2024

And another update

This time I tried other application NNTile

I took a look at data transfers and total execution time (reported by /usr/bin/time utility) for different versions. Total amount of transferred data sorted in descending order:

  1. StarPU-1.2.10: transmitted 64987 GB, execution time 55:32.68 minutes
  2. StarPU-1.4.4: transmitted 56783 GB, execution time 27:25.60 minutes
  3. StarPU-1.3.0: transmitted 29539 GB, execution time 5:36.09 minutes
  4. StarPU-1.3.11: transmitted 29313 GB, execution time 5:54.13 minutes

As one can see, 1.3.x indeed saves a lot of data transmissions. However, 1.4.4 version brings all those transmissions back. Seems like some old code was brought back by 1.4.x release chain. The main problem comes from CPU<->GPU transfers, as 1.4.4 version transfers through a slow PCI-e bus around 65 times more than 1.3.11 version.

Files

Here are the more detailed transmission reports provided by STARPU_BUS_STATS=1 env variable.

transfers_starpu_1.2.10_dmdar.txt

transfers_starpu_1.3.0_dmdar.txt

transfers_starpu_1.3.11_dmdar.txt

transfers_starpu_1.4.4_dmdar.txt

@sthibaul
Copy link
Collaborator

Using STARPU_REDUX leads to another but similar error

That's not precise enough for us to be able to act :)
As soon as we find out the source of increased timing with 1.3.11 version, I will prepare some minimal example where CUDA map and STARPU_REDUX lead to internal StarPU assertion failures.

I mean: please provide the error message. "similar error" doesn't allow us to have any idea what this is about.

Also, again:
Could you also post the output of starpu_machine_display obtained with the different versions, to make sure dmda gets the correct understanding of available PCI bandwidths, gpu placement etc.

Otherwise it's really not surprising that dm* etc. get everything wrong.

I don't have easy access to an 8-gpu machine, so I tried with simulation, and got results that actually see 1.4.4 get better result than 1.3.11 and 1.2.10... So I really need details on how things are going on the machine where you can reproduce the issue.

Also, providing us with the .starpu/sampling/bus/ and codelet/45/ files corresponding to the machine would allow me to simulate the exact same architecture, rather than simulating some 8-gpu machine I happened to have access to at some point.

@Muxas
Copy link
Author

Muxas commented Feb 23, 2024

Also, again:
Could you also post the output of starpu_machine_display obtained with the different versions, to make sure dmda gets the correct understanding of available PCI bandwidths, gpu placement etc.

Here are the files:
starpu_machine_display_1.2.10.txt
starpu_machine_display_1.3.0.txt
starpu_machine_display_1.3.11.txt
starpu_machine_display_1.4.4.txt

P.S. How can I help you simulate my runs? I compiled StarPU without SimGrid support. Traces by FXT weight more than 1 GB. Do not know if giving you contents of codelets/45/ or codelets/44 will help. However, here are the contents of /bus samplings.

bus_stats.tar.gz

@sthibaul
Copy link
Collaborator

Here are the files:

Ok, so you have an nvswitch, which wasn't the case of the machine I was simulated, that can explain why I wasn't seeing the problem.

How can I help you simulate my runs?

By providing the information I'm asking :)

I compiled StarPU without SimGrid support.

Simgrid is only needed for the replay part, not for the calibration part.

Traces by FXT weight more than 1 GB

We don't need traces :)

Do not know if giving you contents of codelets/45/ or codelets/44 will help

Yes, please, to be sure to have the same timings as on your machine.

@sthibaul
Copy link
Collaborator

starpu_machine_display_1.4.4.txt

there's one odd thing here compared to the others: CUDA 0 has very low bandwidth, whatever the peer. Is this reproducible when you force bus re-calibration with STARPU_BUS_CALIBRATE=1 ?

@Muxas
Copy link
Author

Muxas commented Feb 23, 2024

starpu_machine_display_1.4.4.txt

I double checked. It remains the same. CUDA 0 has 11 GB/s connection to CPU, other have 13-15 GB/s. With StarPU-1.3.11 the speeds are around 25 GB/s.

StarPU-1.4.4 bandwidth

bandwidth (MB/s) and latency (us)...

from/to	NUMA 0	CUDA 0	CUDA 1	CUDA 2	CUDA 3	CUDA 4	CUDA 5	CUDA 6	CUDA 7	
NUMA 0	0	11625	14683	14671	14659	14661	14598	14587	14588	
CUDA 0	11744	0	14721	14612	14629	14620	14727	14711	14707	
CUDA 1	14661	13621	0	236193	241212	241259	241637	241287	241874	
CUDA 2	14661	13722	243595	0	241024	243733	244475	244209	243717	
CUDA 3	14684	13867	244122	243544	0	241130	243455	244064	244585	
CUDA 4	14607	13908	240379	241467	246133	0	243570	243885	243641	
CUDA 5	13484	15234	241671	241864	243550	247702	0	244909	244375	
CUDA 6	13229	15071	241887	242582	244249	245052	247115	0	244958	
CUDA 7	13528	15133	241470	241637	244368	244738	244376	247771	0

StarPU-1.3.11 bandwidth

bandwidth (MB/s) and latency (us)...

from/to	NUMA 0	CUDA 0	CUDA 1	CUDA 2	CUDA 3	CUDA 4	CUDA 5	CUDA 6	CUDA 7	
NUMA 0	0	25081	25169	25150	25160	25094	25097	25086	25091	
CUDA 0	23837	0	237628	245022	244849	243492	244425	244068	244064	
CUDA 1	23837	244489	0	244506	244464	244747	244650	244639	244372	
CUDA 2	23837	242112	248106	0	244695	243829	244250	245212	244295	
CUDA 3	23836	241816	243238	247892	0	244343	244691	244443	244240	
CUDA 4	23829	241359	243141	243036	247535	0	244676	244164	244173	
CUDA 5	23908	241918	241900	243932	244365	247550	0	243878	244382	
CUDA 6	23829	241531	241140	244337	244161	244080	246877	0	244022	
CUDA 7	23830	242094	241616	244295	244201	243430	243710	244042	0

@Muxas
Copy link
Author

Muxas commented Feb 23, 2024

starpu_machine_display_1.4.4.txt

Looking at latencies of StarPU-1.4.4:

NUMA 0	0	0	10	9	9	10	9	9	9	
CUDA 0	0	0	10	9	9	10	9	9	9	
CUDA 1	12	12	0	14	14	14	14	14	14	
CUDA 2	12	12	14	0	14	13	13	13	13	
CUDA 3	11	12	14	13	0	13	13	13	13	
CUDA 4	12	12	14	14	13	0	14	13	13	
CUDA 5	12	12	13	13	12	12	0	12	12	
CUDA 6	12	11	13	13	13	13	12	0	12	
CUDA 7	12	11	13	13	13	13	12	12	0

StarPU thinks that CUDA 0 uses the same memory, as NUMA 0... Surprise!

@sthibaul
Copy link
Collaborator

CUDA 0 has 11 GB/s connection to CPU, other have 13-15 GB/s

Not only that, but also the gpu-gpu connexions are not getting the nvswitch speed, that's really odd.

@sthibaul
Copy link
Collaborator

sthibaul commented Feb 23, 2024

StarPU thinks that CUDA 0 uses the same memory, as NUMA 0... Surprise!

The duplicates in the rows and columns and the 0 values in numa0/cuda0 are suspicious indeed.

@sthibaul
Copy link
Collaborator

It might be useful to see the config.log output in the 1.4.4 case.

@sthibaul
Copy link
Collaborator

StarPU thinks that CUDA 0 uses the same memory, as NUMA 0... Surprise!

The duplicates in the rows and columns and the 0 values in numa0/cuda0 are suspicious indeed.

and I can easily reproduce that here, good

@sthibaul
Copy link
Collaborator

(will work on it later next week, though, but at least we have a clear culprit here)

@Muxas
Copy link
Author

Muxas commented Feb 23, 2024

It might be useful to see the config.log output in the 1.4.4 case.

config-StarPU-1.4.4.log

(will work on it later next week, though, but at least we have a clear culprit here)

Thank you! I will be on vacation next week, but after that I will prepare backtraces of initially described failed assertions for StarPU-1.4.4:

P.S. Enabling CUDA memory map leads to the following error:
../../src/datawizard/copy_driver.c:312: _starpu_driver_copy_data_1_to_1: Assertion `0 && "src_replicate->memory_node == dst_replicate->mapped"' failed.

P.P.S Using STARPU_REDUX leads to another but similar error. Seems like memory manager is bugged in 1.4.4 StarPU.

By the way, the version StarPU-1.3.11 gave me an error CUDA out-of-memory with STARPU_REDUX access modes. Setting STARPU_LIMIT_CUDA_MEM=60000 solved the issue. I will also try to find situation when it happened and create a backtrace.

@Muxas
Copy link
Author

Muxas commented Feb 24, 2024

Compiling StarPU-1.4.4 with a flag --enable-maxnumanodes=1 solves the issue with latencies and bandwidth bringing result of STARPU_machine_display to the same of version 1.3.11. However, performance if actual computations is the same as without the flag. Amount of data transfers is still large, as reported in one of the messages above.

@sthibaul
Copy link
Collaborator

sthibaul commented Mar 1, 2024

Ok, I have pushed a fix for the bandwidth/latency management to gitlab, will appear on github by tomorrow.
Now that the scheduler will have the proper values, I'll investigate the large amounts.

nfurmento pushed a commit that referenced this issue Mar 1, 2024
We were previously mixing memory node index and raw memory index. The latter
includes all devices, included the disabled ones!

This was notably making machines with NUMA nodes and an NVSwitch take very
wrong bandwidth values for the first GPUs.

See github #33

(cherry picked from commit 5a72a632e186896a599c5c7e51857d0422837546)
nfurmento pushed a commit that referenced this issue Mar 1, 2024
We were previously mixing memory node index and raw memory index. The latter
includes all devices, included the disabled ones!

This was notably making machines with NUMA nodes and an NVSwitch take very
wrong bandwidth values for the first GPUs.

See github #33
@Muxas
Copy link
Author

Muxas commented Mar 2, 2024

Ok, I have pushed a fix for the bandwidth/latency management to gitlab, will appear on github by tomorrow.

Thank you! I tried the new commit. It fixes output of starpu_machine_display, but only partially. Throughput between CPU and GPUs remains low. I mean it is around 14 GB/s, as it was with StarPU-1.4.4. The version StarPU-1.3.11 reaches 25 GB/s. Output of starpu_machine_display are attached for starpu-1.4 and starpu-1.3.11 tags.

starpu_machine_display_1.4.txt

starpu_machine_display_1.3.11.txt

@sthibaul
Copy link
Collaborator

sthibaul commented Mar 2, 2024

It looks like there is an interference between the numa memory pinning and the nvidia memory pinning. I indeed see a small difference on my testbox, that might be emphasized on your box.

@Muxas
Copy link
Author

Muxas commented Mar 3, 2024

Another update. This time the hardware server is different (4x Nvidia V100 SXM2). For some strange reason CUDA workers require around 500 microseconds for any (even empty) task. Setting environment variable STARPU_WORKERS_NOBIND=1 brings this time down to around 5 microseconds for an empty task (which is still large in my opinion). But it improves overall performance 2x times for my application. Attached is the starpu_machine_display for this new server (StarPU-1.3.11). Since servers with PCI-express CUDA GPUs do not suffer such a problem, I believe the problem is within hwloc-2.9.3 around Nvidia SXM bus.

starpu_machine_display_v100_starpu_1.3.11.txt

@sthibaul
Copy link
Collaborator

sthibaul commented Mar 5, 2024

For some strange reason CUDA workers require around 500 microseconds for any (even empty) task. Setting environment variable STARPU_WORKERS_NOBIND=1 brings this time down to around 5 microseconds for an empty task

It seems that thread binding got broken in the 1.3 series indeed. I backported some fixes from 1.4, which should fix it (by looking at the pci bus numbers in your v100 case the gpus should be driven from numa0, not 1)

5 microseconds for an empty task (which is still large in my opinion)

The CUDA cost itself is already that order of magnitude, unfortunately.

Since servers with PCI-express CUDA GPUs do not suffer such a problem

They probably have the same binding issue, just with much lower overhead probably.

@sthibaul
Copy link
Collaborator

sthibaul commented Mar 5, 2024

It looks like there is an interference between the numa memory pinning and the nvidia memory pinning. I indeed see a small difference on my testbox, that might be emphasized on your box.

Ok, there was a typo in starpu-1.3 which didn't pose problem there, but ended up posing problem to 1.4, thus why it went unnoticed. This should now be fixed by Fix missing pinning memory when benchmarking bus with numa in 1.4 (and not broken on 1.3), so the bandwidth numbers should now be fine, could you please check?

Then I'll check the scheduling part

@Muxas
Copy link
Author

Muxas commented Mar 6, 2024

I tried new commit in the starpu-1.3 branch and it got even worse, just like with starpu-1.4.4 case. Take a look at
starpu_1.3.11_v100_machine_display.txt

@sthibaul
Copy link
Collaborator

sthibaul commented Mar 7, 2024

Ok, with the update it gained the need for the same fix as in 1.4 (Fix bus performance models selection), should now be fixed

@Muxas
Copy link
Author

Muxas commented Mar 7, 2024

New output of starpu_machine_display:

StarPU has found :
32 STARPU_CPU_WORKER workers:
	CPU 0
	CPU 1
	CPU 2
	CPU 3
	CPU 4
	CPU 5
	CPU 6
	CPU 7
	CPU 8
	CPU 9
	CPU 10
	CPU 11
	CPU 12
	CPU 13
	CPU 14
	CPU 15
	CPU 16
	CPU 17
	CPU 18
	CPU 19
	CPU 20
	CPU 21
	CPU 22
	CPU 23
	CPU 24
	CPU 25
	CPU 26
	CPU 27
	CPU 28
	CPU 29
	CPU 30
	CPU 31
4 STARPU_CUDA_WORKER workers:
	CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)
	CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)
	CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)
	CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)
No STARPU_OPENCL_WORKER worker

topology ... (hwloc logical indexes)
numa  0	pack  0	core 0	PU 0	CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 1c:00.0)	
		core 1	PU 1	CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 1d:00.0)	
		core 2	PU 2	CUDA 3.0 (Tesla V100-SXM2-16GB 14.2 GiB 1e:00.0)	
		core 3	PU 3	CPU 0	
		core 4	PU 4	CPU 1	
		core 5	PU 5	CPU 2	
		core 6	PU 6	CPU 3	
		core 7	PU 7	CPU 4	
		core 8	PU 8	CPU 5	
		core 9	PU 9	CPU 6	
		core 10	PU 10	CPU 7	
		core 11	PU 11	CPU 8	
		core 12	PU 12	CPU 9	
		core 13	PU 13	CPU 10	
		core 14	PU 14	CPU 11	
		core 15	PU 15	CPU 12	
		core 16	PU 16	CPU 13	
		core 17	PU 17	CPU 14	
numa  1	pack  1	core 18	PU 18	CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 1a:00.0)	
		core 19	PU 19	CPU 15	
		core 20	PU 20	CPU 16	
		core 21	PU 21	CPU 17	
		core 22	PU 22	CPU 18	
		core 23	PU 23	CPU 19	
		core 24	PU 24	CPU 20	
		core 25	PU 25	CPU 21	
		core 26	PU 26	CPU 22	
		core 27	PU 27	CPU 23	
		core 28	PU 28	CPU 24	
		core 29	PU 29	CPU 25	
		core 30	PU 30	CPU 26	
		core 31	PU 31	CPU 27	
		core 32	PU 32	CPU 28	
		core 33	PU 33	CPU 29	
		core 34	PU 34	CPU 30	
		core 35	PU 35	CPU 31	

bandwidth (MB/s) and latency (us)...
from/to	NUMA 0	CUDA 0	CUDA 1	CUDA 2	CUDA 3	
NUMA 0	0	12309	12329	12334	12331	
CUDA 0	13092	0	47517	47695	47688	
CUDA 1	13101	47693	0	47699	47704	
CUDA 2	13102	47691	47689	0	47689	
CUDA 3	13101	47694	47698	47694	0	

NUMA 0	0	9	9	9	9	
CUDA 0	9	0	11	11	11	
CUDA 1	9	11	0	11	11	
CUDA 2	9	11	11	0	11	
CUDA 3	9	11	11	11	0	

GPU	NUMA in preference order (logical index), host-to-device, device-to-host
CUDA_0	 1	 0	
CUDA_1	 0	 1	
CUDA_2	 0	 1	
CUDA_3	 0	 1

@sthibaul
Copy link
Collaborator

Looking at the detail of the platform xml file, I see that the nvswitch is not detected, do you have libnvidia-ml detected? That shows up in the ./configure output as:

checking whether nvidia-ml should be used... yes

I however also need to add a small piece of code to make it known to the perfmodel. In the meanwhile, you can try to make _starpu_cuda_direct_link always return 1. Otherwise starpu 1.4 thinks the transfers go through the pci buses (starpu 1.3 doesn't care)

@sthibaul
Copy link
Collaborator

sthibaul commented Mar 12, 2024

The CUDA: Also detect NVSwitch when checking the number of gpus sharing a bus commit should be doing it

@sthibaul
Copy link
Collaborator

Comparison of data transfers for starpu-1.3 and starpu-1.4 for the same application with dm scheduler:

The fix mentioned above can also fix that case, because we use the performance prediction for selecting the source node for transfers in _starpu_select_src_node, not only in the scheduler for task placement

@sthibaul
Copy link
Collaborator

It is still strange for me why CUDA 0 is attached to NUMA 1

Before starpu 1.4, we were just using the observed bandwidth to decide where to place the thread driving the gpu, so it might happen that with (mis-)luck, CUDA0 happens to get just a bit more bandwidth from NUMA1.

Starting from starpu 1.4 we use the hwloc information, which is much more stable :)

@sthibaul
Copy link
Collaborator

Nvidia A100 data sheet states 600GB/s

Do you know if there is a programmatic way to get this figure? (other than just measuring by starting transfers from all ends)

@sthibaul
Copy link
Collaborator

Nvidia A100 data sheet states 600GB/s

Do you know if there is a programmatic way to get this figure? (other than just measuring by starting transfers from all ends)

Ah, sorry, you meant the GPU bandwidth itself. I was thinking about the NVSwitch:

In a case of 8x Nvidia A100 we get less than 100GB/s instead of reported 250GB/s.

Do you mean that the total internal bandwidth of the NVSwitch doesn't allow a full 250GB/s for each GPU? Ideally that's the bandwidth I'd like to get access to. Possibly we'll just resort to just measuring it.

@Muxas
Copy link
Author

Muxas commented Mar 12, 2024

Looking at the detail of the platform xml file, I see that the nvswitch is not detected, do you have libnvidia-ml detected? That shows up in the ./configure output as:

Turning off STARPU_SILENT showed me

[starpu][_starpu_init_cuda_config] Warning: could not find location of CUDA0, do you have the hwloc CUDA plugin installed?
[starpu][_starpu_init_cuda_config] Warning: could not find location of CUDA1, do you have the hwloc CUDA plugin installed?
[starpu][_starpu_init_cuda_config] Warning: could not find location of CUDA2, do you have the hwloc CUDA plugin installed?
[starpu][_starpu_init_cuda_config] Warning: could not find location of CUDA3, do you have the hwloc CUDA plugin installed?

And during configuration:

NVML found and can be compiled, but compiled application can not be run, you are probably on a machine without the CUDA driver
configure: WARNING: nvidia-ml could not be found. This will prevent from correct understanding of the machine topology.
checking whether nvidia-ml should be used... no

I see clearly that the library is present at /usr/lib64. But It is not used somehow.

@sthibaul
Copy link
Collaborator

Could you post the whole config.log?

@Muxas
Copy link
Author

Muxas commented Mar 12, 2024

Surely!
config.log

I am using a cluster with SLURM.So, I configure and compile on an access node, which lacks CUDA devices. Probably, it is the reason why nvidia-ml is marked as not found. It is found at first, and it can be even used for compilation. But, according to config.log, no CUDA devices is found and, therefore, libnvidia-ml is discarded.

@Muxas
Copy link
Author

Muxas commented Mar 12, 2024

I am using a cluster with SLURM

It explains why recompiling the same code on an access mode, which was previously compiled on a compute node, gave totally different results (in one of the posts above).

@Muxas
Copy link
Author

Muxas commented Mar 12, 2024

Seems like I have to compile all the prerequisites (fxt and hwloc) on compute nodes to get it work correctly.

Other issue is that I have conda python package manager installed and configure script finds hwloc-topo among conda files, which is incorrect. I compiled hwloc with hwloc-calc and somehow configure does not find it. Is there a way to point to a correct hwloc-calc?

@sthibaul
Copy link
Collaborator

configure just finds hwloc-calc from $PATH

@Muxas
Copy link
Author

Muxas commented Mar 13, 2024

Recompiling everything from source (except cuBLAS) on a compute node leads to a very strange performance of cublasGemmEx on a server with A100 GPUs.
Starpu-1.4 compiled on compute node:

# hash		size		flops		mean (us or J)	dev (us or J)	sum		sum2		n
0a5b7fae	1971322880     	2.147484e+12   	2.185368e+04   	4.096205e+03   	2.141661e+06   	4.844750e+10   	98
146137f2	234881024      	1.717987e+11   	1.011980e+04   	2.119861e+03   	1.042339e+07   	1.101112e+11   	1030
22528761	272629760      	2.147484e+11   	9.720931e+03   	2.044775e+03   	6.126130e+07   	6.218662e+11   	6302

Starpu-1.3 compiled on a host node:

# hash		size		flops		mean (us)	dev (us)	sum		sum2		n
0a5b7fae	1971322880     	2.147484e+12   	1.667207e+04   	8.012294e+02   	4.396424e+07   	7.346677e+11   	2637
146137f2	234881024      	1.717987e+11   	1.350489e+03   	8.248709e+01   	2.934882e+07   	3.978311e+10   	21732
22528761	272629760      	2.147484e+11   	1.703173e+03   	8.857789e+01   	2.417620e+08   	4.128764e+11   	141948

The same device, but performance of 4096x5120 by 5120x5120 matrix multiplications (hash 22528761) is like 6 times different.

@sthibaul
Copy link
Collaborator

It's hard to comment on this without seeing what is happening around, such as with a paje trace.
You can also try to cudaStreamSynchronize() in your kernel and drop the STARPU_CUDA_ASYNC flag, to check whether the kernel itself behaves differently.
Possibly you end up using cublasv1 vs cublasv2 or such difference in implementation.

@Muxas
Copy link
Author

Muxas commented Mar 13, 2024

It's hard to comment on this without seeing what is happening around, such as with a paje trace.

Here it is
paje.trace.tar.gz

Possibly you end up using cublasv1 vs cublasv2 or such difference in implementation.

I explicitly include <cublas_v2.h> and <starpu_cublas_v2.h>. And I am using CUDA 12, so using cublas_v1 and cublas_v2 simultaneously is impossible.

@Muxas
Copy link
Author

Muxas commented Mar 13, 2024

And, for a reference, a paje.trace for access-node-compiled StarPU-1.4 (libnvidia-ml is disabled) host.paje.trace.tar.gz
Performance of computations is around 3 times better.

@sthibaul
Copy link
Collaborator

sthibaul commented Mar 13, 2024

One thing I notice in the compute-node-built case is that there are a lot of 4µs "overhead" states here and there in the trace on the lower part of the T3* bars (below the CUDA* bars), which represents the state of the thread driving the gpu. These don't show up on the access-node-compiled case. I guess that could be some cuda operation triggered perhaps by the presence of nvidia-ml which for some reason takes a lot of time. Could you post the config.log obtained on compute-node-built and access-node-built so we make sure to know what compilation difference there is?

Also, I notice that you have different-but-quite-close data sizes, and a lot of allocating/freeing states. You probably want to round up allocations to e.g. 10% of your data size, so that starpu can reuse data allocations rather than freeing/allocating all the time, that'll avoid a lot of synchronizations. I have just added a faq about it on https://gitlab.inria.fr/starpu/starpu/-/blob/master/doc/doxygen/chapters/starpu_faq/check_list_performance.doxy#L62

@Muxas
Copy link
Author

Muxas commented Mar 14, 2024

Also, I notice that you have different-but-quite-close data sizes

I have a pipeline of computations. Pipeline operates on tiles of shapes 1, 4096, 5120, 4096x5120, 5120x5120, and 4096x51200. Sizes never change. Shapes 4096x5120 and 5120x5120 are indeed close, but does data allocation reuse require ALL tiles to be of the same shape? That would be strange.

@Muxas
Copy link
Author

Muxas commented Mar 14, 2024

Could you post the config.log obtained on compute-node-built and access-node-built so we make sure to know what compilation difference there is?

Host (login node):
config_host.log

Compute node:
config_compute.log

@Muxas
Copy link
Author

Muxas commented Mar 14, 2024

You can also try to cudaStreamSynchronize() in your kernel and drop the STARPU_CUDA_ASYNC flag, to check whether the kernel itself behaves differently.

Adding such a sync only to a single gemm kernel did not change the picture much:

# hash		size		flops		mean (us or J)	dev (us or J)	sum		sum2		n
0a5b7fae	1971322880     	2.147484e+12   	1.969021e+04   	1.245154e+03   	1.988711e+06   	3.931472e+10   	101
146137f2	234881024      	1.717987e+11   	5.597694e+03   	6.308121e+02   	4.466960e+06   	2.532222e+10   	798
22528761	272629760      	2.147484e+11   	5.917192e+03   	6.071566e+02   	2.884039e+07   	1.724509e+11   	4874

Without the sync:

# hash		size		flops		mean (us or J)	dev (us or J)	sum		sum2		n
0a5b7fae	1971322880     	2.147484e+12   	2.185368e+04   	4.096205e+03   	2.141661e+06   	4.844750e+10   	98
146137f2	234881024      	1.717987e+11   	1.011980e+04   	2.119861e+03   	1.042339e+07   	1.101112e+11   	1030
22528761	272629760      	2.147484e+11   	9.720931e+03   	2.044775e+03   	6.126130e+07   	6.218662e+11   	6302

Yes, performance got up twice, but it is still far from performance of StarPU-1.3 compiled on an access (login) node:

# hash		size		flops		mean (us)	dev (us)	sum		sum2		n
0a5b7fae	1971322880     	2.147484e+12   	1.667207e+04   	8.012294e+02   	4.396424e+07   	7.346677e+11   	2637
146137f2	234881024      	1.717987e+11   	1.350489e+03   	8.248709e+01   	2.934882e+07   	3.978311e+10   	21732
22528761	272629760      	2.147484e+11   	1.703173e+03   	8.857789e+01   	2.417620e+08   	4.128764e+11   	141948

@Muxas
Copy link
Author

Muxas commented Mar 14, 2024

Actually, as you can see, there are only 3 different hashes of the gemm kernel. My tiles are really mostly 4096x5120 and 5120x5120.

@sthibaul
Copy link
Collaborator

Could you post the config.log obtained on compute-node-built and access-node-built so we make sure to know what compilation difference there is?

Host (login node): config_host.log

Compute node: config_compute.log

Thanks! Do you have STARPU_PROFILING enabled? I realize that nvmlDeviceGetTotalEnergyConsumption is actually terribly expensive, and is a difference between the two configs. In master, 1.3 and 1.4 branches I have now disabled its use by default.

@sthibaul
Copy link
Collaborator

You can also try to cudaStreamSynchronize() in your kernel and drop the STARPU_CUDA_ASYNC flag, to check whether the kernel itself behaves differently.

Adding such a sync only to a single gemm kernel did not change the picture much:

# hash		size		flops		mean (us or J)	dev (us or J)	sum		sum2		n
0a5b7fae	1971322880     	2.147484e+12   	1.969021e+04   	1.245154e+03   	1.988711e+06   	3.931472e+10   	101
146137f2	234881024      	1.717987e+11   	5.597694e+03   	6.308121e+02   	4.466960e+06   	2.532222e+10   	798
22528761	272629760      	2.147484e+11   	5.917192e+03   	6.071566e+02   	2.884039e+07   	1.724509e+11   	4874

Without the sync:

# hash		size		flops		mean (us or J)	dev (us or J)	sum		sum2		n
0a5b7fae	1971322880     	2.147484e+12   	2.185368e+04   	4.096205e+03   	2.141661e+06   	4.844750e+10   	98
146137f2	234881024      	1.717987e+11   	1.011980e+04   	2.119861e+03   	1.042339e+07   	1.101112e+11   	1030
22528761	272629760      	2.147484e+11   	9.720931e+03   	2.044775e+03   	6.126130e+07   	6.218662e+11   	6302

Yes, performance got up twice, but it is still far from performance of StarPU-1.3 compiled on an access (login) node:

# hash		size		flops		mean (us)	dev (us)	sum		sum2		n
0a5b7fae	1971322880     	2.147484e+12   	1.667207e+04   	8.012294e+02   	4.396424e+07   	7.346677e+11   	2637
146137f2	234881024      	1.717987e+11   	1.350489e+03   	8.248709e+01   	2.934882e+07   	3.978311e+10   	21732
22528761	272629760      	2.147484e+11   	1.703173e+03   	8.857789e+01   	2.417620e+08   	4.128764e+11   	141948

I wasn't really planning for a performance increase, but mostly for more stable measurement. The deviation is really large. The nvmlDeviceGetTotalEnergyConsumption calls could be explaining that.

@sthibaul
Copy link
Collaborator

sthibaul commented Mar 14, 2024

Also, I notice that you have different-but-quite-close data sizes

I have a pipeline of computations. Pipeline operates on tiles of shapes 1, 4096, 5120, 4096x5120, 5120x5120, and 4096x51200. Sizes never change. Shapes 4096x5120 and 5120x5120 are indeed close, but does data allocation reuse require ALL tiles to be of the same shape? That would be strange.

No, but one cannot directly reuse the allocation for a different tile size, so if the global ratios of the different data shapes vary along the workload, one has to free/allocate to cope with the new ratios. That can explain the amount of reallocation. You may want to try to set STARPU_MINIMUM_AVAILABLE_MEM=5 and STARPU_TARGET_AVAILABLE_MEM=10, so that memory is immediately available for different shapes, rather than evicting data at the last minute. You might end up with prefetching fighting with pre-eviction when there are a lot of ready tasks, but I'm interested in seeing the eventual trace.

@Muxas
Copy link
Author

Muxas commented Mar 14, 2024

Do you have STARPU_PROFILING enabled?

Yes, it is nearly always on. STARPU_BUS_STATS=1 is ignored without overall profiling.

In master, 1.3 and 1.4 branches I have now disabled its use by default.

I tried latest starpu-1.4 commit and confirm performance model is now in a good shape:

# hash		size		flops		mean (us or J)	dev (us or J)	sum		sum2		n
22528761	272629760      	2.147484e+11   	1.638281e+03   	9.637613e+01   	3.550647e+07   	5.837090e+10   	21673
146137f2	234881024      	1.717987e+11   	1.282526e+03   	1.013623e+02   	3.500014e+06   	4.516898e+09   	2729
0a5b7fae	1971322880     	2.147484e+12   	1.564672e+04   	1.066607e+03   	1.179762e+07   	1.854519e+11   	754

before it was:

# hash		size		flops		mean (us or J)	dev (us or J)	sum		sum2		n
0a5b7fae	1971322880     	2.147484e+12   	1.969021e+04   	1.245154e+03   	1.988711e+06   	3.931472e+10   	101
146137f2	234881024      	1.717987e+11   	5.597694e+03   	6.308121e+02   	4.466960e+06   	2.532222e+10   	798
22528761	272629760      	2.147484e+11   	5.917192e+03   	6.071566e+02   	2.884039e+07   	1.724509e+11   	4874

Now we are back to fight against the scheduler, that tries to transmit more data, than in StarPU-1.3 version.

@Muxas
Copy link
Author

Muxas commented Mar 14, 2024

You may want to try to set STARPU_MINIMUM_AVAILABLE_MEM=5 and STARPU_TARGET_AVAILABLE_MEM=10, so that memory is immediately available for different shapes, rather than evicting data at the last minute.

Preliminary tests on a previous week with these environment variables did not bring us performance. I will give it another try.

@sthibaul
Copy link
Collaborator

You may want to try to set STARPU_MINIMUM_AVAILABLE_MEM=5 and STARPU_TARGET_AVAILABLE_MEM=10, so that memory is immediately available for different shapes, rather than evicting data at the last minute.

Preliminary tests on a previous week with these environment variables did not bring us performance. I will give it another try.

At the beginning of execution the prefetch probably fights with eviction so that'd lose time, but we'd want to fix that at some point. I'm interested to see later in the execution, when there are much less ready tasks, thus much less prefetching and then no fight, we could hope for much less last-minute write-back.

@Muxas
Copy link
Author

Muxas commented Mar 14, 2024

thus much less prefetching and then no fight, we could hope for much less last-minute write-back

That is why I wonder #35 if there is a way to tell StarPU that a given handle can be assumed "dirty" from now on without reallocating resource as starpu_invalidate_submit does. I, the programmer, guarantee that the next use of the handle will be in STARPU_W mode in the next iteration. I believe this hint will hep StarPU reduce amount of transferred data by a lot for my application of training neural networks.

@Muxas
Copy link
Author

Muxas commented Mar 14, 2024

You may want to try to set STARPU_MINIMUM_AVAILABLE_MEM=5 and STARPU_TARGET_AVAILABLE_MEM=10, so that memory is immediately available for different shapes, rather than evicting data at the last minute.

Setting these parameters enabled data race. Trace is attached:
datarace.paje.trace.tar.gz

@Muxas
Copy link
Author

Muxas commented Mar 15, 2024

You may want to try to set STARPU_MINIMUM_AVAILABLE_MEM=5 and STARPU_TARGET_AVAILABLE_MEM=10, so that memory is immediately available for different shapes, rather than evicting data at the last minute.

Besides triggering watchdog sometimes, this change could not help to increase performance with StarPU-1.4. As of now, performance of my app with a single data parallel track on a single GPU reaches 100 Tflops/s. When I switch to 4 independent data parallel tracks on 4 GPUs, performance goes up to 360 Tflops/s with StarPU-1.3 and remains 100 Tflops/s with StarPU-1.4. For some strange reason StarPU-1.4 communicates much more data through slow CPU-GPU PCI-express bus instead of fast SXM4 bus. I could not believe this is only due to scheduling technique. Maybe there is a double prefetching of the same buffer? Since issue with NUMA indexing is solved and performance of StarPU-1.4 is still much lower than of StarPU-1.3 I would like to continue search. I could send traces, but they weight around 1.8GB each.

StarPU-1.4 (commit 159175aee64b7fa89f70b2ad6045d657fff1dc1a of gitlab):

Training performance: 106.04774427457744 Tflops/s
#---------------------
Data transfer stats:
	NUMA 0 -> CUDA 0	3512.1792 GB	5940.0140 MB/s	(transfers : 58013 - avg 61.9942 MB)
	CUDA 0 -> NUMA 0	378.2384 GB	639.7000 MB/s	(transfers : 10657 - avg 36.3438 MB)
	NUMA 0 -> CUDA 1	4125.6567 GB	6977.5645 MB/s	(transfers : 52147 - avg 81.0147 MB)
	CUDA 1 -> NUMA 0	641.2321 GB	1084.4912 MB/s	(transfers : 9130 - avg 71.9191 MB)
	CUDA 0 -> CUDA 1	2449.2275 GB	4142.2842 MB/s	(transfers : 38293 - avg 65.4952 MB)
	CUDA 1 -> CUDA 0	2818.3569 GB	4766.5784 MB/s	(transfers : 40335 - avg 71.5507 MB)
	NUMA 0 -> CUDA 2	3782.2424 GB	6396.7606 MB/s	(transfers : 51875 - avg 74.6606 MB)
	CUDA 2 -> NUMA 0	278.6027 GB	471.1899 MB/s	(transfers : 8306 - avg 34.3474 MB)
	CUDA 0 -> CUDA 2	2652.0833 GB	4485.3659 MB/s	(transfers : 40200 - avg 67.5556 MB)
	CUDA 2 -> CUDA 0	3210.4692 GB	5429.7424 MB/s	(transfers : 44222 - avg 74.3413 MB)
	CUDA 1 -> CUDA 2	2169.9846 GB	3670.0110 MB/s	(transfers : 32642 - avg 68.0738 MB)
	CUDA 2 -> CUDA 1	2613.1143 GB	4419.4589 MB/s	(transfers : 37885 - avg 70.6303 MB)
	NUMA 0 -> CUDA 3	4245.4556 GB	7180.1744 MB/s	(transfers : 52158 - avg 83.3496 MB)
	CUDA 3 -> NUMA 0	574.7733 GB	972.0917 MB/s	(transfers : 8547 - avg 68.8625 MB)
	CUDA 0 -> CUDA 3	2453.2351 GB	4149.0613 MB/s	(transfers : 37210 - avg 67.5118 MB)
	CUDA 3 -> CUDA 0	2310.4956 GB	3907.6514 MB/s	(transfers : 34640 - avg 68.3010 MB)
	CUDA 1 -> CUDA 3	2358.8792 GB	3989.4806 MB/s	(transfers : 35530 - avg 67.9846 MB)
	CUDA 3 -> CUDA 1	1971.4827 GB	3334.2919 MB/s	(transfers : 31633 - avg 63.8194 MB)
	CUDA 2 -> CUDA 3	2139.7852 GB	3618.9353 MB/s	(transfers : 31820 - avg 68.8605 MB)
	CUDA 3 -> CUDA 2	2551.2749 GB	4314.8718 MB/s	(transfers : 38917 - avg 67.1302 MB)
Total transfers: 47236.7695 GB

Starpu-1.3 (commit 11699e22f3125723fb475e33797a6dcdaaecb7d7 of gitlab):

Training performance: 326.13827992644457 Tflops/s
#---------------------
Data transfer stats:
	NUMA 0 -> CUDA 0	10.4181 GB	41.3259 MB/s	(transfers : 980 - avg 10.8859 MB)
	CUDA 0 -> NUMA 0	3.0865 GB	12.2433 MB/s	(transfers : 59 - avg 53.5694 MB)
	NUMA 0 -> CUDA 1	11.4906 GB	45.5799 MB/s	(transfers : 434 - avg 27.1114 MB)
	CUDA 1 -> NUMA 0	5.2519 GB	20.8328 MB/s	(transfers : 145 - avg 37.0892 MB)
	CUDA 0 -> CUDA 1	1055.8624 GB	4188.3069 MB/s	(transfers : 19097 - avg 56.6164 MB)
	CUDA 1 -> CUDA 0	1514.0365 GB	6005.7532 MB/s	(transfers : 25104 - avg 61.7580 MB)
	NUMA 0 -> CUDA 2	14.4211 GB	57.2045 MB/s	(transfers : 341 - avg 43.3057 MB)
	CUDA 2 -> NUMA 0	6.2903 GB	24.9518 MB/s	(transfers : 99 - avg 65.0631 MB)
	CUDA 0 -> CUDA 2	1609.3960 GB	6384.0162 MB/s	(transfers : 27502 - avg 59.9237 MB)
	CUDA 2 -> CUDA 0	1697.3870 GB	6733.0511 MB/s	(transfers : 27228 - avg 63.8359 MB)
	CUDA 1 -> CUDA 2	1199.5175 GB	4758.1442 MB/s	(transfers : 21281 - avg 57.7184 MB)
	CUDA 2 -> CUDA 1	1876.4987 GB	7443.5356 MB/s	(transfers : 29418 - avg 65.3183 MB)
	NUMA 0 -> CUDA 3	12.0974 GB	47.9870 MB/s	(transfers : 358 - avg 34.6026 MB)
	CUDA 3 -> NUMA 0	6.4070 GB	25.4149 MB/s	(transfers : 76 - avg 86.3264 MB)
	CUDA 0 -> CUDA 3	1819.2949 GB	7216.6235 MB/s	(transfers : 28441 - avg 65.5025 MB)
	CUDA 3 -> CUDA 0	1214.7228 GB	4818.4582 MB/s	(transfers : 20123 - avg 61.8137 MB)
	CUDA 1 -> CUDA 3	1747.9844 GB	6933.7543 MB/s	(transfers : 28541 - avg 62.7146 MB)
	CUDA 3 -> CUDA 1	1395.9900 GB	5537.4929 MB/s	(transfers : 24528 - avg 58.2801 MB)
	CUDA 2 -> CUDA 3	1206.3433 GB	4785.2183 MB/s	(transfers : 20884 - avg 59.1503 MB)
	CUDA 3 -> CUDA 2	2197.1829 GB	8715.5950 MB/s	(transfers : 33908 - avg 66.3535 MB)
Total transfers: 18603.6797 GB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants