Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oversubscribed SMP Performance is Ludicrously Bad #25

Open
jszaday opened this issue Jan 8, 2022 · 0 comments
Open

Oversubscribed SMP Performance is Ludicrously Bad #25

jszaday opened this issue Jan 8, 2022 · 0 comments
Assignees

Comments

@jszaday
Copy link
Member

jszaday commented Jan 8, 2022

In particular, for jacobi, cbench, and pingpong.

Cbench is the worst of the bunch, taking the CI more than a minute to complete:

8: Test command: /home/runner/work/charmlite/charmlite/charm/bin/charmrun "/home/runner/work/charmlite/charmlite/build/bin/pgm_cbench_benchmark" "+p2" "++ppn2"
8: Test timeout computed to be: 120
8: 
8: Running as 1 OS processes: /home/runner/work/charmlite/charmlite/build/bin/pgm_cbench_benchmark ++ppn2
8: charmrun> /usr/bin/setarch x86_64 -R mpirun -np 1 /home/runner/work/charmlite/charmlite/build/bin/pgm_cbench_benchmark ++ppn2
8: Charm++> Running in SMP mode: 1 processes, 2 worker threads (PEs) + 1 comm threads per process, 2 PEs total
8: Charm++> The comm. thread both sends and receives messages
8: Converse/Charm++ Commit ID: v7.1.0-devel-122-g064b48915
8: Charm++> Using STL-based msgQ:
8: Charm++> Message priorities have been turned off and will not be respected.
8: main> rep 1 of 16
8: main> rep 2 of 16
8: main> rep 3 of 16
8: main> rep 4 of 16
8: main> rep 5 of 16
8: main> rep 6 of 16
8: main> rep 7 of 16
8: main> rep 8 of 16
8: main> rep 9 of 16
8: main> rep 10 of 16
8: main> rep 11 of 16
8: main> rep 12 of 16
8: main> rep 13 of 16
8: main> rep 14 of 16
8: main> rep 15 of 16
8: main> rep 16 of 16
8: info> interleaved 129 broadcasts and reductions across 8 chares
8: info> average time per repetition: 4453.8 ms
8: info> average time per broadcast+reduction: 34525.6 ns
8: [Partition 0][Node 0] End of program
 8/10 Test  #8: pgm_cbench_benchmark_pe2 .........   Passed   72.52 sec

It's not uncommon to see these 34525.6 ns broadcasts+reductions on an over-subscribed PC either! We should probably try to determine what's going on here, and why the performance is so bad for these configurations.

What I've tried so far:

  • Enabling or disabling +CmiSleepOnIdle.
  • Enabling or disabling cpu topology/affinity.
  • Using the lockless queue (--enable-lockless-queue).

Nothing seemed to improve the situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants