Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement polite waiting #685

Merged
merged 14 commits into from
Nov 16, 2024
Merged

implement polite waiting #685

merged 14 commits into from
Nov 16, 2024

Conversation

SchrodingerZhu
Copy link
Contributor

@SchrodingerZhu SchrodingerZhu commented Oct 29, 2024

./perf-startup-fast | head -n1 | awk -e '{print $2;}' 

Results:

diff -y futex nofutex
9021562                                                       | 7005614
7032388                                                       | 6074838
5520130                                                       | 8339826
7180558                                                       | 7456548
5003988                                                       | 8501328
5444736                                                       | 8336394
5910520                                                       | 7088290
6360882                                                       | 5502706
5328400                                                       | 8408862
5326002                                                       | 6961724
6036778                                                       | 7615190
5355394                                                       | 6286016
4684680                                                       | 5246340
5733816                                                       | 6361762
8369020                                                       | 5165358
4371268                                                       | 7018726
8047424                                                       | 7594356
6503838                                                       | 7517180
6124338                                                       | 8764514
6654956                                                       | 6460300
8835508                                                       | 8051054
6288260                                                       | 9369866
7399172                                                       | 7305166
5267988                                                       | 5622386
4134526                                                       | 5360190
5263302                                                       | 8671476
4420614                                                       | 6108542
8564402                                                       | 6404860

I think the performance is actually slightly better on the left?

@SchrodingerZhu
Copy link
Contributor Author

SchrodingerZhu commented Oct 29, 2024

Script:

import matplotlib.pyplot as plt
import subprocess

subprocess.run(["git", "checkout", "460c5552b74da2118802f53258b4ef37a2a85601"], check=True)
subprocess.run(["cmake", "--build", ".", "--parallel"], check=True)
time = []
for i in range(1, 100):
    out = subprocess.run(["./perf-startup-fast", str(i)], check=True, capture_output=True)
    out = out.stdout.decode("utf-8")
    for line in out.split("\n"):
        if line.startswith("Taken: "):
            time.append(float(line.split()[1]))
            break

subprocess.run(["git", "checkout", "97b76756707371dda8ae74c95274266a21d217e3"], check=True)
subprocess.run(["cmake", "--build", ".", "--parallel"], check=True)
time2 = []
for i in range(1, 100):
    out = subprocess.run(["./perf-startup-fast", str(i)], check=True, capture_output=True)
    out = out.stdout.decode("utf-8")
    for line in out.split("\n"):
        if line.startswith("Taken: "):
            time2.append(float(line.split()[1]))
            break

plt.boxplot([time, time2], labels=["futex", "nofutex"], vert=True, patch_artist=True, showmeans=True)
plt.show()

Figure_1

src/snmalloc/pal/pal_linux.h Outdated Show resolved Hide resolved
@mjp41
Copy link
Member

mjp41 commented Oct 29, 2024

Script:

import matplotlib.pyplot as plt
import subprocess

subprocess.run(["git", "checkout", "460c5552b74da2118802f53258b4ef37a2a85601"], check=True)
subprocess.run(["cmake", "--build", ".", "--parallel"], check=True)
time = []
for i in range(1, 100):
    out = subprocess.run(["./perf-startup-fast", str(i)], check=True, capture_output=True)
    out = out.stdout.decode("utf-8")
    for line in out.split("\n"):
        if line.startswith("Taken: "):
            time.append(float(line.split()[1]))
            break

subprocess.run(["git", "checkout", "97b76756707371dda8ae74c95274266a21d217e3"], check=True)
subprocess.run(["cmake", "--build", ".", "--parallel"], check=True)
time2 = []
for i in range(1, 100):
    out = subprocess.run(["./perf-startup-fast", str(i)], check=True, capture_output=True)
    out = out.stdout.decode("utf-8")
    for line in out.split("\n"):
        if line.startswith("Taken: "):
            time2.append(float(line.split()[1]))
            break

plt.boxplot([time, time2], labels=["futex", "nofutex"], vert=True, patch_artist=True, showmeans=True)
plt.show()

Figure_1

Results look great. How many cores on the test box?

@SchrodingerZhu
Copy link
Contributor Author

It is on a 7773X with 64c128t.

@SchrodingerZhu
Copy link
Contributor Author

SchrodingerZhu commented Oct 30, 2024

Some additional tests on Latest Surface Pro (X Elite) with Native ARM64 VS2022. (ignore the MSYS path, I simply use their git cuz I don't want to install 😄 ).

import matplotlib.pyplot as plt
import subprocess

subprocess.run(["C:\\msys64\\usr\\bin\\git", "checkout", "14c439afcd08a00c48d23df18fd17403a12cf4da"], check=True)
subprocess.run(["cmake", "--build", ".", "--config", "Release", "--parallel", "--target", "perf-startup-fast"], check=True)
time = []
for i in range(1, 1000):
    out = subprocess.run(["Release\\perf-startup-fast", str(i)], check=True, capture_output=True)
    out = out.stdout.decode("utf-8")
    for line in out.split("\n\r"):
        if line.startswith("Taken: "):
            time.append(float(line.split()[1]))
            break

subprocess.run(["C:\\msys64\\usr\\bin\\git", "checkout", "97b76756707371dda8ae74c95274266a21d217e3"], check=True)
subprocess.run(["cmake", "--build", ".", "--config", "Release", "--parallel", "--target", "perf-startup-fast"], check=True)
time2 = []
for i in range(1, 1000):
    out = subprocess.run(["Release\\perf-startup-fast", str(i)], check=True, capture_output=True)
    out = out.stdout.decode("utf-8")
    for line in out.split("\n\r"):
        if line.startswith("Taken: "):
            time2.append(float(line.split()[1]))
            break

plt.boxplot([time, time2], labels=["futex", "nofutex"], vert=True, patch_artist=True, showmeans=True)
plt.show()

Figure_1

Copy link
Member

@mjp41 mjp41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me. Thanks for doing this.

src/snmalloc/ds/combininglock.h Outdated Show resolved Hide resolved
src/snmalloc/ds/combininglock.h Outdated Show resolved Hide resolved
@mjp41
Copy link
Member

mjp41 commented Oct 30, 2024

@SchrodingerZhu, so I ran it and got a performance slow down:

image

This is running on an Azure VM F72s class.
Linux: 6.5.0-1024-azure
Ubuntu: 22.04.1-Ubuntu SMP
CPU: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz

I am guessing this could be the virtualisation layer making the remote wake-up slower? Do you have any thoughts?

futex is the version from this PR (retry 100 times). futexN is with a retry of N times.

@SchrodingerZhu
Copy link
Contributor Author

Interesting.

I tried another time on surface pro's WSL just now and I indeed see a slow down.

Do you have the bandwidth to try this on more devices?

If the performance is unstable, I think we can make this a cmake option.

@SchrodingerZhu
Copy link
Contributor Author

SchrodingerZhu commented Oct 30, 2024

My observations:

  • I can stably reproduce futex's advantage over nofutex on 7773X (the chart changes of cuz but left side is always lower and more stable).
  • Windows/MacOS (I asked a friend of mine to test and I will attach the data later) all has a small performance gain when "futex" is used.

Given that WSL2/Azure VM does witness the degradation, maybe it is due to virtualization?

@SchrodingerZhu
Copy link
Contributor Author

Do you want me to make this an option?

@mjp41
Copy link
Member

mjp41 commented Nov 12, 2024

Do you want me to make this an option?

If that is okay, it would be great. Could you add a CI pipeline to cover the non-default option.

I'm thinking polite by default, and disable as an option?

@mjp41
Copy link
Member

mjp41 commented Nov 13, 2024

Looks like Mac OS 12 doesn't support this. Perhaps we need to disable on that version.

@SchrodingerZhu
Copy link
Contributor Author

@mjp41 addressed all issues. PTAL.

Copy link
Member

@mjp41 mjp41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for packaging this so nicely.

@SchrodingerZhu
Copy link
Contributor Author

fixed formatting

@mjp41 mjp41 merged commit f7fe702 into microsoft:main Nov 16, 2024
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants