-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std.math: change gcd's implementation to use Stein's algorithm instead of Euclid's #21077
Conversation
It might be worth testing this on a CPU that doesn't have a hardware |
Can you share your testing methodology? |
Since you're trading integer division for bit-twiddling, I imagine stein's would be faster. Here's an example of both implementations compiled for a |
sure, it's not anything sophisticated - I'm running through 10,000,000 random pairs of numbers, computing their gcd, once with each algorithm (also once just xoring the values, to make sure the rng doesn't account for much time on its own...). the three programs are compiled with the program for Euclid's is given below, the only difference between it and the others is the call to const std = @import("std");
const N = u64;
pub fn main() void {
// init with a runtime known seed
var rand = std.Random.Xoroshiro128.init(std.os.argv.len);
var res: N = 0;
for (0..10_000_000) |_| {
res +%= @truncate(std.math.gcd(rand.random().int(N), rand.random().int(N)));
}
// do something with the result... can't let LLVM be too smart...
std.debug.print("{}\n", .{res});
} |
I copied @Fri3dNstuff's test and ran it on the machines I have access too. TL;DR: stein wins every time, except for some cases:
Some devices are too constrained to run with perf, or it is disabled in its kernel. For this @andrewrk I could only ever get
Is this a known issue? |
https://lemire.me/blog/2024/04/13/greatest-common-divisor-the-extended-euclidean-algorithm-and-speed/ suggests a hybrid approach is likely superior to simply the binary approach, and incidentally is also what libc++ has switched to |
I recommend looking at the code for gcd in algorithmica. It claims to have faster implementation than what this PR is suggesting due to data dependencies between instructions. See my implementation in zig here: https://github.com/ProkopRandacek/zig/blob/better-gcd/lib/std/math/gcd.zig. |
translating the C++ code to Zig, comparing with the Stein implementation on randomly generated values of different lengths, yields nearly identical running times (at least - on my machine). below is the translated code, maybe I missed something? var x: N = a;
var y: N = b;
if (x < y) {
const tmp = y;
y = x;
x = tmp;
}
if (y == 0) return x;
x %= y;
if (x == 0) return y;
const i = @ctz(x);
const j = @ctz(y);
const shift = @min(i, j);
x >>= @intCast(i);
y >>= @intCast(j);
while (true) {
// undeflow is legal
const diff = x -% y;
if (x > y) {
x = y;
y = diff;
} else {
y -= x;
}
// shift must be with value < bit size
if (diff != 0) y >>= @intCast(@ctz(diff));
if (y == 0) return x << @intCast(shift);
}
I have tested it against the Stein implementation in the pull request, with random the algorithm, however, uses signed integers (expecting the caller to only pass in non-negatives) - do you know if there's a simple way to adapt the algorithm to accept and return unsigned numbers? calling an |
I don't think there is. And I think that in the name of performance it is more than reasonable to change the function signature from Making the function clever and automatically casting |
@Fri3dNstuff That's only because Btw, I've updated my benchmark with your translation and it's better on all targets except my |
Big rationals say hi. :) Of course, you can make the implementation use a different algorithm based on the type of the arguments to be able to use the faster path when possible |
On second thought, youre right.
My worry is that u64 is common and generally regarded as fast data type yet using it here gives you the slow the implementation. Good design here would be to let users fall into the pit of success and use the i64 version. Only when they really need a larger int type and know that there is a cost, should they be given the general implementation. Here that could look like having 2 functions: |
I have played around with @ProkopRandacek's algorithm, attempting to make it use unsigned numbers. surprisingly, this new version is 18% faster, running on my machine. can anyone please confirm that this version is indeed faster, and not just a fluke of my computer? if it is, I'll update the pull request with a generic version of this. here's the algorithm, I ran both it and @ProkopRandacek's through 10,000,000 pairs of random fn gcd(a: u64, b: u64) u64 {
std.debug.assert(a != 0 or b != 0);
var x = a;
var y = b;
if (x == 0) return y;
if (y == 0) return x;
var xz = @ctz(x);
const yz = @ctz(y);
const shift = @min(xz, yz);
y >>= @intCast(yz);
while (true) {
x >>= @intCast(xz);
var diff = y -% x;
if (diff == 0) return y << @intCast(shift);
xz = @ctz(diff);
if (x > y) diff = -%diff;
y = @min(x, y);
x = diff;
}
} |
I have refined the algorithm a bit further, and managed to squeeze a few more percent of performance out of it. I believe these improvements are applicable generally (and aren't just LLVM liking the new version better, specifically in the case of my machine) - but without tests on other machines I can't be sure... changed to loop a bit, and shuffled some declarations around - this seems to help with register allocations. also, I think I understand why this version is faster than @ProkopRandacek's signed one: the negation of fn gcd(a: u64, b: u64) u64 {
std.debug.assert(a != 0 or b != 0);
var x = a;
var y = b;
if (x == 0) return y;
if (y == 0) return x;
const xz = @ctz(x);
const yz = @ctz(y);
const shift = @min(xz, yz);
x >>= @intCast(xz);
y >>= @intCast(yz);
var diff = y -% x;
while (diff != 0) {
const zeros = @ctz(diff);
if (x > y) diff = -%diff;
y = @min(x, y);
x = diff >> @intCast(zeros);
diff = y -% x;
}
return y << @intCast(shift);
} for the generic case, I think it would be best to just use the does anyone know of an architecture where the algorithm works better for some integer length shorter than |
@Fri3dNstuff seems to work fine for me. I altered the benchmark to generate ints that are 1 bit less than I also added a build option to change the bit width size of the random integers. They're extended to |
I changed to implementation to use the optimised version, @The-King-of-Toasters, thank you so much for the perf tests! there is still the question of expanding the width of the integers used in the calculation, in case the user wants to use some exotic-length ints (currently the algorithm performs quite badly with them). LLVM generates a bunch of we may have to make a table of the fast integer sizes for each architecture, and do some comptime switches for the implementation based on that (on my machine it seems to be fastest to always convert to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate this algorithm and your implementation!
according to my tests (on an x86_64 machine, which has a
@ctz
instruction) Stein's algorithm is on par with Euclid's on small (≤ 100) random inputs, and roughly 40% faster on large random inputs (randomu64
s).