std.math: change gcd's implementation to use Stein's algorithm instead of Euclid's #21077

Fri3dNstuff · 2024-08-14T17:59:20Z

according to my tests (on an x86_64 machine, which has a @ctz instruction) Stein's algorithm is on par with Euclid's on small (≤ 100) random inputs, and roughly 40% faster on large random inputs (random u64s).

SilasLock · 2024-08-14T21:46:25Z

It might be worth testing this on a CPU that doesn't have a hardware ctz instruction or equivalent, just to make sure Stein's algorithm is still a performance improvement there, too! Wikipedia informs me that some modern CPUs are like this and gives the ARM-Mx series as an example.

andrewrk · 2024-08-16T16:44:06Z

Can you share your testing methodology?

The-King-of-Toasters · 2024-08-17T00:34:27Z

Since you're trading integer division for bit-twiddling, I imagine stein's would be faster. Here's an example of both implementations compiled for a v4t core (the same as the GBA uses). Note that gcdStd calls __aeabi_udivmod, which ends up calling div_u32, so you're trading one form of bit twiddling for another.

Fri3dNstuff · 2024-08-17T19:07:31Z

Can you share your testing methodology?

@andrewrk

sure, it's not anything sophisticated - I'm running through 10,000,000 random pairs of numbers, computing their gcd, once with each algorithm (also once just xoring the values, to make sure the rng doesn't account for much time on its own...).

the three programs are compiled with -OReleaseFast and performance is compared with poop.

the program for Euclid's is given below, the only difference between it and the others is the call to std.math.gcd:

const std = @import("std");
const N = u64;

pub fn main() void {
    // init with a runtime known seed
    var rand = std.Random.Xoroshiro128.init(std.os.argv.len);
    var res: N = 0;
    for (0..10_000_000) |_| {
        res +%= @truncate(std.math.gcd(rand.random().int(N), rand.random().int(N)));
    }
    // do something with the result... can't let LLVM be too smart...
    std.debug.print("{}\n", .{res});
}

The-King-of-Toasters · 2024-08-24T14:27:44Z

I copied @Fri3dNstuff's test and ran it on the machines I have access too. TL;DR: stein wins every time, except for some cases:

Riscv64 without specifying the Zbb extension. This uses a software ctz, and is slightly slower.
The POWER10 cfarm server for some reason?

Some devices are too constrained to run with perf, or it is disabled in its kernel. For this time was used instead.

@andrewrk I could only ever get poop running on x86_64. Every other target gave errors along these lines:

/nix/store/nmq6p6jcmqdpj2bdccb0pna3k4fzpn0f-zig-0.14.0-dev.1261+c26206112/lib/std/posix.zig:7195:23: 0x106c59f in perf_event_open (poop)
/private/tmp/poop/src/main.zig:201:54: 0x106719b in main (poop)
/nix/store/nmq6p6jcmqdpj2bdccb0pna3k4fzpn0f-zig-0.14.0-dev.1261+c26206112/lib/std/start.zig:614:37: 0x105ba57 in posixCallMainAndExit (poop)
???:?:?: 0x0 in ??? (???)

Is this a known issue?

scheibo · 2024-08-25T19:02:19Z

https://lemire.me/blog/2024/04/13/greatest-common-divisor-the-extended-euclidean-algorithm-and-speed/ suggests a hybrid approach is likely superior to simply the binary approach, and incidentally is also what libc++ has switched to

ProkopRandacek · 2024-08-25T19:06:25Z

I recommend looking at the code for gcd in algorithmica. It claims to have faster implementation than what this PR is suggesting due to data dependencies between instructions.

See my implementation in zig here: https://github.com/ProkopRandacek/zig/blob/better-gcd/lib/std/math/gcd.zig.

Fri3dNstuff · 2024-08-26T06:41:31Z

@scheibo

https://lemire.me/blog/2024/04/13/greatest-common-divisor-the-extended-euclidean-algorithm-and-speed/ suggests a hybrid approach is likely superior to simply the binary approach, and incidentally is also what libc++ has switched to

translating the C++ code to Zig, comparing with the Stein implementation on randomly generated values of different lengths, yields nearly identical running times (at least - on my machine). below is the translated code, maybe I missed something?

var x: N = a;
var y: N = b;
if (x < y) {
    const tmp = y;
    y = x;
    x = tmp;
}
if (y == 0) return x;
x %= y;
if (x == 0) return y;

const i = @ctz(x);
const j = @ctz(y);
const shift = @min(i, j);
x >>= @intCast(i);
y >>= @intCast(j);

while (true) {
    // undeflow is legal
    const diff = x -% y;
    if (x > y) {
        x = y;
        y = diff;
    } else {
        y -= x;
    }

    // shift must be with value < bit size
    if (diff != 0) y >>= @intCast(@ctz(diff));

    if (y == 0) return x << @intCast(shift);
}

@ProkopRandacek

See my implementation in zig here: https://github.com/ProkopRandacek/zig/blob/better-gcd/lib/std/math/gcd.zig.

I have tested it against the Stein implementation in the pull request, with random u63 values for both (cast to u64s in the pull request implementation) and indeed your implementation runs 15 to 20 percent faster (!)

the algorithm, however, uses signed integers (expecting the caller to only pass in non-negatives) - do you know if there's a simple way to adapt the algorithm to accept and return unsigned numbers? calling an i65 version of the algorithm for full-range u64s and casting back the result seems very expensive...

ProkopRandacek · 2024-08-26T08:52:58Z

do you know if there's a simple way to adapt the algorithm to accept and return unsigned numbers

I don't think there is. And I think that in the name of performance it is more than reasonable to change the function signature from (a: anytype, b: anytype) to (a: i64, b: i64). You probably rarely need to calculate gcd of numbers that don't fit in this range.

Making the function clever and automatically casting u64 to i65 is probably bad because as you said it makes the code a lot slower.

The-King-of-Toasters · 2024-08-26T09:41:42Z

@Fri3dNstuff That's only because std::gcd can accept signed ints. Note that in the libc++ pr there is a static assert that checks if the type of a or b is unsigned. We don't have to care about that.

Btw, I've updated my benchmark with your translation and it's better on all targets except my mipsel router for some reason.

scheibo · 2024-08-26T14:11:52Z

You probably rarely need to calculate gcd of numbers that don't fit in this range.

Big rationals say hi. :)

Of course, you can make the implementation use a different algorithm based on the type of the arguments to be able to use the faster path when possible

ProkopRandacek · 2024-08-26T15:49:13Z

You probably rarely need to calculate gcd of numbers that don't fit in this range.

Big rationals say hi. :)

On second thought, youre right.

Of course, you can make the implementation use a different algorithm based on the type of the arguments to be able to use the faster path when possible

My worry is that u64 is common and generally regarded as fast data type yet using it here gives you the slow the implementation. Good design here would be to let users fall into the pit of success and use the i64 version. Only when they really need a larger int type and know that there is a cost, should they be given the general implementation.

Here that could look like having 2 functions: gcd(i64, i64) and gcdSlow(anytype, anytype).

Fri3dNstuff · 2024-08-27T05:33:07Z

I have played around with @ProkopRandacek's algorithm, attempting to make it use unsigned numbers. surprisingly, this new version is 18% faster, running on my machine. can anyone please confirm that this version is indeed faster, and not just a fluke of my computer? if it is, I'll update the pull request with a generic version of this.

here's the algorithm, I ran both it and @ProkopRandacek's through 10,000,000 pairs of random u63s, wrap-summing their results (as described in a of mine comment above). both algorithms produce the same sum: 223029890 - the signed algorithm obviously produces a i64 as the output, but the result is in the positive range, so they agree, we "got lucky" with that...

fn gcd(a: u64, b: u64) u64 {
    std.debug.assert(a != 0 or b != 0);

    var x = a;
    var y = b;

    if (x == 0) return y;
    if (y == 0) return x;

    var xz = @ctz(x);
    const yz = @ctz(y);
    const shift = @min(xz, yz);
    y >>= @intCast(yz);

    while (true) {
        x >>= @intCast(xz);
        var diff = y -% x;
        if (diff == 0) return y << @intCast(shift);
        xz = @ctz(diff);
        if (x > y) diff = -%diff;
        y = @min(x, y);
        x = diff;
    }
}

Fri3dNstuff · 2024-08-27T08:11:52Z

I have refined the algorithm a bit further, and managed to squeeze a few more percent of performance out of it. I believe these improvements are applicable generally (and aren't just LLVM liking the new version better, specifically in the case of my machine) - but without tests on other machines I can't be sure...

changed to loop a bit, and shuffled some declarations around - this seems to help with register allocations. also, I think I understand why this version is faster than @ProkopRandacek's signed one: the negation of diff and the modification of y happen right beside each other, there is only a need to cmp x, y and cmov based on the result! below is the refined algorithm:

fn gcd(a: u64, b: u64) u64 {
    std.debug.assert(a != 0 or b != 0);

    var x = a;
    var y = b;

    if (x == 0) return y;
    if (y == 0) return x;

    const xz = @ctz(x);
    const yz = @ctz(y);
    const shift = @min(xz, yz);
    x >>= @intCast(xz);
    y >>= @intCast(yz);
    var diff = y -% x;

    while (diff != 0) {
        const zeros = @ctz(diff);
        if (x > y) diff = -%diff;
        y = @min(x, y);
        x = diff >> @intCast(zeros);
        diff = y -% x;
    }
    return y << @intCast(shift);
}

for the generic case, I think it would be best to just use the usize version for integer lengths not larger than usize, and cast the result - x86 for example does not have a ctz version for 8 bits (only 16, 32, and 64) - further, using exotic-length integers will make LLVM sprinkle &s everywhere, for the wrapping subtractions.

does anyone know of an architecture where the algorithm works better for some integer length shorter than usize? I know nearly nothing beyond x86 :/

The-King-of-Toasters · 2024-08-27T13:43:27Z

@Fri3dNstuff seems to work fine for me. I altered the benchmark to generate ints that are 1 bit less than usize before extending to usize again (i.e. u31 -> u32, u63 -> u64). This new version still comes out on top.

I also added a build option to change the bit width size of the random integers. They're extended to usize but you can hack it to change it how you want.

Fri3dNstuff · 2024-08-28T07:52:00Z

I changed to implementation to use the optimised version, @The-King-of-Toasters, thank you so much for the perf tests!

there is still the question of expanding the width of the integers used in the calculation, in case the user wants to use some exotic-length ints (currently the algorithm performs quite badly with them). LLVM generates a bunch of n & std.math.maxInt(@TypeOf(n))s to zero out parts of the number whenever -% is used...

we may have to make a table of the fast integer sizes for each architecture, and do some comptime switches for the implementation based on that (on my machine it seems to be fastest to always convert to u64s before the calculation, and @intCast back at the end)

chrboesch

I appreciate this algorithm and your implementation!

lib/std/math/gcd.zig

…d of Euclid's (ziglang#21077)

Change gcd implementation to Stein's algorithm

c577d1e

Improve implementation performance

fc4b98f

chrboesch reviewed Sep 19, 2024

View reviewed changes

lib/std/math/gcd.zig Outdated Show resolved Hide resolved

lib/std/math/gcd.zig Show resolved Hide resolved

Fri3dNstuff added 2 commits September 20, 2024 09:55

Marge changes from master

ec69653

Add compile error for bad types

7d0b2b4

andrewrk merged commit b2c53eb into ziglang:master Sep 24, 2024
10 checks passed

andrewrk added standard library This issue involves writing Zig code for the standard library. release notes This PR should be mentioned in the release notes. enhancement Solving this issue will likely involve adding new logic or components to the codebase. optimization labels Sep 24, 2024

Fri3dNstuff deleted the stein-gcd branch September 24, 2024 09:46

DivergentClouds pushed a commit to DivergentClouds/zig that referenced this pull request Sep 24, 2024

std.math: change gcd's implementation to use Stein's algorithm instea…

b2d60f0

…d of Euclid's (ziglang#21077)

richerfu pushed a commit to richerfu/zig that referenced this pull request Oct 28, 2024

std.math: change gcd's implementation to use Stein's algorithm instea…

aeac04e

…d of Euclid's (ziglang#21077)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std.math: change gcd's implementation to use Stein's algorithm instead of Euclid's #21077

std.math: change gcd's implementation to use Stein's algorithm instead of Euclid's #21077

Fri3dNstuff commented Aug 14, 2024

SilasLock commented Aug 14, 2024

andrewrk commented Aug 16, 2024

The-King-of-Toasters commented Aug 17, 2024

Fri3dNstuff commented Aug 17, 2024

The-King-of-Toasters commented Aug 24, 2024

scheibo commented Aug 25, 2024

ProkopRandacek commented Aug 25, 2024

Fri3dNstuff commented Aug 26, 2024

ProkopRandacek commented Aug 26, 2024

The-King-of-Toasters commented Aug 26, 2024

scheibo commented Aug 26, 2024

ProkopRandacek commented Aug 26, 2024

Fri3dNstuff commented Aug 27, 2024

Fri3dNstuff commented Aug 27, 2024

The-King-of-Toasters commented Aug 27, 2024

Fri3dNstuff commented Aug 28, 2024

chrboesch left a comment

std.math: change gcd's implementation to use Stein's algorithm instead of Euclid's #21077

std.math: change gcd's implementation to use Stein's algorithm instead of Euclid's #21077

Conversation

Fri3dNstuff commented Aug 14, 2024

SilasLock commented Aug 14, 2024

andrewrk commented Aug 16, 2024

The-King-of-Toasters commented Aug 17, 2024

Fri3dNstuff commented Aug 17, 2024

The-King-of-Toasters commented Aug 24, 2024

scheibo commented Aug 25, 2024

ProkopRandacek commented Aug 25, 2024

Fri3dNstuff commented Aug 26, 2024

ProkopRandacek commented Aug 26, 2024

The-King-of-Toasters commented Aug 26, 2024

scheibo commented Aug 26, 2024

ProkopRandacek commented Aug 26, 2024

Fri3dNstuff commented Aug 27, 2024

Fri3dNstuff commented Aug 27, 2024

The-King-of-Toasters commented Aug 27, 2024

Fri3dNstuff commented Aug 28, 2024

chrboesch left a comment

Choose a reason for hiding this comment