Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std.math: change gcd's implementation to use Stein's algorithm instead of Euclid's #21077

Merged
merged 4 commits into from
Sep 24, 2024

Conversation

Fri3dNstuff
Copy link
Contributor

according to my tests (on an x86_64 machine, which has a @ctz instruction) Stein's algorithm is on par with Euclid's on small (≤ 100) random inputs, and roughly 40% faster on large random inputs (random u64s).

@SilasLock
Copy link

It might be worth testing this on a CPU that doesn't have a hardware ctz instruction or equivalent, just to make sure Stein's algorithm is still a performance improvement there, too! Wikipedia informs me that some modern CPUs are like this and gives the ARM-Mx series as an example.

@andrewrk
Copy link
Member

Can you share your testing methodology?

@The-King-of-Toasters
Copy link
Contributor

Since you're trading integer division for bit-twiddling, I imagine stein's would be faster. Here's an example of both implementations compiled for a v4t core (the same as the GBA uses). Note that gcdStd calls __aeabi_udivmod, which ends up calling div_u32, so you're trading one form of bit twiddling for another.

@Fri3dNstuff
Copy link
Contributor Author

Can you share your testing methodology?

@andrewrk

sure, it's not anything sophisticated - I'm running through 10,000,000 random pairs of numbers, computing their gcd, once with each algorithm (also once just xoring the values, to make sure the rng doesn't account for much time on its own...).

the three programs are compiled with -OReleaseFast and performance is compared with poop.

the program for Euclid's is given below, the only difference between it and the others is the call to std.math.gcd:

const std = @import("std");
const N = u64;

pub fn main() void {
    // init with a runtime known seed
    var rand = std.Random.Xoroshiro128.init(std.os.argv.len);
    var res: N = 0;
    for (0..10_000_000) |_| {
        res +%= @truncate(std.math.gcd(rand.random().int(N), rand.random().int(N)));
    }
    // do something with the result... can't let LLVM be too smart...
    std.debug.print("{}\n", .{res});
}

@The-King-of-Toasters
Copy link
Contributor

I copied @Fri3dNstuff's test and ran it on the machines I have access too. TL;DR: stein wins every time, except for some cases:

  • Riscv64 without specifying the Zbb extension. This uses a software ctz, and is slightly slower.
  • The POWER10 cfarm server for some reason?

Some devices are too constrained to run with perf, or it is disabled in its kernel. For this time was used instead.

@andrewrk I could only ever get poop running on x86_64. Every other target gave errors along these lines:

/nix/store/nmq6p6jcmqdpj2bdccb0pna3k4fzpn0f-zig-0.14.0-dev.1261+c26206112/lib/std/posix.zig:7195:23: 0x106c59f in perf_event_open (poop)
/private/tmp/poop/src/main.zig:201:54: 0x106719b in main (poop)
/nix/store/nmq6p6jcmqdpj2bdccb0pna3k4fzpn0f-zig-0.14.0-dev.1261+c26206112/lib/std/start.zig:614:37: 0x105ba57 in posixCallMainAndExit (poop)
???:?:?: 0x0 in ??? (???)

Is this a known issue?

@scheibo
Copy link
Contributor

scheibo commented Aug 25, 2024

https://lemire.me/blog/2024/04/13/greatest-common-divisor-the-extended-euclidean-algorithm-and-speed/ suggests a hybrid approach is likely superior to simply the binary approach, and incidentally is also what libc++ has switched to

@ProkopRandacek
Copy link
Contributor

I recommend looking at the code for gcd in algorithmica. It claims to have faster implementation than what this PR is suggesting due to data dependencies between instructions.

See my implementation in zig here: https://github.com/ProkopRandacek/zig/blob/better-gcd/lib/std/math/gcd.zig.

@Fri3dNstuff
Copy link
Contributor Author

@scheibo

https://lemire.me/blog/2024/04/13/greatest-common-divisor-the-extended-euclidean-algorithm-and-speed/ suggests a hybrid approach is likely superior to simply the binary approach, and incidentally is also what libc++ has switched to

translating the C++ code to Zig, comparing with the Stein implementation on randomly generated values of different lengths, yields nearly identical running times (at least - on my machine). below is the translated code, maybe I missed something?

var x: N = a;
var y: N = b;
if (x < y) {
    const tmp = y;
    y = x;
    x = tmp;
}
if (y == 0) return x;
x %= y;
if (x == 0) return y;

const i = @ctz(x);
const j = @ctz(y);
const shift = @min(i, j);
x >>= @intCast(i);
y >>= @intCast(j);

while (true) {
    // undeflow is legal
    const diff = x -% y;
    if (x > y) {
        x = y;
        y = diff;
    } else {
        y -= x;
    }

    // shift must be with value < bit size
    if (diff != 0) y >>= @intCast(@ctz(diff));

    if (y == 0) return x << @intCast(shift);
}

@ProkopRandacek

See my implementation in zig here: https://github.com/ProkopRandacek/zig/blob/better-gcd/lib/std/math/gcd.zig.

I have tested it against the Stein implementation in the pull request, with random u63 values for both (cast to u64s in the pull request implementation) and indeed your implementation runs 15 to 20 percent faster (!)

the algorithm, however, uses signed integers (expecting the caller to only pass in non-negatives) - do you know if there's a simple way to adapt the algorithm to accept and return unsigned numbers? calling an i65 version of the algorithm for full-range u64s and casting back the result seems very expensive...

@ProkopRandacek
Copy link
Contributor

do you know if there's a simple way to adapt the algorithm to accept and return unsigned numbers

I don't think there is. And I think that in the name of performance it is more than reasonable to change the function signature from (a: anytype, b: anytype) to (a: i64, b: i64). You probably rarely need to calculate gcd of numbers that don't fit in this range.

Making the function clever and automatically casting u64 to i65 is probably bad because as you said it makes the code a lot slower.

@The-King-of-Toasters
Copy link
Contributor

@Fri3dNstuff That's only because std::gcd can accept signed ints. Note that in the libc++ pr there is a static assert that checks if the type of a or b is unsigned. We don't have to care about that.

Btw, I've updated my benchmark with your translation and it's better on all targets except my mipsel router for some reason.

@scheibo
Copy link
Contributor

scheibo commented Aug 26, 2024

You probably rarely need to calculate gcd of numbers that don't fit in this range.

Big rationals say hi. :)

Of course, you can make the implementation use a different algorithm based on the type of the arguments to be able to use the faster path when possible

@ProkopRandacek
Copy link
Contributor

You probably rarely need to calculate gcd of numbers that don't fit in this range.

Big rationals say hi. :)

On second thought, youre right.

Of course, you can make the implementation use a different algorithm based on the type of the arguments to be able to use the faster path when possible

My worry is that u64 is common and generally regarded as fast data type yet using it here gives you the slow the implementation. Good design here would be to let users fall into the pit of success and use the i64 version. Only when they really need a larger int type and know that there is a cost, should they be given the general implementation.

Here that could look like having 2 functions: gcd(i64, i64) and gcdSlow(anytype, anytype).

@Fri3dNstuff
Copy link
Contributor Author

I have played around with @ProkopRandacek's algorithm, attempting to make it use unsigned numbers. surprisingly, this new version is 18% faster, running on my machine. can anyone please confirm that this version is indeed faster, and not just a fluke of my computer? if it is, I'll update the pull request with a generic version of this.

here's the algorithm, I ran both it and @ProkopRandacek's through 10,000,000 pairs of random u63s, wrap-summing their results (as described in a of mine comment above). both algorithms produce the same sum: 223029890 - the signed algorithm obviously produces a i64 as the output, but the result is in the positive range, so they agree, we "got lucky" with that...

fn gcd(a: u64, b: u64) u64 {
    std.debug.assert(a != 0 or b != 0);

    var x = a;
    var y = b;

    if (x == 0) return y;
    if (y == 0) return x;

    var xz = @ctz(x);
    const yz = @ctz(y);
    const shift = @min(xz, yz);
    y >>= @intCast(yz);

    while (true) {
        x >>= @intCast(xz);
        var diff = y -% x;
        if (diff == 0) return y << @intCast(shift);
        xz = @ctz(diff);
        if (x > y) diff = -%diff;
        y = @min(x, y);
        x = diff;
    }
}

@Fri3dNstuff
Copy link
Contributor Author

I have refined the algorithm a bit further, and managed to squeeze a few more percent of performance out of it. I believe these improvements are applicable generally (and aren't just LLVM liking the new version better, specifically in the case of my machine) - but without tests on other machines I can't be sure...

changed to loop a bit, and shuffled some declarations around - this seems to help with register allocations. also, I think I understand why this version is faster than @ProkopRandacek's signed one: the negation of diff and the modification of y happen right beside each other, there is only a need to cmp x, y and cmov based on the result! below is the refined algorithm:

fn gcd(a: u64, b: u64) u64 {
    std.debug.assert(a != 0 or b != 0);

    var x = a;
    var y = b;

    if (x == 0) return y;
    if (y == 0) return x;

    const xz = @ctz(x);
    const yz = @ctz(y);
    const shift = @min(xz, yz);
    x >>= @intCast(xz);
    y >>= @intCast(yz);
    var diff = y -% x;

    while (diff != 0) {
        const zeros = @ctz(diff);
        if (x > y) diff = -%diff;
        y = @min(x, y);
        x = diff >> @intCast(zeros);
        diff = y -% x;
    }
    return y << @intCast(shift);
}

for the generic case, I think it would be best to just use the usize version for integer lengths not larger than usize, and cast the result - x86 for example does not have a ctz version for 8 bits (only 16, 32, and 64) - further, using exotic-length integers will make LLVM sprinkle &s everywhere, for the wrapping subtractions.

does anyone know of an architecture where the algorithm works better for some integer length shorter than usize? I know nearly nothing beyond x86 :/

@The-King-of-Toasters
Copy link
Contributor

@Fri3dNstuff seems to work fine for me. I altered the benchmark to generate ints that are 1 bit less than usize before extending to usize again (i.e. u31 -> u32, u63 -> u64). This new version still comes out on top.

I also added a build option to change the bit width size of the random integers. They're extended to usize but you can hack it to change it how you want.

@Fri3dNstuff
Copy link
Contributor Author

I changed to implementation to use the optimised version, @The-King-of-Toasters, thank you so much for the perf tests!

there is still the question of expanding the width of the integers used in the calculation, in case the user wants to use some exotic-length ints (currently the algorithm performs quite badly with them). LLVM generates a bunch of n & std.math.maxInt(@TypeOf(n))s to zero out parts of the number whenever -% is used...

we may have to make a table of the fast integer sizes for each architecture, and do some comptime switches for the implementation based on that (on my machine it seems to be fastest to always convert to u64s before the calculation, and @intCast back at the end)

Copy link
Contributor

@chrboesch chrboesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this algorithm and your implementation!

lib/std/math/gcd.zig Outdated Show resolved Hide resolved
lib/std/math/gcd.zig Show resolved Hide resolved
@andrewrk andrewrk merged commit b2c53eb into ziglang:master Sep 24, 2024
10 checks passed
@andrewrk andrewrk added standard library This issue involves writing Zig code for the standard library. release notes This PR should be mentioned in the release notes. enhancement Solving this issue will likely involve adding new logic or components to the codebase. optimization labels Sep 24, 2024
@Fri3dNstuff Fri3dNstuff deleted the stein-gcd branch September 24, 2024 09:46
DivergentClouds pushed a commit to DivergentClouds/zig that referenced this pull request Sep 24, 2024
richerfu pushed a commit to richerfu/zig that referenced this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Solving this issue will likely involve adding new logic or components to the codebase. optimization release notes This PR should be mentioned in the release notes. standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants