Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unfiltering is slower on Windows than on Linux (especially with std::simd) #567

Open
anforowicz opened this issue Jan 15, 2025 · 12 comments

Comments

@anforowicz
Copy link
Contributor

Problem description

TL;DR:

  • rustc auto-vectorization produces different binary code on Windows than on Linux
  • Turning off unstable feature of the png crate (i.e. avoiding std::simd) helps

I have been running Chromium-based PNG benchmarks (see https://crrev.com/c/4860210/25). In the past I have been using my 2023 corpus of 1655 PNG images from top 500 websites (see here for more details). This week I have tried running these microbenchmarks against 100 biggest PNG images from this corpus on Windows and on Linux and discovered that the ratio of total-decoding-runtime-when-using-Rust / total-decoding-runtime-when-using-C++ differs considerably across these OS platforms: 79% for Linux (i.e. Rust is ~20% faster) vs 101% for Windows (i.e. Rust is comparable).

I have investigated further by looking at 1 image with the biggest ratio (color type = 6 / RGBA; bits = 8; width x height = 740 x 3640; all 740 rows use Paeth filter). For this image, it turned out that unfiltering takes the most time and so I experimented with turning on and off the unstable / std::simd feature of the png crate:

Windows Linux
png features = ['unstable'] 147% - 151% 93% - 94%
png features = [] 114% - 118% 94%

Profiling notes below show how often png::decoder::unfiltering_buffer::UnfilteringBuffer::unfilter_curr_row is present in ETW/profiling samples on Windows or perf record -e cpu-clock on Linux (decoding the single file). I am also including fdeflate::decompress:Decompressor::read for comparison. The percentages are "inclusive" although this doesn't really matter for unfilter_curr_row which IIUC doesn't call any other functions):

unfilter_curr_row Decompressor::read
Windows with unstable 56.3% 5.7%
Windows w/o unstable 43.2% 7.5%
Linux w/o unstable 54.4% 11.0%
Linux with unstable 54.4% 10.9%

FWIW profiling C++ on Linux shows png_read_filter_row_paeth4_sse2.cfi at 55.1% and Cr_z_inflate_fast_chunk_ at 11.3%. This suggests to me that Rust's auto-vectorization gives approximately the same speed as direct usage of SIMD intrinsics in the C/C++ code. I am not sure why Windows numbers look different.

Discussion

AFAIK this is the 2nd time when we find that using std::simd negatively affects performance - the last time we discovered this, we limited std::simd usage to x86 (see #539 (comment)).

In the last couple of months I tried measuring the impact of the unstable feature of the png crate and the overall impact was modest, but consistently negative - I recorded a 1.6% regression in row 1.9 from "2024-11-27 - 2024-12-02" and 0.13% regression in row 1.1.1 from "2024-12-12 - 2024-12-17" in a document here. Back then I didn't act on this data, because the 2 results were different, the delta was relatively modest, and because I don't fully trust my benchmarking methodology (leaning toward attributing small differences to noise).

Based on the results above, I plan to turn off the unstable feature of the png crate in Chromium. (But still using crc32fast/nightly.)

Open questions:

  • We may want to consider removing the std::simd-based code from the png crate
  • I assume that the Windows-vs-Linux difference is due to how auto-vectorization works. And I am surprised that auto-vectorization in this OS-agnostic/CPU-focused code behaves differently on Windows-vs-Linux. So maybe there is an opportunity to identify and fix a bug in rustc or LLVM here. OTOH 1) I am not sure how to efficiently narrow down the problem into something that others can debug further, and 2) I am not sure if rustc / LLVM wants-to/can give any guarantees of when auto-ectorization occurs. It is also possible that the difference is not because Rust ends up slower on Windows, but because C++ ends up faster on Windows.
  • I am not sure if we want to revise our high-level approach to SIMD
    • It seems that relying on auto-vectorization produces results comparable to direct usage of SIMD intrinsics (at least on Linux, when unstable feature of the png crate is off). From this perspective we can probably say that auto-vectorization works well.
    • OTOH, in theory relying on auto-vectorization should mean being able to work with single (platform-agnostic) and simple code. And IMO the png unfiltering code is quite complex and it uses different code paths are used on different platforms. Ideally this would be accompanied by automated tests that can catch functional and/or performance regressions on all special-cased platforms, but I am not sure if github supports this.

/cc @calebzulawski from https://rust-lang.github.io/rfcs/2977-stdsimd.html (see also the related rust-lang/rfcs#2948)
/cc @veluca93 who has kindly helped me narrow down microbenchmarks-vs-field-trial differences I've observed into the Windows-vs-Linux investigation above

@workingjubilee
Copy link

weird. there shouldn't be a huge difference.

@workingjubilee
Copy link

the big questions to ask are

  • does any of this interact with repr(C) types? those have different platform-dependent layouts
  • does any of this go through memcpy, float functions, or other compiler-rt-ish functions? those can wind up being different code due to trying to use platform functions instead of compiler-builtins

@workingjubilee
Copy link

Also, it probably shouldn't matter, but the Rust Windows target enables these features:

    base.features = "+cx16,+sse3,+sahf".into();

Does Linux with the same features bench differently?

@anforowicz
Copy link
Contributor Author

I have rebuilt the benchmark on Linux, after adding rustflags = ["-Ctarget-feature=+sse3"] to Chromium's //build/config/compiler/BUILD.gn. Full command-line can be found at the end (*). This didn't change the results on Linux (i.e. it didn't result in a performance degradation similar to what we've seen above on Windows) - the Rust/C++ ratio was 95% (roughly similar to what I've seen earlier, without -Ctarget-feature=+sse3).

(*) rustc command-line:

build start: Ready 5618 Pending 35422[0/19537] 5.17s "python3" "../../build/rust/rustc_wrapper.py" --rustc=../../third_party/rust-toolchain/bin/rustc --depfile=obj/third_party/rust/png/v0_17/lib/libpng_lib.rlib.d --rsp=obj/third_party/rust/png/v0_17/lib/libpng_lib.rlib.rsp -- -Clinker="../../third_party/llvm-build/Release+Asserts/bin/clang++" --crate-name png ../../third_party/rust/chromium_crates_io/vendor/png-0.17.16/src/lib.rs --crate-type rlib --cap-lints=allow --cfg=feature=\"unstable\" -Cmetadata=png-0.17 --edition=2018 -Ctarget-feature=+sse3 -Cforce-unwind-tables=no -Crelocation-model=pic -Zsplit-lto-unit -Clinker-plugin-lto=yes -Cdefault-linker-libraries -Zdep-info-omit-d-target -Zmacro-backtrace -Zremap-cwd-prefix=. -Zexternal-clangrt --color=always --target=x86_64-unknown-linux-gnu -Ccodegen-units=1 -Cpanic=abort -Zpanic_abort_tests --cfg cr_rustc_revision=\"ad211ced81509462cdfe4c29ed10f97279a0acae-1-llvmorg-20-init-17108-g29ed6000\" -Copt-level=3 -Zdwarf-version=4 -g -Coverflow-checks=on -Clinker-plugin-lto=yes -Zsplit-lto-unit -Clinker-plugin-lto=yes -Zdefault-visibility=hidden -Dunsafe_op_in_unsafe_fn -Dwarnings --sysroot=local_rustc_sysroot --emit=dep-info=obj/third_party/rust/png/v0_17/lib/libpng_lib.rlib.d,link -o obj/third_party/rust/png/v0_17/lib/libpng_lib.rlib LDFLAGS RUSTENV OUT_DIR=../../../../../../out/bench/gen/third_party/rust/png/v0_17/lib CARGO_PKG_AUTHORS=The\ image-rs\ Developers CARGO_PKG_VERSION=0.17.16 CARGO_PKG_NAME=png CARGO_PKG_DESCRIPTION=PNG\ decoding\ and\ encoding\ library\ in\ pure\ Rust CARGO_MANIFEST_DIR=../../third_party/rust/png/v0_17/lib/crate

@anforowicz
Copy link
Contributor Author

FWIW, here is rustc --version output from my build system:

$ third_party/rust-toolchain/bin/rustc --version
rustc 1.86.0-dev (ad211ced81509462cdfe4c29ed10f97279a0acae-1-llvmorg-20-init-17108-g29ed6000 chromium)

@anforowicz
Copy link
Contributor Author

the big questions to ask are

  • does any of this interact with repr(C) types? those have different platform-dependent layouts

No - there are no repr(C) in png's src/filter.rs. FWIW std::simd is marked as #[repr(simd)] in the standard library.

  • does any of this go through memcpy, float functions, or other compiler-rt-ish functions? those can wind up being different code due to trying to use platform functions instead of compiler-builtins

I don't think so. Paeth unfiltering operates on 8-bit or 16-bit integers. There is no memcpy.

@calebzulawski
Copy link

I diffed the asm for unfilter_paeth3, unfilter_paeth_u8, and unfilter_paeth6 for x86_64-unknown-linux-gnu, x86_64-pc-windows-gnu, and x86_64-pc-windows-msvc and there are zero differences other than a couple extra calling-convention related moves on windows at the beginning and end of the functions.

Considering both linux implementations and the windows std::simd implementation of the unfiltering functions all show the same performance vs C++, I'm suspicious of the windows autovec benchmark--why would that be faster than all linux implementations? Can you replicate those benchmark results?

@anforowicz
Copy link
Contributor Author

I diffed the asm for unfilter_paeth3, unfilter_paeth_u8, and unfilter_paeth6 for x86_64-unknown-linux-gnu, x86_64-pc-windows-gnu, and x86_64-pc-windows-msvc and there are zero differences other than a couple extra calling-convention related moves on windows at the beginning and end of the functions.

Let me quickly clarify that the image tested here always uses FilterType::Paeth and BytesPerPixel::Four. This means that unfilter_paeth3 and unfilter_paeth6 won’t be used.

I also note that it is possible that my problem is somehow caused by a quirk of Chromium’s build system - e.g. using a particular version of rustc nightly, or passing specific command-line options to rustc. To help verify that, let me try to put together a smaller, self-contained repro.

I'm suspicious of the windows autovec benchmark--why would that be faster than all linux implementations?

That’s fair - the results are indeed weird.

FWIW (to rule out one explanation) I've verified that my Windows and Linux machines are using the same Chromium toolchain: rustc 1.86.0-dev (ad211ced81509462cdfe4c29ed10f97279a0acae-1-llvmorg-20-init-17108-g29ed6000 chromium).

Can you replicate those benchmark results?

Today I replicated the following results:

  • Turning the unstable feature on/off and rebuilding and running the end-to-end Chromium benchmark:
    • Doesn’t impact results on Linux
    • Does significantly impact results on Windows

OTOH, my results from a smaller, self-contained, Rust-only benchmark shows effect both on Linux and Windows. I am not sure what may be causing the difference. One reasonable hypothesis seems to be that the bigger repro is somehow more sensitive to something OS-specific (e.g. maybe "bigger" means more opportunities for different LTO results? just a wild guess...).

FWIW, I got the following numbers today, when working on replicating my results:

  • Linux, unstable feature enabled: 17.679µs, 17.68µs, 17.68µs, 17.68µs, 17.68µs (using the Chromium benchmark: 96.1%, 95.3%, 93.1%, 93.1%, 95.5%)
  • Linux, unstable feature disabled: 10.754µs, 10.758µs, 10.774µs, 10.781µs, avg = 10.76µs (using the Chromium benchmark: 94.7%, 97.7%, 94.3%, 95.5%, 94.2%)
  • Windows, unstable feature enabled: 17.034µs, 17.038µs, 17.027µs, 17.017µs, 17.04µs (using the Chromium benchmark: 147.6%, 148.2%, 149.0%, 148.7%, 147.4%)
  • Windows, unstable feature disabled: 10.656µs, 10.652µs, 10.652µs, 10.651µs, 10.653µs (using the Chromium benchmark: 117.37%, 117.35%, 116.9%, 118.2%, 112.7%)

So to summarize:

  • The main finding reported in this issue seems to hold: In the current shape, std::simd usage in the png crate results in worse performance than no std::simd (for Paeth/bpp=4, with rustc nightly used in Chromium, both on Linux and Windows). As a follow-up I probably should put together a PR to stop using std::simd in png in this scenario. I probably should also investigate the impact of std::simd on bpp=3 and bpp=6 (and experiment with removing std::simd but keeping load3, etc for those other bpp settings).
  • The difference between Windows-vs-Linux sensitivity to std::simd is weird and unexplained (I am just tweaking whether the unstable feature is turned on here).
  • In general the difference I am observing in field-trial results and in microbenchmarks seems unexplained (although maybe avoiding std::simd for Paeth/bpp=4 will help). See a separate doc with additional notes about comparing microbenchmarks and field trials

For the standalone repro, I used:

[main.rs](http://main.rs/)
#![feature(portable_simd)]
#[cfg(all(feature = "unstable", target_arch = "x86_64"))]
mod simd {
    use std::simd::{LaneCount, Simd, SimdElement, SupportedLaneCount};

    #[derive(Default)]
    struct PaethState<T, const N: usize>
    where
        T: SimdElement,
        LaneCount<N>: SupportedLaneCount,
    {
        /// Previous pixel in the previous row.
        c: Simd<T, N>,

        /// Previous pixel in the current row.
        a: Simd<T, N>,
    }

    fn paeth_predictor_u8<const N: usize>(
        a: Simd<u8, N>,
        b: Simd<u8, N>,
        c: Simd<u8, N>,
    ) -> Simd<u8, N>
    where
        LaneCount<N>: SupportedLaneCount,
    {
        let mut out = [0; N];
        for i in 0..N {
            out[i] = super::filter_paeth_stbi(a[i].into(), b[i].into(), c[i].into());
        }
        out.into()
    }

    fn paeth_step_u8<const N: usize>(
        state: &mut PaethState<u8, N>,
        b: Simd<u8, N>,
        x: &mut Simd<u8, N>,
    ) where
        LaneCount<N>: SupportedLaneCount,
    {
        // Calculating the new value of the current pixel.
        *x += paeth_predictor_u8(state.a, b, state.c);

        // Preparing for the next step.
        state.c = b;
        state.a = *x;
    }

    pub fn unfilter_paeth_u8<const N: usize>(prev_row: &[u8], curr_row: &mut [u8])
    where
        LaneCount<N>: SupportedLaneCount,
    {
        debug_assert_eq!(prev_row.len(), curr_row.len());
        debug_assert_eq!(prev_row.len() % N, 0);
        assert!(matches!(N, 4 | 8));

        let mut state = PaethState::<u8, N>::default();
        for (prev_row, curr_row) in prev_row.chunks_exact(N).zip(curr_row.chunks_exact_mut(N)) {
            let b = Simd::from_slice(prev_row);
            let mut x = Simd::from_slice(curr_row);

            paeth_step_u8(&mut state, b, &mut x);

            curr_row[..N].copy_from_slice(&x.to_array()[..N]);
        }
    }
}

fn filter_paeth_stbi(a: u8, b: u8, c: u8) -> u8 {
    // Decoding optimizes better with this algorithm than with `filter_paeth`
    //
    // This formulation looks very different from the reference in the PNG spec, but is
    // actually equivalent and has favorable data dependencies and admits straightforward
    // generation of branch-free code, which helps performance significantly.
    //
    // Adapted from public domain PNG implementation:
    // https://github.com/nothings/stb/blob/5c205738c191bcb0abc65c4febfa9bd25ff35234/stb_image.h#L4657-L4668
    let thresh = i16::from(c) * 3 - (i16::from(a) + i16::from(b));
    let lo = a.min(b);
    let hi = a.max(b);
    let t0 = if hi as i16 <= thresh { lo } else { c };
    let t1 = if thresh <= lo as i16 { hi } else { t0 };
    return t1;
}

fn filter_paeth(a: u8, b: u8, c: u8) -> u8 {
    // On ARM this algorithm performs much better than the one above adapted from stb,
    // and this is the better-studied algorithm we've always used here,
    // so we default to it on all non-x86 platforms.
    let pa = (i16::from(b) - i16::from(c)).abs();
    let pb = (i16::from(a) - i16::from(c)).abs();
    let pc = ((i16::from(a) - i16::from(c)) + (i16::from(b) - i16::from(c))).abs();

    let mut out = a;
    let mut min = pa;

    if pb < min {
        min = pb;
        out = b;
    }
    if pc < min {
        out = c;
    }

    out
}

#[allow(unreachable_code)]
pub(crate) fn unfilter(
    previous: &[u8],
    current: &mut [u8],
) {
    #[cfg(all(feature = "unstable", target_arch = "x86_64"))]
    {
        simd::unfilter_paeth_u8::<4>(previous, current);
        return;
    }

    // Select the fastest Paeth filter implementation based on the target architecture.
    let filter_paeth_decode = if cfg!(target_arch = "x86_64") {
        filter_paeth_stbi
    } else {
        filter_paeth
    };

    let mut a_bpp = [0; 4];
    let mut c_bpp = [0; 4];
    for (chunk, b_bpp) in current.chunks_exact_mut(4).zip(previous.chunks_exact(4))
    {
        let new_chunk = [
            chunk[0]
                .wrapping_add(filter_paeth_decode(a_bpp[0], b_bpp[0], c_bpp[0])),
            chunk[1]
                .wrapping_add(filter_paeth_decode(a_bpp[1], b_bpp[1], c_bpp[1])),
            chunk[2]
                .wrapping_add(filter_paeth_decode(a_bpp[2], b_bpp[2], c_bpp[2])),
            chunk[3]
                .wrapping_add(filter_paeth_decode(a_bpp[3], b_bpp[3], c_bpp[3])),
        ];
        *TryInto::<&mut [u8; 4]>::try_into(chunk).unwrap() = new_chunk;
        a_bpp = new_chunk;
        c_bpp = b_bpp.try_into().unwrap();
    }
}

fn main() {
    const ROW_SIZE: usize = 8192;  // Need something that fits in L1 cache.
    const REPETITIONS: usize = 100_000;

    let (prev_row, curr_row) = {
        fn get_random_bytes(n: usize) -> Vec<u8> {
            use std::collections::hash_map::RandomState;
            use std::hash::{BuildHasher, Hasher};
            (0..n)
                .map(|_| RandomState::new().build_hasher().finish() as u8)
                .collect::<Vec<u8>>()
        }
        let prev_row = get_random_bytes(ROW_SIZE);
        let curr_row = get_random_bytes(ROW_SIZE);
        (prev_row, curr_row)
    };

    let mut results = Vec::with_capacity(REPETITIONS);
    for _ in 0..REPETITIONS {
        let mut curr_row = curr_row.clone();
        let start = std::time::Instant::now();
        let _ = std::hint::black_box({
            unfilter(prev_row.as_slice(), curr_row.as_mut_slice())
        });
        results.push(start.elapsed());
    }

    // Discard 20% outliers from the top and bottom.
    results.sort();
    let results = {
        let low = REPETITIONS * 20 / 100;
        let high = REPETITIONS * 80 / 100;
        results[low..high].iter().collect::<Vec<_>>()
    };

    let len = results.len();
    let sum = results.into_iter().sum::<std::time::Duration>();
    let avg = sum / len.try_into().unwrap();
    dbg!(avg);
}
Corresponding Chromium’s `[BUILD.gn](http://build.gn/)` (just for the record)
import("//build/rust/cargo_crate.gni")

cargo_crate("png_unfilter_adhoc_test") {
  crate_type = "bin"
  crate_root = "main.rs"
  sources = [ "main.rs" ]

  # Mostly mimicking `third_party/rust/png/v0_17/BUILD.gn` below:
  #features = [ "unstable" ]
  features = []
  inputs = []
  build_native_rust_unit_tests = false
  library_configs -= [ "//build/config/compiler:chromium_code" ]
  library_configs += [ "//build/config/compiler:no_chromium_code" ]
  executable_configs -= [ "//build/config/compiler:chromium_code" ]
  executable_configs += [ "//build/config/compiler:no_chromium_code" ]           
  proc_macro_configs -= [ "//build/config/compiler:chromium_code" ]              
  proc_macro_configs += [ "//build/config/compiler:no_chromium_code" ]           
  library_configs -= [ "//build/config/compiler:disallow_unstable_features" ]
  executable_configs -= [ "//build/config/compiler:disallow_unstable_features" ]
  proc_macro_configs -= [ "//build/config/compiler:disallow_unstable_features" ] 
}                                                                                

@kornelski
Copy link
Contributor

Could these builds have a different PGO implementation?

e.g. BOLT is Unix-only.

@calebzulawski
Copy link

calebzulawski commented Jan 17, 2025

Not to be flippant, but that example only uses a single SIMD arithmetic operation:

*x += paeth_predictor_u8(state.a, b, state.c);

The actual work is still scalar:

let mut out = [0; N];
for i in 0..N {
    out[i] = super::filter_paeth_stbi(a[i].into(), b[i].into(), c[i].into());
}

The rest of the code around it simply uses Simd as a 4-element container type. I think this is a misuse of SIMD since the value of SIMD is the parallel arithmetic ops (of which you only have one) and the price you pay is this otherwise unnecessary memory layout restriction.

I did play around with parallelizing paeth_predictor_u8:

fn paeth_predictor_u8<const N: usize>(
    a: Simd<u8, N>,
    b: Simd<u8, N>,
    c: Simd<u8, N>,
) -> Simd<u8, N>
where
    LaneCount<N>: SupportedLaneCount,
{
    let a = a.cast::<i16>();
    let b = b.cast::<i16>();
    let c = c.cast::<i16>();
    let thresh = c * Simd::splat(3) - (a + b);
    let lo = a.min(b);
    let hi = a.max(b);
    let t0 = hi.simd_le(thresh).select(lo, c);
    let t1 = thresh.simd_le(lo).select(hi, t0);
    t1.cast()
}

This is a pretty bad candidate for parallelization because the casts are such a significant portion of the function. In the scalar variation of the function, the compiler is able to merge some of the moves and casts into unpack instructions. It's probably possible to mimic this with explicit SIMD, but not going to gain you anything.

@calebzulawski
Copy link

Also, when examining the asm the SIMD version is still only 1 instruction longer, but some of the instructions move around. Since it's so similar I'm surprised it's noticeably slower. If you compare the two with perf report at the instruction level I bet you'll see a stall on some high throughput instruction that is no longer adjacent to similar instructions.

@fintelia
Copy link
Contributor

fintelia commented Jan 17, 2025

Ok, I think what happened is that we added the rgba8 and rgba16 SIMD versions in #492 and then optimized the scalar versions of all paeth filter sizes in #539. During that second PR, we noticed that certain sizes no longer benefited from SIMD but they didn't seem to regress performance either, so there was no hurry to get rid of them.

My understanding is that the most important cases for SIMD are the 3-byte and 6-byte filters, because we've been unable to find a way to get the purely scalar version to use overlapping 4-byte / 8-byte loads and stores, which leads to a significant performance difference.

This is all made trickier by the different CPU models everyone has, which may not have matching performance. We know that ARM and x86 have very different performance characteristics but I somewhat suspect that there may be meaningful differences Intel and AMD (and perhaps even processor generations) as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants