-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unfiltering is slower on Windows than on Linux (especially with std::simd
)
#567
Comments
weird. there shouldn't be a huge difference. |
the big questions to ask are
|
Also, it probably shouldn't matter, but the Rust Windows target enables these features: base.features = "+cx16,+sse3,+sahf".into(); Does Linux with the same features bench differently? |
I have rebuilt the benchmark on Linux, after adding (*)
|
FWIW, here is
|
No - there are no
I don't think so. Paeth unfiltering operates on 8-bit or 16-bit integers. There is no |
I diffed the asm for Considering both linux implementations and the windows std::simd implementation of the unfiltering functions all show the same performance vs C++, I'm suspicious of the windows autovec benchmark--why would that be faster than all linux implementations? Can you replicate those benchmark results? |
Let me quickly clarify that the image tested here always uses I also note that it is possible that my problem is somehow caused by a quirk of Chromium’s build system - e.g. using a particular version of
That’s fair - the results are indeed weird. FWIW (to rule out one explanation) I've verified that my Windows and Linux machines are using the same Chromium toolchain:
Today I replicated the following results:
OTOH, my results from a smaller, self-contained, Rust-only benchmark shows effect both on Linux and Windows. I am not sure what may be causing the difference. One reasonable hypothesis seems to be that the bigger repro is somehow more sensitive to something OS-specific (e.g. maybe "bigger" means more opportunities for different LTO results? just a wild guess...). FWIW, I got the following numbers today, when working on replicating my results:
So to summarize:
For the standalone repro, I used: [main.rs](http://main.rs/)#![feature(portable_simd)]
#[cfg(all(feature = "unstable", target_arch = "x86_64"))]
mod simd {
use std::simd::{LaneCount, Simd, SimdElement, SupportedLaneCount};
#[derive(Default)]
struct PaethState<T, const N: usize>
where
T: SimdElement,
LaneCount<N>: SupportedLaneCount,
{
/// Previous pixel in the previous row.
c: Simd<T, N>,
/// Previous pixel in the current row.
a: Simd<T, N>,
}
fn paeth_predictor_u8<const N: usize>(
a: Simd<u8, N>,
b: Simd<u8, N>,
c: Simd<u8, N>,
) -> Simd<u8, N>
where
LaneCount<N>: SupportedLaneCount,
{
let mut out = [0; N];
for i in 0..N {
out[i] = super::filter_paeth_stbi(a[i].into(), b[i].into(), c[i].into());
}
out.into()
}
fn paeth_step_u8<const N: usize>(
state: &mut PaethState<u8, N>,
b: Simd<u8, N>,
x: &mut Simd<u8, N>,
) where
LaneCount<N>: SupportedLaneCount,
{
// Calculating the new value of the current pixel.
*x += paeth_predictor_u8(state.a, b, state.c);
// Preparing for the next step.
state.c = b;
state.a = *x;
}
pub fn unfilter_paeth_u8<const N: usize>(prev_row: &[u8], curr_row: &mut [u8])
where
LaneCount<N>: SupportedLaneCount,
{
debug_assert_eq!(prev_row.len(), curr_row.len());
debug_assert_eq!(prev_row.len() % N, 0);
assert!(matches!(N, 4 | 8));
let mut state = PaethState::<u8, N>::default();
for (prev_row, curr_row) in prev_row.chunks_exact(N).zip(curr_row.chunks_exact_mut(N)) {
let b = Simd::from_slice(prev_row);
let mut x = Simd::from_slice(curr_row);
paeth_step_u8(&mut state, b, &mut x);
curr_row[..N].copy_from_slice(&x.to_array()[..N]);
}
}
}
fn filter_paeth_stbi(a: u8, b: u8, c: u8) -> u8 {
// Decoding optimizes better with this algorithm than with `filter_paeth`
//
// This formulation looks very different from the reference in the PNG spec, but is
// actually equivalent and has favorable data dependencies and admits straightforward
// generation of branch-free code, which helps performance significantly.
//
// Adapted from public domain PNG implementation:
// https://github.com/nothings/stb/blob/5c205738c191bcb0abc65c4febfa9bd25ff35234/stb_image.h#L4657-L4668
let thresh = i16::from(c) * 3 - (i16::from(a) + i16::from(b));
let lo = a.min(b);
let hi = a.max(b);
let t0 = if hi as i16 <= thresh { lo } else { c };
let t1 = if thresh <= lo as i16 { hi } else { t0 };
return t1;
}
fn filter_paeth(a: u8, b: u8, c: u8) -> u8 {
// On ARM this algorithm performs much better than the one above adapted from stb,
// and this is the better-studied algorithm we've always used here,
// so we default to it on all non-x86 platforms.
let pa = (i16::from(b) - i16::from(c)).abs();
let pb = (i16::from(a) - i16::from(c)).abs();
let pc = ((i16::from(a) - i16::from(c)) + (i16::from(b) - i16::from(c))).abs();
let mut out = a;
let mut min = pa;
if pb < min {
min = pb;
out = b;
}
if pc < min {
out = c;
}
out
}
#[allow(unreachable_code)]
pub(crate) fn unfilter(
previous: &[u8],
current: &mut [u8],
) {
#[cfg(all(feature = "unstable", target_arch = "x86_64"))]
{
simd::unfilter_paeth_u8::<4>(previous, current);
return;
}
// Select the fastest Paeth filter implementation based on the target architecture.
let filter_paeth_decode = if cfg!(target_arch = "x86_64") {
filter_paeth_stbi
} else {
filter_paeth
};
let mut a_bpp = [0; 4];
let mut c_bpp = [0; 4];
for (chunk, b_bpp) in current.chunks_exact_mut(4).zip(previous.chunks_exact(4))
{
let new_chunk = [
chunk[0]
.wrapping_add(filter_paeth_decode(a_bpp[0], b_bpp[0], c_bpp[0])),
chunk[1]
.wrapping_add(filter_paeth_decode(a_bpp[1], b_bpp[1], c_bpp[1])),
chunk[2]
.wrapping_add(filter_paeth_decode(a_bpp[2], b_bpp[2], c_bpp[2])),
chunk[3]
.wrapping_add(filter_paeth_decode(a_bpp[3], b_bpp[3], c_bpp[3])),
];
*TryInto::<&mut [u8; 4]>::try_into(chunk).unwrap() = new_chunk;
a_bpp = new_chunk;
c_bpp = b_bpp.try_into().unwrap();
}
}
fn main() {
const ROW_SIZE: usize = 8192; // Need something that fits in L1 cache.
const REPETITIONS: usize = 100_000;
let (prev_row, curr_row) = {
fn get_random_bytes(n: usize) -> Vec<u8> {
use std::collections::hash_map::RandomState;
use std::hash::{BuildHasher, Hasher};
(0..n)
.map(|_| RandomState::new().build_hasher().finish() as u8)
.collect::<Vec<u8>>()
}
let prev_row = get_random_bytes(ROW_SIZE);
let curr_row = get_random_bytes(ROW_SIZE);
(prev_row, curr_row)
};
let mut results = Vec::with_capacity(REPETITIONS);
for _ in 0..REPETITIONS {
let mut curr_row = curr_row.clone();
let start = std::time::Instant::now();
let _ = std::hint::black_box({
unfilter(prev_row.as_slice(), curr_row.as_mut_slice())
});
results.push(start.elapsed());
}
// Discard 20% outliers from the top and bottom.
results.sort();
let results = {
let low = REPETITIONS * 20 / 100;
let high = REPETITIONS * 80 / 100;
results[low..high].iter().collect::<Vec<_>>()
};
let len = results.len();
let sum = results.into_iter().sum::<std::time::Duration>();
let avg = sum / len.try_into().unwrap();
dbg!(avg);
} Corresponding Chromium’s `[BUILD.gn](http://build.gn/)` (just for the record)
|
Could these builds have a different PGO implementation? e.g. BOLT is Unix-only. |
Not to be flippant, but that example only uses a single SIMD arithmetic operation: *x += paeth_predictor_u8(state.a, b, state.c); The actual work is still scalar: let mut out = [0; N];
for i in 0..N {
out[i] = super::filter_paeth_stbi(a[i].into(), b[i].into(), c[i].into());
} The rest of the code around it simply uses I did play around with parallelizing fn paeth_predictor_u8<const N: usize>(
a: Simd<u8, N>,
b: Simd<u8, N>,
c: Simd<u8, N>,
) -> Simd<u8, N>
where
LaneCount<N>: SupportedLaneCount,
{
let a = a.cast::<i16>();
let b = b.cast::<i16>();
let c = c.cast::<i16>();
let thresh = c * Simd::splat(3) - (a + b);
let lo = a.min(b);
let hi = a.max(b);
let t0 = hi.simd_le(thresh).select(lo, c);
let t1 = thresh.simd_le(lo).select(hi, t0);
t1.cast()
} This is a pretty bad candidate for parallelization because the casts are such a significant portion of the function. In the scalar variation of the function, the compiler is able to merge some of the moves and casts into unpack instructions. It's probably possible to mimic this with explicit SIMD, but not going to gain you anything. |
Also, when examining the asm the SIMD version is still only 1 instruction longer, but some of the instructions move around. Since it's so similar I'm surprised it's noticeably slower. If you compare the two with perf report at the instruction level I bet you'll see a stall on some high throughput instruction that is no longer adjacent to similar instructions. |
Ok, I think what happened is that we added the rgba8 and rgba16 SIMD versions in #492 and then optimized the scalar versions of all paeth filter sizes in #539. During that second PR, we noticed that certain sizes no longer benefited from SIMD but they didn't seem to regress performance either, so there was no hurry to get rid of them. My understanding is that the most important cases for SIMD are the 3-byte and 6-byte filters, because we've been unable to find a way to get the purely scalar version to use overlapping 4-byte / 8-byte loads and stores, which leads to a significant performance difference. This is all made trickier by the different CPU models everyone has, which may not have matching performance. We know that ARM and x86 have very different performance characteristics but I somewhat suspect that there may be meaningful differences Intel and AMD (and perhaps even processor generations) as well |
Problem description
TL;DR:
rustc
auto-vectorization produces different binary code on Windows than on Linuxunstable
feature of thepng
crate (i.e. avoidingstd::simd
) helpsI have been running Chromium-based PNG benchmarks (see https://crrev.com/c/4860210/25). In the past I have been using my 2023 corpus of 1655 PNG images from top 500 websites (see here for more details). This week I have tried running these microbenchmarks against 100 biggest PNG images from this corpus on Windows and on Linux and discovered that the ratio of total-decoding-runtime-when-using-Rust / total-decoding-runtime-when-using-C++ differs considerably across these OS platforms: 79% for Linux (i.e. Rust is ~20% faster) vs 101% for Windows (i.e. Rust is comparable).
I have investigated further by looking at 1 image with the biggest ratio (color type = 6 / RGBA; bits = 8; width x height = 740 x 3640; all 740 rows use Paeth filter). For this image, it turned out that unfiltering takes the most time and so I experimented with turning on and off the
unstable
/std::simd
feature of thepng
crate:png
features =['unstable']
png
features =[]
Profiling notes below show how often
png::decoder::unfiltering_buffer::UnfilteringBuffer::unfilter_curr_row
is present in ETW/profiling samples on Windows orperf record -e cpu-clock
on Linux (decoding the single file). I am also includingfdeflate::decompress:Decompressor::read
for comparison. The percentages are "inclusive" although this doesn't really matter forunfilter_curr_row
which IIUC doesn't call any other functions):unstable
unstable
unstable
unstable
FWIW profiling C++ on Linux shows
png_read_filter_row_paeth4_sse2.cfi
at 55.1% andCr_z_inflate_fast_chunk_
at 11.3%. This suggests to me that Rust's auto-vectorization gives approximately the same speed as direct usage of SIMD intrinsics in the C/C++ code. I am not sure why Windows numbers look different.Discussion
AFAIK this is the 2nd time when we find that using
std::simd
negatively affects performance - the last time we discovered this, we limitedstd::simd
usage to x86 (see #539 (comment)).In the last couple of months I tried measuring the impact of the
unstable
feature of thepng
crate and the overall impact was modest, but consistently negative - I recorded a 1.6% regression in row 1.9 from "2024-11-27 - 2024-12-02" and 0.13% regression in row 1.1.1 from "2024-12-12 - 2024-12-17" in a document here. Back then I didn't act on this data, because the 2 results were different, the delta was relatively modest, and because I don't fully trust my benchmarking methodology (leaning toward attributing small differences to noise).Based on the results above, I plan to turn off the
unstable
feature of thepng
crate in Chromium. (But still usingcrc32fast/nightly
.)Open questions:
std::simd
-based code from thepng
craterustc
or LLVM here. OTOH 1) I am not sure how to efficiently narrow down the problem into something that others can debug further, and 2) I am not sure ifrustc
/ LLVM wants-to/can give any guarantees of when auto-ectorization occurs. It is also possible that the difference is not because Rust ends up slower on Windows, but because C++ ends up faster on Windows.unstable
feature of thepng
crate is off). From this perspective we can probably say that auto-vectorization works well.png
unfiltering code is quite complex and it uses different code paths are used on different platforms. Ideally this would be accompanied by automated tests that can catch functional and/or performance regressions on all special-cased platforms, but I am not sure if github supports this./cc @calebzulawski from https://rust-lang.github.io/rfcs/2977-stdsimd.html (see also the related rust-lang/rfcs#2948)
/cc @veluca93 who has kindly helped me narrow down microbenchmarks-vs-field-trial differences I've observed into the Windows-vs-Linux investigation above
The text was updated successfully, but these errors were encountered: