SIMD level selection #352

Zylann · 2022-12-08T00:09:42Z

Zylann
Dec 8, 2022

I'm relatively new to SIMD management so I have some questionning.

It seems the SIMD levels to choose when compiling Jolt require to know exactly which CPU the game will run on. That makes sense, and on consoles I guess it's easy to get the best performance. But on PC, it is quite limiting, because to release a game on PC, users can have a wide range of CPUs. According to Steam hardware support survey, AVX2 appears widely supported, but 5% of users would end up crashing. Which means the distribution would have to support no more than SSE3 to match 100% of users with the same executable, leaving the others with no way to exploit better performance.

So in order to get maximum performance, the game would have to be built not only for every platform, but also for multiple SIMD levels, and somehow have users install the right one.
I have seen/used other projects FastNoise2 and ISPC where multiple variants of the SIMD part of the code can be compiled into the same executable, and dynamically selected at runtime based on cpuid. This allows the code to "just work" on as many platforms as possible, without requiring many versions of the executable. It seems unlikely that Jolt can do something similar though?

Another way would be to compile just Jolt as multiple dynamic libraries for each SIMD level, and dynamically load the best one at runtime. However, that means the game can't use Jolt directly anymore. It would have to be abstracted behind an interface, and the dynamic library would instantiate its implementation or fill in function pointers, which feels kinda meh.
Yet another option is to ship all versions for a given OS and have a launcher to choose the best version (assuming we can't always count on the means of distribution to download only the right one).

I tend to consider downgrading to SSE3 if I had to ship just one executable, as it is the simplest and safest option, maybe use AVX2 for my personal testing since I know my CPU supports it, but I feel like I'd be missing out.

Also, I'm wondering when I can enable LZNCT, TZNCT, F16C and FMADD options. Is x86-64 enough or are there more things to check? (still in the scope of targetting PC in general)

Answered by jrouwe

Dec 8, 2022

I have the same preference as @Wunkolo.

Switching between versions for every call to Abs or Dot has too much overhead indeed. The alternative is to compile hot spots in the code in multiple versions (e.g. I know that some physics engines do this for the inner loop of the solver). Obviously that limits the benefit to only those portions of the code at the cost of a much more complex code base (basically have to support multiple vector classes, multiple matrix classes etc. or SIMD-ify everything in place which makes things much less readable).

I think you'll get most gain by compiling the executable in multiple flavors and have a 'main' executable whose only job it is to start the right exe…

View full answer

Wunkolo · 2022-12-08T02:15:55Z

Wunkolo
Dec 8, 2022

I personally favor multiple binaries for each architecture rather than dynamically dispatching AVX or SSE code for every Vec3::Abs or Vec3::DotV4 operation.

So rather than having conditional branches all over code for certain SIMD features, it just a static high-level decision(picking an applicable .dll, shipping multiple executables, etc) to maintain performance and avoid all the branching and slowdowns required to dispatch arch-optimized functions at runtime.

Clear Linux does this route, where the host machine's capabilities are detected and then the most applicable version of a package is downloaded that is specifically compiled for the architecture's features. I think Clear Linux delimits this by nehalem, haswell, and skylakex.

Linux also has formal names for particular microarchitecture feature-levels:

x86-64: CMOV, CMPXCHG8B, FPU, FXSR, MMX, FXSR, SCE, SSE, SSE2
x86-64-v2: (close to Nehalem) CMPXCHG16B, LAHF-SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
x86-64-v3: (close to Haswell) AVX, AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, XSAVE
x86-64-v4: AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL

Some software such as PCSX2 go this route by releasing an SSE4 and AVX version of their software.

0 replies

jrouwe · 2022-12-08T20:04:49Z

jrouwe
Dec 8, 2022
Maintainer

I have the same preference as @Wunkolo.

Switching between versions for every call to Abs or Dot has too much overhead indeed. The alternative is to compile hot spots in the code in multiple versions (e.g. I know that some physics engines do this for the inner loop of the solver). Obviously that limits the benefit to only those portions of the code at the cost of a much more complex code base (basically have to support multiple vector classes, multiple matrix classes etc. or SIMD-ify everything in place which makes things much less readable).

I think you'll get most gain by compiling the executable in multiple flavors and have a 'main' executable whose only job it is to start the right executable.

That said, I have to say that the performance benefit of AVX2 over SSE2 (Jolt doesn't have a SSE3 version) is much less than you may expect. See this discussion: #327 (reply in thread)

So if you don't want to go through the hassle, dropping down to SSE2 is not the end of the world.

0 replies

jrouwe · 2022-12-08T20:13:42Z

jrouwe
Dec 8, 2022
Maintainer

B.t.w. LZNCT, TZNCT, F16C and FMADD are all newer than SSE2, so if you want to drop down to SSE2 you should disable these as well. In fact the 32-bit build turns all of these off by default, see cmake_vs2022_cl_32bit.bat.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD level selection #352

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

SIMD level selection #352

Zylann Dec 8, 2022

Replies: 3 comments

Wunkolo Dec 8, 2022

jrouwe Dec 8, 2022 Maintainer

jrouwe Dec 8, 2022 Maintainer

Zylann
Dec 8, 2022

Wunkolo
Dec 8, 2022

jrouwe
Dec 8, 2022
Maintainer

jrouwe
Dec 8, 2022
Maintainer