-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Porting to MSVC #28
Porting to MSVC #28
Conversation
less_slow.cpp
Outdated
@@ -525,7 +552,7 @@ static void sorting_with_openmp(bm::State &state) { | |||
|
|||
#pragma omp parallel for | |||
// Sort each chunk in parallel | |||
for (std::size_t i = 0; i < chunks; i++) { | |||
for (int64_t i = 0; i < chunks; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why switch to to int64_t
here and in the lines below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned it in the notes, but, at least on msvc, openmp only allows signed integers as the parallel for loop index/iterator.
less_slow.cpp
Outdated
@@ -2245,6 +2316,9 @@ BENCHMARK(cblas_tops<double>)->RangeMultiplier(2)->Range(8, 65536)->Complexity(b | |||
|
|||
template <typename scalar_type_> | |||
static void eigen_tops(bm::State &state) { | |||
// Make sure Eigen uses all cores - also, Eigen can't multithread without openMP or GEMM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's GEMM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General Matrix-Matrix product, I guess the more correct comment would be to say the internal GEMM threadpool that Eigen manages. There seems to be a way to make eigen use that instead of openMP (there is a EIGEN_GEMM_THREADPOOL flag that is checked) but I'm not too familiar with eigen, so not sure how that would be done.
less_slow.cpp
Outdated
#if defined(_MSC_VER) | ||
// MSVC does not implement std::regex_constants::multiline yet | ||
auto regex_options = std::regex_constants::ECMAScript; | ||
#else | ||
// Use multiline mode so ^ and $ anchor to line breaks. | ||
auto regex_options = std::regex_constants::ECMAScript | std::regex_constants::multiline; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe replace this with a simple if
or ternary condition as opposed to #if
macro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how that would work at runtime, because MSVC does not define std::regex_constants::multiline at all, so any mention of it would not compile.
less_slow.cpp
Outdated
// On MSVC, high_resolution_clock is steady_clock, which cannot have to_time_t applied to it | ||
auto now = std::chrono::system_clock::now(); | ||
#else | ||
auto now = std::chrono::high_resolution_clock::now(); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a good place to add some notes on how std::chrono
ambiguously wraps different system-level APIs without providing many guarantees 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I'll make some notes about that, and about how high_resolution_clock is implementation-defined.
# Fetch and build OpenBLAS | ||
FetchContent_Declare( | ||
OpenBLAS | ||
GIT_REPOSITORY https://github.com/xianyi/OpenBLAS.git | ||
GIT_TAG v0.3.29 | ||
) | ||
|
||
# Set OpenBLAS build options | ||
set(NOFORTRAN ON CACHE BOOL "Disable Fortran" FORCE) | ||
set(BUILD_WITHOUT_LAPACK OFF CACHE BOOL "Build without LAPACK" FORCE) | ||
set(USE_THREAD ON CACHE BOOL "Use threading" FORCE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, BLAS to map to OpenBLAS on all platforms is a probably a good idea. Agreed 🤝
Hi @RazielXYZ! This is awesome, epic job and an amazing PR 🔥 Minor Questions
Future Work
Yes, that can be discussed in #14. |
@ashvardanian thanks for giving this such an in-depth look! On MSVC, OpenMP explicitly requires signed integers for parallel for loops. I don't think there's a way around it, but we could ifdef to just MSVC if wanted (although, in the case where these are used, the upper limit will realistically never be high enough to require the extra bit of unsigned, unless we expect to get close to ten quintillion cores any time soon, so I'm not sure why we would use unsigned there). For the flags, they are not all necessary, but that is the collection of flags that results in maximum optimizations for speed. Realistically we could leave just O2 and GL. |
Wow, didn't know. Is it OK if we use |
Of course, I'll change that |
I've polished some little things and pushed to your branch. I hope that's fine 🤗
Agreed! Maybe we should dynamically fetch the RAM volume and skip all the benchmarks that require more than a quarter of RAM? Or just set a lower number? |
As for your benchmarks, the higher-level abstractions and bigger tasks definitely look interesting. Pipelines & Metaprogramming
On GCC,
TOPsAssuming the code was compiled with AVX2, I was expecting to see the numbers for #if defined(__AVX2__) |
…Also, use std::int64_t for consistency.
Merge branch 'msvc' of https://github.com/RazielXYZ/less_slow.cpp into msvc # Conflicts: # less_slow.cpp
Perfectly fine! I've merged it with my new changes. In particular, I changed the windows code for physical_cores to get actual physical_cores - the GetActiveProcessorCount implementation gets logical cores, and I realize now that we specifically want physical ones. Somewhat unfortunately, that is far more complicated than getting logical ones! Also, I appreciate sorting includes in logical and alphabetical order as much as the next guy, but including WinBase.h before Windows.h actually can not compile. Generally, Windows.h must always be included before any other windows header. |
The condition is valid, but I have ifdef'd out the benchmarks for asm kernels from MSVC, because I don't know how (or if) the ASM files can be linked in with MSVC. Just adding the file to the project like CMake is doing does not actually get the contents linked in by the linker. I can look into that further later, as I imagine it should be possible. |
Include order was changed by clang-format. We can either introduce a gap line or disable clang-format for that section of includes 🤔 |
Fair enough! I usually keep include re-ordering disabled in my formatters since I've been bitten a few times (and not just by windows headers). |
That's a good approach, although even if one has the RAM, it's still a long runtime. |
That's a good idea, worth patching in my other projects as well!
Yes, that may have practical importance beyond this repo as well. In SimSIMD I am tired of dealing with compiler compatibility problems for SIMD intrinsics and considering a shift towards pure Assembly. |
Disables automatic lexicographic ordering in favor of manual, putting lighter includes first. #28 (comment) Co-authored-by: RazielXYZ <[email protected]>
@RazielXYZ, I've resolved sorting includes. LMK if there is smth else you want to add to this PR or is it good to go? |
@ashvardanian Should be good to go, I can make a new PR for when I investigate getting the ASM kernels to work on msvc, and whatever else I can find/improve! |
I also just remembered, I wanted to also point out that this repo has no license - I assume it should be Apache 2 like the others? |
Will add a license soon. |
I'll have to revert BLAS settings, as it fails to compile on Linux 🤦♂️ |
I've built and ran the current version (with FetchContent openBLAS) on Linux too, what's it failing on? |
It's failing on AVX-512 intrinsics for me. |
@ashvardanian odd, which ones? The only ones I couldn't build/test here are the intel specific ones, since I'm on AMD. |
I am on Sapphire Rapids. Can try again during the week. |
@RazielXYZ, This is an amazing job, thank you very much; I've been looking at trying to properly get this to work on MSVC, too, for a bit of time, but I noticed this last night, and I was so happy. I am running the tests over the weekend on this, as i have both AMD + Intel, Windows workstations. |
Background:
I frequently write back-end code that I try to keep both high-performance and multi-platform.
However, my usual primary platform is Windows with MSVC, so I was somewhat disappointed to see that for this awesome project, the recommended way to build and run it on Windows is "use WSL".
I decided to see what changes would be needed to get it working on MSVC, and analyze what's going on in the process, as I believe Windows and MSVC are still important platforms, especially in the enterprise space, and I want code for them to be as less slow as possible too!
BLAS:
I'm honestly not sure how one would get CMake to find BLAS on Windows. Further, the official BLAS builds on Windows use mingw, which I generally prefer to pretend doesn't exist.
As such, both to make it consistent across platforms and to make it work more easily on Windows, I've switched to using OpenBLAS through FetchContent.
This works well on all platforms, but I have not yet benchmarked the performance differences between reference blas and openblas.
I would suggest considering reducing the default highest range for the TOPs benchmarks to something lower than 65536, as that requires slightly over 100GB of RAM, and takes quite long even on very fast machines like my Epyc 9654.
A side-note about FetchContent:
I don't like it. It seems very fragile, it depends on both git and cmake exclusively, and the tiniest change in the CMakeLists requires rebuilding everything.
While it seems suitable for a project like this, I would not use it in anything larger, that changes frequently, or that more than a couple people work on; conan and vcpkg seem much more suitable for that - I may be slightly biased since I frequently use conan though.
OpenMP:
OpenMP works fine on MSVC, and CMake knows what to do when it's linked in for a MSVC project to use it without any other additions.
It's also required for Eigen to use parallelism, so I removed the condition of only using OpenMP on Linux.
The only other related change was that
omp parallel for
loops explicitly require a signed index, so I changed them from size_t to int64_t.Compile options:
I added suitable compile options for MSVC, and removed the ones that would either not do anything or break MSVC. /Zc:__cplusplus is actually required for stringzilla to see that we have std::string_view available, among other things.
About high core counts on Windows:
Windows splits up logical CPUs into processor groups, which can have at most 64 processors each. In the past, using cores across groups would require going through the windows APIs and setting affinity across groups, as well as potentially manually managing thread affinities.
In recent Windows versions, applications are allowed and able to just implicitly use cores across groups without doing anything fancy.
However, both
std::thread::hardware_concurrency
andGetSystemInfo
only return, at most, a value of 64 (the limit of one processor group). In order to get the real number of logical processors, one has to useGetActiveProcessorCount(ALL_PROCESSOR_GROUPS);
or iterate through the relations provided byGetLogicalProcessorInformationEx
.As such, I've implemented that in the
physical_cores
method.As an aside, Google Bench also has no idea what it's doing with this CPU on Windows, only reporting 64 cores and x32 L1 and L2 caches, as well as an L3 cache of 32MB x4 (when it's actually x12), which is very sad to see. In my experience, many libraries and projects just rely on
std::thread::hardware_concurrency
, and while it's not a problem for many users (not that many systems have more than 64 cores still), it's not great to encounter, and I think many people aren't even aware of the issue.Logging:
I could not, for the life of me, get either the std::format or the fmt logging benchmark implementations to compile on msvc.
For the std implementation, it seems like chrono is missing a parse implementation, so it refuses to use it in there. It also complains about the result being consteval rather than constexpr.
For the fmt implementation, I am encountering errors inside the fmt headers, and namespace mismatches. I'm not sure if this is FetchContent's fault, as I use fmt in all my projects with no such issues through conan. It is also still complaining about there being no parse implementation for chrono::time_point as well.
Other notes and small changes:
Due to the lack of
__builtin_popcountll
on msvc, I used a manual implementation foris_power_of_two
for now. We could consider using std::popcount or intrinsics as well, and micro-benchmarking between them might be interesting.I've also implemented
LESS_SLOW_ALWAYS_INLINE
to replace the instances ofgnu::always_inline
with something more portable, and exemplify how that is usually done.I've set the thread count for both blas/openblas and eigen to physical cores manually.
Openblas seems to need this, as it only uses 64 cores by default, but eigen actually does not, as openmp seems to give it the real number of logical cores (192 in this case) without help.
std::pair
appears to actually be trivially copyable in MSVC, so I've removed the assert on that for MSVC.The assembly files and inline assembly (obviously) won't easily work on MSVC, so they're ifdef'd out for MSVC for now. In the future, it would be interesting to implement those with intrinsics and/or a portable third-party SIMD library as well, and perhaps benchmark between those implementations.
Some things have been moved around, in particular the logging tests so that I can only ifdef out the fmt and format ones, and
is_power_of_two
so that it can be used inmemory_access
Building with AVX512 on MSVC is making linking extremely slow for some reason. Not sure how to identify what's causing it. AVX2 and others are building normally.
Intel/oneAPI TBB does not seem to be used anywhere or for anything yet, so I didn't change anything about it, but it can and should work on MSVC as well - there should be no reason to only add it on Linux. Using other modern libraries for task scheduling and parallelism would be interesting as well - taskflow and libfork are great candidates but there are many in this space.
Diving into libraries (and perhaps implementations) for optimized parallel and lockfree containers would also be of interest, I think, as well as exemplifying usage of queues (SPSC, MPMC, etc) and well-optimized asynchronous event loops and reactors.
Results:
For reference, here is one set of results from my machine - Windows 11 24H2, Epyc 9654, 512GB RAM. Built/running with AVX2. No particular optimizations were done (other things were running, SMT was on, did not use random interleaving, etc) so they're more preliminary and just for fun.