Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting to MSVC #28

Merged
merged 10 commits into from
Jan 26, 2025
Merged

Porting to MSVC #28

merged 10 commits into from
Jan 26, 2025

Conversation

RazielXYZ
Copy link
Contributor

@RazielXYZ RazielXYZ commented Jan 25, 2025

Background:

I frequently write back-end code that I try to keep both high-performance and multi-platform.
However, my usual primary platform is Windows with MSVC, so I was somewhat disappointed to see that for this awesome project, the recommended way to build and run it on Windows is "use WSL".
I decided to see what changes would be needed to get it working on MSVC, and analyze what's going on in the process, as I believe Windows and MSVC are still important platforms, especially in the enterprise space, and I want code for them to be as less slow as possible too!

BLAS:

I'm honestly not sure how one would get CMake to find BLAS on Windows. Further, the official BLAS builds on Windows use mingw, which I generally prefer to pretend doesn't exist.
As such, both to make it consistent across platforms and to make it work more easily on Windows, I've switched to using OpenBLAS through FetchContent.
This works well on all platforms, but I have not yet benchmarked the performance differences between reference blas and openblas.
I would suggest considering reducing the default highest range for the TOPs benchmarks to something lower than 65536, as that requires slightly over 100GB of RAM, and takes quite long even on very fast machines like my Epyc 9654.

A side-note about FetchContent:

I don't like it. It seems very fragile, it depends on both git and cmake exclusively, and the tiniest change in the CMakeLists requires rebuilding everything.
While it seems suitable for a project like this, I would not use it in anything larger, that changes frequently, or that more than a couple people work on; conan and vcpkg seem much more suitable for that - I may be slightly biased since I frequently use conan though.

OpenMP:

OpenMP works fine on MSVC, and CMake knows what to do when it's linked in for a MSVC project to use it without any other additions.
It's also required for Eigen to use parallelism, so I removed the condition of only using OpenMP on Linux.
The only other related change was that omp parallel for loops explicitly require a signed index, so I changed them from size_t to int64_t.

Compile options:

I added suitable compile options for MSVC, and removed the ones that would either not do anything or break MSVC. /Zc:__cplusplus is actually required for stringzilla to see that we have std::string_view available, among other things.

About high core counts on Windows:

Windows splits up logical CPUs into processor groups, which can have at most 64 processors each. In the past, using cores across groups would require going through the windows APIs and setting affinity across groups, as well as potentially manually managing thread affinities.
In recent Windows versions, applications are allowed and able to just implicitly use cores across groups without doing anything fancy.
However, both std::thread::hardware_concurrency and GetSystemInfo only return, at most, a value of 64 (the limit of one processor group). In order to get the real number of logical processors, one has to use GetActiveProcessorCount(ALL_PROCESSOR_GROUPS); or iterate through the relations provided by Get­Logical­Processor­Information­Ex.
As such, I've implemented that in the physical_cores method.
As an aside, Google Bench also has no idea what it's doing with this CPU on Windows, only reporting 64 cores and x32 L1 and L2 caches, as well as an L3 cache of 32MB x4 (when it's actually x12), which is very sad to see. In my experience, many libraries and projects just rely on std::thread::hardware_concurrency, and while it's not a problem for many users (not that many systems have more than 64 cores still), it's not great to encounter, and I think many people aren't even aware of the issue.

Logging:

I could not, for the life of me, get either the std::format or the fmt logging benchmark implementations to compile on msvc.
For the std implementation, it seems like chrono is missing a parse implementation, so it refuses to use it in there. It also complains about the result being consteval rather than constexpr.
For the fmt implementation, I am encountering errors inside the fmt headers, and namespace mismatches. I'm not sure if this is FetchContent's fault, as I use fmt in all my projects with no such issues through conan. It is also still complaining about there being no parse implementation for chrono::time_point as well.

Other notes and small changes:

Due to the lack of __builtin_popcountll on msvc, I used a manual implementation for is_power_of_two for now. We could consider using std::popcount or intrinsics as well, and micro-benchmarking between them might be interesting.

I've also implemented LESS_SLOW_ALWAYS_INLINE to replace the instances of gnu::always_inline with something more portable, and exemplify how that is usually done.

I've set the thread count for both blas/openblas and eigen to physical cores manually.
Openblas seems to need this, as it only uses 64 cores by default, but eigen actually does not, as openmp seems to give it the real number of logical cores (192 in this case) without help.

std::pair appears to actually be trivially copyable in MSVC, so I've removed the assert on that for MSVC.

The assembly files and inline assembly (obviously) won't easily work on MSVC, so they're ifdef'd out for MSVC for now. In the future, it would be interesting to implement those with intrinsics and/or a portable third-party SIMD library as well, and perhaps benchmark between those implementations.

Some things have been moved around, in particular the logging tests so that I can only ifdef out the fmt and format ones, and is_power_of_two so that it can be used in memory_access

Building with AVX512 on MSVC is making linking extremely slow for some reason. Not sure how to identify what's causing it. AVX2 and others are building normally.

Intel/oneAPI TBB does not seem to be used anywhere or for anything yet, so I didn't change anything about it, but it can and should work on MSVC as well - there should be no reason to only add it on Linux. Using other modern libraries for task scheduling and parallelism would be interesting as well - taskflow and libfork are great candidates but there are many in this space.

Diving into libraries (and perhaps implementations) for optimized parallel and lockfree containers would also be of interest, I think, as well as exemplifying usage of queues (SPSC, MPMC, etc) and well-optimized asynchronous event loops and reactors.

Results:

For reference, here is one set of results from my machine - Windows 11 24H2, Epyc 9654, 512GB RAM. Built/running with AVX2. No particular optimizations were done (other things were running, SMT was on, did not use random interleaving, etc) so they're more preliminary and just for fun.

Cache line width: 64 bytes
2025-01-25T15:20:50+02:00
Running D:\Dev\Proj\OSS\less_slow.cpp\build_release\Release\less_slow.exe
Run on (64 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 1024 KiB (x32)
  L3 Unified 32768 KiB (x4)
--------------------------------------------------------------------------------------------------------------------
Benchmark                                                                          Time             CPU   Iterations
--------------------------------------------------------------------------------------------------------------------
i32_addition                                                                   0.000 ns        0.000 ns   1000000000000
i32_addition_random                                                             13.4 ns         13.5 ns     49777778
i32_addition_paused                                                              298 ns          307 ns      2036364
i32_addition_randomly_initialized                                               1.44 ns         1.46 ns    448000000
i32_addition_random/threads:192                                                 32.9 ns         61.0 ns     19200000
i32_addition_randomly_initialized/threads:192                                    905 ns          399 ns      1920000
sorting/3/0                                                                      308 ns          284 ns      2036364
sorting/3/1                                                                     11.9 ns         12.0 ns     56000000
sorting/4/0                                                                      340 ns          322 ns      2036364
sorting/4/1                                                                     17.2 ns         17.3 ns     40727273
sorting/1024/0                                                                  6593 ns         6562 ns       100000
sorting/1024/1                                                                  6302 ns         6278 ns       112000
sorting/8196/0                                                                 64085 ns        62779 ns         8960
sorting/8196/1                                                                 67305 ns        66961 ns         7467
sorting_with_executors/seq/1048576/min_time:10.000/real_time                13012314 ns     12959392 ns         1102
sorting_with_executors/seq/4194304/min_time:10.000/real_time                59507808 ns     59179688 ns          240
sorting_with_executors/seq/16777216/min_time:10.000/real_time              262529059 ns    260706019 ns           54
sorting_with_executors/seq/67108864/min_time:10.000/real_time             1119567150 ns   1114583333 ns           12
sorting_with_executors/seq/268435456/min_time:10.000/real_time            4741767767 ns   4739583333 ns            3
sorting_with_executors/seq/min_time:10.000/real_time_BigO                       0.63 NlgN       0.63 NlgN
sorting_with_executors/seq/min_time:10.000/real_time_RMS                           1 %             1 %
sorting_with_executors/par_unseq/1048576/iterations:1/real_time              4949900 ns        0.000 ns            1
sorting_with_executors/par_unseq/4194304/iterations:1/real_time             12920800 ns     15625000 ns            1
sorting_with_executors/par_unseq/16777216/iterations:1/real_time            44564900 ns     46875000 ns            1
sorting_with_executors/par_unseq/67108864/iterations:1/real_time           163719400 ns    171875000 ns            1
sorting_with_executors/par_unseq/268435456/iterations:1/real_time          648277000 ns    656250000 ns            1
sorting_with_executors/par_unseq/iterations:1/real_time_BigO                    0.09 NlgN       0.09 NlgN
sorting_with_executors/par_unseq/iterations:1/real_time_RMS                        4 %             6 %
sorting_with_openmp/1048576/min_time:10.000/real_time                        7238191 ns      6118993 ns         1956
sorting_with_openmp/4194304/min_time:10.000/real_time                       20118308 ns     18194444 ns          675
sorting_with_openmp/16777216/min_time:10.000/real_time                      71842451 ns     67012032 ns          187
sorting_with_openmp/67108864/min_time:10.000/real_time                     255427762 ns    235227273 ns           55
sorting_with_openmp/268435456/min_time:10.000/real_time                    799911782 ns    757352941 ns           17
sorting_with_openmp/min_time:10.000/real_time_BigO                              0.11 NlgN       0.10 NlgN
sorting_with_openmp/min_time:10.000/real_time_RMS                                 14 %            13 %
recursion_cost<recursive_sort_i32s, 16>                                          131 ns          129 ns      6400000
recursion_cost<iterative_sort_i32s, 16>                                          108 ns          109 ns      7466667
recursion_cost<recursive_sort_i32s, 256>                                       23377 ns        23577 ns        34462
recursion_cost<iterative_sort_i32s, 256>                                       22020 ns        21973 ns        32000
recursion_cost<recursive_sort_i32s, 4096>                                    4973032 ns      5000000 ns          100
recursion_cost<iterative_sort_i32s, 4096>                                    4745585 ns      4743304 ns          112
branch_cost/256                                                                 2.33 ns         2.34 ns    320000000
branch_cost/1024                                                                2.32 ns         2.30 ns    298666667
branch_cost/4096                                                                2.32 ns         2.30 ns    298666667
branch_cost/16384                                                               2.31 ns         2.30 ns    298666667
branch_cost/32768                                                               2.32 ns         2.30 ns    298666667
cache_misses_cost<sequential>/8192/min_time:2.000                              14584 ns        14519 ns       194783
cache_misses_cost<sequential>/32768/min_time:2.000                             57983 ns        57748 ns        48432
cache_misses_cost<sequential>/262144/min_time:2.000                           465532 ns       463021 ns         5973
cache_misses_cost<sequential>/2097152/min_time:2.000                         3747499 ns      3730745 ns          779
cache_misses_cost<sequential>/16777216/min_time:2.000                       28770821 ns     28756649 ns           94
cache_misses_cost<sequential>/134217728/min_time:2.000                     217800643 ns    217633929 ns           14
cache_misses_cost<random>/8192/min_time:2.000                                  14734 ns        14596 ns       192688
cache_misses_cost<random>/32768/min_time:2.000                                 57855 ns        57652 ns        47158
cache_misses_cost<random>/262144/min_time:2.000                               506304 ns       499442 ns         5600
cache_misses_cost<random>/2097152/min_time:2.000                             5062858 ns      5001081 ns          578
cache_misses_cost<random>/16777216/min_time:2.000                          181598376 ns    179227941 ns           17
cache_misses_cost<random>/134217728/min_time:2.000                        1649278050 ns   1640625000 ns            2
rvo_friendly                                                                    54.5 ns         54.7 ns     10000000
rvo_impossible                                                                   104 ns          103 ns      6400000
f64_sin                                                                         6.43 ns         6.42 ns    112000000 bytes_per_second=1.16099Gi/s
f64_sin_maclaurin                                                               35.0 ns         34.5 ns     19478261 bytes_per_second=221.183Mi/s
f64_sin_maclaurin_powless                                                       2.50 ns         2.51 ns    280000000 bytes_per_second=2.96699Gi/s
f64_sin_maclaurin_with_fast_math                                                2.50 ns         2.51 ns    280000000 bytes_per_second=2.96699Gi/s
integral_division                                                               1.96 ns         1.88 ns    373333333
integral_division_by_constexpr                                                  1.53 ns         1.53 ns    407272727
integral_division_by_const                                                      1.96 ns         1.95 ns    344615385
integral_division_with_doubles                                                  1.64 ns         1.65 ns    407272727
f32x4x4_matmul                                                                  18.4 ns         17.7 ns     34461538 TOP=6.33385G/s
f32x4x4_matmul_unrolled                                                         19.9 ns         19.0 ns     32000000 TOP=5.88144G/s
memory_access_unaligned/min_time:10.000                                      1542182 ns      1532559 ns         9237
memory_access_aligned/min_time:10.000                                        1393222 ns      1382646 ns         9956
spread_memory/gather_scalar/1024/min_time:5.000/min_warmup_time:1.000            323 ns          320 ns     22288557
spread_memory/gather_scalar/4096/min_time:5.000/min_warmup_time:1.000           1424 ns         1425 ns      5033708
spread_memory/gather_scalar/32768/min_time:5.000/min_warmup_time:1.000         18203 ns        18085 ns       386207
spread_memory/gather_scalar/262144/min_time:5.000/min_warmup_time:1.000       205096 ns       204233 ns        34198
spread_memory/gather_scalar/1048576/min_time:5.000/min_warmup_time:1.000     1023505 ns      1019547 ns         7111
spread_memory/scatter_scalar/1024/min_time:5.000/min_warmup_time:1.000           356 ns          355 ns     19911111
spread_memory/scatter_scalar/4096/min_time:5.000/min_warmup_time:1.000          1631 ns         1627 ns      4349515
spread_memory/scatter_scalar/32768/min_time:5.000/min_warmup_time:1.000        18909 ns        18935 ns       367213
spread_memory/scatter_scalar/262144/min_time:5.000/min_warmup_time:1.000      192876 ns       192186 ns        36423
spread_memory/scatter_scalar/1048576/min_time:5.000/min_warmup_time:1.000    1006542 ns      1004335 ns         6892
cblas_tops<float>/8                                                              376 ns          377 ns      1866667 TOP=2.54862G/s
cblas_tops<float>/16                                                            1278 ns         1256 ns       560000 TOP=6.32058G/s
cblas_tops<float>/32                                                            8183 ns         8161 ns        74667 TOP=7.90469G/s
cblas_tops<float>/64                                                           64924 ns        65569 ns        11200 TOP=7.93348G/s
cblas_tops<float>/128                                                         247439 ns       203321 ns         4073 TOP=20.5484G/s
cblas_tops<float>/256                                                        1290599 ns       767299 ns          896 TOP=43.6452G/s
cblas_tops<float>/512                                                        3659907 ns      2343750 ns          320 TOP=114.421G/s
cblas_tops<float>/1024                                                       7107774 ns      4743304 ns          112 TOP=452.519G/s
cblas_tops<float>/2048                                                      41423067 ns     33203125 ns           24 TOP=517.291G/s
cblas_tops<float>/4096                                                     263291567 ns    229166667 ns            3 TOP=599.66G/s
cblas_tops<float>/8192                                                    1873525700 ns   1531250000 ns            1 TOP=718.005G/s
cblas_tops<float>/16384                                                   1.4452e+10 ns   1.2578e+10 ns            1 TOP=699.295G/s
cblas_tops<float>/32768                                                   1.0966e+11 ns   9.6297e+10 ns            1 TOP=730.737G/s
cblas_tops<float>/65536                                                   8.7054e+11 ns   7.6606e+11 ns            1 TOP=734.856G/s
cblas_tops<float>_BigO                                                          0.00 N^3        0.00 N^3
cblas_tops<float>_RMS                                                              0 %             0 %
cblas_tops<double>/8                                                             427 ns          427 ns      1792000 TOP=2.24695G/s
cblas_tops<double>/16                                                           1509 ns         1507 ns       497778 TOP=5.26715G/s
cblas_tops<double>/32                                                           8754 ns         8894 ns        89600 TOP=7.25368G/s
cblas_tops<double>/64                                                          67465 ns        66267 ns         8960 TOP=7.84997G/s
cblas_tops<double>/128                                                        272437 ns       198778 ns         2987 TOP=21.018G/s
cblas_tops<double>/256                                                       1340431 ns       711178 ns          747 TOP=47.0893G/s
cblas_tops<double>/512                                                       3628079 ns      2188268 ns          407 TOP=122.55G/s
cblas_tops<double>/1024                                                      7489572 ns      4743304 ns          112 TOP=452.519G/s
cblas_tops<double>/2048                                                     44048300 ns     37006579 ns           19 TOP=464.125G/s
cblas_tops<double>/4096                                                    298876300 ns    260416667 ns            3 TOP=527.701G/s
cblas_tops<double>/8192                                                   2237111500 ns   2140625000 ns            1 TOP=513.609G/s
cblas_tops<double>/16384                                                  1.6801e+10 ns   1.4781e+10 ns            1 TOP=595.066G/s
cblas_tops<double>/32768                                                  1.2610e+11 ns   1.1206e+11 ns            1 TOP=627.932G/s
cblas_tops<double>/65536                                                  1.0268e+12 ns   9.1666e+11 ns            1 TOP=614.13G/s
cblas_tops<double>_BigO                                                         0.00 N^3        0.00 N^3
cblas_tops<double>_RMS                                                             1 %             1 %
eigen_tops<double>/8                                                             217 ns          213 ns      3446154 TOP=4.50493G/s
eigen_tops<double>/16                                                            950 ns          963 ns       746667 TOP=8.24424G/s
eigen_tops<double>/32                                                           5847 ns         5755 ns        89600 TOP=11.2102G/s
eigen_tops<double>/64                                                          27070 ns        27169 ns        23579 TOP=19.1463G/s
eigen_tops<double>/128                                                         50197 ns        50000 ns        10000 TOP=83.5584G/s
eigen_tops<double>/256                                                        172871 ns       167411 ns         4480 TOP=200.04G/s
eigen_tops<double>/512                                                       1766689 ns      1765971 ns          407 TOP=151.856G/s
eigen_tops<double>/1024                                                     79527122 ns     74652778 ns            9 TOP=28.7522G/s
eigen_tops<double>/2048                                                    614375700 ns    500000000 ns            1 TOP=34.3513G/s
eigen_tops<double>/4096                                                   1335294200 ns   1062500000 ns            1 TOP=129.339G/s
eigen_tops<double>/8192                                                   1.0451e+10 ns   9281250000 ns            1 TOP=118.459G/s
eigen_tops<double>/16384                                                  2.8860e+10 ns   2.5938e+10 ns            1 TOP=339.116G/s
eigen_tops<double>/32768                                                  2.2882e+11 ns   2.0462e+11 ns            1 TOP=343.886G/s
eigen_tops<double>/65536                                                  9.7846e+11 ns   8.7783e+11 ns            1 TOP=641.294G/s
eigen_tops<double>_BigO                                                         0.00 N^3        0.00 N^3
eigen_tops<double>_RMS                                                            32 %            32 %
eigen_tops<float>/8                                                              137 ns          138 ns      4977778 TOP=6.95079G/s
eigen_tops<float>/16                                                             539 ns          547 ns      1000000 TOP=14.5115G/s
eigen_tops<float>/32                                                            3145 ns         3139 ns       224000 TOP=20.5521G/s
eigen_tops<float>/64                                                           18643 ns        18834 ns        37333 TOP=27.62G/s
eigen_tops<float>/128                                                          71528 ns        71150 ns        11200 TOP=58.7203G/s
eigen_tops<float>/256                                                         268221 ns       266841 ns         2635 TOP=125.501G/s
eigen_tops<float>/512                                                        1319216 ns      1311384 ns          560 TOP=204.496G/s
eigen_tops<float>/1024                                                      39855005 ns     36184211 ns           19 TOP=59.3197G/s
eigen_tops<float>/2048                                                     302116567 ns    255208333 ns            3 TOP=67.3006G/s
eigen_tops<float>/4096                                                     665038400 ns    500000000 ns            1 TOP=274.844G/s
eigen_tops<float>/8192                                                    5202128400 ns   4500000000 ns            1 TOP=244.321G/s
eigen_tops<float>/16384                                                   1.3939e+10 ns   1.2016e+10 ns            1 TOP=732.032G/s
eigen_tops<float>/32768                                                   1.1089e+11 ns   9.7391e+10 ns            1 TOP=722.53G/s
eigen_tops<float>/65536                                                   4.5886e+11 ns   4.0717e+11 ns            1 TOP=1.38258T/s
eigen_tops<float>_BigO                                                          0.00 N^3        0.00 N^3
eigen_tops<float>_RMS                                                             34 %            33 %
eigen_tops<std::int16_t>/8                                                       514 ns          500 ns      1000000 TOP=1.92G/s
eigen_tops<std::int16_t>/16                                                     2981 ns         2916 ns       235789 TOP=2.72178G/s
eigen_tops<std::int16_t>/32                                                    22216 ns        21972 ns        29867 TOP=2.93605G/s
eigen_tops<std::int16_t>/64                                                    65951 ns        66267 ns         8960 TOP=7.84997G/s
eigen_tops<std::int16_t>/128                                                   86190 ns        87193 ns         8960 TOP=47.9157G/s
eigen_tops<std::int16_t>/256                                                  326154 ns       322316 ns         2133 TOP=103.901G/s
eigen_tops<std::int16_t>/512                                                 1524212 ns      1537400 ns          498 TOP=174.433G/s
eigen_tops<std::int16_t>/1024                                              220873467 ns    187500000 ns            3 TOP=11.4477G/s
eigen_tops<std::int16_t>/2048                                             1760802000 ns   1531250000 ns            1 TOP=11.2168G/s
eigen_tops<std::int16_t>/4096                                             3854859100 ns   3312500000 ns            1 TOP=41.4859G/s
eigen_tops<std::int16_t>/8192                                             3.0983e+10 ns   2.6859e+10 ns            1 TOP=40.9334G/s
eigen_tops<std::int16_t>/16384                                            8.4830e+10 ns   7.6594e+10 ns            1 TOP=114.837G/s
eigen_tops<std::int16_t>/32768                                            6.7654e+11 ns   6.0194e+11 ns            1 TOP=116.902G/s
eigen_tops<std::int16_t>/65536                                            2.8277e+12 ns   2.5493e+12 ns            1 TOP=220.821G/s
eigen_tops<std::int16_t>_BigO                                                   0.01 N^3        0.01 N^3
eigen_tops<std::int16_t>_RMS                                                      33 %            33 %
eigen_tops<std::int8_t>/8                                                        479 ns          476 ns      1445161 TOP=2.01797G/s
eigen_tops<std::int8_t>/16                                                      2815 ns         2783 ns       235789 TOP=2.85139G/s
eigen_tops<std::int8_t>/32                                                     20680 ns        20403 ns        34462 TOP=3.1619G/s
eigen_tops<std::int8_t>/64                                                     57506 ns        57199 ns        11200 TOP=9.09448G/s
eigen_tops<std::int8_t>/128                                                    80053 ns        80218 ns         8960 TOP=52.0823G/s
eigen_tops<std::int8_t>/256                                                   314987 ns       314991 ns         2133 TOP=106.317G/s
eigen_tops<std::int8_t>/512                                                  1489098 ns      1506024 ns          498 TOP=178.067G/s
eigen_tops<std::int8_t>/1024                                               218097833 ns    182291667 ns            3 TOP=11.7747G/s
eigen_tops<std::int8_t>/2048                                              1732975600 ns   1546875000 ns            1 TOP=11.1035G/s
eigen_tops<std::int8_t>/4096                                              3755390800 ns   3265625000 ns            1 TOP=42.0814G/s
eigen_tops<std::int8_t>/8192                                              3.0097e+10 ns   2.6297e+10 ns            1 TOP=41.8089G/s
eigen_tops<std::int8_t>/16384                                             8.3572e+10 ns   7.4781e+10 ns            1 TOP=117.621G/s
eigen_tops<std::int8_t>/32768                                             6.6638e+11 ns   5.8508e+11 ns            1 TOP=120.271G/s
eigen_tops<std::int8_t>/65536                                             2.8020e+12 ns   2.5104e+12 ns            1 TOP=224.243G/s
eigen_tops<std::int8_t>_BigO                                                    0.01 N^3        0.01 N^3
eigen_tops<std::int8_t>_RMS                                                       33 %            32 %
pipeline_cpp11_lambdas                                                           247 ns          246 ns      2800000
pipeline_cpp11_std_function                                                     1701 ns         1688 ns       407273
pipeline_cpp20_coroutine<toy_generator_t>                                       1403 ns         1395 ns       448000
pipeline_cpp20_coroutine<cppcoro_generator_t>                                   1498 ns         1475 ns       497778
pipeline_cpp20_coroutine<cppcoro_recursive_generator_t>                         1606 ns         1604 ns       448000
pipeline_cpp20_std_ranges                                                       1232 ns         1228 ns       560000
pipeline_virtual_functions                                                      1400 ns         1413 ns       497778
packaging_custom_pairs/min_time:2.000                                         163634 ns       163588 ns        16906
packaging_stl_pair/min_time:2.000                                             520598 ns       521723 ns         5271
packaging_stl_tuple/min_time:2.000                                            525833 ns       523711 ns         5430
packaging_stl_any/min_time:2.000                                             7368383 ns      7251603 ns          390
construct_string/length=/7                                                      5.71 ns         5.78 ns    100000000
construct_string/length=/8                                                      5.71 ns         5.72 ns    112000000
construct_string/length=/15                                                     5.76 ns         5.72 ns    112000000
construct_string/length=/16                                                     42.1 ns         41.4 ns     16592593
construct_string/length=/22                                                     41.8 ns         41.7 ns     17230769
construct_string/length=/23                                                     43.5 ns         42.6 ns     17230769
construct_string/length=/24                                                     42.9 ns         43.3 ns     16592593
construct_string/length=/25                                                     41.6 ns         41.7 ns     17230769
construct_string/length=/31                                                     40.3 ns         40.5 ns     16592593
construct_string/length=/32                                                     41.1 ns         41.0 ns     17920000
construct_string/length=/33                                                     41.2 ns         41.0 ns     17920000
parse_stl/short_/min_time:2.000                                                  368 ns          367 ns      7791304 bytes_per_second=309.233Mi/s pairs/s=8.17448M/s
parse_ranges/short_/min_time:2.000                                              1324 ns         1324 ns      2159036 bytes_per_second=85.6911Mi/s pairs/s=2.26522M/s
parse_sz/short_/min_time:2.000                                                   209 ns          209 ns     13784615 bytes_per_second=544.131Mi/s pairs/s=14.3839M/s
parse_stl/long_/min_time:2.000                                                  4352 ns         4355 ns       663704 bytes_per_second=335.023Mi/s pairs/s=4.82172M/s
parse_ranges/long_/min_time:2.000                                              17694 ns        17548 ns       155826 bytes_per_second=83.1521Mi/s pairs/s=1.19674M/s
parse_sz/long_/min_time:2.000                                                   3030 ns         3021 ns       853333 bytes_per_second=482.954Mi/s pairs/s=6.95079M/s
parse_regex/short_/min_time:2.000                                              18675 ns        18687 ns       157193 bytes_per_second=6.07299Mi/s pairs/s=107.025k/s
parse_regex/long_/min_time:2.000                                              365048 ns       365004 ns         7791 bytes_per_second=3.99754Mi/s pairs/s=52.0542k/s
parse_ctre/short_/min_time:2.000                                                4305 ns         4285 ns       663704 bytes_per_second=26.4868Mi/s pairs/s=466.781k/s
parse_ctre/long_/min_time:2.000                                                47568 ns        47032 ns        61793 bytes_per_second=31.024Mi/s pairs/s=403.98k/s
json_yyjson<malloc>/min_time:10.000                                              489 ns          488 ns     28903226 bytes_per_second=953.768Mi/s
json_yyjson<malloc>/min_time:10.000/threads:192                                 5780 ns         4446 ns      3518016 bytes_per_second=104.608Mi/s
json_yyjson<arena, prepend>/min_time:10.000                                      377 ns          376 ns     37178423 bytes_per_second=1.20746Gi/s max_alloc=642 mean_alloc=638.5 peak_usage=1.277k
json_yyjson<arena, prepend>/min_time:10.000/threads:192                         2412 ns         2364 ns      5857152 bytes_per_second=196.768Mi/s max_alloc=642 mean_alloc=638.5 peak_usage=1.277k
json_nlohmann<std::allocator, throw>/min_time:10.000                           10780 ns        10769 ns      1298551 bytes_per_second=43.1856Mi/s
json_nlohmann<arena_allocator, throw>/min_time:10.000                           8784 ns         8734 ns      1602862 bytes_per_second=53.2465Mi/s max_alloc=158 mean_alloc=53.1 peak_usage=2.655k
json_nlohmann<std::allocator, noexcept>/min_time:10.000                         8753 ns         8715 ns      1626134 bytes_per_second=53.3645Mi/s
json_nlohmann<arena_allocator, noexcept>/min_time:10.000                        6833 ns         6820 ns      2059770 bytes_per_second=68.1965Mi/s max_alloc=158 mean_alloc=53.1 peak_usage=2.655k
json_nlohmann<std::allocator, throw>/min_time:10.000/threads:192              137194 ns       124674 ns       192000 bytes_per_second=3.72993Mi/s
json_nlohmann<arena_allocator, throw>/min_time:10.000/threads:192              64747 ns        60465 ns       192000 bytes_per_second=7.69078Mi/s max_alloc=158 mean_alloc=53.1 peak_usage=2.655k
json_nlohmann<std::allocator, noexcept>/min_time:10.000/threads:192           124543 ns       123254 ns       125376 bytes_per_second=3.77293Mi/s
json_nlohmann<arena_allocator, noexcept>/min_time:10.000/threads:192           47922 ns        47118 ns       290496 bytes_per_second=9.86982Mi/s max_alloc=158 mean_alloc=53.1 peak_usage=2.655k
graph_make<std::unordered_maps>/min_time:10.000                                85991 us        85490 us          157
graph_make<std::map>/min_time:10.000                                          151168 us       150645 us           92
graph_make<absl::flat_set>/min_time:10.000                                    141170 us       140137 us           96
graph_rank<std::unordered_maps>/min_time:10.000                                50137 us        50032 us          292
graph_rank<std::map>/min_time:10.000                                           70341 us        69926 us          202
graph_rank<absl::flat_set>/min_time:10.000                                       225 us          225 us        66866
errors_throw/min_time:2.000                                                     1119 ns         1111 ns      2560000
errors_throw/min_time:2.000/threads:192                                        29527 ns       128581 ns        19200
errors_variants/min_time:2.000                                                  27.8 ns         27.8 ns     99555556
errors_variants/min_time:2.000/threads:192                                      2488 ns         2824 ns      1920000
errors_with_status/min_time:2.000                                               16.6 ns         16.6 ns    165925926
errors_with_status/min_time:2.000/threads:192                                   2528 ns         1481 ns      1920000
log_printf/min_time:2.000                                                        528 ns          530 ns      5780645 bytes_per_second=189.612Mi/s

less_slow.cpp Outdated
@@ -525,7 +552,7 @@ static void sorting_with_openmp(bm::State &state) {

#pragma omp parallel for
// Sort each chunk in parallel
for (std::size_t i = 0; i < chunks; i++) {
for (int64_t i = 0; i < chunks; i++) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why switch to to int64_t here and in the lines below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned it in the notes, but, at least on msvc, openmp only allows signed integers as the parallel for loop index/iterator.

less_slow.cpp Outdated Show resolved Hide resolved
less_slow.cpp Outdated
@@ -2245,6 +2316,9 @@ BENCHMARK(cblas_tops<double>)->RangeMultiplier(2)->Range(8, 65536)->Complexity(b

template <typename scalar_type_>
static void eigen_tops(bm::State &state) {
// Make sure Eigen uses all cores - also, Eigen can't multithread without openMP or GEMM
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's GEMM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General Matrix-Matrix product, I guess the more correct comment would be to say the internal GEMM threadpool that Eigen manages. There seems to be a way to make eigen use that instead of openMP (there is a EIGEN_GEMM_THREADPOOL flag that is checked) but I'm not too familiar with eigen, so not sure how that would be done.

less_slow.cpp Outdated
Comment on lines 3396 to 3402
#if defined(_MSC_VER)
// MSVC does not implement std::regex_constants::multiline yet
auto regex_options = std::regex_constants::ECMAScript;
#else
// Use multiline mode so ^ and $ anchor to line breaks.
auto regex_options = std::regex_constants::ECMAScript | std::regex_constants::multiline;
#endif
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe replace this with a simple if or ternary condition as opposed to #if macro?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how that would work at runtime, because MSVC does not define std::regex_constants::multiline at all, so any mention of it would not compile.

less_slow.cpp Outdated
Comment on lines 5088 to 5092
// On MSVC, high_resolution_clock is steady_clock, which cannot have to_time_t applied to it
auto now = std::chrono::system_clock::now();
#else
auto now = std::chrono::high_resolution_clock::now();
#endif
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a good place to add some notes on how std::chrono ambiguously wraps different system-level APIs without providing many guarantees 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I'll make some notes about that, and about how high_resolution_clock is implementation-defined.

Comment on lines +39 to +49
# Fetch and build OpenBLAS
FetchContent_Declare(
OpenBLAS
GIT_REPOSITORY https://github.com/xianyi/OpenBLAS.git
GIT_TAG v0.3.29
)

# Set OpenBLAS build options
set(NOFORTRAN ON CACHE BOOL "Disable Fortran" FORCE)
set(BUILD_WITHOUT_LAPACK OFF CACHE BOOL "Build without LAPACK" FORCE)
set(USE_THREAD ON CACHE BOOL "Use threading" FORCE)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, BLAS to map to OpenBLAS on all platforms is a probably a good idea. Agreed 🤝

@ashvardanian
Copy link
Owner

Hi @RazielXYZ! This is awesome, epic job and an amazing PR 🔥
My nitpicks are mostly about styling and wording - super minor stuff.

Minor Questions

  • I'm curious if we have to use int64_t in OpenMP-powered sorts?
  • As I don't have much experience with MSVC, I'm curious about the O2, Ob2, Oi, Ot, GL and other flags. Do we need them all?

Future Work

Diving into libraries (and perhaps implementations) for optimized parallel and lockfree containers would also be of interest, I think, as well as exemplifying usage of queues (SPSC, MPMC, etc) and well-optimized asynchronous event loops and reactors.

Yes, that can be discussed in #14.

@RazielXYZ
Copy link
Contributor Author

@ashvardanian thanks for giving this such an in-depth look!

On MSVC, OpenMP explicitly requires signed integers for parallel for loops. I don't think there's a way around it, but we could ifdef to just MSVC if wanted (although, in the case where these are used, the upper limit will realistically never be high enough to require the extra bit of unsigned, unless we expect to get close to ten quintillion cores any time soon, so I'm not sure why we would use unsigned there).

For the flags, they are not all necessary, but that is the collection of flags that results in maximum optimizations for speed. Realistically we could leave just O2 and GL.

@ashvardanian
Copy link
Owner

On MSVC, OpenMP explicitly requires signed integers for parallel for loops. I don't think there's a way around it, but we could ifdef to just MSVC if wanted (although, in the case where these are used, the upper limit will realistically never be high enough to require the extra bit of unsigned, unless we expect to get close to ten quintillion cores any time soon, so I'm not sure why we would use unsigned there).

Wow, didn't know. Is it OK if we use std::int64_t for consistency with the rest of the repo?

@RazielXYZ
Copy link
Contributor Author

Wow, didn't know. Is it OK if we use std::int64_t for consistency with the rest of the repo?

Of course, I'll change that

@ashvardanian
Copy link
Owner

I've polished some little things and pushed to your branch. I hope that's fine 🤗

I would suggest considering reducing the default highest range for the TOPs benchmarks to something lower than 65536, as that requires slightly over 100GB of RAM, and takes quite long even on very fast machines like my Epyc 9654.

Agreed! Maybe we should dynamically fetch the RAM volume and skip all the benchmarks that require more than a quarter of RAM? Or just set a lower number?

@ashvardanian
Copy link
Owner

As for your benchmarks, the higher-level abstractions and bigger tasks definitely look interesting.

Pipelines & Metaprogramming

pipeline_cpp11_lambdas                                                           247 ns          246 ns      2800000
pipeline_cpp11_std_function                                                     1701 ns         1688 ns       407273
pipeline_cpp20_coroutine<toy_generator_t>                                       1403 ns         1395 ns       448000
pipeline_cpp20_coroutine<cppcoro_generator_t>                                   1498 ns         1475 ns       497778
pipeline_cpp20_coroutine<cppcoro_recursive_generator_t>                         1606 ns         1604 ns       448000
pipeline_cpp20_std_ranges                                                       1232 ns         1228 ns       560000
pipeline_virtual_functions                                                      1400 ns         1413 ns       497778

On GCC, std_ranges perform the best. It's remarkable how poorly they execute here 🤔 Maybe worth checking if @ericniebler has any notes on using ranges on Windows. Seems like MSVC struggles with heavy template meta-programming, including the following numbers for CTRE parsers:

parse_stl/short_/min_time:2.000                                                  368 ns          367 ns      7791304 bytes_per_second=309.233Mi/s pairs/s=8.17448M/s
parse_ranges/short_/min_time:2.000                                              1324 ns         1324 ns      2159036 bytes_per_second=85.6911Mi/s pairs/s=2.26522M/s
parse_sz/short_/min_time:2.000                                                   209 ns          209 ns     13784615 bytes_per_second=544.131Mi/s pairs/s=14.3839M/s
parse_stl/long_/min_time:2.000                                                  4352 ns         4355 ns       663704 bytes_per_second=335.023Mi/s pairs/s=4.82172M/s
parse_ranges/long_/min_time:2.000                                              17694 ns        17548 ns       155826 bytes_per_second=83.1521Mi/s pairs/s=1.19674M/s
parse_sz/long_/min_time:2.000                                                   3030 ns         3021 ns       853333 bytes_per_second=482.954Mi/s pairs/s=6.95079M/s
parse_regex/short_/min_time:2.000                                              18675 ns        18687 ns       157193 bytes_per_second=6.07299Mi/s pairs/s=107.025k/s
parse_regex/long_/min_time:2.000                                              365048 ns       365004 ns         7791 bytes_per_second=3.99754Mi/s pairs/s=52.0542k/s
parse_ctre/short_/min_time:2.000                                                4305 ns         4285 ns       663704 bytes_per_second=26.4868Mi/s pairs/s=466.781k/s
parse_ctre/long_/min_time:2.000                                                47568 ns        47032 ns        61793 bytes_per_second=31.024Mi/s pairs/s=403.98k/s

TOPs

Assuming the code was compiled with AVX2, I was expecting to see the numbers for tops_f64_avx2ma_asm_kernel, tops_f64_avx2fma_asm_kernel, tops_f32_avx2ma_asm_kernel, and tops_f32_avx2fma_asm_kernel kernels. Are those available? Is the following a valid condition on MSVC?

#if defined(__AVX2__)

@RazielXYZ
Copy link
Contributor Author

I've polished some little things and pushed to your branch. I hope that's fine 🤗

Perfectly fine! I've merged it with my new changes.

In particular, I changed the windows code for physical_cores to get actual physical_cores - the GetActiveProcessorCount implementation gets logical cores, and I realize now that we specifically want physical ones. Somewhat unfortunately, that is far more complicated than getting logical ones!

Also, I appreciate sorting includes in logical and alphabetical order as much as the next guy, but including WinBase.h before Windows.h actually can not compile. Generally, Windows.h must always be included before any other windows header.

@RazielXYZ
Copy link
Contributor Author

TOPs

Assuming the code was compiled with AVX2, I was expecting to see the numbers for tops_f64_avx2ma_asm_kernel, tops_f64_avx2fma_asm_kernel, tops_f32_avx2ma_asm_kernel, and tops_f32_avx2fma_asm_kernel kernels. Are those available? Is the following a valid condition on MSVC?

#if defined(__AVX2__)

The condition is valid, but I have ifdef'd out the benchmarks for asm kernels from MSVC, because I don't know how (or if) the ASM files can be linked in with MSVC. Just adding the file to the project like CMake is doing does not actually get the contents linked in by the linker. I can look into that further later, as I imagine it should be possible.

@ashvardanian
Copy link
Owner

Include order was changed by clang-format. We can either introduce a gap line or disable clang-format for that section of includes 🤔

@RazielXYZ
Copy link
Contributor Author

Include order was changed by clang-format. We can either introduce a gap line or disable clang-format for that section of includes 🤔

Fair enough! I usually keep include re-ordering disabled in my formatters since I've been bitten a few times (and not just by windows headers).

@RazielXYZ
Copy link
Contributor Author

Agreed! Maybe we should dynamically fetch the RAM volume and skip all the benchmarks that require more than a quarter of RAM? Or just set a lower number?

That's a good approach, although even if one has the RAM, it's still a long runtime.

@ashvardanian
Copy link
Owner

Fair enough! I usually keep include re-ordering disabled in my formatters since I've been bitten a few times (and not just by windows headers).

That's a good idea, worth patching in my other projects as well!

The condition is valid, but I have ifdef'd out the benchmarks for asm kernels from MSVC, because I don't know how (or if) the ASM files can be linked in with MSVC. Just adding the file to the project like CMake is doing does not actually get the contents linked in by the linker. I can look into that further later, as I imagine it should be possible.

Yes, that may have practical importance beyond this repo as well. In SimSIMD I am tired of dealing with compiler compatibility problems for SIMD intrinsics and considering a shift towards pure Assembly.

ashvardanian added a commit that referenced this pull request Jan 26, 2025
Disables automatic lexicographic ordering
in favor of manual, putting lighter includes
first.

#28 (comment)

Co-authored-by: RazielXYZ <[email protected]>
@ashvardanian
Copy link
Owner

@RazielXYZ, I've resolved sorting includes. LMK if there is smth else you want to add to this PR or is it good to go?

@RazielXYZ
Copy link
Contributor Author

@ashvardanian Should be good to go, I can make a new PR for when I investigate getting the ASM kernels to work on msvc, and whatever else I can find/improve!

@RazielXYZ
Copy link
Contributor Author

I also just remembered, I wanted to also point out that this repo has no license - I assume it should be Apache 2 like the others?

@ashvardanian
Copy link
Owner

Will add a license soon.

@ashvardanian ashvardanian changed the title Improve/add MSVC support, and other notes Porting to MSVC Jan 26, 2025
@ashvardanian ashvardanian merged commit dcb065b into ashvardanian:main Jan 26, 2025
@ashvardanian
Copy link
Owner

ashvardanian commented Jan 26, 2025

I'll have to revert BLAS settings, as it fails to compile on Linux 🤦‍♂️
I'll add this patch, but have no way of checking if it works.

@RazielXYZ
Copy link
Contributor Author

RazielXYZ commented Jan 26, 2025

I'll have to revert BLAS settings, as it fails to compile on Linux 🤦‍♂️ I'll add this patch, but have no way of checking if it works.

I've built and ran the current version (with FetchContent openBLAS) on Linux too, what's it failing on?

@ashvardanian
Copy link
Owner

It's failing on AVX-512 intrinsics for me.

@RazielXYZ
Copy link
Contributor Author

@ashvardanian odd, which ones? The only ones I couldn't build/test here are the intel specific ones, since I'm on AMD.

@ashvardanian
Copy link
Owner

I am on Sapphire Rapids. Can try again during the week.

@Cosmin-B
Copy link

@RazielXYZ, This is an amazing job, thank you very much; I've been looking at trying to properly get this to work on MSVC, too, for a bit of time, but I noticed this last night, and I was so happy.

I am running the tests over the weekend on this, as i have both AMD + Intel, Windows workstations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants