Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deep Learning Benchmark Manual CI Report #163

Open
xlinsist opened this issue Jan 14, 2025 · 1 comment
Open

Deep Learning Benchmark Manual CI Report #163

xlinsist opened this issue Jan 14, 2025 · 1 comment

Comments

@xlinsist
Copy link
Collaborator

Motivation

Since PR #162 replaced llc with clang, I re-evaluated the performance of deep learning benchmark and will hang the results in this issue. This issue will be updated if any recent changes (especially performance-related updates) are applied. Therefore, it can temporarily act as a CI report.

Benchmark Testing Summary (After PR #162)

X86(Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz)

  • All models and operations work fine.

RISC-V(SpacemiT K1)

  • Models:
    • LLaMA: Successfully runs.
    • Whisper: Successfully runs but is killed during correctness verification.
  • Operations:
    • Running dl-op-linalg-reduceaddf-benchmark and dl-op-linalg-reducemaxf-benchmark causes a Segmentation fault error.

These issues are not related to the changes introduced in PR #162 (replacing llc with clang). Instead, they are bugs introduced during cross-platform migration and can be resolved in subsequent PRs.

Benchmark Testing Results (TinyLlama and Whisper)

X86

2025-01-13T12:21:43+00:00
Running ./dl-model-tinyllama-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.26, 14.39, 8.40
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
DL_MODEL_TINYLLAMA/scalar         374448 ms       374445 ms            1
DL_MODEL_TINYLLAMA/matmul_opt     371304 ms       371280 ms            1
�[34m---------- Verification ----------�[0m
matmul_opt �[32mPASS�[0m
2025-01-13T12:46:04+00:00
Running ./dl-model-tinyllama-benchmark-before-fusion
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 23.97, 23.94, 20.78
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
DL_MODEL_TINYLLAMA/scalar             339031 ms       338973 ms            1
DL_MODEL_TINYLLAMA/matmul_opt         330564 ms       330444 ms            1
DL_MODEL_TINYLLAMA/matmul_opt_omp     326282 ms       324553 ms            1
�[34m---------- Verification ----------�[0m
matmul_opt �[32mPASS�[0m
matmul_opt_omp �[32mPASS�[0m
2025-01-13T13:22:34+00:00
Running ./dl-model-whisper-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 8.93, 32.28, 42.02
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
DL_MODEL_Whisper/Auto_Vectorization      188571 ms       188525 ms            1
DL_MODEL_Whisper/Buddy_Vectorization      71893 ms        71879 ms            1
-----------------------------------------------------------
Correctness Verification for Output1: �[32mPASS�[0m
Correctness Verification for Output2: �[31mFAIL�[0m
-----------------------------------------------------------

RISC-V

Executing dl-model-tinyllama-benchmark at 2025年 01月 13日 星期一 22:36:07 CST
2025-01-13T22:36:12+08:00
Running ./dl-model-tinyllama-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 2.77, 2.38
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
DL_MODEL_TINYLLAMA/scalar         650432 ms       650353 ms            1
DL_MODEL_TINYLLAMA/matmul_opt     646291 ms       646229 ms            1
---------- Verification ----------
matmul_opt PASS

real    43m15.453s
user    43m9.710s
sys     0m5.477s
Executing dl-model-whisper-benchmark at 2025年 01月 13日 星期一 23:19:22 CST
2025-01-13T23:19:22+08:00
Running ./dl-model-whisper-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
DL_MODEL_Whisper/Auto_Vectorization     4379395 ms      4379140 ms            1
DL_MODEL_Whisper/Buddy_Vectorization     957599 ms       957515 ms            1
bash: 第 1 行:142223 Killed                  ./"$file"

real    164m48.164s
user    164m23.712s
sys     0m23.662s

Benchmark Testing Results (Others)

X86

(py310) zhouxulin@plct-gpu:~/intern/buddy-benchmark$ cd /home/zhouxulin/intern/buddy-benchmark/build-x86/bin
(py310) zhouxulin@plct-gpu:~/intern/buddy-benchmark/build-x86/bin$ ls
dl-layer-ffn-benchmark                      dl-op-linalg-batch-matmul-benchmark
dl-layer-rmsnorm-benchmark                  dl-op-linalg-conv2d-nchw-fchw-benchmark
dl-layer-selfattention-benchmark            dl-op-linalg-conv2d-nhwc-fhwc-benchmark
dl-model-bert-benchmark                     dl-op-linalg-conv2d-nhwc-hwcf-benchmark
dl-model-lenet-benchmark                    dl-op-linalg-depthwise-conv-2d-nhwc-hwc-benchmark
dl-model-mobilenetv3-benchmark              dl-op-linalg-mathexp-benchmark
dl-model-resnet18-benchmark                 dl-op-linalg-mathfpow-benchmark
dl-model-tinyllama-benchmark                dl-op-linalg-mathrsqrt-benchmark
dl-model-tinyllama-benchmark-before-fusion  dl-op-linalg-matmul-benchmark
dl-model-whisper-benchmark                  dl-op-linalg-pooling-nhwc-sum-benchmark
dl-op-linalg-arithaddf-benchmark            dl-op-linalg-reduceaddf-benchmark
dl-op-linalg-arithdivf-benchmark            dl-op-linalg-reducemaxf-benchmark
dl-op-linalg-arithmulf-benchmark            dl-op-linalg-softmax-exp-sum-div-benchmark
dl-op-linalg-arithnegf-benchmark            log.txt
dl-op-linalg-arithsubf-benchmark


2025-01-13T12:21:13+00:00
Running ./dl-layer-ffn-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.79, 13.42, 7.90
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations
--------------------------------------------------------------------------
DL_LAYER_FFN/Scalar                  0.176 ms        0.176 ms         3965
DL_LAYER_FFN/Auto_Vectorization      0.084 ms        0.084 ms         8171
-----------------------------------------------------------
Correctness Verification: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T12:21:15+00:00
Running ./dl-layer-rmsnorm-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.73, 13.60, 7.99
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
DL_LAYER_RMSNORM/Scalar                  0.005 ms        0.005 ms       125424
DL_LAYER_RMSNORM/Auto_Vectorization      0.002 ms        0.002 ms       311591
-----------------------------------------------------------
Correctness Verification: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T12:21:17+00:00
Running ./dl-layer-selfattention-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.73, 13.60, 7.99
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
DL_LAYER_ATTENTION/Scalar                   12.9 ms         12.9 ms           56
DL_LAYER_ATTENTION/Auto_Vectorization       4.27 ms         4.27 ms          167
-----------------------------------------------------------
Correctness Verification: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T12:21:19+00:00
Running ./dl-model-bert-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.59, 13.76, 8.07
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
DL_MODEL_BERT/Auto_Vectorization        2447 ms         2446 ms            1
DL_MODEL_BERT/Buddy_Vectorization        656 ms          656 ms            1
-----------------------------------------------------------
Correctness Verification: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T12:21:27+00:00
Running ./dl-model-lenet-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.62, 13.94, 8.16
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------
Benchmark                                   Time             CPU   Iterations
-----------------------------------------------------------------------------
DL_MODEL_LENET/Auto_Vectorization       0.565 ms        0.564 ms         1246
DL_MODEL_LENET/Buddy_Vectorization      0.581 ms        0.581 ms         1216
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T12:21:28+00:00
Running ./dl-model-mobilenetv3-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.62, 13.94, 8.16
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
BM_MobileNet_V3/BM_MobileNet_V3_scalar          127 ms          127 ms            5
BM_MobileNet_V3/BM_MobileNet_V3_conv_opt        105 ms          105 ms            8
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T12:21:31+00:00
Running ./dl-model-resnet18-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.49, 14.09, 8.24
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
DL_MODEL_Resnet18/Auto_Vectorization        2573 ms         2572 ms            1
DL_MODEL_Resnet18/Buddy_Vectorization       2533 ms         2533 ms            1
-----------------------------------------------------------
Correctness Verification: �[32mPASS�[0m
-----------------------------------------------------------


2025-01-13T13:31:12+00:00
Running ./dl-op-linalg-arithaddf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.06, 9.88, 26.49
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ADDF_SCALAR                 0.075 ms        0.075 ms         8983
BM_ADDF_AutoVectorization      0.011 ms        0.011 ms        60839
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:14+00:00
Running ./dl-op-linalg-arithdivf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.13, 9.80, 26.38
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_DIVF_SCALAR                 0.075 ms        0.075 ms         8491
BM_DIVF_AutoVectorization      0.014 ms        0.014 ms        50687
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:16+00:00
Running ./dl-op-linalg-arithmulf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.13, 9.80, 26.38
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_MULF_SCALAR                 0.075 ms        0.075 ms         8874
BM_MULF_AutoVectorization      0.011 ms        0.011 ms        60732
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:17+00:00
Running ./dl-op-linalg-arithnegf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.13, 9.80, 26.38
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_NEGF_SCALAR                 0.052 ms        0.052 ms        12582
BM_NEGF_AutoVectorization      0.006 ms        0.006 ms       114656
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:19+00:00
Running ./dl-op-linalg-arithsubf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.20, 9.72, 26.26
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_SUBF_SCALAR                 0.075 ms        0.075 ms         9301
BM_SUBF_AutoVectorization      0.012 ms        0.012 ms        60480
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:21+00:00
Running ./dl-op-linalg-batch-matmul-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.20, 9.72, 26.26
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
DL_OPS_BATCH_MATMUL/Scalar/iterations:5                  26.8 ms         26.8 ms            5
DL_OPS_BATCH_MATMUL/AutoVectorization/iterations:5       6.22 ms         6.22 ms            5
DL_OPS_BATCH_MATMUL/Vectorization/iterations:5          0.922 ms        0.922 ms            5
DL_OPS_BATCH_MATMUL/Tile/iterations:5                   0.335 ms        0.335 ms            5
DL_OPS_BATCH_MATMUL/SCF/iterations:5                    0.473 ms        0.473 ms            5
�[34m---------- Verification ----------�[0m
Tile �[32mPASS�[0m
SCF �[32mPASS�[0m
2025-01-13T13:31:21+00:00
Running ./dl-op-linalg-conv2d-nchw-fchw-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.20, 9.72, 26.26
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_Conv2DNchwFchw_SCALAR        712 ms          712 ms            1
BM_Conv2DNchwFchw_Im2col       11.3 ms         11.3 ms           61
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:24+00:00
Running ./dl-op-linalg-conv2d-nhwc-fhwc-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.27, 9.64, 26.15
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------
DL_OPS_CONV_2D_NHWC_FHWC/scalar/iterations:5                    136 ms          136 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/auto_vectorization/iterations:5       28.7 ms         28.7 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/vectorization/iterations:5            5.84 ms         5.84 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/vec_tile/iterations:5                 5.67 ms         5.67 ms            5
�[34m---------- Verification ----------�[0m
auto_vectorization �[32mPASS�[0m
vectorization �[32mPASS�[0m
vec_tile �[32mPASS�[0m
2025-01-13T13:31:25+00:00
Running ./dl-op-linalg-conv2d-nhwc-hwcf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.27, 9.64, 26.15
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
BM_CONV_2D_NHWC_HWCF_SCALAR                  69.3 ms         69.3 ms            9
BM_CONV_2D_NHWC_HWCF_AutoVectorization       14.9 ms         14.9 ms           47
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:27+00:00
Running ./dl-op-linalg-depthwise-conv-2d-nhwc-hwc-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.27, 9.64, 26.15
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------
DL_OPS_DEPTHWISE_CONV_2D_NHWC_HWC/scalar/iterations:5                   12.6 ms         12.6 ms            5
DL_OPS_DEPTHWISE_CONV_2D_NHWC_HWC/auto_vectorization/iterations:5       4.11 ms         4.10 ms            5
DL_OPS_DEPTHWISE_CONV_2D_NHWC_HWC/vectorization/iterations:5           0.220 ms        0.220 ms            5
�[34m---------- Verification ----------�[0m
auto_vectorization �[32mPASS�[0m
vectorization �[32mPASS�[0m
2025-01-13T13:31:27+00:00
Running ./dl-op-linalg-mathexp-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.27, 9.64, 26.15
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_EXP_SCALAR                 0.106 ms        0.106 ms         6404
BM_EXP_AutoVectorization      0.060 ms        0.060 ms        12277
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:29+00:00
Running ./dl-op-linalg-mathfpow-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.17, 9.53, 26.02
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_FPOW_SCALAR                 0.206 ms        0.206 ms         3410
BM_FPOW_AutoVectorization      0.131 ms        0.131 ms         5390
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:31+00:00
Running ./dl-op-linalg-mathrsqrt-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.17, 9.53, 26.02
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_RSQRT_SCALAR                 0.106 ms        0.106 ms         6598
BM_RSQRT_AutoVectorization      0.027 ms        0.027 ms        25532
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:33+00:00
Running ./dl-op-linalg-matmul-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.17, 9.53, 26.02
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
DL_OPS_MATMUL/scalar_O0/iterations:5        108 ms          108 ms            5
DL_OPS_MATMUL/scalar_O3/iterations:5       35.2 ms         35.2 ms            5
DL_OPS_MATMUL/tile/iterations:5            3.03 ms         3.03 ms            5
�[34m---------- Verification ----------�[0m
tile �[32mPASS�[0m
2025-01-13T13:31:33+00:00
Running ./dl-op-linalg-pooling-nhwc-sum-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.07, 9.43, 25.90
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_POOLING_NHWC_SUM_SCALAR                 0.581 ms        0.581 ms         1091
BM_POOLING_NHWC_SUM_AutoVectorization      0.083 ms        0.083 ms         8462
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:35+00:00
Running ./dl-op-linalg-pooling-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.07, 9.43, 25.90
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
2025-01-13T13:31:35+00:00
Running ./dl-op-linalg-reducemaxf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.07, 9.43, 25.90
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
2025-01-13T13:31:35+00:00
Running ./dl-op-linalg-softmax-exp-sum-div-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.07, 9.43, 25.90
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_SOFTMAXEXPSUMDIV_SCALAR                 0.015 ms        0.015 ms        47706
BM_SOFTMAXEXPSUMDIV_AutoVectorization      0.009 ms        0.009 ms        86129
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------

RISC-V

muse-pi-3% ls
dl-layer-ffn-benchmark                   dl-op-linalg-conv2d-nchw-fchw-benchmark
dl-layer-rmsnorm-benchmark               dl-op-linalg-conv2d-nhwc-fhwc-benchmark
dl-layer-selfattention-benchmark         dl-op-linalg-conv2d-nhwc-fhwc-benchmark-rvv
dl-model-bert-benchmark                  dl-op-linalg-conv2d-nhwc-hwcf-benchmark
dl-model-lenet-benchmark                 dl-op-linalg-depthwise-conv-2d-nhwc-hwc-benchmark
dl-model-mobilenetv3-benchmark           dl-op-linalg-mathexp-benchmark
dl-model-resnet18-benchmark              dl-op-linalg-mathfpow-benchmark
dl-model-tinyllama-benchmark             dl-op-linalg-mathrsqrt-benchmark
dl-model-whisper-benchmark               dl-op-linalg-matmul-benchmark
dl-op-linalg-arithaddf-benchmark         dl-op-linalg-matmul-benchmark-rvv
dl-op-linalg-arithdivf-benchmark         dl-op-linalg-pooling-nhwc-sum-benchmark
dl-op-linalg-arithmulf-benchmark         dl-op-linalg-reduceaddf-benchmark
dl-op-linalg-arithnegf-benchmark         dl-op-linalg-reducemaxf-benchmark
dl-op-linalg-arithsubf-benchmark         dl-op-linalg-softmax-exp-sum-div-benchmark
dl-op-linalg-batch-matmul-benchmark      log.txt
dl-op-linalg-batch-matmul-benchmark-rvv  output.log
muse-pi-3% cat log.txt
Executing dl-layer-ffn-benchmark at 2025年 01月 13日 星期一 22:29:15 CST
2025-01-13T22:29:15+08:00
Running ./dl-layer-ffn-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 2.00, 2.00, 2.00
--------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations
--------------------------------------------------------------------------
DL_LAYER_FFN/Scalar                   1.29 ms         1.29 ms          530
DL_LAYER_FFN/Auto_Vectorization      0.777 ms        0.777 ms          945
-----------------------------------------------------------
Correctness Verification: PASS
-----------------------------------------------------------

real    0m1.779s
user    0m1.539s
sys     0m0.240s
Executing dl-layer-rmsnorm-benchmark at 2025年 01月 13日 星期一 22:29:17 CST
2025-01-13T22:29:17+08:00
Running ./dl-layer-rmsnorm-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 2.08, 2.02, 2.01
------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
DL_LAYER_RMSNORM/Scalar                  0.029 ms        0.029 ms        24437
DL_LAYER_RMSNORM/Auto_Vectorization      0.009 ms        0.009 ms        78021
-----------------------------------------------------------
Correctness Verification: PASS
-----------------------------------------------------------

real    0m1.977s
user    0m1.305s
sys     0m0.673s
Executing dl-layer-selfattention-benchmark at 2025年 01月 13日 星期一 22:29:19 CST
2025-01-13T22:29:19+08:00
Running ./dl-layer-selfattention-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 2.08, 2.02, 2.01
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
DL_LAYER_ATTENTION/Scalar                    101 ms          101 ms            7
DL_LAYER_ATTENTION/Auto_Vectorization       40.9 ms         40.9 ms           16
-----------------------------------------------------------
Correctness Verification: PASS
-----------------------------------------------------------

real    0m3.009s
user    0m2.892s
sys     0m0.116s
Executing dl-model-bert-benchmark at 2025年 01月 13日 星期一 22:29:22 CST
2025-01-13T22:29:22+08:00
Running ./dl-model-bert-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 2.15, 2.03, 2.01
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
DL_MODEL_BERT/Auto_Vectorization      158150 ms       158139 ms            1
DL_MODEL_BERT/Buddy_Vectorization      18938 ms        18936 ms            1
-----------------------------------------------------------
Correctness Verification: PASS
-----------------------------------------------------------

real    5m56.495s
user    5m51.783s
sys     0m4.685s
Executing dl-model-lenet-benchmark at 2025年 01月 13日 星期一 22:35:19 CST
2025-01-13T22:35:19+08:00
Running ./dl-model-lenet-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 2.72, 2.34
-----------------------------------------------------------------------------
Benchmark                                   Time             CPU   Iterations
-----------------------------------------------------------------------------
DL_MODEL_LENET/Auto_Vectorization        2.56 ms         2.56 ms          267
DL_MODEL_LENET/Buddy_Vectorization       2.45 ms         2.45 ms          286
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m2.050s
user    0m1.646s
sys     0m0.404s
Executing dl-model-mobilenetv3-benchmark at 2025年 01月 13日 星期一 22:35:21 CST
2025-01-13T22:35:21+08:00
Running ./dl-model-mobilenetv3-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 2.72, 2.34
-----------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
BM_MobileNet_V3/BM_MobileNet_V3_scalar          778 ms          778 ms            1
BM_MobileNet_V3/BM_MobileNet_V3_conv_opt        929 ms          929 ms            1
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m4.056s
user    0m3.535s
sys     0m0.521s
Executing dl-model-resnet18-benchmark at 2025年 01月 13日 星期一 22:35:25 CST
2025-01-13T22:35:25+08:00
Running ./dl-model-resnet18-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 2.72, 2.34
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
DL_MODEL_Resnet18/Auto_Vectorization       10344 ms        10343 ms            1
DL_MODEL_Resnet18/Buddy_Vectorization      10239 ms        10238 ms            1
-----------------------------------------------------------
Correctness Verification: PASS
-----------------------------------------------------------

real    0m41.960s
user    0m40.787s
sys     0m1.168s

Executing dl-op-linalg-arithaddf-benchmark at 2025年 01月 14日 星期二 02:04:10 CST
2025-01-14T02:04:10+08:00
Running ./dl-op-linalg-arithaddf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.08, 3.02, 3.01
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ADDF_SCALAR                 0.802 ms        0.802 ms          871
BM_ADDF_AutoVectorization      0.034 ms        0.034 ms        20463
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m2.003s
user    0m1.974s
sys     0m0.021s
Executing dl-op-linalg-arithdivf-benchmark at 2025年 01月 14日 星期二 02:04:12 CST
2025-01-14T02:04:12+08:00
Running ./dl-op-linalg-arithdivf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.07, 3.02, 3.00
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_DIVF_SCALAR                 0.814 ms        0.814 ms          858
BM_DIVF_AutoVectorization      0.078 ms        0.078 ms         8967
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.709s
user    0m1.687s
sys     0m0.013s
Executing dl-op-linalg-arithmulf-benchmark at 2025年 01月 14日 星期二 02:04:14 CST
2025-01-14T02:04:14+08:00
Running ./dl-op-linalg-arithmulf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.07, 3.02, 3.00
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_MULF_SCALAR                 0.802 ms        0.802 ms          871
BM_MULF_AutoVectorization      0.034 ms        0.034 ms        20015
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.995s
user    0m1.977s
sys     0m0.009s
Executing dl-op-linalg-arithnegf-benchmark at 2025年 01月 14日 星期二 02:04:16 CST
2025-01-14T02:04:16+08:00
Running ./dl-op-linalg-arithnegf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.07, 3.02, 3.00
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_NEGF_SCALAR                 0.625 ms        0.625 ms         1119
BM_NEGF_AutoVectorization      0.023 ms        0.023 ms        30072
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.810s
user    0m1.785s
sys     0m0.017s
Executing dl-op-linalg-arithsubf-benchmark at 2025年 01月 14日 星期二 02:04:18 CST
2025-01-14T02:04:18+08:00
Running ./dl-op-linalg-arithsubf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.07, 3.02, 3.00
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_SUBF_SCALAR                 0.803 ms        0.803 ms          871
BM_SUBF_AutoVectorization      0.034 ms        0.034 ms        20476
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m2.000s
user    0m1.979s
sys     0m0.013s
Executing dl-op-linalg-batch-matmul-benchmark at 2025年 01月 14日 星期二 02:04:20 CST
2025-01-14T02:04:20+08:00
Running ./dl-op-linalg-batch-matmul-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.07, 3.02, 3.00
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
DL_OPS_BATCH_MATMUL/Scalar/iterations:5                   314 ms          314 ms            5
DL_OPS_BATCH_MATMUL/AutoVectorization/iterations:5       73.0 ms         73.0 ms            5
DL_OPS_BATCH_MATMUL/Vectorization/iterations:5           11.8 ms         11.8 ms            5
DL_OPS_BATCH_MATMUL/Tile/iterations:5                    7.54 ms         7.54 ms            5
DL_OPS_BATCH_MATMUL/SCF/iterations:5                     10.3 ms         10.3 ms            5
---------- Verification ----------
Tile PASS
SCF PASS

real    0m2.469s
user    0m2.443s
sys     0m0.013s
Executing dl-op-linalg-batch-matmul-benchmark-rvv at 2025年 01月 14日 星期二 02:04:22 CST
2025-01-14T02:04:23+08:00
Running ./dl-op-linalg-batch-matmul-benchmark-rvv
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.06, 3.02, 3.00
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
DL_OPS_BATCH_MATMUL/Scalar/iterations:5                 23496 ms        23495 ms            5
DL_OPS_BATCH_MATMUL/AutoVectorization/iterations:5      18139 ms        18138 ms            5
DL_OPS_BATCH_MATMUL/Vectorization/iterations:5           4029 ms         4029 ms            5
DL_OPS_BATCH_MATMUL/RVVVectorization/iterations:5         706 ms          706 ms            5
---------- Verification ----------
AutoVectorization PASS
Vectorization PASS
RVVVectorization PASS

real    4m44.103s
user    4m44.053s
sys     0m0.033s
Executing dl-op-linalg-conv2d-nchw-fchw-benchmark at 2025年 01月 14日 星期二 02:09:07 CST
2025-01-14T02:09:07+08:00
Running ./dl-op-linalg-conv2d-nchw-fchw-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_Conv2DNchwFchw_SCALAR       7446 ms         7445 ms            1
BM_Conv2DNchwFchw_Im2col        274 ms          274 ms            2
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m16.852s
user    0m16.783s
sys     0m0.057s
Executing dl-op-linalg-conv2d-nhwc-fhwc-benchmark at 2025年 01月 14日 星期二 02:09:23 CST
2025-01-14T02:09:23+08:00
Running ./dl-op-linalg-conv2d-nhwc-fhwc-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------
DL_OPS_CONV_2D_NHWC_FHWC/scalar/iterations:5                   1468 ms         1468 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/auto_vectorization/iterations:5        138 ms          138 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/vectorization/iterations:5            34.2 ms         34.2 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/vec_tile/iterations:5                 34.2 ms         34.2 ms            5
---------- Verification ----------
auto_vectorization PASS
vectorization PASS
vec_tile PASS

real    0m10.100s
user    0m10.082s
sys     0m0.009s
Executing dl-op-linalg-conv2d-nhwc-fhwc-benchmark-rvv at 2025年 01月 14日 星期二 02:09:34 CST
2025-01-14T02:09:34+08:00
Running ./dl-op-linalg-conv2d-nhwc-fhwc-benchmark-rvv
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------
DL_OPS_CONV_2D_NHWC_FHWC/scalar/iterations:5                   1664 ms         1664 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/auto_vectorization/iterations:5        110 ms          110 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/vectorization/iterations:5            60.7 ms         60.7 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/rvv_vectorization/iterations:5         122 ms          122 ms            5
---------- Verification ----------
auto_vectorization PASS
vectorization PASS
rvv_vectorization PASS

real    0m11.771s
user    0m11.760s
sys     0m0.008s
Executing dl-op-linalg-conv2d-nhwc-hwcf-benchmark at 2025年 01月 14日 星期二 02:09:45 CST
2025-01-14T02:09:45+08:00
Running ./dl-op-linalg-conv2d-nhwc-hwcf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
BM_CONV_2D_NHWC_HWCF_SCALAR                   831 ms          831 ms            1
BM_CONV_2D_NHWC_HWCF_AutoVectorization        159 ms          159 ms            4
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m2.642s
user    0m2.621s
sys     0m0.013s
Executing dl-op-linalg-depthwise-conv-2d-nhwc-hwc-benchmark at 2025年 01月 14日 星期二 02:09:48 CST
2025-01-14T02:09:48+08:00
Running ./dl-op-linalg-depthwise-conv-2d-nhwc-hwc-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
------------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------
DL_OPS_DEPTHWISE_CONV_2D_NHWC_HWC/scalar/iterations:5                    104 ms          104 ms            5
DL_OPS_DEPTHWISE_CONV_2D_NHWC_HWC/auto_vectorization/iterations:5       13.4 ms         13.4 ms            5
DL_OPS_DEPTHWISE_CONV_2D_NHWC_HWC/vectorization/iterations:5            2.53 ms         2.53 ms            5
---------- Verification ----------
auto_vectorization PASS
vectorization PASS

real    0m0.770s
user    0m0.752s
sys     0m0.016s
Executing dl-op-linalg-mathexp-benchmark at 2025年 01月 14日 星期二 02:09:49 CST
2025-01-14T02:09:49+08:00
Running ./dl-op-linalg-mathexp-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_EXP_SCALAR                  1.11 ms         1.11 ms          631
BM_EXP_AutoVectorization      0.644 ms        0.644 ms         1088
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.674s
user    0m1.654s
sys     0m0.012s
Executing dl-op-linalg-mathfpow-benchmark at 2025年 01月 14日 星期二 02:09:50 CST
2025-01-14T02:09:50+08:00
Running ./dl-op-linalg-mathfpow-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_FPOW_SCALAR                  1.67 ms         1.67 ms          418
BM_FPOW_AutoVectorization       1.19 ms         1.19 ms          589
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.797s
user    0m1.772s
sys     0m0.017s
Executing dl-op-linalg-mathrsqrt-benchmark at 2025年 01月 14日 星期二 02:09:52 CST
2025-01-14T02:09:52+08:00
Running ./dl-op-linalg-mathrsqrt-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_RSQRT_SCALAR                 0.786 ms        0.786 ms          889
BM_RSQRT_AutoVectorization      0.113 ms        0.113 ms         6185
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.693s
user    0m1.679s
sys     0m0.005s
Executing dl-op-linalg-matmul-benchmark at 2025年 01月 14日 星期二 02:09:54 CST
2025-01-14T02:09:54+08:00
Running ./dl-op-linalg-matmul-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
DL_OPS_MATMUL/scalar_O0/iterations:5       2147 ms         2147 ms            5
DL_OPS_MATMUL/scalar_O3/iterations:5       1089 ms         1089 ms            5
DL_OPS_MATMUL/tile/iterations:5             175 ms          175 ms            5
---------- Verification ----------
tile PASS

real    0m18.381s
user    0m18.358s
sys     0m0.021s
Executing dl-op-linalg-matmul-benchmark-rvv at 2025年 01月 14日 星期二 02:10:12 CST
2025-01-14T02:10:12+08:00
Running ./dl-op-linalg-matmul-benchmark-rvv
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
DL_OPS_MATMUL/scalar/iterations:5       5706 ms         5705 ms            5
DL_OPS_MATMUL/vec/iterations:5          1374 ms         1374 ms            5
DL_OPS_MATMUL/rvv/iterations:5           242 ms          242 ms            5
---------- Verification ----------
vec PASS
rvv PASS

real    0m45.064s
user    0m45.018s
sys     0m0.037s
Executing dl-op-linalg-pooling-nhwc-sum-benchmark at 2025年 01月 14日 星期二 02:10:57 CST
2025-01-14T02:10:57+08:00
Running ./dl-op-linalg-pooling-nhwc-sum-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_POOLING_NHWC_SUM_SCALAR                  5.28 ms         5.28 ms          132
BM_POOLING_NHWC_SUM_AutoVectorization      0.397 ms        0.397 ms         1807
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.932s
user    0m1.912s
sys     0m0.013s
Executing dl-op-linalg-reduceaddf-benchmark at 2025年 01月 14日 星期二 02:10:59 CST
2025-01-14T02:10:59+08:00
Running ./dl-op-linalg-reduceaddf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
bash: 第 1 行:142360 Segmentation fault      ./"$file"

real    0m0.022s
user    0m0.000s
sys     0m0.014s
Executing dl-op-linalg-reducemaxf-benchmark at 2025年 01月 14日 星期二 02:10:59 CST
2025-01-14T02:10:59+08:00
Running ./dl-op-linalg-reducemaxf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
bash: 第 1 行:142362 Segmentation fault      ./"$file"

real    0m0.023s
user    0m0.007s
sys     0m0.008s
Executing dl-op-linalg-softmax-exp-sum-div-benchmark at 2025年 01月 14日 星期二 02:10:59 CST
2025-01-14T02:10:59+08:00
Running ./dl-op-linalg-softmax-exp-sum-div-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_SOFTMAXEXPSUMDIV_SCALAR                 0.151 ms        0.151 ms         4646
BM_SOFTMAXEXPSUMDIV_AutoVectorization      0.127 ms        0.127 ms         5504
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.738s
user    0m1.709s
sys     0m0.021s
muse-pi-3%
@xlinsist
Copy link
Collaborator Author

Results of TinyLlama with OpenMP after 5b6e665:

// X86 platform
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
DL_MODEL_TINYLLAMA/scalar             237142 ms       237115 ms            1
DL_MODEL_TINYLLAMA/matmul_opt          15745 ms        15742 ms            1
DL_MODEL_TINYLLAMA/matmul_opt_omp      13696 ms        13594 ms            1

// RV platform
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
DL_MODEL_TINYLLAMA/scalar             >1h
DL_MODEL_TINYLLAMA/matmul_opt         342555 ms       342539 ms            1
DL_MODEL_TINYLLAMA/matmul_opt_omp     245805 ms       245794 ms            1

Results of matmul and batch_matmul with OpenMP

X86

(py310) zhouxulin@plct-gpu:~/intern/buddy-benchmark/build-x86/bin$ ./dl-op-linalg-matmul-benchmark 
2025-01-19T09:09:44+00:00
Running ./dl-op-linalg-matmul-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 10.80, 10.79, 9.99
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
DL_OPS_MATMUL/scalar_O0/iterations:1      11262 ms        11262 ms            1
DL_OPS_MATMUL/scalar_O3/iterations:1       5155 ms         5155 ms            1
DL_OPS_MATMUL/tile/iterations:1             162 ms          162 ms            1
DL_OPS_MATMUL/vec/iterations:1              209 ms          209 ms            1
DL_OPS_MATMUL/vec_omp/iterations:1          120 ms         88.6 ms            1
---------- Verification ----------
tile PASS
vec PASS
vec_omp PASS

(py310) zhouxulin@plct-gpu:~/intern/buddy-benchmark/build-x86/bin$ ./dl-op-linalg-batch-matmul-benchmark 
2025-01-19T09:10:21+00:00
Running ./dl-op-linalg-batch-matmul-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 14.60, 11.71, 10.33
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
DL_OPS_BATCH_MATMUL/Scalar/iterations:1                  8672 ms         8672 ms            1
DL_OPS_BATCH_MATMUL/AutoVectorization/iterations:1       2649 ms         2649 ms            1
DL_OPS_BATCH_MATMUL/Vectorization/iterations:1            374 ms          374 ms            1
DL_OPS_BATCH_MATMUL/Tile/iterations:1                     205 ms          205 ms            1
DL_OPS_BATCH_MATMUL/SCF/iterations:1                      203 ms          203 ms            1
DL_OPS_BATCH_MATMUL/BROADCAST/iterations:1                650 ms          650 ms            1
DL_OPS_BATCH_MATMUL/BROADCAST_OMP/iterations:1           86.9 ms         61.2 ms            1
---------- Verification ----------
Tile PASS
SCF PASS
BROADCAST PASS
BROADCAST_OMP PASS

RISC-V

muse-pi-3% LD_LIBRARY_PATH=/home/user/buddy-benchmark/build-omp-shared-rv ./dl-op-linalg-matmul-benchmark
2025-01-19T16:53:13+08:00
Running ./dl-op-linalg-matmul-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 2.36, 2.68, 2.72
-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
DL_OPS_MATMUL/scalar_O0/iterations:1     239818 ms       239805 ms            1
DL_OPS_MATMUL/scalar_O3/iterations:1     181199 ms       181190 ms            1
DL_OPS_MATMUL/tile/iterations:1            6748 ms         6747 ms            1
DL_OPS_MATMUL/vec/iterations:1             8412 ms         8411 ms            1
DL_OPS_MATMUL/vec_omp/iterations:1         1409 ms         1409 ms            1
---------- Verification ----------
tile PASS
vec PASS
vec_omp PASS

muse-pi-3% LD_LIBRARY_PATH=/home/user/buddy-benchmark/build-omp-shared-rv ./dl-op-linalg-batch-matmul-benchmark
2025-01-19T16:44:24+08:00
Running ./dl-op-linalg-batch-matmul-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 2.48, 2.68, 2.64
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
DL_OPS_BATCH_MATMUL/Scalar/iterations:1                107096 ms       107087 ms            1
DL_OPS_BATCH_MATMUL/AutoVectorization/iterations:1      39651 ms        39650 ms            1
DL_OPS_BATCH_MATMUL/Vectorization/iterations:1           4016 ms         4016 ms            1
DL_OPS_BATCH_MATMUL/Tile/iterations:1                    2672 ms         2671 ms            1
DL_OPS_BATCH_MATMUL/SCF/iterations:1                     3317 ms         3316 ms            1
DL_OPS_BATCH_MATMUL/BROADCAST/iterations:1               5848 ms         5848 ms            1
DL_OPS_BATCH_MATMUL/BROADCAST_OMP/iterations:1           5503 ms         5502 ms            1
---------- Verification ----------
Tile PASS
SCF PASS
BROADCAST PASS
BROADCAST_OMP PASS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant