Deep Learning Benchmark Manual CI Report #163

xlinsist · 2025-01-14T02:52:38Z

Motivation

Since PR #162 replaced llc with clang, I re-evaluated the performance of deep learning benchmark and will hang the results in this issue. This issue will be updated if any recent changes (especially performance-related updates) are applied. Therefore, it can temporarily act as a CI report.

Benchmark Testing Summary (After PR #162)

X86(Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz)

All models and operations work fine.

RISC-V(SpacemiT K1)

Models:
- LLaMA: Successfully runs.
- Whisper: Successfully runs but is killed during correctness verification.
Operations:
- Running dl-op-linalg-reduceaddf-benchmark and dl-op-linalg-reducemaxf-benchmark causes a Segmentation fault error.

These issues are not related to the changes introduced in PR #162 (replacing llc with clang). Instead, they are bugs introduced during cross-platform migration and can be resolved in subsequent PRs.

Benchmark Testing Results (TinyLlama and Whisper)

X86

2025-01-13T12:21:43+00:00
Running ./dl-model-tinyllama-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.26, 14.39, 8.40
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
DL_MODEL_TINYLLAMA/scalar         374448 ms       374445 ms            1
DL_MODEL_TINYLLAMA/matmul_opt     371304 ms       371280 ms            1
�[34m---------- Verification ----------�[0m
matmul_opt �[32mPASS�[0m
2025-01-13T12:46:04+00:00
Running ./dl-model-tinyllama-benchmark-before-fusion
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 23.97, 23.94, 20.78
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
DL_MODEL_TINYLLAMA/scalar             339031 ms       338973 ms            1
DL_MODEL_TINYLLAMA/matmul_opt         330564 ms       330444 ms            1
DL_MODEL_TINYLLAMA/matmul_opt_omp     326282 ms       324553 ms            1
�[34m---------- Verification ----------�[0m
matmul_opt �[32mPASS�[0m
matmul_opt_omp �[32mPASS�[0m
2025-01-13T13:22:34+00:00
Running ./dl-model-whisper-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 8.93, 32.28, 42.02
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
DL_MODEL_Whisper/Auto_Vectorization      188571 ms       188525 ms            1
DL_MODEL_Whisper/Buddy_Vectorization      71893 ms        71879 ms            1
-----------------------------------------------------------
Correctness Verification for Output1: �[32mPASS�[0m
Correctness Verification for Output2: �[31mFAIL�[0m
-----------------------------------------------------------

RISC-V

Executing dl-model-tinyllama-benchmark at 2025年 01月 13日 星期一 22:36:07 CST
2025-01-13T22:36:12+08:00
Running ./dl-model-tinyllama-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 2.77, 2.38
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
DL_MODEL_TINYLLAMA/scalar         650432 ms       650353 ms            1
DL_MODEL_TINYLLAMA/matmul_opt     646291 ms       646229 ms            1
---------- Verification ----------
matmul_opt PASS

real    43m15.453s
user    43m9.710s
sys     0m5.477s
Executing dl-model-whisper-benchmark at 2025年 01月 13日 星期一 23:19:22 CST
2025-01-13T23:19:22+08:00
Running ./dl-model-whisper-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
DL_MODEL_Whisper/Auto_Vectorization     4379395 ms      4379140 ms            1
DL_MODEL_Whisper/Buddy_Vectorization     957599 ms       957515 ms            1
bash: 第 1 行：142223 Killed                  ./"$file"

real    164m48.164s
user    164m23.712s
sys     0m23.662s

Benchmark Testing Results (Others)

X86

(py310) zhouxulin@plct-gpu:~/intern/buddy-benchmark$ cd /home/zhouxulin/intern/buddy-benchmark/build-x86/bin
(py310) zhouxulin@plct-gpu:~/intern/buddy-benchmark/build-x86/bin$ ls
dl-layer-ffn-benchmark                      dl-op-linalg-batch-matmul-benchmark
dl-layer-rmsnorm-benchmark                  dl-op-linalg-conv2d-nchw-fchw-benchmark
dl-layer-selfattention-benchmark            dl-op-linalg-conv2d-nhwc-fhwc-benchmark
dl-model-bert-benchmark                     dl-op-linalg-conv2d-nhwc-hwcf-benchmark
dl-model-lenet-benchmark                    dl-op-linalg-depthwise-conv-2d-nhwc-hwc-benchmark
dl-model-mobilenetv3-benchmark              dl-op-linalg-mathexp-benchmark
dl-model-resnet18-benchmark                 dl-op-linalg-mathfpow-benchmark
dl-model-tinyllama-benchmark                dl-op-linalg-mathrsqrt-benchmark
dl-model-tinyllama-benchmark-before-fusion  dl-op-linalg-matmul-benchmark
dl-model-whisper-benchmark                  dl-op-linalg-pooling-nhwc-sum-benchmark
dl-op-linalg-arithaddf-benchmark            dl-op-linalg-reduceaddf-benchmark
dl-op-linalg-arithdivf-benchmark            dl-op-linalg-reducemaxf-benchmark
dl-op-linalg-arithmulf-benchmark            dl-op-linalg-softmax-exp-sum-div-benchmark
dl-op-linalg-arithnegf-benchmark            log.txt
dl-op-linalg-arithsubf-benchmark


2025-01-13T12:21:13+00:00
Running ./dl-layer-ffn-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.79, 13.42, 7.90
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations
--------------------------------------------------------------------------
DL_LAYER_FFN/Scalar                  0.176 ms        0.176 ms         3965
DL_LAYER_FFN/Auto_Vectorization      0.084 ms        0.084 ms         8171
-----------------------------------------------------------
Correctness Verification: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T12:21:15+00:00
Running ./dl-layer-rmsnorm-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.73, 13.60, 7.99
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
DL_LAYER_RMSNORM/Scalar                  0.005 ms        0.005 ms       125424
DL_LAYER_RMSNORM/Auto_Vectorization      0.002 ms        0.002 ms       311591
-----------------------------------------------------------
Correctness Verification: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T12:21:17+00:00
Running ./dl-layer-selfattention-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.73, 13.60, 7.99
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
DL_LAYER_ATTENTION/Scalar                   12.9 ms         12.9 ms           56
DL_LAYER_ATTENTION/Auto_Vectorization       4.27 ms         4.27 ms          167
-----------------------------------------------------------
Correctness Verification: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T12:21:19+00:00
Running ./dl-model-bert-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.59, 13.76, 8.07
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
DL_MODEL_BERT/Auto_Vectorization        2447 ms         2446 ms            1
DL_MODEL_BERT/Buddy_Vectorization        656 ms          656 ms            1
-----------------------------------------------------------
Correctness Verification: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T12:21:27+00:00
Running ./dl-model-lenet-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.62, 13.94, 8.16
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------
Benchmark                                   Time             CPU   Iterations
-----------------------------------------------------------------------------
DL_MODEL_LENET/Auto_Vectorization       0.565 ms        0.564 ms         1246
DL_MODEL_LENET/Buddy_Vectorization      0.581 ms        0.581 ms         1216
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T12:21:28+00:00
Running ./dl-model-mobilenetv3-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.62, 13.94, 8.16
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
BM_MobileNet_V3/BM_MobileNet_V3_scalar          127 ms          127 ms            5
BM_MobileNet_V3/BM_MobileNet_V3_conv_opt        105 ms          105 ms            8
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T12:21:31+00:00
Running ./dl-model-resnet18-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 24.49, 14.09, 8.24
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
DL_MODEL_Resnet18/Auto_Vectorization        2573 ms         2572 ms            1
DL_MODEL_Resnet18/Buddy_Vectorization       2533 ms         2533 ms            1
-----------------------------------------------------------
Correctness Verification: �[32mPASS�[0m
-----------------------------------------------------------


2025-01-13T13:31:12+00:00
Running ./dl-op-linalg-arithaddf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.06, 9.88, 26.49
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ADDF_SCALAR                 0.075 ms        0.075 ms         8983
BM_ADDF_AutoVectorization      0.011 ms        0.011 ms        60839
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:14+00:00
Running ./dl-op-linalg-arithdivf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.13, 9.80, 26.38
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_DIVF_SCALAR                 0.075 ms        0.075 ms         8491
BM_DIVF_AutoVectorization      0.014 ms        0.014 ms        50687
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:16+00:00
Running ./dl-op-linalg-arithmulf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.13, 9.80, 26.38
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_MULF_SCALAR                 0.075 ms        0.075 ms         8874
BM_MULF_AutoVectorization      0.011 ms        0.011 ms        60732
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:17+00:00
Running ./dl-op-linalg-arithnegf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.13, 9.80, 26.38
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_NEGF_SCALAR                 0.052 ms        0.052 ms        12582
BM_NEGF_AutoVectorization      0.006 ms        0.006 ms       114656
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:19+00:00
Running ./dl-op-linalg-arithsubf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.20, 9.72, 26.26
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_SUBF_SCALAR                 0.075 ms        0.075 ms         9301
BM_SUBF_AutoVectorization      0.012 ms        0.012 ms        60480
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:21+00:00
Running ./dl-op-linalg-batch-matmul-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.20, 9.72, 26.26
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
DL_OPS_BATCH_MATMUL/Scalar/iterations:5                  26.8 ms         26.8 ms            5
DL_OPS_BATCH_MATMUL/AutoVectorization/iterations:5       6.22 ms         6.22 ms            5
DL_OPS_BATCH_MATMUL/Vectorization/iterations:5          0.922 ms        0.922 ms            5
DL_OPS_BATCH_MATMUL/Tile/iterations:5                   0.335 ms        0.335 ms            5
DL_OPS_BATCH_MATMUL/SCF/iterations:5                    0.473 ms        0.473 ms            5
�[34m---------- Verification ----------�[0m
Tile �[32mPASS�[0m
SCF �[32mPASS�[0m
2025-01-13T13:31:21+00:00
Running ./dl-op-linalg-conv2d-nchw-fchw-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.20, 9.72, 26.26
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_Conv2DNchwFchw_SCALAR        712 ms          712 ms            1
BM_Conv2DNchwFchw_Im2col       11.3 ms         11.3 ms           61
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:24+00:00
Running ./dl-op-linalg-conv2d-nhwc-fhwc-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.27, 9.64, 26.15
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------
DL_OPS_CONV_2D_NHWC_FHWC/scalar/iterations:5                    136 ms          136 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/auto_vectorization/iterations:5       28.7 ms         28.7 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/vectorization/iterations:5            5.84 ms         5.84 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/vec_tile/iterations:5                 5.67 ms         5.67 ms            5
�[34m---------- Verification ----------�[0m
auto_vectorization �[32mPASS�[0m
vectorization �[32mPASS�[0m
vec_tile �[32mPASS�[0m
2025-01-13T13:31:25+00:00
Running ./dl-op-linalg-conv2d-nhwc-hwcf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.27, 9.64, 26.15
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
BM_CONV_2D_NHWC_HWCF_SCALAR                  69.3 ms         69.3 ms            9
BM_CONV_2D_NHWC_HWCF_AutoVectorization       14.9 ms         14.9 ms           47
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:27+00:00
Running ./dl-op-linalg-depthwise-conv-2d-nhwc-hwc-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.27, 9.64, 26.15
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------
DL_OPS_DEPTHWISE_CONV_2D_NHWC_HWC/scalar/iterations:5                   12.6 ms         12.6 ms            5
DL_OPS_DEPTHWISE_CONV_2D_NHWC_HWC/auto_vectorization/iterations:5       4.11 ms         4.10 ms            5
DL_OPS_DEPTHWISE_CONV_2D_NHWC_HWC/vectorization/iterations:5           0.220 ms        0.220 ms            5
�[34m---------- Verification ----------�[0m
auto_vectorization �[32mPASS�[0m
vectorization �[32mPASS�[0m
2025-01-13T13:31:27+00:00
Running ./dl-op-linalg-mathexp-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.27, 9.64, 26.15
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_EXP_SCALAR                 0.106 ms        0.106 ms         6404
BM_EXP_AutoVectorization      0.060 ms        0.060 ms        12277
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:29+00:00
Running ./dl-op-linalg-mathfpow-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.17, 9.53, 26.02
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_FPOW_SCALAR                 0.206 ms        0.206 ms         3410
BM_FPOW_AutoVectorization      0.131 ms        0.131 ms         5390
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:31+00:00
Running ./dl-op-linalg-mathrsqrt-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.17, 9.53, 26.02
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_RSQRT_SCALAR                 0.106 ms        0.106 ms         6598
BM_RSQRT_AutoVectorization      0.027 ms        0.027 ms        25532
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:33+00:00
Running ./dl-op-linalg-matmul-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.17, 9.53, 26.02
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
DL_OPS_MATMUL/scalar_O0/iterations:5        108 ms          108 ms            5
DL_OPS_MATMUL/scalar_O3/iterations:5       35.2 ms         35.2 ms            5
DL_OPS_MATMUL/tile/iterations:5            3.03 ms         3.03 ms            5
�[34m---------- Verification ----------�[0m
tile �[32mPASS�[0m
2025-01-13T13:31:33+00:00
Running ./dl-op-linalg-pooling-nhwc-sum-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.07, 9.43, 25.90
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_POOLING_NHWC_SUM_SCALAR                 0.581 ms        0.581 ms         1091
BM_POOLING_NHWC_SUM_AutoVectorization      0.083 ms        0.083 ms         8462
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------
2025-01-13T13:31:35+00:00
Running ./dl-op-linalg-pooling-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.07, 9.43, 25.90
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
2025-01-13T13:31:35+00:00
Running ./dl-op-linalg-reducemaxf-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.07, 9.43, 25.90
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
2025-01-13T13:31:35+00:00
Running ./dl-op-linalg-softmax-exp-sum-div-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 4.07, 9.43, 25.90
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_SOFTMAXEXPSUMDIV_SCALAR                 0.015 ms        0.015 ms        47706
BM_SOFTMAXEXPSUMDIV_AutoVectorization      0.009 ms        0.009 ms        86129
-----------------------------------------------------------
Correctness Verification:
Transform case: �[32mPASS�[0m
-----------------------------------------------------------

RISC-V

muse-pi-3% ls
dl-layer-ffn-benchmark                   dl-op-linalg-conv2d-nchw-fchw-benchmark
dl-layer-rmsnorm-benchmark               dl-op-linalg-conv2d-nhwc-fhwc-benchmark
dl-layer-selfattention-benchmark         dl-op-linalg-conv2d-nhwc-fhwc-benchmark-rvv
dl-model-bert-benchmark                  dl-op-linalg-conv2d-nhwc-hwcf-benchmark
dl-model-lenet-benchmark                 dl-op-linalg-depthwise-conv-2d-nhwc-hwc-benchmark
dl-model-mobilenetv3-benchmark           dl-op-linalg-mathexp-benchmark
dl-model-resnet18-benchmark              dl-op-linalg-mathfpow-benchmark
dl-model-tinyllama-benchmark             dl-op-linalg-mathrsqrt-benchmark
dl-model-whisper-benchmark               dl-op-linalg-matmul-benchmark
dl-op-linalg-arithaddf-benchmark         dl-op-linalg-matmul-benchmark-rvv
dl-op-linalg-arithdivf-benchmark         dl-op-linalg-pooling-nhwc-sum-benchmark
dl-op-linalg-arithmulf-benchmark         dl-op-linalg-reduceaddf-benchmark
dl-op-linalg-arithnegf-benchmark         dl-op-linalg-reducemaxf-benchmark
dl-op-linalg-arithsubf-benchmark         dl-op-linalg-softmax-exp-sum-div-benchmark
dl-op-linalg-batch-matmul-benchmark      log.txt
dl-op-linalg-batch-matmul-benchmark-rvv  output.log
muse-pi-3% cat log.txt
Executing dl-layer-ffn-benchmark at 2025年 01月 13日 星期一 22:29:15 CST
2025-01-13T22:29:15+08:00
Running ./dl-layer-ffn-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 2.00, 2.00, 2.00
--------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations
--------------------------------------------------------------------------
DL_LAYER_FFN/Scalar                   1.29 ms         1.29 ms          530
DL_LAYER_FFN/Auto_Vectorization      0.777 ms        0.777 ms          945
-----------------------------------------------------------
Correctness Verification: PASS
-----------------------------------------------------------

real    0m1.779s
user    0m1.539s
sys     0m0.240s
Executing dl-layer-rmsnorm-benchmark at 2025年 01月 13日 星期一 22:29:17 CST
2025-01-13T22:29:17+08:00
Running ./dl-layer-rmsnorm-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 2.08, 2.02, 2.01
------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
DL_LAYER_RMSNORM/Scalar                  0.029 ms        0.029 ms        24437
DL_LAYER_RMSNORM/Auto_Vectorization      0.009 ms        0.009 ms        78021
-----------------------------------------------------------
Correctness Verification: PASS
-----------------------------------------------------------

real    0m1.977s
user    0m1.305s
sys     0m0.673s
Executing dl-layer-selfattention-benchmark at 2025年 01月 13日 星期一 22:29:19 CST
2025-01-13T22:29:19+08:00
Running ./dl-layer-selfattention-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 2.08, 2.02, 2.01
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
DL_LAYER_ATTENTION/Scalar                    101 ms          101 ms            7
DL_LAYER_ATTENTION/Auto_Vectorization       40.9 ms         40.9 ms           16
-----------------------------------------------------------
Correctness Verification: PASS
-----------------------------------------------------------

real    0m3.009s
user    0m2.892s
sys     0m0.116s
Executing dl-model-bert-benchmark at 2025年 01月 13日 星期一 22:29:22 CST
2025-01-13T22:29:22+08:00
Running ./dl-model-bert-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 2.15, 2.03, 2.01
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
DL_MODEL_BERT/Auto_Vectorization      158150 ms       158139 ms            1
DL_MODEL_BERT/Buddy_Vectorization      18938 ms        18936 ms            1
-----------------------------------------------------------
Correctness Verification: PASS
-----------------------------------------------------------

real    5m56.495s
user    5m51.783s
sys     0m4.685s
Executing dl-model-lenet-benchmark at 2025年 01月 13日 星期一 22:35:19 CST
2025-01-13T22:35:19+08:00
Running ./dl-model-lenet-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 2.72, 2.34
-----------------------------------------------------------------------------
Benchmark                                   Time             CPU   Iterations
-----------------------------------------------------------------------------
DL_MODEL_LENET/Auto_Vectorization        2.56 ms         2.56 ms          267
DL_MODEL_LENET/Buddy_Vectorization       2.45 ms         2.45 ms          286
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m2.050s
user    0m1.646s
sys     0m0.404s
Executing dl-model-mobilenetv3-benchmark at 2025年 01月 13日 星期一 22:35:21 CST
2025-01-13T22:35:21+08:00
Running ./dl-model-mobilenetv3-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 2.72, 2.34
-----------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
BM_MobileNet_V3/BM_MobileNet_V3_scalar          778 ms          778 ms            1
BM_MobileNet_V3/BM_MobileNet_V3_conv_opt        929 ms          929 ms            1
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m4.056s
user    0m3.535s
sys     0m0.521s
Executing dl-model-resnet18-benchmark at 2025年 01月 13日 星期一 22:35:25 CST
2025-01-13T22:35:25+08:00
Running ./dl-model-resnet18-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 2.72, 2.34
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
DL_MODEL_Resnet18/Auto_Vectorization       10344 ms        10343 ms            1
DL_MODEL_Resnet18/Buddy_Vectorization      10239 ms        10238 ms            1
-----------------------------------------------------------
Correctness Verification: PASS
-----------------------------------------------------------

real    0m41.960s
user    0m40.787s
sys     0m1.168s

Executing dl-op-linalg-arithaddf-benchmark at 2025年 01月 14日 星期二 02:04:10 CST
2025-01-14T02:04:10+08:00
Running ./dl-op-linalg-arithaddf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.08, 3.02, 3.01
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ADDF_SCALAR                 0.802 ms        0.802 ms          871
BM_ADDF_AutoVectorization      0.034 ms        0.034 ms        20463
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m2.003s
user    0m1.974s
sys     0m0.021s
Executing dl-op-linalg-arithdivf-benchmark at 2025年 01月 14日 星期二 02:04:12 CST
2025-01-14T02:04:12+08:00
Running ./dl-op-linalg-arithdivf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.07, 3.02, 3.00
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_DIVF_SCALAR                 0.814 ms        0.814 ms          858
BM_DIVF_AutoVectorization      0.078 ms        0.078 ms         8967
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.709s
user    0m1.687s
sys     0m0.013s
Executing dl-op-linalg-arithmulf-benchmark at 2025年 01月 14日 星期二 02:04:14 CST
2025-01-14T02:04:14+08:00
Running ./dl-op-linalg-arithmulf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.07, 3.02, 3.00
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_MULF_SCALAR                 0.802 ms        0.802 ms          871
BM_MULF_AutoVectorization      0.034 ms        0.034 ms        20015
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.995s
user    0m1.977s
sys     0m0.009s
Executing dl-op-linalg-arithnegf-benchmark at 2025年 01月 14日 星期二 02:04:16 CST
2025-01-14T02:04:16+08:00
Running ./dl-op-linalg-arithnegf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.07, 3.02, 3.00
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_NEGF_SCALAR                 0.625 ms        0.625 ms         1119
BM_NEGF_AutoVectorization      0.023 ms        0.023 ms        30072
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.810s
user    0m1.785s
sys     0m0.017s
Executing dl-op-linalg-arithsubf-benchmark at 2025年 01月 14日 星期二 02:04:18 CST
2025-01-14T02:04:18+08:00
Running ./dl-op-linalg-arithsubf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.07, 3.02, 3.00
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_SUBF_SCALAR                 0.803 ms        0.803 ms          871
BM_SUBF_AutoVectorization      0.034 ms        0.034 ms        20476
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m2.000s
user    0m1.979s
sys     0m0.013s
Executing dl-op-linalg-batch-matmul-benchmark at 2025年 01月 14日 星期二 02:04:20 CST
2025-01-14T02:04:20+08:00
Running ./dl-op-linalg-batch-matmul-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.07, 3.02, 3.00
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
DL_OPS_BATCH_MATMUL/Scalar/iterations:5                   314 ms          314 ms            5
DL_OPS_BATCH_MATMUL/AutoVectorization/iterations:5       73.0 ms         73.0 ms            5
DL_OPS_BATCH_MATMUL/Vectorization/iterations:5           11.8 ms         11.8 ms            5
DL_OPS_BATCH_MATMUL/Tile/iterations:5                    7.54 ms         7.54 ms            5
DL_OPS_BATCH_MATMUL/SCF/iterations:5                     10.3 ms         10.3 ms            5
---------- Verification ----------
Tile PASS
SCF PASS

real    0m2.469s
user    0m2.443s
sys     0m0.013s
Executing dl-op-linalg-batch-matmul-benchmark-rvv at 2025年 01月 14日 星期二 02:04:22 CST
2025-01-14T02:04:23+08:00
Running ./dl-op-linalg-batch-matmul-benchmark-rvv
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.06, 3.02, 3.00
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
DL_OPS_BATCH_MATMUL/Scalar/iterations:5                 23496 ms        23495 ms            5
DL_OPS_BATCH_MATMUL/AutoVectorization/iterations:5      18139 ms        18138 ms            5
DL_OPS_BATCH_MATMUL/Vectorization/iterations:5           4029 ms         4029 ms            5
DL_OPS_BATCH_MATMUL/RVVVectorization/iterations:5         706 ms          706 ms            5
---------- Verification ----------
AutoVectorization PASS
Vectorization PASS
RVVVectorization PASS

real    4m44.103s
user    4m44.053s
sys     0m0.033s
Executing dl-op-linalg-conv2d-nchw-fchw-benchmark at 2025年 01月 14日 星期二 02:09:07 CST
2025-01-14T02:09:07+08:00
Running ./dl-op-linalg-conv2d-nchw-fchw-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_Conv2DNchwFchw_SCALAR       7446 ms         7445 ms            1
BM_Conv2DNchwFchw_Im2col        274 ms          274 ms            2
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m16.852s
user    0m16.783s
sys     0m0.057s
Executing dl-op-linalg-conv2d-nhwc-fhwc-benchmark at 2025年 01月 14日 星期二 02:09:23 CST
2025-01-14T02:09:23+08:00
Running ./dl-op-linalg-conv2d-nhwc-fhwc-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------
DL_OPS_CONV_2D_NHWC_FHWC/scalar/iterations:5                   1468 ms         1468 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/auto_vectorization/iterations:5        138 ms          138 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/vectorization/iterations:5            34.2 ms         34.2 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/vec_tile/iterations:5                 34.2 ms         34.2 ms            5
---------- Verification ----------
auto_vectorization PASS
vectorization PASS
vec_tile PASS

real    0m10.100s
user    0m10.082s
sys     0m0.009s
Executing dl-op-linalg-conv2d-nhwc-fhwc-benchmark-rvv at 2025年 01月 14日 星期二 02:09:34 CST
2025-01-14T02:09:34+08:00
Running ./dl-op-linalg-conv2d-nhwc-fhwc-benchmark-rvv
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------
DL_OPS_CONV_2D_NHWC_FHWC/scalar/iterations:5                   1664 ms         1664 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/auto_vectorization/iterations:5        110 ms          110 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/vectorization/iterations:5            60.7 ms         60.7 ms            5
DL_OPS_CONV_2D_NHWC_FHWC/rvv_vectorization/iterations:5         122 ms          122 ms            5
---------- Verification ----------
auto_vectorization PASS
vectorization PASS
rvv_vectorization PASS

real    0m11.771s
user    0m11.760s
sys     0m0.008s
Executing dl-op-linalg-conv2d-nhwc-hwcf-benchmark at 2025年 01月 14日 星期二 02:09:45 CST
2025-01-14T02:09:45+08:00
Running ./dl-op-linalg-conv2d-nhwc-hwcf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
BM_CONV_2D_NHWC_HWCF_SCALAR                   831 ms          831 ms            1
BM_CONV_2D_NHWC_HWCF_AutoVectorization        159 ms          159 ms            4
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m2.642s
user    0m2.621s
sys     0m0.013s
Executing dl-op-linalg-depthwise-conv-2d-nhwc-hwc-benchmark at 2025年 01月 14日 星期二 02:09:48 CST
2025-01-14T02:09:48+08:00
Running ./dl-op-linalg-depthwise-conv-2d-nhwc-hwc-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
------------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------
DL_OPS_DEPTHWISE_CONV_2D_NHWC_HWC/scalar/iterations:5                    104 ms          104 ms            5
DL_OPS_DEPTHWISE_CONV_2D_NHWC_HWC/auto_vectorization/iterations:5       13.4 ms         13.4 ms            5
DL_OPS_DEPTHWISE_CONV_2D_NHWC_HWC/vectorization/iterations:5            2.53 ms         2.53 ms            5
---------- Verification ----------
auto_vectorization PASS
vectorization PASS

real    0m0.770s
user    0m0.752s
sys     0m0.016s
Executing dl-op-linalg-mathexp-benchmark at 2025年 01月 14日 星期二 02:09:49 CST
2025-01-14T02:09:49+08:00
Running ./dl-op-linalg-mathexp-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_EXP_SCALAR                  1.11 ms         1.11 ms          631
BM_EXP_AutoVectorization      0.644 ms        0.644 ms         1088
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.674s
user    0m1.654s
sys     0m0.012s
Executing dl-op-linalg-mathfpow-benchmark at 2025年 01月 14日 星期二 02:09:50 CST
2025-01-14T02:09:50+08:00
Running ./dl-op-linalg-mathfpow-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_FPOW_SCALAR                  1.67 ms         1.67 ms          418
BM_FPOW_AutoVectorization       1.19 ms         1.19 ms          589
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.797s
user    0m1.772s
sys     0m0.017s
Executing dl-op-linalg-mathrsqrt-benchmark at 2025年 01月 14日 星期二 02:09:52 CST
2025-01-14T02:09:52+08:00
Running ./dl-op-linalg-mathrsqrt-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_RSQRT_SCALAR                 0.786 ms        0.786 ms          889
BM_RSQRT_AutoVectorization      0.113 ms        0.113 ms         6185
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.693s
user    0m1.679s
sys     0m0.005s
Executing dl-op-linalg-matmul-benchmark at 2025年 01月 14日 星期二 02:09:54 CST
2025-01-14T02:09:54+08:00
Running ./dl-op-linalg-matmul-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
DL_OPS_MATMUL/scalar_O0/iterations:5       2147 ms         2147 ms            5
DL_OPS_MATMUL/scalar_O3/iterations:5       1089 ms         1089 ms            5
DL_OPS_MATMUL/tile/iterations:5             175 ms          175 ms            5
---------- Verification ----------
tile PASS

real    0m18.381s
user    0m18.358s
sys     0m0.021s
Executing dl-op-linalg-matmul-benchmark-rvv at 2025年 01月 14日 星期二 02:10:12 CST
2025-01-14T02:10:12+08:00
Running ./dl-op-linalg-matmul-benchmark-rvv
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
DL_OPS_MATMUL/scalar/iterations:5       5706 ms         5705 ms            5
DL_OPS_MATMUL/vec/iterations:5          1374 ms         1374 ms            5
DL_OPS_MATMUL/rvv/iterations:5           242 ms          242 ms            5
---------- Verification ----------
vec PASS
rvv PASS

real    0m45.064s
user    0m45.018s
sys     0m0.037s
Executing dl-op-linalg-pooling-nhwc-sum-benchmark at 2025年 01月 14日 星期二 02:10:57 CST
2025-01-14T02:10:57+08:00
Running ./dl-op-linalg-pooling-nhwc-sum-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_POOLING_NHWC_SUM_SCALAR                  5.28 ms         5.28 ms          132
BM_POOLING_NHWC_SUM_AutoVectorization      0.397 ms        0.397 ms         1807
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.932s
user    0m1.912s
sys     0m0.013s
Executing dl-op-linalg-reduceaddf-benchmark at 2025年 01月 14日 星期二 02:10:59 CST
2025-01-14T02:10:59+08:00
Running ./dl-op-linalg-reduceaddf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
bash: 第 1 行：142360 Segmentation fault      ./"$file"

real    0m0.022s
user    0m0.000s
sys     0m0.014s
Executing dl-op-linalg-reducemaxf-benchmark at 2025年 01月 14日 星期二 02:10:59 CST
2025-01-14T02:10:59+08:00
Running ./dl-op-linalg-reducemaxf-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
bash: 第 1 行：142362 Segmentation fault      ./"$file"

real    0m0.023s
user    0m0.007s
sys     0m0.008s
Executing dl-op-linalg-softmax-exp-sum-div-benchmark at 2025年 01月 14日 星期二 02:10:59 CST
2025-01-14T02:10:59+08:00
Running ./dl-op-linalg-softmax-exp-sum-div-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 3.00, 3.00, 3.00
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_SOFTMAXEXPSUMDIV_SCALAR                 0.151 ms        0.151 ms         4646
BM_SOFTMAXEXPSUMDIV_AutoVectorization      0.127 ms        0.127 ms         5504
-----------------------------------------------------------
Correctness Verification:
Transform case: PASS
-----------------------------------------------------------

real    0m1.738s
user    0m1.709s
sys     0m0.021s
muse-pi-3%

The text was updated successfully, but these errors were encountered:

xlinsist · 2025-01-19T10:50:59Z

Results of TinyLlama with OpenMP after `5b6e665`:

// X86 platform
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
DL_MODEL_TINYLLAMA/scalar             237142 ms       237115 ms            1
DL_MODEL_TINYLLAMA/matmul_opt          15745 ms        15742 ms            1
DL_MODEL_TINYLLAMA/matmul_opt_omp      13696 ms        13594 ms            1

// RV platform
----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
DL_MODEL_TINYLLAMA/scalar             >1h
DL_MODEL_TINYLLAMA/matmul_opt         342555 ms       342539 ms            1
DL_MODEL_TINYLLAMA/matmul_opt_omp     245805 ms       245794 ms            1

Results of matmul and batch_matmul with OpenMP

X86

(py310) zhouxulin@plct-gpu:~/intern/buddy-benchmark/build-x86/bin$ ./dl-op-linalg-matmul-benchmark 
2025-01-19T09:09:44+00:00
Running ./dl-op-linalg-matmul-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 10.80, 10.79, 9.99
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
DL_OPS_MATMUL/scalar_O0/iterations:1      11262 ms        11262 ms            1
DL_OPS_MATMUL/scalar_O3/iterations:1       5155 ms         5155 ms            1
DL_OPS_MATMUL/tile/iterations:1             162 ms          162 ms            1
DL_OPS_MATMUL/vec/iterations:1              209 ms          209 ms            1
DL_OPS_MATMUL/vec_omp/iterations:1          120 ms         88.6 ms            1
---------- Verification ----------
tile PASS
vec PASS
vec_omp PASS

(py310) zhouxulin@plct-gpu:~/intern/buddy-benchmark/build-x86/bin$ ./dl-op-linalg-batch-matmul-benchmark 
2025-01-19T09:10:21+00:00
Running ./dl-op-linalg-batch-matmul-benchmark
Run on (80 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x40)
  L1 Instruction 32 KiB (x40)
  L2 Unified 1024 KiB (x40)
  L3 Unified 28160 KiB (x2)
Load Average: 14.60, 11.71, 10.33
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
DL_OPS_BATCH_MATMUL/Scalar/iterations:1                  8672 ms         8672 ms            1
DL_OPS_BATCH_MATMUL/AutoVectorization/iterations:1       2649 ms         2649 ms            1
DL_OPS_BATCH_MATMUL/Vectorization/iterations:1            374 ms          374 ms            1
DL_OPS_BATCH_MATMUL/Tile/iterations:1                     205 ms          205 ms            1
DL_OPS_BATCH_MATMUL/SCF/iterations:1                      203 ms          203 ms            1
DL_OPS_BATCH_MATMUL/BROADCAST/iterations:1                650 ms          650 ms            1
DL_OPS_BATCH_MATMUL/BROADCAST_OMP/iterations:1           86.9 ms         61.2 ms            1
---------- Verification ----------
Tile PASS
SCF PASS
BROADCAST PASS
BROADCAST_OMP PASS

RISC-V

muse-pi-3% LD_LIBRARY_PATH=/home/user/buddy-benchmark/build-omp-shared-rv ./dl-op-linalg-matmul-benchmark
2025-01-19T16:53:13+08:00
Running ./dl-op-linalg-matmul-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 2.36, 2.68, 2.72
-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
DL_OPS_MATMUL/scalar_O0/iterations:1     239818 ms       239805 ms            1
DL_OPS_MATMUL/scalar_O3/iterations:1     181199 ms       181190 ms            1
DL_OPS_MATMUL/tile/iterations:1            6748 ms         6747 ms            1
DL_OPS_MATMUL/vec/iterations:1             8412 ms         8411 ms            1
DL_OPS_MATMUL/vec_omp/iterations:1         1409 ms         1409 ms            1
---------- Verification ----------
tile PASS
vec PASS
vec_omp PASS

muse-pi-3% LD_LIBRARY_PATH=/home/user/buddy-benchmark/build-omp-shared-rv ./dl-op-linalg-batch-matmul-benchmark
2025-01-19T16:44:24+08:00
Running ./dl-op-linalg-batch-matmul-benchmark
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 2.48, 2.68, 2.64
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
DL_OPS_BATCH_MATMUL/Scalar/iterations:1                107096 ms       107087 ms            1
DL_OPS_BATCH_MATMUL/AutoVectorization/iterations:1      39651 ms        39650 ms            1
DL_OPS_BATCH_MATMUL/Vectorization/iterations:1           4016 ms         4016 ms            1
DL_OPS_BATCH_MATMUL/Tile/iterations:1                    2672 ms         2671 ms            1
DL_OPS_BATCH_MATMUL/SCF/iterations:1                     3317 ms         3316 ms            1
DL_OPS_BATCH_MATMUL/BROADCAST/iterations:1               5848 ms         5848 ms            1
DL_OPS_BATCH_MATMUL/BROADCAST_OMP/iterations:1           5503 ms         5502 ms            1
---------- Verification ----------
Tile PASS
SCF PASS
BROADCAST PASS
BROADCAST_OMP PASS

xlinsist mentioned this issue Jan 14, 2025

[DeepLearning] Update dependencies & replace llc with clang #162

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deep Learning Benchmark Manual CI Report #163

Deep Learning Benchmark Manual CI Report #163

xlinsist commented Jan 14, 2025

xlinsist commented Jan 19, 2025

Deep Learning Benchmark Manual CI Report #163

Deep Learning Benchmark Manual CI Report #163

Comments

xlinsist commented Jan 14, 2025

Motivation

Benchmark Testing Summary (After PR #162)

X86(Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz)

RISC-V(SpacemiT K1)

Benchmark Testing Results (TinyLlama and Whisper)

X86

RISC-V

Benchmark Testing Results (Others)

X86

RISC-V

xlinsist commented Jan 19, 2025

Results of TinyLlama with OpenMP after 5b6e665:

Results of matmul and batch_matmul with OpenMP

X86

RISC-V

Results of TinyLlama with OpenMP after `5b6e665`: