diff --git a/Changelog.txt b/Changelog.txt index 03c3cfbd97..7f89a2eab7 100644 --- a/Changelog.txt +++ b/Changelog.txt @@ -1,4 +1,127 @@ OpenBLAS ChangeLog +==================================================================== +Version 0.3.28 + 8-Aug-2024 + +general: +- Reworked the unfinished implementation of HUGETLB from GotoBLAS + for allocating huge memory pages as buffers on suitable systems +- Changed the unfinished implementation of GEMM3M for the generic + target on all architectures to at least forward to regular GEMM +- Improved multithreaded GEMM performance for large non-skinny matrices +- Improved BLAS3 performance on larger multicore systems through improved + parallelism +- Improved performance of the initial memory allocation by reducing + locking overhead +- Improved performance of GBMV at small problem sizes by introducing + a size barrier for the switch to multithreading +- Added an implementation of the CBLAS_GEMM_BATCH extension +- Fixed miscompilation of CAXPYC and ZAXPYC on all architectures in + CMAKE builds (error introduced in 0.3.27) +- Fixed corner cases involving the handling of NAN and INFINITY + arguments in ?SCAL on all architectures +- Added support for cross-compiling to WEBM with CMAKE (in addition + to the already present makefile support) +- Fixed NAN handling and potential accuracy issues in compilations with + Intel ICX by supplying a suitable fp-model option by default +- The contents of the github project wiki have been converted into + a new set of documentation included with the source code. +- It is now possible to register a callback function that replaces + the built-in support for multithreading with an external backend + like TBB (openblas_set_threads_callback_function) +- Fixed potential duplication of suffixes in shared library naming +- Improved C compiler detection by the build system to tolerate more + naming variants for gcc builds +- Fixed an unnecessary dependency of the utest on CBLAS +- Fixed spurious error reports from the BLAS extensions utest +- Fixed unwanted invocation of the GEMM3M tests in cross-compilation +- Fixed a flaw in the makefile build that could lead to the pkgconfig + file containing an entry of UNKNOWN for the target cpu after installing +- Integrated fixes from the Reference-LAPACK project: + - Fixed uninitialized variables in the LAPACK tests for ?QP3RK (PR 961) + - Fixed potential bounds error in ?UNHR_COL/?ORHR_COL (PR 1018) + - Fixed potential infinite loop in the LAPACK testsuite (PR 1024) + - Make the variable type used for hidden length arguments configurable (PR 1025) + - Fixed SYTRD workspace computation and various typos (PR 1030) + - Prevent compiler use of FMA that could increase numerical error in ?GEEVX (PR 1033) + +x86-64: +- reverted thread management under Windows to its state before 0.3.26 + due to signs of race conditions in some circumstances now under study +- fixed accidental selection of the unoptimized generic SBGEMM kernel + in CMAKE builds for CooperLake and SapphireRapids targets +- fixed a potential thread buffer overrun in SBSTOBF16 on small systems +- fixed an accuracy issue in ZSCAL introduced in 0.3.26 +- fixed compilation with CMAKE and recent releases of LLVM +- added support for Intel Emerald Rapids and Meteor Lake cpus +- added autodetection support for the Zhaoxin KX-7000 cpu +- fixed autodetection of Intel Prescott (probably broken since 0.3.19) +- fixed compilation for older targets with the Yocto SDK +- fixed compilation of the converter-generated C versions + of the LAPACK sources with gcc-14 +- improved compiler options when building with CMAKE and LLVM for + AVX512-capable targets +- added support for supplying the L2 cache size via an environment + variable (OPENBLAS_L2_SIZE) in case it is not correctly reported + (as in some VM configurations) +- improved the error message shown when thread creation fails on startup +- fixed setting the rpath entry of the dylib in CMAKE builds on MacOS + +arm: +- fixed building for baremetal targets with make + +arm64: +- Added a fast path forwarding SGEMM and DGEMM calls with a 1xN or Mx1 + matrix to the corresponding GEMV kernel +- added optimized SGEMV and DGEMV kernels for A64FX +- added optimized SVE kernels for small-matrix GEMM +- added A64FX to the cpu list for DYNAMIC_ARCH +- fixed building with support for cpu affinity +- worked around accuracy problems with C/ZNRM2 on NeoverseN1 and + Apple M targets +- improved GEMM performance on Neoverse V1 +- fixed compilation for NEOVERSEN2 with older compilers +- fixed potential miscompilation of the SVE SDOT and DDOT kernels +- fixed potential miscompilation of the non-SVE CDOT and ZDOT kernels +- fixed a potential overflow when using very large user-defined BUFFERSIZE +- fixed setting the rpath entry of the dylib in CMAKE builds on MacOS + +power: +- Added a fast path forwarding SGEMM and DGEMM calls with a 1xN or Mx1 + matrix to the corresponding GEMV kernel +- significantly improved performance of SBGEMM on POWER10 +- fixed compilation with OpenMP and the XLF compiler +- fixed building of the BLAS extension utests under AIX +- fixed building of parts of the LAPACK testsuite with XLF +- fixed CSWAP/ZSWAP on big-endian POWER10 targets +- fixed a performance regression in SAXPY on POWER10 with OpenXL +- fixed accuracy issues in CSCAL/ZSCAL when compiled with LLVM +- fixed building for POWER9 under FreeBSD +- fixed a potential overflow when using very large user-defined BUFFERSIZE +- fixed an accuracy issue in the POWER6 kernels for GEMM and GEMV + +riscv64: +- Added a fast path forwarding SGEMM and DGEMM calls with a 1xN or Mx1 + matrix to the corresponding GEMV kernel +- fixed building for RISCV64_GENERIC with OpenMP enabled +- added DYNAMIC_ARCH support (comprising GENERIC_RISCV64 and the two + RVV 1.0 targets with vector length of 128 and 256) +- worked around the ZVL128B kernels for AXPBY mishandling the special + case of zero Y increment + +loongarch64: +- improved GEMM performance on servers of the 3C5000 generation +- improved performance and stability of DGEMM +- improved GEMV and TRSM kernels for LSX and LASX vector ABIs +- fixed CMAKE compilation with the INTERFACE64 option set +- fixed compilation with CMAKE +- worked around spurious errors flagged by the BLAS3 tests +- worked around a miscompilation of the POTRS utest by gcc 14.1 + +mips64: +- fixed ASUM and SUM kernels to accept negative step sizes in X +- fixed complex GEMV kernels for MSA + ==================================================================== Version 0.3.27 4-Apr-2024