Releases: ROCm/rocBLAS
Releases · ROCm/rocBLAS
rocBLAS 2.43.0 for ROCm 5.1.1
rocBLAS code for ROCm 5.1.1 did not change. The library was rebuilt for the updated ROCm 5.1.1 stack.
rocBLAS 2.43.0 for ROCm 5.1.0
Added
- Option to install script for number of jobs to use for rocBLAS and Tensile compilation (-j, --jobs)
- Option to install script to build clients without using any Fortran (--clients_no_fortran)
- rocblas_client_initialize function, to perform rocBLAS initialize for clients(benchmark/test) and report the execution time.
- Added tests for output of reduction functions when given bad input
- Added user specified initialization (rand_int/trig_float/hpl) for initializing matrices and vectors in rocblas-bench
Optimizations
- Improved performance of trsm with side == left and n == 1
- Improved perforamnce of trsm with side == left and m <= 32 along with side == right and n <= 32
Changed
- For syrkx and trmm internal API use rocblas_stride datatype for offset
- For non-batched and batched gemm_ex functions if the C matrix pointer equals the D matrix pointer (aliased) their respective type and leading dimension arguments must now match
- Test client dependencies updated to GTest 1.11
- non-global false positives reported by cppcheck from file based suppression to inline suppression. File based suppression will only be used for global false positives.
- Help menu messages in install.sh
- For ger function, typecast the 'lda'(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions.
- Modified default initialization from rand_int to hpl for initializing matrices and vectors in rocblas-bench
Fixed
- For function trmv (non-transposed cases) avoid overflow in offset calculation
- Fixed cppcheck errors/warnings
- Fixed doxygen warnings
rocBLAS 2.42.0 for ROCm 5.0.2
rocBLAS code for ROCm 5.0.2 is unchanged from rocBLAS for ROCm 5.0.1. The library was rebuilt for the updated ROCm 5.0.2 stack.
rocBLAS 2.42.0 for ROCm 5.0.1
rocBLAS code for ROCm 5.0.1 is unchanged from rocBLAS for ROCm 5.0.0. The library was rebuilt for the updated ROCm 5.0.1 stack.
rocBLAS 2.42.0 for ROCm 5.0.0
Added
- Added rocblas_get_version_string_size convenience function
- Added rocblas_xtrmm_outofplace, an out-of-place version of rocblas_xtrmm
- Added hpl and trig initialization for gemm_ex to rocblas-bench
- Added source code gemm. It can be used as an alternative to Tensile for debugging and development
- Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex
Optimizations
- Improved performance of non-batched and batched single-precision GER for size m > 1024. Performance enhanced by 5-10% measured on a MI100 (gfx908) GPU.
- Improved performance of non-batched and batched HER for all sizes and data types. Performance enhanced by 2-17% measured on a MI100 (gfx908) GPU.
Changed
- Instantiate templated rocBLAS functions to reduce size of librocblas.so
- Removed static library dependency on msgpack
- Removed boost dependencies for clients
Fixed
- Option to install script to build only rocBLAS clients with a pre-built rocBLAS library
- Correctly set output of nrm2_batched_ex and nrm2_strided_batched_ex when given bad input
- Fix for dgmm with side == rocblas_side_left and a negative incx
- Fixed out-of-bounds read for small trsm
- Fixed numerical checking for tbmv_strided_batched
rocBLAS 2.41.0 for ROCm 4.5.2
rocBLAS code for ROCm 4.5.2 is unchanged from rocBLAS for ROCm 4.5.0. The library was rebuilt for the updated ROCm 4.5.2 stack.
rocBLAS 2.41.0 for ROCm 4.5.0
Optimizations
- Improved performance of non-batched and batched syr for all sizes and data types
- Improved performance of non-batched and batched hemv for all sizes and data types
- Improved performance of non-batched and batched symv for all sizes and data types
- Improved memory utilization in rocblas-bench, rocblas-test gemm functions, increasing possible runtime sizes.
- Improved performance of non-batched and batched dot, dotc, and dot_ex for small n. e.g. sdot n <= 31000.
- Improved performance of non-batched and batched trmv for all sizes and matrix types.
- Improved performance of non-batched and batched gemv transpose case for all sizes and datatypes.
- Improved performance of sger and dger for all sizes, in particular the larger dger sizes.
- Improved performance of syrkx for for large size including those in rocBLAS Issue #1184.
Changed
- Update from C++14 to C++17.
- Packaging split into a runtime package (called rocblas) and a development package (called rocblas-dev for
.deb
packages, and rocblas-devel for.rpm
packages). The development package depends on runtime. The runtime package suggests the development package for all supported OSes except CentOS 7 to aid in the transition. The suggests feature in packaging is introduced as a deprecated feature and will be removed in a future rocm release.
Fixed
- For function geam avoid overflow in offset calculation.
- For function syr avoid overflow in offset calculation.
- For function gemv (Transpose-case) avoid overflow in offset calculation.
- For functions ssyrk and dsyrk, allow conjugate-transpose case to match legacy BLAS. Behavior is the same as the transpose case.
rocBLAS 2.39.0 for ROCm 4.3.1
Fixed
- CI testing/benchmark issues.
rocBLAS 2.39.0 for ROCm 4.3.0
Optimizations
- Improved performance of non-batched and batched rocblas_Xgemv for gfx908 when m <= 15000 and n <= 15000
- Improved performance of non-batched and batched rocblas_sgemv and rocblas_dgemv for gfx906 when m <= 6000 and n <= 6000
- Improved the overall performance of non-batched and batched rocblas_cgemv for gfx906
Changed
- Internal use only APIs prefixed with rocblas_internal_ and deprecated to discourage use
rocBLAS-2.38.0 for ROCm 4.2.0
Added
- Added option to install script to build only rocBLAS clients with a pre-built rocBLAS library
- Supported gemm ext for unpacked int8 input layout on gfx908 GPUs
- Added new flags rocblas_gemm_flags::rocblas_gemm_flags_pack_int8x4 to specify if using the packed layout
- Set the rocblas_gemm_flags_pack_int8x4 when using packed int8x4, this should be always set on GPUs before gfx908.
- For gfx908 GPUs, unpacked int8 is supported so no need to set this flag.
- Notice the default flags 0 uses unpacked int8, this somehow changes the behaviour of int8 gemm from ROCm 4.1.0
- Added new flags rocblas_gemm_flags::rocblas_gemm_flags_pack_int8x4 to specify if using the packed layout
- Added a query function rocblas_query_int8_layout_flag to get the preferable layout of int8 for gemm by device
Optimizations
- Improved performance of single precision copy, swap, and scal when incx == 1 and incy == 1.
- Improved performance of single precision axpy when incx == 1, incy == 1 and batch_count =< 8192.
- Improved performance of trmm.
Changed
- Change cmake_minimum_required to VERSION 3.16.8