# DotOpt - Optimized matrix multiplication
- Update the container.
./scripts/update-all.sh
- Install the required dependencies.
./scripts/install-deps.sh
- Fetch third party libraries
./scripts/fetch-libs.sh
WARNING: Benchmarking requires
SYS_CAP_ADMIN
.
Main library
The main library is implemented as a python module which computes matrix multiplication operations using NumPy types.
Testing
Tests are implemented in Python, and added to test/CMakeLists.txt
so that they
are executed when testing.
Benchmarking
Benchmarks are implemented in C/C++ using Google's benchmarking library.
Warning: CPU frequency scaling should be disabled during the benchmarks.
# To query the current CPU frequency scaler:
sudo cpupower frequency-info
# To force the CPU to run at full power:
sudo cpupower frequency-set --governor performance
# To restore the previous CPU fequency scaler profile:
sudo cpupower frequency-set --governor <previous governor>
- Sequential algorithm.
OpenMP parallelism
-
Parallel algorithm using OpenMP's loop parallelism.
-
Parallel algorithm using OpenMP's task parallelism.
-
Parallel algorithm with row/col tiling using OpenMP's loop parallelism.
-
Parallel algorithm with row/col tiling using OpenMP's task parallelism.
-
Parallel algorithm with 2d tiling using OpenMP's loop parallelism.
-
Parallel algorithm with 2d tiling using OpenMP's task parallelism.
Our IMTS implementation
-
Parallel algorithm using 1 level IMTS with random tile ordering.
-
Parallel algorithm using 1 level IMTS with tile ordering.
-
Parallel algorithm using 2 level ITMS with random tile ordering.
-
Parallel algorithm using 2 level ITMS with tile ordering.
-
Parallel algorithm using 3 level ITMS with random tile ordering.
-
Parallel algorithm using 3 level ITMS with tile ordering.
Extras
- The best version to date with a hand-crafted matrix multiplication assembly code.
- The hand-crafted version with 4-level ITMS tiling, where the smallest tiles fit into the vector registers. In other words, a hand crafted tiling level for our hand-crafted matrix multiplication implementation.
GPU using Vulkan compute shaders
- (Not implemented) A GPU version where each thread computes an output.
- (Not implemented) A tiling GPU version where each thread computes a 4x4 tile (GLSL).
- (Not implemented) A tiling GPU version where each SM computes a tile using the shared memory as a scratchpad for inputs and/or outputs.
GPU using OpenCL (if we can make it work)
- (Not implemented) Same as #16.
- (Not implemented) Same as #17.
- (Not implemented) Same as #18.
External implementations (for reference when comparing)
- (Testing only) NumPy implementation.
- (Not used) TensorFlow implementation.
- (Not used) OpenBLAS (maybe?).
- (Not used) clBLAS (maybe?).
- (Not used) ArrayFire (OpenCL; maybe?).
The IMTS scheduler divides the workload into 2d tiles taking into account the sizes of the LLC. Those are then passed on to another tiler which further divides the workload into smaller tiles which fit into the next-to-last level cache. The same process is repeated until the first level cache is reached.
The order in which the tiles are computed is predefined to minify the number of transfers between the caches. For example, a cache will need both input tiles and the output. When switching tiles, only one of those will be loaded, and preferably one of the inputs, since the outputs will have to be transferred back to the next level.