If you don't mind the portability issue, you may try to opt it by working on the cblas_dgemm()
using:
- MKL on a Intel machine, or even
- GPU computing lib, for example, cuBLAS
cublasGemmEx
may draw your interests and you would have to convert the types and prepare (set up) the matrix beforehand