Identifying Performance Bottlenecks in BuildROM Function #11031

SADPR · 2023-04-18T14:30:10Z

SADPR
Apr 18, 2023
Collaborator

Identifying the Performance Bottleneck in `BuildROM`

After profiling the BuildROM function, I found that the majority of the time was being spent on the second matrix product in the following three lines:

noalias(rPreAlloc.aux) = prod(rPreAlloc.lhs, rPreAlloc.phiE);
noalias(rPreAlloc.romA) = prod(trans(rPreAlloc.phiE), rPreAlloc.aux) * h_rom_weight;
noalias(rPreAlloc.romB) = prod(trans(rPreAlloc.phiE), rPreAlloc.rhs) * h_rom_weight;

As we use more modes, the BuildROM function takes longer to execute due to this bottleneck. It is worth noting that the GetPhiElemental function, which I initially suspected to be the bottleneck, was not the issue.

These are the formulas:
First:
$\mathbf{Aux}^e = \mathbf{J}^e\mathbf{\Phi}^e$

Second:
$\mathbf{A}^e = \mathbf{\Phi}^{eT}\mathbf{Aux}^e \longleftarrow \text{Expensive}$

Third:
$\mathbf{b}^e = \mathbf{\Phi}^{eT}\mathbf{R}^e$

For example, a test was conducted with ndofs=9 and nromdofs=384, and the average times for each element were:

First product: 1.6759e-05 sec
Second product: 5.55855e-04 sec
Third product: 7.06e-06 sec

The total BuildROM time was 4.79021 sec for all 6144 elements. It was found that the second product takes up most of the time due to its matrix dimensions. For ndofs=9 and nromdofs=384, the dimensions of the matrices involved are:

$\mathbf{J}^e$: $9\times9$
$\mathbf{\Phi}^e$: $9\times384$
$\mathbf{Aux}^e$: $9\times384$
$\mathbf{\Phi}^{eT}$: $384\times9$
$\mathbf{A}^e$: $384\times384$
$\mathbf{R}^e$: $9\times1$
$\mathbf{b}^e$: $384\times1$

Conclusion

In conclusion, after profiling the BuildROM function, we found that the majority of the time was being spent on the second matrix product. This was confirmed by measuring the time for each of the three products individually, and it was found that the second product took significantly longer than the other two.

We ran the time measurement with a KRATOS_INFO_IF() and I made that it did not add significant time to the overall execution. Furthermore, the profiling was done with a single thread, so we do not need to worry about parallelism.

It seems that there is no obvious way to improve the performance of the second matrix product as the product seems to be already optimized, and as we increase the number of modes, the BuildROM time will increase significantly. This may become a bottleneck in larger simulations, and we should be aware of its impact on performance.

Further Steps

To continue improving the performance of the BuildROM function, we plan to take the following steps:

Use Intel VTune Amplifier to obtain more detailed information about the performance of the code, including identifying any hotspots and areas for improvement. This will also allow us to measure the performance without the overhead of KRATOS_INFO_IF() statements.
Try using the Intel Math Kernel Library (MKL) to perform the matrix products in the BuildROM function. MKL is highly optimized for performance and can potentially provide a significant speedup.
Compile the code using the Clang compiler, which may provide additional optimization opportunities compared to the current GCC compiler.

By taking these steps, we hope to further optimize the performance of the BuildROM function and reduce its impact on larger simulations.

loumalouomega · 2023-04-18T22:14:50Z

loumalouomega
Apr 18, 2023
Collaborator

Have you think about using sparse matrices instead of dense ones?, those sizes are already big

4 replies

philbucher Apr 19, 2023
Collaborator

Fully agree, this can make differences of several orders of magnitude

But first, are the matrices dense or sparse?

SADPR Apr 24, 2023
Collaborator Author

The ROM matrices are currently dense because they are assembled element by element. However, we are testing a global approach where the Jacobian will become sparse, allowing us to perform operations between sparse and dense matrices, as well as between two dense matrices.

We tested the time difference between performing these operations globally and elementally using the LSPG approach, which was initially built globally. This test only involved a Dense by Dense operation, but it showed significant improvements. We also used the Eigen libraries to optimize this product. However, we must be careful because the HROM used so far works with ROMs assembled elementally using the Empirical Cubature Method. To tackle this bottleneck, we would need to either adapt the procedure to fit the global approach or change the algorithm, for instance, to a Gauss Newton with Approximated Tensors.

Please note that the results we've shown are not optimized because they only involve a Dense by Dense operation. They could be further improved by implementing the first Sparse by Dense matrix.

loumalouomega Apr 24, 2023
Collaborator

I don't know if it may help, because I am not an ROM expert. But I have been thinking recently about the creation of "superelements". In our particular case because we are interested to perform a static condensation (see and this).

This may be interesting. In my design (in my mind), the "superelements"internally assemble a sparse matrix from the dense matrix contributions of the elements it aggregates. In addition, it would inform of the DoFs condensated, etc..

Tell me if this idea is interesting for ROM, and we may discuss together with the @KratosMultiphysics/technical-committee a possible design that may fit both "superelements"and ROM elements.

SADPR Apr 24, 2023
Collaborator Author

Thank you for sharing your idea with us. It definitely looks interesting and potentially helpful. I appreciate your suggestion and will take some time to think about how this can fit into our ROM approach. I will also take a look at the references you provided to better understand the concept of "superelements". We will keep you updated on our thoughts and potential next steps. Thank you again for your input!

SADPR · 2023-04-24T09:45:48Z

SADPR
Apr 24, 2023
Collaborator Author

I wanted to share some images that show the consistency of the LSPG approach for different strategies. As you can see in the attached images, the LSPG approach exhibits good consistency across different strategies.

I hope these images help to illustrate the robustness of the LSPG approach with the new strategies.

0 replies

SADPR · 2023-04-24T10:04:14Z

SADPR
Apr 24, 2023
Collaborator Author

This image compares different strategies for solving our problem by examining the Total time required for Building and Solving against the number of modes considered. The global approach with normal equations significantly reduces the Total time, even when performing a Dense_3 by Dense_2 matrix product. However, there is still room for further improvements by implementing a Sparse_1 by Dense_1 matrix. Currently, the Dense_2 matrix is being built in a dense form instead of being generated from a Sparse_1 by Dense_1 product (that is Dense_2 = Sparse_1@Dense_1). I know this can be confusing, I will try to formulate the problem correctly if I find the time.

The figure illustrates a significant improvement in the Total time required for Building and Solving the ROM model as the number of modes increases. The cost has been reduced from a higher order (possibly quadratic) to linear. The use of Eigen libraries for optimization has also contributed to a reduction in time for the matrix product operations.

0 replies

SADPR · 2023-04-25T09:24:55Z

SADPR
Apr 25, 2023
Collaborator Author

In addition to the earlier comparisons, I also conducted additional tests by integrating the Eigen library into both the Elemental Galerkin of the master and an older B&S implementation. The results are presented in the following figure:

It is evident that incorporating the Eigen library results in a significant improvement in the total computation time. Interestingly, the old ROM B&S, which was less complex and utilized pragma instead of block for each, outperformed the newer one.

These observations suggest that integrating the Eigen library can lead to substantial computational improvements.

1 reply

loumalouomega Apr 25, 2023
Collaborator

We discussed replacing Ubla long time ago, but it is not trivial, as it is deeply integrated in Kratos. We discussed to use Eigen (as already used in LinearSolverApp) or Blaze (https://bitbucket.org/blaze-lib/blaze/src). In the past we also tried to use AMatrix, created by @pooyan-dadvand, but we did not finish the implementations.

loumalouomega · 2023-04-25T09:40:10Z

loumalouomega
Apr 25, 2023
Collaborator

FYI @pooyan-dadvand

0 replies

SADPR · 2023-04-26T10:41:22Z

SADPR
Apr 26, 2023
Collaborator Author

1 reply

loumalouomega Apr 26, 2023
Collaborator

And you are comparing with Eigen, which is obviously faster than Ublas, but not the fastest one of all linear algebra libraries (see)

SADPR · 2023-04-26T10:45:31Z

SADPR
Apr 26, 2023
Collaborator Author

Comparison of Galerkin Ublas (Elemental) vs Galerkin Eigen (Global)

In the previous approach, the Galerkin reduced order model (ROM) system of equations was built element by element using the following expression:

$\sum_e \mathbf{\Phi}^{eT} \mathbf{J}^e \mathbf{\Phi}^{e}$

However, when constructing the ROM system for a basis $\mathbf{\Phi}$ with a large number of modes, it was found that it is more cost-effective to build the system globally. This involves:

Building the global Jacobian $\mathbf{J}$ (using, for instance, the Residual Based Block Builder And Solver).
Building the global basis $\mathbf{\Phi}$.
Performing a sparse to dense product to obtain $\mathbf{Aux}=\mathbf{J} \mathbf{\Phi}$.
Performing a dense to dense product to obtain $\mathbf{\Phi}^T\mathbf{Aux}$.

This approach utilizes Eigen library instead of Ublas for better performance.

The comparison of the two approaches is shown in the figure below:

From the results, it can be seen that the Galerkin Eigen (Global) approach is significantly faster for larger basis sizes. Therefore, it is recommended to use this approach when building the ROM system for a basis with a large number of modes.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identifying Performance Bottlenecks in BuildROM Function #11031

{{title}}

Replies: 7 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Identifying Performance Bottlenecks in BuildROM Function #11031

SADPR Apr 18, 2023 Collaborator

Identifying the Performance Bottleneck in BuildROM

Conclusion

Further Steps

Replies: 7 comments · 6 replies

loumalouomega Apr 18, 2023 Collaborator

philbucher Apr 19, 2023 Collaborator

SADPR Apr 24, 2023 Collaborator Author

loumalouomega Apr 24, 2023 Collaborator

SADPR Apr 24, 2023 Collaborator Author

SADPR Apr 24, 2023 Collaborator Author

SADPR Apr 24, 2023 Collaborator Author

SADPR Apr 25, 2023 Collaborator Author

loumalouomega Apr 25, 2023 Collaborator

loumalouomega Apr 25, 2023 Collaborator

SADPR Apr 26, 2023 Collaborator Author

loumalouomega Apr 26, 2023 Collaborator

SADPR Apr 26, 2023 Collaborator Author

Comparison of Galerkin Ublas (Elemental) vs Galerkin Eigen (Global)

SADPR
Apr 18, 2023
Collaborator

Identifying the Performance Bottleneck in `BuildROM`

Replies: 7 comments 6 replies

loumalouomega
Apr 18, 2023
Collaborator

philbucher Apr 19, 2023
Collaborator

SADPR Apr 24, 2023
Collaborator Author

loumalouomega Apr 24, 2023
Collaborator

SADPR Apr 24, 2023
Collaborator Author

SADPR
Apr 24, 2023
Collaborator Author

SADPR
Apr 24, 2023
Collaborator Author

SADPR
Apr 25, 2023
Collaborator Author

loumalouomega Apr 25, 2023
Collaborator

loumalouomega
Apr 25, 2023
Collaborator

SADPR
Apr 26, 2023
Collaborator Author

loumalouomega Apr 26, 2023
Collaborator

SADPR
Apr 26, 2023
Collaborator Author