[MPPI]: Replace xtensor with Eigen Experiment #4237

guoxxiong · 2024-04-03T02:19:04Z

No description provided.

SteveMacenski · 2024-04-03T02:21:05Z

Can you back up that claim or provide any other detail?

guoxxiong · 2024-04-03T02:42:51Z

Because my hardware does not support xsimd acceleration, and xtensor does not support gcc-7, while other programs rely on gcc-7, so I want to know whether the performance will be weakened by using Eigen instead of xtensor for the tensor of 2000x50.

SteveMacenski · 2024-04-03T02:48:38Z

That's not sufficient to prevent us from looking into it as a potential course of action.

guoxxiong · 2024-04-03T03:21:10Z

For a 2000*60 array, run following code, it outputs:
dot_product_xt: 240000
xtensor dot product took 0.00535805 seconds.
dot_product_eigen: 240000
Eigen dot product took 0.000759429 seconds.

Code:
#include
#include
#include <xtensor/xarray.hpp>
#include "/usr/include/eigen3/Eigen/Dense"

constexpr int rows = 2000;
constexpr int cols = 60;

int main()
{

xt::xarray<double> A_xt({rows, cols}, 1.0);
xt::xarray<double> B_xt({rows, cols}, 2.0);
Eigen::MatrixXd A_eigen(rows, cols);
Eigen::MatrixXd B_eigen(rows, cols);
A_eigen.setOnes();
B_eigen= B_eigen.setOnes() * 2.0;


double dot_product_xt = 0.0;
double dot_product_eigen = 0.0;


auto start_xtensor = std::chrono::steady_clock::now();
dot_product_xt = xt::sum(A_xt * B_xt)();
std::cout << "dot_product_xt: " << dot_product_xt << std::endl;
auto end_xtensor = std::chrono::steady_clock::now();
std::chrono::duration<double> elapsed_xtensor = end_xtensor - start_xtensor;
std::cout << "xtensor dot product took " << elapsed_xtensor.count() << " seconds." << std::endl;


auto start_eigen = std::chrono::steady_clock::now();
dot_product_eigen = (A_eigen.array() * B_eigen.array()).sum();
std::cout << "dot_product_eigen: " << dot_product_eigen << std::endl;
auto end_eigen = std::chrono::steady_clock::now();
std::chrono::duration<double> elapsed_eigen = end_eigen - start_eigen;
std::cout << "Eigen dot product took " << elapsed_eigen.count() << " seconds." << std::endl;

return 0;

}

SteveMacenski · 2024-04-03T03:28:09Z

That would seem to evidence that Eigen is faster than xtensor, not slower by a factor of over 7x! That would point to the fact that we should consider using Eigen instead of xtensor - which is why its mentioned in the MPPI ticket as something that we should seriously look to evaluate instead of using xtensor for an acceleration boost.

I don't understand your ticket's claim then that xtensor is slower and shouldn't be analyzed?

guoxxiong · 2024-04-03T03:49:38Z

The xtensor should be analyzed. I just want to try to replace the xtensor in MPPI with eigen, and I do not know whether the original control frequency can be maintained.

SteveMacenski · 2024-04-03T03:54:16Z

That is not what your initial ticket title implies, but OK - maybe just a misunderstanding.

Yes, I definitely support some help in trying to migrate MPPI to Eigen to see if its an improvement.

and I do not know whether the original control frequency can be maintained.

No one knows to be able to tell you for certain. However, metrics I see make me think that its a worthwhile direction to explore and that it might actually make it run faster. Are you interested in spending some time working on this yourself?

guoxxiong · 2024-04-03T04:05:13Z

Thanks for your advice and help, I will try to migrate MPPI to Eigen.

SteveMacenski · 2024-04-03T04:26:46Z

OK. I renamed the ticket to be in line with our discussion.

I'd recommend outright ignoring the critics to get started and focus on the core Optimizer/Noise Generator/utils. I think that should make it clear from benchmarking whether its faster or not than xtensor without going through the more laborious task of migrating the critics (which some are easy, some are not). Also, don't fret if immediately the performance isn't as good; there may be steps we can take to improve things. There were several cycles of optimizations and testing on xtensor to get the performance we have now and I would expect the same from Eigen. However, the benchmark on Eigen views makes me think we can save some compute time.

Do you have a general sense of the priority of this task or when you expect to make some progress? I'd be happy to answer any questions or see how I can help if you work on this on a Nav2 fork that is public.

guoxxiong · 2024-04-11T03:36:47Z

Hi~, I have initially migrated MPPI to Eigen in Nav1 and conducted a rough comparison experiment. Specifically, Eigen::Array is used instead of xt::xtensor.
Params: model_dt = 0.1, time_step = 32, batch_size = 1800, all Critics
Processor: I5 12400F
Control loop execution time in Eigen Array (ROS1): average: 5 [ms], max: 7 [ms], min: 3 [ms] .
Control loop execution time in xt::xtensor (ROS1): average: 5 [ms], max: 8 [ms], min: 3 [ms].
The conclusion is that using Eigen Array instead of xtensor will not significantly improve the calculation speed.

SteveMacenski · 2024-04-11T04:09:05Z

Can you share the source? I’d like to spend a day or two playing around to make sure I find the same behavior and see if I can tweak it at all to improve things.

Did you look at using Eigen’s tensor library instead of Array? The tensor views are supposedly much faster from open source benchmarking I’ve seen

guoxxiong · 2024-04-11T08:24:00Z

source: https://github.com/guoxxiong/mppi_local_planner
Because the tensor used by MPPI is at most two-dimensional, I used Array. Eigen : : Tensor may be faster for high-dimensional tensor calculation, but it is more complicated to operate.

SteveMacenski · 2024-04-11T17:55:09Z

The tensor views have the ability to process information without direct copies, which array I believe lacks. There may be some efficiency gains using Eigen tensor views that arrays don't have. For example, I'm seeing some for loops in your code that could be handled with Tensors that would likely be more efficient by using simd vectorization or even later GPU support

I think it is worth trying to use Eigen Tensor & views over creating everything as a looping function on arrays. That could give you that edge in run-time speed

guoxxiong · 2024-04-12T02:37:04Z

Yes, when I increased the array to 2000 * 56, Eigen::Array was much slower than xtensor, xtensor(15 ms) vs Eigen::Array(30 ms).

guoxxiong · 2024-04-12T04:15:36Z

I wonder if Eigen::Ref works, Eigen::Array allows the use of references (Eigen::Ref) to modify elements in an array without data copying, which is similar to xt::view in xtensor.

SteveMacenski · 2024-04-12T17:23:26Z

I think its worth testing with Eigen Tensors themselves instead of trying to work around them with Arrays https://eigen.tuxfamily.org/dox/unsupported/eigen_tensors.html

SteveMacenski · 2024-04-23T18:30:41Z

@guoxxiong any thoughts / have some cycles to look into it?

guoxxiong · 2024-04-30T02:52:53Z

@SteveMacenski ，I used Eigen::replicate, Eigen::Map, Eigen::TensorMap to improve performance, it can indeed achieve better results in CPU occupancy and speed.

SteveMacenski · 2024-04-30T15:42:26Z

Great! Can you share some metrics and and potential paths forward? 😄

This would be a great contribution to Nav2 / the community to speed things up!

Ayush1285 · 2024-06-11T18:14:56Z

@SteveMacenski We can go ahead and discuss about using Eigen for MPPI.
As mentioned earlier, I've implemented Eigen Array for the calculations in Optimizer, NoiseGenerator, MotionModel, and Util files. Link of the fork repo: https://github.com/Ayush1285/navigation2/tree/main. Changes are in eigen_mppi branch.
These are the benchmark results of Eigen and xtensor. (Note - Without any Critics plugin).

Results averaged over 5 runs:

Benchmark	xtensor	Eigen Array
BM_DiffDrivePointFootprint	7.82 ms	6.17 ms
BM_DiffDrive	7.31 ms	6.30 ms
BM_Omni	8.35 ms	8.30 ms
BM_Ackermann	7.51 ms	6.27 ms

Considering that this is my very first draft of the implementation, we can potentially gain significant improvement in compute time for the controller. Although, I'm not sure if we can reduce the time to half.

Few points on which we can improve:

We don't have any xt::roll() like function in Eigen. So, I've assigned the values to a temporary copy, and then I'm reassigning it back to the original variable. For the size of the 2000 * 50 matrix, it could be a very expensive operation. We can potentially write a custom function to use Eigen::Map or something else to improve on this.
Similarly we don't have any function in Eigen to generate random numbers from a normal distribution. I've used Eigen::NullaryExpr with lambda expression to generate noises. We can improve on this also.

There could be many more optimisations possible. We'll have to find out. Looking for your insights!!

SteveMacenski · 2024-06-11T18:26:28Z

Have you taken a look at Eigen tensors? https://eigen.tuxfamily.org/dox/unsupported/eigen_tensors.html I think that should be a source of improvement as well rather than using Array. I think its definitely worth experimenting with that (maybe keep a second branch with tensor to build out in parallal to compare) since that's a big feature that could have major runtime differences. Particularly when striding through the tensors in the critics.

Note @guoxxiong 's comment #4237 (comment) -- though it looks like his repository is no longer public. @guoxxiong can you share this again so we can work on this from where you were at?

That's annoying that there's no roll... luckily that's one of the more simple ones to implement for a first-order level! It looks like the tensor library has a NormalRandomGenerator! But, googling around shows some different options for Eigen-esk things for Array/Matrix

Ayush1285 · 2024-06-11T18:41:14Z

Have you taken a look at Eigen tensors? https://eigen.tuxfamily.org/dox/unsupported/eigen_tensors.html I think that should be a source of improvement as well rather than using Array.

I don't think Eigen Tensor will improve the performance since we are only dealing with 2 dimensions. It will only increase complexities. Even in the Medium Article that you shared link, they have used Eigen::Matrix for linear algebraic calculations.

SteveMacenski · 2024-06-11T18:56:06Z

I think its worth a try and important to benchmark to make sure its not an improvement. Striding and AVX I don't know about Matrix vs Eigen w.r.t. the benchmark, they only showed Tensor.

We also broke down the tensor into 2D from ND to optimize the performance for xtensor, so we could possibly readd the full dimensionality X,Y,Theta in a single tensor with Eigen if that helps performance in the Eigen migration. So this would either way be a necessary stepping stone. Part of this is experimenting with different configurations of Eigen and see what's the best!

Eigen Array
Eigen Tensor - 2D as is now
Eigen Tensor - ND as trajectories can be represented in the dense tensor rather than 3x 2D tensors
xtensor - 2D as is now as the baseline
~~xtensor - ND (bad, no need to replicate, striding and slices were slow)~~

We run this at 50-100hz with thousands of samples in each iteration, so small improvements are a big runtime change - especially in strides/views/slices that we cannot intuit without trying. That's the area a migration like this would possibly get improvements from - which the benchmark shows should be pretty major. I suspect that because the strides/slices are so much faster in Eigen, we can recollapse the tensors so that they contain the X,Y,Theta information (instead of 3 separate 2D tensors) and will be again another big boost to exploit the AVX optimizations for a single iteration through the tensor to do all 3x the work. But, it is definitely worth testing all 3 ways: Array, Tensor 2D, and Tensor ND

Ayush1285 · 2024-06-11T19:42:59Z

But, it is definitely worth testing all 3 ways: Array, Tensor 2D, and Tensor ND

Agreed, without testing it would be inconclusive. I'll move forward with Eigen::Tensor implementation.

they only showed Tensor.

They have mentioned that they have used Eigen::Matrix for benchmarking, not Eigen::Tensor.
Change “Tensor” to “Eigen::Matrix<double,num,num>” and you get an Eigen implementation (not exactly as sequences in Eigen are closed sets — they include the ends).
Snapshot of the blogpost:

guoxxiong · 2024-06-12T02:31:49Z

Agreed! In addition, I think CPU occupancy should also be taken into account.

guoxxiong · 2024-06-12T02:42:11Z

Note @guoxxiong 's comment #4237 (comment) -- though it looks like his repository is no longer public. @guoxxiong can you share this again so we can work on this from where you were at?
@SteveMacenski , So sorry, due to my own professional reasons, this part of the code can not be open source for the time being.

SteveMacenski · 2024-07-03T16:57:37Z

@Ayush1285 hey! Any updates or findings! 😄

Ayush1285 · 2024-07-04T18:28:00Z

@Ayush1285 hey! Any updates or findings! 😄

Hey @SteveMacenski , I wasn't able to work on it due to some personal reasons. I can resume working on it atleast after 2 weeks. Sorry for the delay :)

Ayush1285 · 2024-08-01T19:35:29Z

@SteveMacenski Hi, I'm available now to resume the work. I'll start implementing Eigen::Tensor. I'll start with the 3 * Tensor 2d case. 1 * Tensor ND might not offer much advantage because for the y-axis we will need to check if the model is holonomic or not.

SteveMacenski · 2024-08-01T20:39:20Z

Great! Let me know if I can help answer any questions or if you need me to review a branch!

Ayush1285 · 2024-08-06T19:04:13Z

Great! Let me know if I can help answer any questions or if you need me to review a branch!

Eigen::Tensor module comes under the eigen-unsupported section. Looks like basic operations like multiplication or addition of Tensor with a scalar are not supported. Broadcast operation is there, it will broadcast the scalar into the required tensor. But we will have to create a temporary tensor object to perform this operation.

Ayush1285 · 2024-08-09T18:32:45Z

@SteveMacenski It seems element-wise trigonometric operations are also not supported by default in Eigen Tensor :(

SteveMacenski · 2024-08-09T20:34:31Z

You should have able to write a function that can be applied elementwise across a tensor (whose implementation is trig). I experimented with xtensor and didn't find that simd was hitting those functions anyway so the cpu time spent on xt::sin and creating an xfunction using std::sin were exactly the same. Unsurprisingly the implemention of trig within xtensor was also that same thing

Ayush1285 · 2024-08-10T15:06:49Z

You should have able to write a function that can be applied elementwise across a tensor (whose implementation is trig). I experimented with xtensor and didn't find that simd was hitting those functions anyway so the cpu time spent on xt::sin and creating an xfunction using std::sin were exactly the same. Unsurprisingly the implemention of trig within xtensor was also that same thing

Eigen provides unaryExpr, NullaryExpr and binaryExpr functions to wrap custom element wise operations. I've tried benchmarking a few basic mathematical operations used in our optimizer class. Eigen::Tensor took 31 ms on avg while it was 5 ms on avg for Eigen::Array. So Eigen::Tensor turns out to be 6 times slower for these operations.
This is the sample code, you can test it on your machine if possible.

#include <iostream>
#include <math.h>
#include <chrono>
#include <Eigen/Dense>
#include <unsupported/Eigen/CXX11/Tensor>

template<class T>
auto applyCosine(T&& e)
{
  return e.unaryExpr([&](float val){ return cosf(val);});
}

template<class T>
auto applySine(T&& e)
{
  return e.unaryExpr([&](float val){ return sinf(val);});
}

int main()
{
  Eigen::Tensor<float, 2> yaws_tensor(4000, 500);
  yaws_tensor.setRandom();
  Eigen::Tensor<float, 2> vx_tensor(4000, 500);
  vx_tensor.setRandom();
  
  Eigen::ArrayXXf yaws_array(4000, 500);
  yaws_array.setRandom();
  Eigen::ArrayXXf vx_array(4000, 500);
  vx_array.setRandom();  

  std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();

  auto yaw_cos_tensor = applyCosine(yaws_tensor);
  auto yaw_sin_tensor = applySine(vx_tensor);
  Eigen::Tensor<float, 2> dx_tensor = (vx_tensor * yaw_cos_tensor).eval();
  Eigen::Tensor<float, 2> dy_tensor = (vx_tensor * yaw_sin_tensor).eval();

  std::chrono::steady_clock::time_point now = std::chrono::steady_clock::now();
  std::cout << "Time taken for Tensor operations = " << std::chrono::duration_cast<std::chrono::milliseconds>(now- begin).count() << "[ms]" << std::endl;

  begin = std::chrono::steady_clock::now();

  auto yaw_cos_array = yaws_array.cos();
  auto yaw_sin_array = yaws_array.sin();
  Eigen::ArrayXXf dx_array = (vx_array * yaw_cos_array).eval();
  Eigen::ArrayXXf dy_array = (vx_array * yaw_sin_array).eval();

  now = std::chrono::steady_clock::now();
  std::cout << "Time taken for Array operations = " << std::chrono::duration_cast<std::chrono::milliseconds>(now - begin).count() << "[ms]" << std::endl;
}

Ayush1285 · 2024-08-12T19:07:03Z

Since Eigen::Tensor performed poorly for custom unary expressions, I tried optimizing Eigen::Array implementation.
After minimizing L1 cache misses(in for loops) and reducing some overheads. These are the benchmarks:

Results are averaged over 5 runs and full footprint is enabled.
batch size = 2000
time steps = 50
path points = 70
iteration count = 2
lookahead distance = 20.0

Benchmark	xtensor	Eigen Array	% improvement
BM_DiffDrive	11.56 ms	8.58 ms	25.77 %
BM_Omni	11.82 ms	9.51 ms	19.54 %
BM_Ackermann	11.59 ms	8.71 ms	24.84 %

Ayush1285 · 2024-08-12T20:17:33Z

@SteveMacenski any remarks? Should I create a PR?

SteveMacenski · 2024-08-14T02:03:31Z

Ooooooo yes please! Open a draft and lets look over it together. I wouldn't be surprised if I found some places we could improve further.

I'd also be interested if you could profile that to and provide the kcachegrind file so I can review the new hot spots. here's a tutorial: https://docs.nav2.org/tutorials/docs/get_profile.html. But, I can also do that myself later - it might be useful for you to go through that and maybe find a few spots yourself

20-25% is nothing to snuff at, that's a huge difference maker, I've spent weeks and weeks working on optimizing various things to get back 5-10%. This is a nice good step function jump and amazing to see that we can run this at under 10ms (which was my target) -- which equates to 100hz! Many GPU enabled MPPI implementations can't do 100hz. I assume you have some laptop i5 or i7 CPU?

This is impressive work indeed and a very non-trivial contribution and effort! I will definitely make sure to announce / promote this once we get it in the stack!

Ayush1285 · 2024-08-14T20:13:29Z

I assume you have some laptop i5 or i7 CPU?

Yes, it is i7 8th gen with SSE4.1/4.2 and AVX2 Instruction set extensions.

I'd also be interested if you could profile that to and provide the kcachegrind file so I can review the new hot spots. here's a tutorial: https://docs.nav2.org/tutorials/docs/get_profile.html. But, I can also do that myself later - it might be useful for you to go through that and maybe find a few spots yourself

Sure, I'll do that. Earlier, I tried profiling with GNU Profiler. I'll try with Valgrind then as mentioned in the tutorial.

SteveMacenski · 2024-08-19T18:39:02Z

@Ayush1285 it looks like those metrics were taken when not using some of the more runtime-costly critics (not yet ported). I see that you finished some porting, when using the default configs here, do you still see the improvement?

I'm going to give it a review in a few minutes!

Ayush1285 · 2024-08-19T18:50:43Z

@Ayush1285 it looks like those metrics were taken when not using some of the more runtime-costly critics (not yet ported). I see that you finished some porting, when using the default configs here, do you still see the improvement?

I'm going to give it a review in a few minutes!

I used optimizer_benchmark.cpp file to benchmark, so I first ported those critics to eigen. Let me benchmark with all critics loaded, and check the result.

Ayush1285 · 2024-08-19T19:27:34Z

@Ayush1285 it looks like those metrics were taken when not using some of the more runtime-costly critics (not yet ported). I see that you finished some porting, when using the default configs here, do you still see the improvement?

I'm going to give it a review in a few minutes!

After loading all the critics(except VelocityDeadband and TwirlingCritic), I'm getting this from optimizer_benchmark. All params for critics are set to default value.

Benchmark	xtensor	Eigen Array
BM_DiffDrive	11.58 ms	9.17 ms
BM_Omni	12.48 ms	10.09 ms
BM_Ackermann	11.66 ms	8.25 ms

SteveMacenski closed this as completed Apr 3, 2024

SteveMacenski reopened this Apr 3, 2024

SteveMacenski changed the title ~~[MPPI]: If Eigen is used instead of xtensor, will the speed be significantly slower~~ [MPPI]: Replace xtensor with Eigen Experiment Apr 3, 2024

SteveMacenski mentioned this issue Jun 4, 2024

Add acceleration constraints in the MPPI controller #4352

Merged

7 tasks

Ayush1285 mentioned this issue Aug 14, 2024

20-25% performance improvement in MPPI controller using Eigen library for computation. #4621

Open

7 tasks

[MPPI]: Replace xtensor with Eigen Experiment #4237

[MPPI]: Replace xtensor with Eigen Experiment #4237

Comments

guoxxiong commented Apr 3, 2024

SteveMacenski commented Apr 3, 2024

guoxxiong commented Apr 3, 2024

SteveMacenski commented Apr 3, 2024

guoxxiong commented Apr 3, 2024

SteveMacenski commented Apr 3, 2024

guoxxiong commented Apr 3, 2024

SteveMacenski commented Apr 3, 2024

guoxxiong commented Apr 3, 2024

SteveMacenski commented Apr 3, 2024

guoxxiong commented Apr 11, 2024

SteveMacenski commented Apr 11, 2024

guoxxiong commented Apr 11, 2024

SteveMacenski commented Apr 11, 2024

guoxxiong commented Apr 12, 2024

guoxxiong commented Apr 12, 2024

SteveMacenski commented Apr 12, 2024

SteveMacenski commented Apr 23, 2024

guoxxiong commented Apr 30, 2024

SteveMacenski commented Apr 30, 2024 • edited Loading

Ayush1285 commented Jun 11, 2024 • edited Loading

SteveMacenski commented Jun 11, 2024 • edited Loading

Ayush1285 commented Jun 11, 2024 • edited Loading

SteveMacenski commented Jun 11, 2024 • edited Loading

Ayush1285 commented Jun 11, 2024 • edited Loading

guoxxiong commented Jun 12, 2024

guoxxiong commented Jun 12, 2024

SteveMacenski commented Jul 3, 2024

Ayush1285 commented Jul 4, 2024

Ayush1285 commented Aug 1, 2024

SteveMacenski commented Aug 1, 2024

Ayush1285 commented Aug 6, 2024

Ayush1285 commented Aug 9, 2024

SteveMacenski commented Aug 9, 2024

Ayush1285 commented Aug 10, 2024 • edited Loading

Ayush1285 commented Aug 12, 2024

Ayush1285 commented Aug 12, 2024

SteveMacenski commented Aug 14, 2024 • edited Loading

Ayush1285 commented Aug 14, 2024 • edited Loading

SteveMacenski commented Aug 19, 2024 • edited Loading

Ayush1285 commented Aug 19, 2024

Ayush1285 commented Aug 19, 2024

SteveMacenski commented Apr 30, 2024 •

edited

Loading

Ayush1285 commented Jun 11, 2024 •

edited

Loading

SteveMacenski commented Jun 11, 2024 •

edited

Loading

Ayush1285 commented Jun 11, 2024 •

edited

Loading

SteveMacenski commented Jun 11, 2024 •

edited

Loading

Ayush1285 commented Jun 11, 2024 •

edited

Loading

Ayush1285 commented Aug 10, 2024 •

edited

Loading

SteveMacenski commented Aug 14, 2024 •

edited

Loading

Ayush1285 commented Aug 14, 2024 •

edited

Loading

SteveMacenski commented Aug 19, 2024 •

edited

Loading