Replies: 17 comments
-
Miguel de Icaza already confirmed it: |
Beta Was this translation helpful? Give feedback.
-
RyuJIT CTP4 was released a few weeks ago, and it supports Win7 / Server2008R2, in case you don't have a Win8 machine available. CTP4 also includes support for mutable vectors (which were only immutable in previous versions). |
Beta Was this translation helpful? Give feedback.
-
Great, thanks for the update! |
Beta Was this translation helpful? Give feedback.
-
Hi there, I've been looking aroud this weeks about SIMD support in .Net. Here are some ressources I found on the web : Lionel |
Beta Was this translation helpful? Give feedback.
-
At the moment RyuJIT is only being used for x64 so it might be better to wait until it's used for both x86 and x64. If you're interested though, a while back I added a few simple SIMD implementations to DenseVectorAdd in the performance project and did some benchmarking. They're in my performance branch on github. There's also some native versions using auto vectorization, auto parallelization and OpenMP for comparison. The .Net SIMD version performs fairly well but depends a lot on the size of the vector (array) as to whether it will perform better than the native versions or MKL. Here's sample output on my computer:
|
Beta Was this translation helpful? Give feedback.
-
@kjbartel Which processor did you run this on? I'm curious if it supports AVX2. Thanks |
Beta Was this translation helpful? Give feedback.
-
@cuda It's an Intel i7-4770 (4 core, 8 threads) 3.4 / 3.9 GHz (base / turbo), with 16Gb ram. So yes it supports AVX2 and the C++ project used for the the Performance.LinearAlgebra.DenseVectorAdd: Small (1'000) - 100x1000 iterations
I can't really see any significant difference between the AVX2 and SSE2 versions as all the results are slightly lower this time around. Comparing MKL would be using AVX2 and OpenMP but I generally found the auto vectorized and auto parallelized version generated by the VC compiler to beat MKL. At least on my hardware. I started looking at this a while back as the OpenBLAS provider didn't have native vector functions like MKL. |
Beta Was this translation helpful? Give feedback.
-
Interesting, I would have thought the 256-bit vs 128-bit registers would have made a difference (process 4 doubles at a time vs 2). Thanks |
Beta Was this translation helpful? Give feedback.
-
Hi, Thanks for sharing these interesting benchmarks. So apparently there's almost a x2 speedup between the managed mathnet and the System.Numeric (with SIMD) vector add operation. @kjbartel : would you mind posting the results on your machine for small vectors (10 elements) ? => It's always interresting to know the variability of these results regarding the vector length as there is always sevral ways to model a given problem (see AoS vs SoA) I'll try to run your bench on my laptop. To get back to my question : do you think mathnet could benefit from SIMD in the actual managed provider and is it complex to implement in your opinion ? Cheers, |
Beta Was this translation helpful? Give feedback.
-
@cuda Always the question is how the JIT end up optimizing specific accesses. I have been working for 2 years already making sure the JIT emits the most optimized code for very low level routines and how you write the code really matters. There are million of tricks you can apply to achieve what you want. For example, if you study the Bond code you will find they achieve a raw performance that is so freaking high that they beat by 4.4x a hand optimized routine that we had in place to read-write variable-size structures over pointers. On another case, we wrote the fastest memory copy in .net tweaking how to write certain loops and resort to managed only after the size is big enough to make sense for the p/invoke. We also wrote an xxHash hashing function which has pretty good raw performance (10x the next fastest provider) but we were able to achieve 10% improvement on top of that just serializing the load/stores in the emitted assembly. Achieving C++ raw performance is possible, it's just a question if you are willing (and make economic sense) to tweak to code to get there. |
Beta Was this translation helpful? Give feedback.
-
@redknightlois Thanks for the info. I've been meaning to look at Bond. |
Beta Was this translation helpful? Give feedback.
-
@lionpeloux Sorry was a bit busy. Here's some results for 10, 16, 100, 1000 and 10000. NativePerformance project was compiled with AVX2. I had to change the implementation of the Numerics.Vector loops to handle the 10 length case as it's not a multiple of 4 ( I disabled most of the benchmarks as I noticed that benchmarks which were run later were significantly slower than those run first. For example, the As can be seen, a plain old Overall it looks like using Performance.LinearAlgebra.DenseVectorAdd: X-Tiny (10) - 1000x4000 iterations
Performance.LinearAlgebra.DenseVectorAdd: Tiny (16) - 1000x4000 iterations
Performance.LinearAlgebra.DenseVectorAdd: X-Small (100) - 1000x2000 iterations
Performance.LinearAlgebra.DenseVectorAdd: Small (1'000) - 1000x1000 iterations
Performance.LinearAlgebra.DenseVectorAdd: Medium (10'000) - 1000x100 iterations
|
Beta Was this translation helpful? Give feedback.
-
Hi kjbartel, Thank you for these results. I tried to run your Perf branch but without sucess. I'm strugulling to build all the native providers ... I've learn a bit about System.Numerics from the DotNetSamples. I'm performing simple Add on My results are still encouraging for small vectors with a overall speedup between 2 to 3,5 and no loss of performance for vectors of length < buffer size (4 in my case) :
Here's my code class BenchAddSIMD
{
public int size, rounds;
public float[] val1, val2, val;
public Vector<float>[] v1, v2, v;
public BenchAddSIMD(int size, int rounds)
{
this.size = size;
this.rounds = rounds;
val1 = new float[size];
val2 = new float[size];
val = new float[size];
for (int i = 0; i < size; i++)
{
val1[i] = i;
val2[i] = 1;
}
}
public EvaluationResultCollection Run(uint N)
{
var shark = new BenchShark(true);
var results = shark.EvaluateDecoratedTasks(this, N);
return results;
}
[BenchSharkTask("NO SIMD)")]
public void noSIMD()
{
for (int k = 0; k < rounds; k++)
{
for (int i = 0; i < size; i++)
{
val[i] = val1[i] + val2[i];
}
}
}
[BenchSharkTask("SIMD)")]
public void SIMD()
{
for (int k = 0; k < rounds; k++)
{
int buffer_size = Vector<float>.Count;
int q = size / buffer_size;
int n = q * buffer_size;
for (int i = 0; i < n; i += buffer_size)
{
var va = new Vector<float>(val1, i);
var vb = new Vector<float>(val2, i);
var vc = va + vb;
vc.CopyTo(val, i);
}
for (int i = n; i < size; i++)
{
val[i] = val1[i] + val2[i];
}
}
}
public void Show()
{
for (int i = 0; i < size; i++)
{
Console.WriteLine("{0} + {1} = {2}", val1[i], val2[i], val[i]);
}
}
} class Program
{
static void Main(string[] args)
{
if (!SIMD.Vector.IsHardwareAccelerated)
{
Console.WriteLine("SIMD isn't enabled for the current process");
}
else{Console.WriteLine("SIMD is enabled");}
Benchmark_AddSIMD(2, 100000);
Benchmark_AddSIMD(4, 100000);
Benchmark_AddSIMD(8, 100000);
Benchmark_AddSIMD(9, 100000);
Benchmark_AddSIMD(10, 100000);
Benchmark_AddSIMD(11, 100000);
Benchmark_AddSIMD(12, 100000);
Benchmark_AddSIMD(20, 100000);
Benchmark_AddSIMD(50, 100000);
Console.ReadKey();
}
public static void Benchmark_AddSIMD(int size, int rounds)
{
var Nop = size * rounds;
Console.WriteLine("\n\n====================================\n");
Console.WriteLine(" vector size = {0} | rounds = {1}", size, rounds);
Console.WriteLine("\n====================================\n\n");
var benchadd = new BenchAddSIMD(size, rounds);
var results = benchadd.Run(100);
int i=0;
var t = new double[2];
foreach (var evaluation in results.Evaluations)
{
Console.WriteLine(evaluation.Name + " : \t" + "{0:F2}" + " ns/el", evaluation.AverageExecutionTime.TotalMilliseconds * 1e6 / Nop);
t[i] = evaluation.AverageExecutionTime.TotalMilliseconds;
i++;
}
Console.WriteLine("\nSIMD is faster by {0:F2}",t[0]/t[1]);
//benchadd.Show();
}
} |
Beta Was this translation helpful? Give feedback.
-
What I don't really understand is the performance of this approach with memory allocation.
|
Beta Was this translation helpful? Give feedback.
-
Any progress on this now that the SIMD support is in official releases of .NET Framework and .NET Core? |
Beta Was this translation helpful? Give feedback.
-
Leveraging SIMD in the managed providers will make a huge difference in terms of performance. Mono will likely support this at some point as well (in favor of the existing Mono SIMD).
Beta Was this translation helpful? Give feedback.
All reactions