Leverage SIMD support with RyuJIT #846

cdrnet · 2014-04-05T21:09:46Z

cdrnet
Apr 5, 2014
Maintainer

Leveraging SIMD in the managed providers will make a huge difference in terms of performance. Mono will likely support this at some point as well (in favor of the existing Mono SIMD).

tibel · 2014-04-06T05:37:51Z

tibel
Apr 6, 2014

Miguel de Icaza already confirmed it:
https://twitter.com/migueldeicaza/statuses/452099923157065728

0 replies

cdrnet · 2014-04-06T08:23:11Z

cdrnet
Apr 6, 2014
Maintainer Author

RyuJIT CTP3: How to use SIMD

0 replies

cdrnet · 2014-04-08T10:21:13Z

cdrnet
Apr 8, 2014
Maintainer Author

The JIT finally proposed. JIT and SIMD are getting married.

0 replies

jack-pappas · 2014-06-04T22:33:18Z

jack-pappas
Jun 4, 2014

RyuJIT CTP4 was released a few weeks ago, and it supports Win7 / Server2008R2, in case you don't have a Win8 machine available. CTP4 also includes support for mutable vectors (which were only immutable in previous versions).

http://blogs.msdn.com/b/clrcodegeneration/archive/2014/05/12/ryujit-ctp4-now-with-more-simd-types-and-better-os-support.aspx

0 replies

cdrnet · 2014-06-04T23:22:24Z

cdrnet
Jun 4, 2014
Maintainer Author

Great, thanks for the update!

0 replies

lionpeloux · 2015-11-30T12:16:37Z

lionpeloux
Nov 30, 2015

Hi there,

I've been looking aroud this weeks about SIMD support in .Net.
Is there any plans to support SIMD via System.Numerics and the last JIT in .Net 4.6 ?

Here are some ressources I found on the web :

Lionel

0 replies

kjbartel · 2015-12-01T06:57:45Z

kjbartel
Dec 1, 2015

At the moment RyuJIT is only being used for x64 so it might be better to wait until it's used for both x86 and x64. If you're interested though, a while back I added a few simple SIMD implementations to DenseVectorAdd in the performance project and did some benchmarking. They're in my performance branch on github. There's also some native versions using auto vectorization, auto parallelization and OpenMP for comparison. The .Net SIMD version performs fairly well but depends a lot on the size of the vector (array) as to whether it will perform better than the native versions or MKL.

Here's sample output on my computer:
Performance.LinearAlgebra.DenseVectorAdd: Small (1'000) - 100x1000 iterations

Name	Mean	StdDev	Min	Q1	Median	Q3	Max	TopSlowdown	ManagedSpeedup
NativeProvider_Vector	2445	82	2325	2379	2430	2497	2737	1	2.3
NativeProvider_Optimized2	2471	127	2316	2404	2447	2518	3314	1.01	2.28
MklProvider	2720	163	2571	2622	2686	2756	3863	1.11	2.08
NativeProvider	2812	164	2730	2752	2775	2826	4231	1.14	2.01
VectorLoop	2909	213	2729	2807	2881	2945	4622	1.19	1.94
NativeProvider_Optimized	3283	445	2365	3384	3470	3533	3949	1.43	1.61
Loop	3560	648	3277	3330	3352	3421	6205	1.38	1.66
ParallelVectorLoop4096	3730	171	3533	3569	3744	3803	4678	1.54	1.49
ParallelVectorLoop32768	3735	163	3526	3579	3774	3826	4358	1.55	1.48
AddOperator	5352	301	5049	5145	5262	5413	6295	2.17	1.06
ParallelLoop4096	5415	325	5198	5277	5312	5398	7938	2.19	1.05
ParallelLoop32768	5441	494	5235	5278	5318	5394	9925	2.19	1.05
ManagedProvider	6380	1481	5470	5508	5581	6333	10492	2.3	1
NativeProvider_Parallel	9816	488	8509	9430	9752	10217	11093	4.01	0.57
NativeProvider_OpenMP	10583	1545	8381	9310	10166	11564	16730	4.18	0.55
Map2	11193	2029	10005	10328	10506	10870	18171	4.32	0.53

0 replies

cuda · 2015-12-01T07:29:55Z

cuda
Dec 1, 2015

@kjbartel Which processor did you run this on? I'm curious if it supports AVX2. Thanks

0 replies

kjbartel · 2015-12-01T09:22:35Z

kjbartel
Dec 1, 2015

@cuda It's an Intel i7-4770 (4 core, 8 threads) 3.4 / 3.9 GHz (base / turbo), with 16Gb ram. So yes it supports AVX2 and the C++ project used for the the NativeProvider tests all used /arch:AVX2. Changing instead to /arch:SSE2 produced the following results:

Performance.LinearAlgebra.DenseVectorAdd: Small (1'000) - 100x1000 iterations

Name	Mean	StdDev	Min	Q1	Median	Q3	Max	TopSlowdown	ManagedSpeedup
NativeProvider_Vector	2478	126	2325	2401	2448	2510	3104	1	2.25
NativeProvider_Optimized2	2518	262	2347	2392	2460	2506	3917	1	2.24
MklProvider	2773	312	2530	2620	2697	2786	4274	1.1	2.05
NativeProvider	2819	172	2698	2742	2770	2809	3591	1.13	1.99
VectorLoop	2946	243	2727	2833	2898	2986	4831	1.18	1.9
NativeProvider_Optimized	3233	454	2324	2880	3464	3523	3657	1.41	1.59
ParallelVectorLoop4096	3707	132	3539	3594	3700	3762	4261	1.51	1.49
ParallelVectorLoop32768	3755	174	3546	3573	3780	3834	4575	1.54	1.46
Loop	3793	852	3272	3327	3360	3514	5947	1.37	1.64
ParallelLoop4096	5427	473	5230	5264	5292	5388	8583	2.16	1.04
ParallelLoop32768	5460	436	5217	5259	5314	5453	8340	2.17	1.04
ManagedProvider	5769	816	5450	5485	5521	5589	9938	2.25	1
AddOperator	7957	1821	5283	5518	8871	9397	10593	3.62	0.62
Map2	10164	1372	9608	9706	9753	9937	17512	3.98	0.57
NativeProvider_OpenMP	10273	3231	9453	9673	9803	10002	41757	4	0.56
NativeProvider_Parallel	10432	489	9760	10209	10366	10558	13507	4.23	0.53

I can't really see any significant difference between the AVX2 and SSE2 versions as all the results are slightly lower this time around. Comparing VectorLoop to NativeProvider_Vector you can see that even with the p/invoke the native auto vectorized version is almost 20% faster that the System.Numerics version. This may just be that my implementation is suboptimal though.

MKL would be using AVX2 and OpenMP but I generally found the auto vectorized and auto parallelized version generated by the VC compiler to beat MKL. At least on my hardware. I started looking at this a while back as the OpenBLAS provider didn't have native vector functions like MKL.

0 replies

cuda · 2015-12-01T09:58:44Z

cuda
Dec 1, 2015

I can't really see any significant difference between the AVX2 and SSE2 versions

Interesting, I would have thought the 256-bit vs 128-bit registers would have made a difference (process 4 doubles at a time vs 2). Thanks

0 replies

lionpeloux · 2015-12-01T13:48:44Z

lionpeloux
Dec 1, 2015

Hi,

Thanks for sharing these interesting benchmarks. So apparently there's almost a x2 speedup between the managed mathnet and the System.Numeric (with SIMD) vector add operation.

@kjbartel : would you mind posting the results on your machine for small vectors (10 elements) ?

=> It's always interresting to know the variability of these results regarding the vector length as there is always sevral ways to model a given problem (see AoS vs SoA)

I'll try to run your bench on my laptop.
Would it be interesting to publish these results on the website ? It's kind of hard to find quantitative data on how fast C# compete against C++ / MKL / OpenBLAS ...

To get back to my question : do you think mathnet could benefit from SIMD in the actual managed provider and is it complex to implement in your opinion ?

Cheers,
Lionel

0 replies

redknightlois · 2015-12-01T14:16:27Z

redknightlois
Dec 1, 2015

@cuda Always the question is how the JIT end up optimizing specific accesses. I have been working for 2 years already making sure the JIT emits the most optimized code for very low level routines and how you write the code really matters. There are million of tricks you can apply to achieve what you want.

For example, if you study the Bond code you will find they achieve a raw performance that is so freaking high that they beat by 4.4x a hand optimized routine that we had in place to read-write variable-size structures over pointers. On another case, we wrote the fastest memory copy in .net tweaking how to write certain loops and resort to managed only after the size is big enough to make sense for the p/invoke. We also wrote an xxHash hashing function which has pretty good raw performance (10x the next fastest provider) but we were able to achieve 10% improvement on top of that just serializing the load/stores in the emitted assembly.

Achieving C++ raw performance is possible, it's just a question if you are willing (and make economic sense) to tweak to code to get there.

0 replies

cuda · 2015-12-02T06:11:25Z

cuda
Dec 2, 2015

@redknightlois Thanks for the info. I've been meaning to look at Bond.

0 replies

kjbartel · 2015-12-04T08:17:56Z

kjbartel
Dec 4, 2015

@lionpeloux Sorry was a bit busy. Here's some results for 10, 16, 100, 1000 and 10000. NativePerformance project was compiled with AVX2. I had to change the implementation of the Numerics.Vector loops to handle the 10 length case as it's not a multiple of 4 (Numerics.Vector<double>.Count). It now uses a normal while loop for remaining items.

I disabled most of the benchmarks as I noticed that benchmarks which were run later were significantly slower than those run first. For example, the ManagedProvider should effectively be the same as the ParallelLoop4096 benchmark, the only difference being that there are null and length checks, however ParallelLoop4096 was 1.8x faster. Putting the ManagedProvider first in the file made it much faster. Seems there must be a memory and garbage collection problem so later benchmark tasks suffer. With the smaller number of tasks it gives much more believable results but I'd still not 100% trust the results.

As can be seen, a plain old for loop will beat the for loop using Vector<T> for small length arrays. Loop is consistently faster for 16 length than 10 length which to me suggests that the JIT compiler is actually doing some auto-vectorisation.

Overall it looks like using Numerics.Vector<T> gives a significant improvement over the current parallel loop implementation. The same or faster than the current implementation for small arrays and almost 2x as fast for larger arrays, about the same as MKL.

Performance.LinearAlgebra.DenseVectorAdd: X-Tiny (10) - 1000x4000 iterations

Name	Mean	StdDev	Min	Q1	Median	Q3	Max	TopSlowdown	ManagedSpeedup
Loop	181	56	118	124	206	213	548	1	1.07
VectorLoop	187	54	129	132	224	228	541	1.09	0.98
ParallelVectorLoop4096	234	43	213	218	219	222	820	1.06	1
ManagedProvider	235	45	208	215	220	231	944	1.07	1
NativeProvider_Vector	245	60	183	186	278	282	790	1.35	0.79
NativeProvider_Optimized2	261	56	187	194	284	288	570	1.38	0.77
MklProvider	340	60	281	283	376	381	752	1.83	0.59

Performance.LinearAlgebra.DenseVectorAdd: Tiny (16) - 1000x4000 iterations

Name	Mean	StdDev	Min	Q1	Median	Q3	Max	TopSlowdown	ManagedSpeedup
Loop	153	24	137	143	144	149	344	1	1.74
VectorLoop	155	27	148	150	151	152	1452	1.05	1.66
ParallelVectorLoop4096	248	20	238	241	242	244	491	1.68	1.03
NativeProvider_Vector	258	66	188	191	309	313	589	2.15	0.81
NativeProvider_Optimized2	260	68	190	192	310	315	682	2.15	0.81
ManagedProvider	265	38	239	248	250	265	595	1.74	1
MklProvider	378	70	306	310	429	436	875	2.98	0.58

Performance.LinearAlgebra.DenseVectorAdd: X-Small (100) - 1000x2000 iterations

Name	Mean	StdDev	Min	Q1	Median	Q3	Max	TopSlowdown	ManagedSpeedup
NativeProvider_Vector	336	26	317	327	330	334	671	1	2.1
NativeProvider_Optimized2	340	23	324	332	335	338	611	1.02	2.07
VectorLoop	349	28	326	339	342	350	637	1.04	2.03
Loop	455	55	412	423	433	467	926	1.31	1.6
ParallelVectorLoop4096	538	29	520	527	530	533	971	1.61	1.31
MklProvider	539	38	518	527	531	535	1149	1.61	1.31
ManagedProvider	734	99	665	686	694	753	1553	2.1	1

Performance.LinearAlgebra.DenseVectorAdd: Small (1'000) - 1000x1000 iterations

Name	Mean	StdDev	Min	Q1	Median	Q3	Max	TopSlowdown	ManagedSpeedup
NativeProvider_Vector	2471	117	2373	2410	2452	2493	3853	1	2.21
NativeProvider_Optimized2	2485	127	2369	2422	2470	2500	3847	1.01	2.19
MklProvider	2780	116	2637	2728	2760	2793	4495	1.13	1.96
VectorLoop	2904	142	2785	2831	2891	2925	4681	1.18	1.87
Loop	3577	360	3241	3358	3438	3668	6411	1.4	1.57
ParallelVectorLoop4096	4144	239	4040	4074	4091	4136	7547	1.67	1.32
ManagedProvider	5600	522	5062	5231	5414	5768	10022	2.21	1

Performance.LinearAlgebra.DenseVectorAdd: Medium (10'000) - 1000x100 iterations

Name	Mean	StdDev	Min	Q1	Median	Q3	Max	TopSlowdown	ManagedSpeedup
NativeProvider_Optimized2	27845	3045	21202	25151	28688	30044	33438	1	2.49
NativeProvider_Vector	30580	759	29557	30276	30430	30608	36645	1.06	2.35
MklProvider	30672	639	30177	30360	30520	30712	34818	1.06	2.34
VectorLoop	31547	1197	30902	31164	31363	31680	42854	1.09	2.28
Loop	35218	2706	32649	34132	34896	35762	59507	1.22	2.05
ParallelVectorLoop4096	36229	2281	35251	35478	35656	36000	55114	1.24	2
ManagedProvider	71543	1493	68817	70528	71402	72170	78742	2.49	1

0 replies

lionpeloux · 2015-12-04T10:32:38Z

lionpeloux
Dec 4, 2015

Hi kjbartel,

Thank you for these results. I tried to run your Perf branch but without sucess. I'm strugulling to build all the native providers ...

I've learn a bit about System.Numerics from the DotNetSamples.

I'm performing simple Add on float[] arrays of various length.
My VMWare (windows 7) can adress SIMD-128 (4 floats eachtime).

My results are still encouraging for small vectors with a overall speedup between 2 to 3,5 and no loss of performance for vectors of length < buffer size (4 in my case) :

SIMD is enabled

====================================
 vector size = 2 | rounds = 100000
====================================
NO SIMD) :      2,32 ns/el
SIMD) :         2,40 ns/el

SIMD is faster by 0,97

====================================
 vector size = 4 | rounds = 100000
====================================
NO SIMD) :      1,91 ns/el
SIMD) :         0,69 ns/el

SIMD is faster by 2,77

====================================
 vector size = 8 | rounds = 100000
====================================
NO SIMD) :      1,96 ns/el
SIMD) :         0,59 ns/el

SIMD is faster by 3,33

====================================
 vector size = 9 | rounds = 100000
====================================
NO SIMD) :      1,81 ns/el
SIMD) :         0,77 ns/el

SIMD is faster by 2,34

====================================
 vector size = 10 | rounds = 100000
====================================
NO SIMD) :      1,78 ns/el
SIMD) :         0,80 ns/el

SIMD is faster by 2,23

====================================
 vector size = 11 | rounds = 100000
====================================
NO SIMD) :      1,68 ns/el
SIMD) :         0,83 ns/el

SIMD is faster by 2,02

====================================
 vector size = 12 | rounds = 100000
====================================
NO SIMD) :      1,75 ns/el
SIMD) :         0,53 ns/el

SIMD is faster by 3,32

====================================
 vector size = 20 | rounds = 100000
====================================
NO SIMD) :      1,62 ns/el
SIMD) :         0,45 ns/el

SIMD is faster by 3,61

====================================
 vector size = 50 | rounds = 100000
====================================
NO SIMD) :      1,74 ns/el
SIMD) :         0,48 ns/el

SIMD is faster by 3,61

Here's my code

class BenchAddSIMD
    {
        public int size, rounds;

        public float[] val1, val2, val;
        public Vector<float>[] v1, v2, v;

        public BenchAddSIMD(int size, int rounds)
        {
            this.size = size;
            this.rounds = rounds;


            val1 = new float[size];
            val2 = new float[size];
            val = new float[size];
            for (int i = 0; i < size; i++)
            {
                val1[i] = i;
                val2[i] = 1;
            }
        }
        public EvaluationResultCollection Run(uint N)
        {
            var shark = new BenchShark(true);
            var results = shark.EvaluateDecoratedTasks(this, N);
            return results;
        }

        [BenchSharkTask("NO SIMD)")]
        public void noSIMD()
        {
            for (int k = 0; k < rounds; k++)
            {
                for (int i = 0; i < size; i++)
                {
                    val[i] = val1[i] + val2[i];
                }     
            }
        }

        [BenchSharkTask("SIMD)")]
        public void SIMD()
        {
            for (int k = 0; k < rounds; k++)
            {
                int buffer_size = Vector<float>.Count;
                int q = size / buffer_size;
                int n = q * buffer_size;
                for (int i = 0; i < n; i += buffer_size)
                {
                    var va = new Vector<float>(val1, i);
                    var vb = new Vector<float>(val2, i);
                    var vc = va + vb;
                    vc.CopyTo(val, i);
                }
                for (int i = n; i < size; i++)
                {
                    val[i] = val1[i] + val2[i];
                }
            }
        }

        public void Show()
        {
            for (int i = 0; i < size; i++)
            {
                Console.WriteLine("{0} + {1} = {2}", val1[i], val2[i], val[i]);
            }
        }
    }

class Program
    {
        static void Main(string[] args)
        {
            if (!SIMD.Vector.IsHardwareAccelerated)
            {
                Console.WriteLine("SIMD isn't enabled for the current process");
            }
            else{Console.WriteLine("SIMD is enabled");}
            Benchmark_AddSIMD(2, 100000);
            Benchmark_AddSIMD(4, 100000);
            Benchmark_AddSIMD(8, 100000);
            Benchmark_AddSIMD(9, 100000);
            Benchmark_AddSIMD(10, 100000);
            Benchmark_AddSIMD(11, 100000);
            Benchmark_AddSIMD(12, 100000);
            Benchmark_AddSIMD(20, 100000);
            Benchmark_AddSIMD(50, 100000);
            Console.ReadKey();
        }

        public static void Benchmark_AddSIMD(int size, int rounds)
        {
            var Nop = size * rounds;

            Console.WriteLine("\n\n====================================\n");
            Console.WriteLine(" vector size = {0} | rounds = {1}", size, rounds);
            Console.WriteLine("\n====================================\n\n");

            var benchadd = new BenchAddSIMD(size, rounds);
            var results = benchadd.Run(100);

            int i=0;
            var t = new double[2];
            foreach (var evaluation in results.Evaluations)
            {
                Console.WriteLine(evaluation.Name + " : \t" + "{0:F2}" + " ns/el", evaluation.AverageExecutionTime.TotalMilliseconds * 1e6 / Nop);
                t[i] =  evaluation.AverageExecutionTime.TotalMilliseconds;
                i++;
            }
            Console.WriteLine("\nSIMD is faster by {0:F2}",t[0]/t[1]);
            //benchadd.Show();
        }
    }

0 replies

lionpeloux · 2015-12-04T10:40:13Z

lionpeloux
Dec 4, 2015

What I don't really understand is the performance of this approach with memory allocation.
Are the termporary Vector subject to GC ? Or are they allocated on the stack ?

I'm doing a structural analysis solver (mass-spring system) for my thesis which basically runs 1e5 iterations. For each iterations, I'll be doing lots of vector works with arrays of points.
I can managed to vectorized a bit my data structure from arrays of Vector3d points with (2 to 50 Vector3d in each array) to 3 double[2-50] arrays, one for each coordinate (X,Y,Z).

0 replies

colgreen · 2017-09-21T09:52:24Z

colgreen
Sep 21, 2017

Any progress on this now that the SIMD support is in official releases of .NET Framework and .NET Core?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leverage SIMD support with RyuJIT #846

{{title}}

Replies: 17 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Leverage SIMD support with RyuJIT #846

cdrnet Apr 5, 2014 Maintainer

Replies: 17 comments

cdrnet Apr 6, 2014 Maintainer Author

cdrnet Apr 8, 2014 Maintainer Author

cdrnet Jun 4, 2014 Maintainer Author

cdrnet
Apr 5, 2014
Maintainer

cdrnet
Apr 6, 2014
Maintainer Author

cdrnet
Apr 8, 2014
Maintainer Author

cdrnet
Jun 4, 2014
Maintainer Author