Performance of my Maze Generator went from 2.1 seconds to 3.8 after upgrading to .NET 7.0 #78110

devedse · 2022-11-09T16:16:40Z

Description

I've been working on a Maze Generator as a personal hobby project for a number of years now and always try to keep it up to date with the latest .NET stuff.
Yesterday .NET 7.0 came out and I immediately upgraded my project to see if it would bring performance benefits.

Sadly the opposite seems to be true.

On my main computer:
Processor: AMD 5950x
RAM: 128GB
Operating System: Windows 11
dotnet SDK: 7.0.100

Before .NET 7 Upgrade:

After .NET 7 upgrade:

On my laptop:
Processor: Intel Core i7 12700H
RAM: 128GB
Operating System: Windows 11
dotnet SDK: 7.0.100

I didn't make screenshots but the times went from:

Before: 2.1 seconds
After: 2.6 seconds

My code

The following file contains the implementation of the Algorithm I'm using. Besides the code in this file I'm using my own implementation of a Random generator + my own implementation of a BitArray to store the maze.

https://github.com/devedse/DeveMazeGeneratorCore/blob/master/DeveMazeGeneratorCore/Generators/AlgorithmBacktrack2Deluxe2.cs

Reproduce

I've created 2 branches that show the difference quite easily by simply running the ConsoleApp:

dotnet 6:
https://github.com/devedse/DeveMazeGeneratorCore

dotnet 7:
https://github.com/devedse/DeveMazeGeneratorCore/tree/dotnet7

Additional info

Even running them at the same time side by side you can clearly see the performance difference:

Some more research and doing a NativeAOT compilation to see if this differs (it does?!?)

I did some more investigation and compiled my code like this:

DeveMazeGeneratorCore\DeveMazeGeneratorCore.ConsoleApp> dotnet publish -r win-x64 -c Release -p:PublishReadyToRun=true -p:PublishSingleFile=true --self-contained

This results in 2 exe files that for some reason also differ quite significantly in performance:

DeveMazeGeneratorCore.ConsoleApp\bin\Release\net7.0\win-x64\DeveMazeGenerator.ConsoleApp.exe (9.265kb):
DeveMazeGeneratorCore.ConsoleApp\bin\Release\net7.0\win-x64\publish\DeveMazeGenerator.ConsoleApp.exe (100.360kb):

The second file (in the publish folder) does seem to go slightly faster then the original .NET 6.0.

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2022-11-09T16:16:44Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

danmoseley · 2022-11-09T16:45:54Z

@devedse Thanks for the report, that's not expected. Have you tried using a profiler to see what changed? That might help us route this.
BTW can you check they're both built in release configuration?

AndyAyersMS · 2022-11-09T17:34:16Z

@devedse can you add repro steps? I would like to look into this and want to make sure I'm looking at the right things.

jkotas · 2022-11-09T17:36:25Z

https://github.com/devedse/DeveMazeGeneratorCore/blob/master/DeveMazeGeneratorCore/Generators/AlgorithmBacktrack2Deluxe2.cs#L47-L50 is invalid Unsafe code. Bool is 1 byte. Int is 4 bytes. It is not ok to cast managed pointer to bool to managed pointer to int. It can be reading random memory from stack.

devedse · 2022-11-09T21:57:40Z

Hey all, thanks for the suggestions. @jkotas I've indeed run into that issue when running my program on android. Previously though for C# (with dotnet 6.0) it seemed to work fine.

I did write a "byte" version of the algorithm and have changed to that:
https://github.com/devedse/DeveMazeGeneratorCore/blob/master/DeveMazeGeneratorCore/Generators/AlgorithmBacktrack2Deluxe2_AsByte.cs

I've now changed to this maze generation algorithm:

Dotnet 6:

Dotnet 7:

It does indeed seem there's a huge performance improvement. 😄
I wonder if the last 0.2 second difference is something that can be explained though.

Edit1:
I can confirm that both builds are done on Release

Edit2:
The reproduction steps are basically cloning the branches and doing a CTRL+F5 when the configuration is set to Release (in VisualStudio) which starts the ConsoleApp project.

AndyAyersMS · 2022-11-10T04:00:48Z

As with #78127 I think the remaining performance difference can be explained by OSR. You can disable this in your .csproj via

  <PropertyGroup>
    <TieredCompilationQuickJitForLoops>false</TieredCompilationQuickJitForLoops>
  </PropertyGroup>

The generation is dominated by one long-running method, AlgorithmBacktrack2Deluxe2:GoGenerateInternal. In .NET 7 by default this method is first quickly optimized and then the inner loop is re-optimized via OSR. The codegen for this loop is similar to the optimized .NET 6 version, except that the invocations of the get_Item methods require reloading the object before each call:

;; .NET 6

G_M000_IG07:                ;; offset=011AH
       418D5424FE           lea      edx, [r12-02H]
       488BCE               mov      rcx, rsi
       458BC5               mov      r8d, r13d
       41FF5718             call     [r15+18H]DeveMazeGeneratorCore.InnerMaps.InnerMap:get_Item(int,int):bool:this

;; .NET 7

G_M000_IG04:                ;; offset=0099H
       418D57FE             lea      edx, [r15-02H]
       488BCE               mov      rcx, rsi
       458BC4               mov      r8d, r12d
       488B06               mov      rax, qword ptr [rsi]
       4C8B6848             mov      r13, qword ptr [rax+48H]
       41FF5518             call     [r13+18H]DeveMazeGeneratorCore.InnerMaps.InnerMap:get_Item(int,int):bool:this

I suspect in the .NET 7 OSR case there is no dominating deref of rsi at the top of the loop or outside, so this sequence cannot be safely hoisted?

ghost · 2022-11-10T04:01:38Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I've been working on a Maze Generator as a personal hobby project for a number of years now and always try to keep it up to date with the latest .NET stuff.
Yesterday .NET 7.0 came out and I immediately upgraded my project to see if it would bring performance benefits.

Sadly the opposite seems to be true.

On my main computer:
Processor: AMD 5950x
RAM: 128GB
Operating System: Windows 11
dotnet SDK: 7.0.100

Before .NET 7 Upgrade:

After .NET 7 upgrade:

On my laptop:
Processor: Intel Core i7 12700H
RAM: 128GB
Operating System: Windows 11
dotnet SDK: 7.0.100

I didn't make screenshots but the times went from:

Before: 2.1 seconds
After: 2.6 seconds

My code

The following file contains the implementation of the Algorithm I'm using. Besides the code in this file I'm using my own implementation of a Random generator + my own implementation of a BitArray to store the maze.

https://github.com/devedse/DeveMazeGeneratorCore/blob/master/DeveMazeGeneratorCore/Generators/AlgorithmBacktrack2Deluxe2.cs

Reproduce

I've created 2 branches that show the difference quite easily by simply running the ConsoleApp:

dotnet 6:
https://github.com/devedse/DeveMazeGeneratorCore

dotnet 7:
https://github.com/devedse/DeveMazeGeneratorCore/tree/dotnet7

Additional info

Even running them at the same time side by side you can clearly see the performance difference:

Some more research and doing a NativeAOT compilation to see if this differs (it does?!?)

I did some more investigation and compiled my code like this:

DeveMazeGeneratorCore\DeveMazeGeneratorCore.ConsoleApp> dotnet publish -r win-x64 -c Release -p:PublishReadyToRun=true -p:PublishSingleFile=true --self-contained

This results in 2 exe files that for some reason also differ quite significantly in performance:

DeveMazeGeneratorCore.ConsoleApp\bin\Release\net7.0\win-x64\DeveMazeGenerator.ConsoleApp.exe (9.265kb):
DeveMazeGeneratorCore.ConsoleApp\bin\Release\net7.0\win-x64\publish\DeveMazeGenerator.ConsoleApp.exe (100.360kb):

The second file (in the publish folder) does seem to go slightly faster then the original .NET 6.0.

Author:	devedse
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`
Milestone:	-

AndyAyersMS · 2022-11-10T04:02:00Z

Also @EgorBo FYI PGO doesn't help here at all.

devedse · 2022-11-11T11:02:17Z

@AndyAyersMS

  <PropertyGroup>
    <TieredCompilationQuickJitForLoops>false</TieredCompilationQuickJitForLoops>
  </PropertyGroup>

I added this setting to the ConsoleApp project and this seemed to remove the overhead and even improve the performance slightly! :)

Dotnet 6:

Dotnet 7 <TieredCompilationQuickJitForLoops>true</TieredCompilationQuickJitForLoops>:

Dotnet 7 <TieredCompilationQuickJitForLoops>false</TieredCompilationQuickJitForLoops>:

I'm curious though if this is something that should be improved in the csharp runtime as I would expect the TieredCompilation to automatically improve the loop once it's been run a few times.

AndyAyersMS · 2023-03-31T01:50:40Z

I'm curious though if this is something that should be improved in the csharp runtime as I would expect the TieredCompilation to automatically improve the loop once it's been run a few times.

Yes it is something that should be improved -- if you set TieredCompilationQuickJitForLoops false then we will eagerly optimize the method instead of tiering it up, and in this particular case (though not most other cases) that gives more well-optimized code.

The issue here is that in .NET 7 with OSR, we no longer see some of the code that executed before the main loop in AlgorithmBacktrack2Deluxe2:GoGenerateInternal, in particular the fetch of a method table slot off of V02. This fetch is potentially faulting but also invariant:

N013 ( 33, 23) [000070] --CXG------                         *  CALLV vt-ind void   DeveMazeGeneratorCore.InnerMaps.InnerMap.set_Item $VN.Void
N012 (  9,  8) [000665] n--X------- control expr            \--*  IND       long   <l:$3d1, c:$3d0>
N011 (  7,  6) [000664] ---X---N---                            \--*  ADD       long   $3cf
N009 (  6,  5) [000662] #--X-------                               +--*  IND       long   $3cd
N008 (  4,  3) [000661] ---X---N---                               |  \--*  ADD       long   $3cc
N006 (  3,  2) [000659] #--X-------                               |     +--*  IND       long   $3ca
N005 (  1,  1) [000658] -----------                               |     |  \--*  LCL_VAR   ref    V02 arg1         u:1 $100
N007 (  1,  1) [000660] -----------                               |     \--*  CNS_INT   long   80 $14c
N010 (  1,  1) [000663] -----------                               \--*  CNS_INT   long   32 $14d

so in .NET 6 the dominating pre-loop case lets all the in-loop cases get CSE'd, while in .NET 7 and current .NET 8 we can only CSE within the loop, and that double indir within the loop is apparently costly.

Possible fixes include loop peeing and/or cloning to expose an invariant INDIR tree (which we don't do right now) gated by a null check of V02.

I would expect PGO to help as we should be able to clone based on type; let me look into that next.

AndyAyersMS · 2023-03-31T02:02:44Z

With .NET 8, PGO does seem to help, I get considerably faster times. So that is another possible fix. We indeed clone based on type the cloned loop is free of calls.

In .NET 7 we seem to be unable to properly resolve the call targets for .get_Item, not sure why, and this inhibits devirt.

AndyAyersMS · 2023-03-31T16:48:05Z

Without PGO it seems like we could clone the loop to expose the invariant method table fetch and make it non-faulting in the hot loop, and then hoist it out.

I prototyped the cloning part and it's not too bad. But these loads are conditional within the loop and so we won't hoist them. Fixing that may be a bit more involved. Also note that any OSR loop is quite likely to be very hot, as we know that it's already iterated several thousand times. So we might want to be more aggressive optimizing these loops (however we have to be careful not to do a better job than Tier1 or we will see some really wonky perf behaviors).

Considering loop L00 to clone for optimizations.
Checking loop L00 for optimization candidates (invariant loads)
Loop L00 has invariant faulting load [000652] on V02
Loop L00 has invariant faulting load [000660] on V02
Loop L00 has invariant faulting load [000668] on V02
Loop L00 has invariant faulting load [000676] on V02
Loop L00 has invariant faulting load [000747] on V02
Loop L00 has invariant faulting load [000755] on V02
------------------------------------------------------------

------------------------------------------------------------
Deriving cloning conditions for L00
Conditions: (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null)
No array deref conditions
Block conditions:
0 = (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null)
Evaluating 6 loop cloning conditions for loop L00
Considering condition 0: (V02 NE null), could not be evaluated
Considering condition 1: (V02 NE null), could not be evaluated
Considering condition 2: (V02 NE null), could not be evaluated
Considering condition 3: (V02 NE null), could not be evaluated
Considering condition 4: (V02 NE null), could not be evaluated
Considering condition 5: (V02 NE null), could not be evaluated
Evaluation result allTrue = false, anyFalse = false
Before optimizing cloning conditions
	(V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null)
After optimizing cloning conditions
	(V02 NE null)
After optimizing block-level cloning conditions
	(V02 NE null)

Motivating example is the OSR loop in dotnet#78110 where in OSR we don't see a deref before the loop, so can't hoist repeated virtual method lookups. Turns out that these indirs are conditional so we end up not hoisting, but at least this does the cloning part. Still needs some polish, but should be functionally correct.

AndyAyersMS · 2023-06-09T02:11:23Z

As noted above, PGO in .NET 8 will provide a 20% performance boost.

With .NET 8 Preview 5 (coming out next week) PGO is now enabled by default. So I am going to close this issue.

;; net6.0

Generation time: 00:00:02.4523739 <<<<<<<< new fastest time          Fastest: 00:00:02.4523739
Generation time: 00:00:02.4772369                                    Fastest: 00:00:02.4523739
Generation time: 00:00:02.4934081                                    Fastest: 00:00:02.4523739
Generation time: 00:00:02.4947292                                    Fastest: 00:00:02.4523739
Generation time: 00:00:02.4584699                                    Fastest: 00:00:02.4523739

;; net7.0

Generation time: 00:00:02.5462457 <<<<<<<< new fastest time          Fastest: 00:00:02.5462457
Generation time: 00:00:02.5450569 <<<<<<<< new fastest time          Fastest: 00:00:02.5450569
Generation time: 00:00:02.6279318                                    Fastest: 00:00:02.5450569
Generation time: 00:00:02.5372493 <<<<<<<< new fastest time          Fastest: 00:00:02.5372493
Generation time: 00:00:02.5083985 <<<<<<<< new fastest time          Fastest: 00:00:02.5083985

;; net8.0 (Preview 4)

Generation time: 00:00:02.5601764 <<<<<<<< new fastest time          Fastest: 00:00:02.5601764
Generation time: 00:00:02.5641837                                    Fastest: 00:00:02.5601764
Generation time: 00:00:02.5902517                                    Fastest: 00:00:02.5601764
Generation time: 00:00:02.7199203                                    Fastest: 00:00:02.5601764
Generation time: 00:00:02.5658560                                    Fastest: 00:00:02.5601764

;; net8.0 (Preview4 + DOTNET_TieredPGO=1)

Generation time: 00:00:01.9386467 <<<<<<<< new fastest time          Fastest: 00:00:01.9386467
Generation time: 00:00:01.9029803 <<<<<<<< new fastest time          Fastest: 00:00:01.9029803
Generation time: 00:00:01.9701224                                    Fastest: 00:00:01.9029803
Generation time: 00:00:01.9371083                                    Fastest: 00:00:01.9029803
Generation time: 00:00:01.9046991                                    Fastest: 00:00:01.9029803

devedse · 2023-06-09T10:56:34Z

Thanks @AndyAyersMS for the elaborate explanation and work on making dotnet even faster 😄

devedse added the tenet-performance Performance related issue label Nov 9, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Nov 9, 2022

AndyAyersMS added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed untriaged New issue has not been triaged by the area owner labels Nov 10, 2022

danmoseley added the regression-from-last-release label Nov 10, 2022

JulieLeeMSFT assigned AndyAyersMS Nov 10, 2022

JulieLeeMSFT added this to the 8.0.0 milestone Dec 27, 2022

AndyAyersMS mentioned this issue Mar 30, 2023

On Stack Replacement Next Steps #33658

Open

72 tasks

AndyAyersMS mentioned this issue Mar 31, 2023

Improve JIT loop optimizations #65342

Open

24 tasks

AndyAyersMS closed this as completed Jun 9, 2023

ghost locked as resolved and limited conversation to collaborators Jul 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of my Maze Generator went from 2.1 seconds to 3.8 after upgrading to .NET 7.0 #78110

Performance of my Maze Generator went from 2.1 seconds to 3.8 after upgrading to .NET 7.0 #78110

devedse commented Nov 9, 2022 •

edited

Loading

dotnet-issue-labeler bot commented Nov 9, 2022

danmoseley commented Nov 9, 2022

AndyAyersMS commented Nov 9, 2022

jkotas commented Nov 9, 2022

devedse commented Nov 9, 2022 •

edited

Loading

AndyAyersMS commented Nov 10, 2022 •

edited

Loading

ghost commented Nov 10, 2022

Description

My code

Reproduce

Additional info

Some more research and doing a NativeAOT compilation to see if this differs (it does?!?)

AndyAyersMS commented Nov 10, 2022

devedse commented Nov 11, 2022

AndyAyersMS commented Mar 31, 2023

AndyAyersMS commented Mar 31, 2023

AndyAyersMS commented Mar 31, 2023

AndyAyersMS commented Jun 9, 2023

devedse commented Jun 9, 2023 •

edited

Loading

Performance of my Maze Generator went from 2.1 seconds to 3.8 after upgrading to .NET 7.0 #78110

Performance of my Maze Generator went from 2.1 seconds to 3.8 after upgrading to .NET 7.0 #78110

Comments

devedse commented Nov 9, 2022 • edited Loading

Description

My code

Reproduce

Additional info

Some more research and doing a NativeAOT compilation to see if this differs (it does?!?)

dotnet-issue-labeler bot commented Nov 9, 2022

danmoseley commented Nov 9, 2022

AndyAyersMS commented Nov 9, 2022

jkotas commented Nov 9, 2022

devedse commented Nov 9, 2022 • edited Loading

AndyAyersMS commented Nov 10, 2022 • edited Loading

ghost commented Nov 10, 2022

Description

My code

Reproduce

Additional info

Some more research and doing a NativeAOT compilation to see if this differs (it does?!?)

AndyAyersMS commented Nov 10, 2022

devedse commented Nov 11, 2022

AndyAyersMS commented Mar 31, 2023

AndyAyersMS commented Mar 31, 2023

AndyAyersMS commented Mar 31, 2023

AndyAyersMS commented Jun 9, 2023

devedse commented Jun 9, 2023 • edited Loading

devedse commented Nov 9, 2022 •

edited

Loading

devedse commented Nov 9, 2022 •

edited

Loading

AndyAyersMS commented Nov 10, 2022 •

edited

Loading

devedse commented Jun 9, 2023 •

edited

Loading