Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of my Maze Generator went from 2.1 seconds to 3.8 after upgrading to .NET 7.0 #78110

Closed
Tracked by #33658
devedse opened this issue Nov 9, 2022 · 14 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI regression-from-last-release tenet-performance Performance related issue
Milestone

Comments

@devedse
Copy link

devedse commented Nov 9, 2022

Description

I've been working on a Maze Generator as a personal hobby project for a number of years now and always try to keep it up to date with the latest .NET stuff.
Yesterday .NET 7.0 came out and I immediately upgraded my project to see if it would bring performance benefits.

Sadly the opposite seems to be true.

On my main computer:
Processor: AMD 5950x
RAM: 128GB
Operating System: Windows 11
dotnet SDK: 7.0.100

Before .NET 7 Upgrade:
image

After .NET 7 upgrade:
image

On my laptop:
Processor: Intel Core i7 12700H
RAM: 128GB
Operating System: Windows 11
dotnet SDK: 7.0.100

I didn't make screenshots but the times went from:

Before: 2.1 seconds
After: 2.6 seconds

My code

The following file contains the implementation of the Algorithm I'm using. Besides the code in this file I'm using my own implementation of a Random generator + my own implementation of a BitArray to store the maze.

https://github.com/devedse/DeveMazeGeneratorCore/blob/master/DeveMazeGeneratorCore/Generators/AlgorithmBacktrack2Deluxe2.cs

Reproduce

I've created 2 branches that show the difference quite easily by simply running the ConsoleApp:

dotnet 6:
https://github.com/devedse/DeveMazeGeneratorCore

dotnet 7:
https://github.com/devedse/DeveMazeGeneratorCore/tree/dotnet7

Additional info

Even running them at the same time side by side you can clearly see the performance difference:
image

Some more research and doing a NativeAOT compilation to see if this differs (it does?!?)

I did some more investigation and compiled my code like this:

DeveMazeGeneratorCore\DeveMazeGeneratorCore.ConsoleApp> dotnet publish -r win-x64 -c Release -p:PublishReadyToRun=true -p:PublishSingleFile=true --self-contained

This results in 2 exe files that for some reason also differ quite significantly in performance:

  1. DeveMazeGeneratorCore.ConsoleApp\bin\Release\net7.0\win-x64\DeveMazeGenerator.ConsoleApp.exe (9.265kb):
    image

  2. DeveMazeGeneratorCore.ConsoleApp\bin\Release\net7.0\win-x64\publish\DeveMazeGenerator.ConsoleApp.exe (100.360kb):
    image

The second file (in the publish folder) does seem to go slightly faster then the original .NET 6.0.

@devedse devedse added the tenet-performance Performance related issue label Nov 9, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@danmoseley
Copy link
Member

@devedse Thanks for the report, that's not expected. Have you tried using a profiler to see what changed? That might help us route this.
BTW can you check they're both built in release configuration?

@AndyAyersMS
Copy link
Member

@devedse can you add repro steps? I would like to look into this and want to make sure I'm looking at the right things.

@jkotas
Copy link
Member

jkotas commented Nov 9, 2022

https://github.com/devedse/DeveMazeGeneratorCore/blob/master/DeveMazeGeneratorCore/Generators/AlgorithmBacktrack2Deluxe2.cs#L47-L50 is invalid Unsafe code. Bool is 1 byte. Int is 4 bytes. It is not ok to cast managed pointer to bool to managed pointer to int. It can be reading random memory from stack.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Nov 9, 2022
@devedse
Copy link
Author

devedse commented Nov 9, 2022

Hey all, thanks for the suggestions. @jkotas I've indeed run into that issue when running my program on android. Previously though for C# (with dotnet 6.0) it seemed to work fine.

I did write a "byte" version of the algorithm and have changed to that:
https://github.com/devedse/DeveMazeGeneratorCore/blob/master/DeveMazeGeneratorCore/Generators/AlgorithmBacktrack2Deluxe2_AsByte.cs

I've now changed to this maze generation algorithm:

Dotnet 6:
image

Dotnet 7:
image

It does indeed seem there's a huge performance improvement. 😄
I wonder if the last 0.2 second difference is something that can be explained though.

Edit1:
I can confirm that both builds are done on Release

Edit2:
The reproduction steps are basically cloning the branches and doing a CTRL+F5 when the configuration is set to Release (in VisualStudio) which starts the ConsoleApp project.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Nov 10, 2022

As with #78127 I think the remaining performance difference can be explained by OSR. You can disable this in your .csproj via

  <PropertyGroup>
    <TieredCompilationQuickJitForLoops>false</TieredCompilationQuickJitForLoops>
  </PropertyGroup>

The generation is dominated by one long-running method, AlgorithmBacktrack2Deluxe2:GoGenerateInternal. In .NET 7 by default this method is first quickly optimized and then the inner loop is re-optimized via OSR. The codegen for this loop is similar to the optimized .NET 6 version, except that the invocations of the get_Item methods require reloading the object before each call:

;; .NET 6

G_M000_IG07:                ;; offset=011AH
       418D5424FE           lea      edx, [r12-02H]
       488BCE               mov      rcx, rsi
       458BC5               mov      r8d, r13d
       41FF5718             call     [r15+18H]DeveMazeGeneratorCore.InnerMaps.InnerMap:get_Item(int,int):bool:this

;; .NET 7

G_M000_IG04:                ;; offset=0099H
       418D57FE             lea      edx, [r15-02H]
       488BCE               mov      rcx, rsi
       458BC4               mov      r8d, r12d
       488B06               mov      rax, qword ptr [rsi]
       4C8B6848             mov      r13, qword ptr [rax+48H]
       41FF5518             call     [r13+18H]DeveMazeGeneratorCore.InnerMaps.InnerMap:get_Item(int,int):bool:this

I suspect in the .NET 7 OSR case there is no dominating deref of rsi at the top of the loop or outside, so this sequence cannot be safely hoisted?

@AndyAyersMS AndyAyersMS added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed untriaged New issue has not been triaged by the area owner labels Nov 10, 2022
@ghost
Copy link

ghost commented Nov 10, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I've been working on a Maze Generator as a personal hobby project for a number of years now and always try to keep it up to date with the latest .NET stuff.
Yesterday .NET 7.0 came out and I immediately upgraded my project to see if it would bring performance benefits.

Sadly the opposite seems to be true.

On my main computer:
Processor: AMD 5950x
RAM: 128GB
Operating System: Windows 11
dotnet SDK: 7.0.100

Before .NET 7 Upgrade:
image

After .NET 7 upgrade:
image

On my laptop:
Processor: Intel Core i7 12700H
RAM: 128GB
Operating System: Windows 11
dotnet SDK: 7.0.100

I didn't make screenshots but the times went from:

Before: 2.1 seconds
After: 2.6 seconds

My code

The following file contains the implementation of the Algorithm I'm using. Besides the code in this file I'm using my own implementation of a Random generator + my own implementation of a BitArray to store the maze.

https://github.com/devedse/DeveMazeGeneratorCore/blob/master/DeveMazeGeneratorCore/Generators/AlgorithmBacktrack2Deluxe2.cs

Reproduce

I've created 2 branches that show the difference quite easily by simply running the ConsoleApp:

dotnet 6:
https://github.com/devedse/DeveMazeGeneratorCore

dotnet 7:
https://github.com/devedse/DeveMazeGeneratorCore/tree/dotnet7

Additional info

Even running them at the same time side by side you can clearly see the performance difference:
image

Some more research and doing a NativeAOT compilation to see if this differs (it does?!?)

I did some more investigation and compiled my code like this:

DeveMazeGeneratorCore\DeveMazeGeneratorCore.ConsoleApp> dotnet publish -r win-x64 -c Release -p:PublishReadyToRun=true -p:PublishSingleFile=true --self-contained

This results in 2 exe files that for some reason also differ quite significantly in performance:

  1. DeveMazeGeneratorCore.ConsoleApp\bin\Release\net7.0\win-x64\DeveMazeGenerator.ConsoleApp.exe (9.265kb):
    image

  2. DeveMazeGeneratorCore.ConsoleApp\bin\Release\net7.0\win-x64\publish\DeveMazeGenerator.ConsoleApp.exe (100.360kb):
    image

The second file (in the publish folder) does seem to go slightly faster then the original .NET 6.0.

Author: devedse
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member

Also @EgorBo FYI PGO doesn't help here at all.

@devedse
Copy link
Author

devedse commented Nov 11, 2022

@AndyAyersMS

  <PropertyGroup>
    <TieredCompilationQuickJitForLoops>false</TieredCompilationQuickJitForLoops>
  </PropertyGroup>

I added this setting to the ConsoleApp project and this seemed to remove the overhead and even improve the performance slightly! :)

Dotnet 6:
image

Dotnet 7 <TieredCompilationQuickJitForLoops>true</TieredCompilationQuickJitForLoops>:
image

Dotnet 7 <TieredCompilationQuickJitForLoops>false</TieredCompilationQuickJitForLoops>:
image

I'm curious though if this is something that should be improved in the csharp runtime as I would expect the TieredCompilation to automatically improve the loop once it's been run a few times.

@JulieLeeMSFT JulieLeeMSFT added this to the 8.0.0 milestone Dec 27, 2022
@AndyAyersMS
Copy link
Member

I'm curious though if this is something that should be improved in the csharp runtime as I would expect the TieredCompilation to automatically improve the loop once it's been run a few times.

Yes it is something that should be improved -- if you set TieredCompilationQuickJitForLoops false then we will eagerly optimize the method instead of tiering it up, and in this particular case (though not most other cases) that gives more well-optimized code.

The issue here is that in .NET 7 with OSR, we no longer see some of the code that executed before the main loop in AlgorithmBacktrack2Deluxe2:GoGenerateInternal, in particular the fetch of a method table slot off of V02. This fetch is potentially faulting but also invariant:

N013 ( 33, 23) [000070] --CXG------                         *  CALLV vt-ind void   DeveMazeGeneratorCore.InnerMaps.InnerMap.set_Item $VN.Void
N012 (  9,  8) [000665] n--X------- control expr            \--*  IND       long   <l:$3d1, c:$3d0>
N011 (  7,  6) [000664] ---X---N---                            \--*  ADD       long   $3cf
N009 (  6,  5) [000662] #--X-------                               +--*  IND       long   $3cd
N008 (  4,  3) [000661] ---X---N---                               |  \--*  ADD       long   $3cc
N006 (  3,  2) [000659] #--X-------                               |     +--*  IND       long   $3ca
N005 (  1,  1) [000658] -----------                               |     |  \--*  LCL_VAR   ref    V02 arg1         u:1 $100
N007 (  1,  1) [000660] -----------                               |     \--*  CNS_INT   long   80 $14c
N010 (  1,  1) [000663] -----------                               \--*  CNS_INT   long   32 $14d

so in .NET 6 the dominating pre-loop case lets all the in-loop cases get CSE'd, while in .NET 7 and current .NET 8 we can only CSE within the loop, and that double indir within the loop is apparently costly.

Possible fixes include loop peeing and/or cloning to expose an invariant INDIR tree (which we don't do right now) gated by a null check of V02.

I would expect PGO to help as we should be able to clone based on type; let me look into that next.

@AndyAyersMS
Copy link
Member

With .NET 8, PGO does seem to help, I get considerably faster times. So that is another possible fix. We indeed clone based on type the cloned loop is free of calls.

In .NET 7 we seem to be unable to properly resolve the call targets for .get_Item, not sure why, and this inhibits devirt.

@AndyAyersMS
Copy link
Member

Without PGO it seems like we could clone the loop to expose the invariant method table fetch and make it non-faulting in the hot loop, and then hoist it out.

I prototyped the cloning part and it's not too bad. But these loads are conditional within the loop and so we won't hoist them. Fixing that may be a bit more involved. Also note that any OSR loop is quite likely to be very hot, as we know that it's already iterated several thousand times. So we might want to be more aggressive optimizing these loops (however we have to be careful not to do a better job than Tier1 or we will see some really wonky perf behaviors).

Considering loop L00 to clone for optimizations.
Checking loop L00 for optimization candidates (invariant loads)
Loop L00 has invariant faulting load [000652] on V02
Loop L00 has invariant faulting load [000660] on V02
Loop L00 has invariant faulting load [000668] on V02
Loop L00 has invariant faulting load [000676] on V02
Loop L00 has invariant faulting load [000747] on V02
Loop L00 has invariant faulting load [000755] on V02
------------------------------------------------------------

------------------------------------------------------------
Deriving cloning conditions for L00
Conditions: (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null)
No array deref conditions
Block conditions:
0 = (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null)
Evaluating 6 loop cloning conditions for loop L00
Considering condition 0: (V02 NE null), could not be evaluated
Considering condition 1: (V02 NE null), could not be evaluated
Considering condition 2: (V02 NE null), could not be evaluated
Considering condition 3: (V02 NE null), could not be evaluated
Considering condition 4: (V02 NE null), could not be evaluated
Considering condition 5: (V02 NE null), could not be evaluated
Evaluation result allTrue = false, anyFalse = false
Before optimizing cloning conditions
	(V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null) && (V02 NE null)
After optimizing cloning conditions
	(V02 NE null)
After optimizing block-level cloning conditions
	(V02 NE null)

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Mar 31, 2023
Motivating example is the OSR loop in dotnet#78110 where in OSR we don't see
a deref before the loop, so can't hoist repeated virtual method lookups.

Turns out that these indirs are conditional so we end up not hoisting,
but at least this does the cloning part.

Still needs some polish, but should be functionally correct.
@AndyAyersMS
Copy link
Member

As noted above, PGO in .NET 8 will provide a 20% performance boost.

With .NET 8 Preview 5 (coming out next week) PGO is now enabled by default. So I am going to close this issue.

;; net6.0

Generation time: 00:00:02.4523739 <<<<<<<< new fastest time          Fastest: 00:00:02.4523739
Generation time: 00:00:02.4772369                                    Fastest: 00:00:02.4523739
Generation time: 00:00:02.4934081                                    Fastest: 00:00:02.4523739
Generation time: 00:00:02.4947292                                    Fastest: 00:00:02.4523739
Generation time: 00:00:02.4584699                                    Fastest: 00:00:02.4523739

;; net7.0

Generation time: 00:00:02.5462457 <<<<<<<< new fastest time          Fastest: 00:00:02.5462457
Generation time: 00:00:02.5450569 <<<<<<<< new fastest time          Fastest: 00:00:02.5450569
Generation time: 00:00:02.6279318                                    Fastest: 00:00:02.5450569
Generation time: 00:00:02.5372493 <<<<<<<< new fastest time          Fastest: 00:00:02.5372493
Generation time: 00:00:02.5083985 <<<<<<<< new fastest time          Fastest: 00:00:02.5083985

;; net8.0 (Preview 4)

Generation time: 00:00:02.5601764 <<<<<<<< new fastest time          Fastest: 00:00:02.5601764
Generation time: 00:00:02.5641837                                    Fastest: 00:00:02.5601764
Generation time: 00:00:02.5902517                                    Fastest: 00:00:02.5601764
Generation time: 00:00:02.7199203                                    Fastest: 00:00:02.5601764
Generation time: 00:00:02.5658560                                    Fastest: 00:00:02.5601764

;; net8.0 (Preview4 + DOTNET_TieredPGO=1)

Generation time: 00:00:01.9386467 <<<<<<<< new fastest time          Fastest: 00:00:01.9386467
Generation time: 00:00:01.9029803 <<<<<<<< new fastest time          Fastest: 00:00:01.9029803
Generation time: 00:00:01.9701224                                    Fastest: 00:00:01.9029803
Generation time: 00:00:01.9371083                                    Fastest: 00:00:01.9029803
Generation time: 00:00:01.9046991                                    Fastest: 00:00:01.9029803

@devedse
Copy link
Author

devedse commented Jun 9, 2023

Thanks @AndyAyersMS for the elaborate explanation and work on making dotnet even faster 😄

@ghost ghost locked as resolved and limited conversation to collaborators Jul 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI regression-from-last-release tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants