Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognize more "local addr" trees as invariant #68484

Merged
merged 3 commits into from
Apr 25, 2022

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Apr 25, 2022

This will ensure we do not evaluate e.g. the retbuffer into a temp when
unnecessary.

I first used IsLocalAddrExpr here but it did not bring any extra diffs and it was much more expensive.

This will ensure we do not evaluate e.g. the retbuffer into a temp when
unnecessary.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 25, 2022
@ghost ghost assigned jakobbotsch Apr 25, 2022
@ghost
Copy link

ghost commented Apr 25, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This will ensure we do not evaluate e.g. the retbuffer into a temp when
unnecessary.

I first used IsLocalAddrExpr here but it did not bring any extra diffs and it was much more expensive.

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member Author

I first used IsLocalAddrExpr here but it did not bring any extra diffs and it was much more expensive.

Hmm, this does not really make sense now that I think about it. Will need to look again.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Apr 25, 2022

I switched to IsLocalAddrExpr. It's a bit more expensive due to looking into GT_ADD nodes (which does not seem to help much on the improvements in this case, it appears), but it does make the code clearer. Furthermore it handles GT_ADDR(GT_LCL_FLD) that seems to be created more often on arm32.

@@ -22865,7 +22865,7 @@ bool GenTreeLclFld::IsOffsetMisaligned() const

bool GenTree::IsInvariant() const
{
return OperIsConst() || Compiler::impIsAddressInLocal(this);
return OperIsConst() || IsLocalAddrExpr();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's notable that combined with #68469, this will bind IsLocalAddrExpr "the simple version" to LocalAddressVisitor like DefinesLocalAddr and IsLocalAddrExpr "the not simple version" are bound right now (which may also not be a new thing, it may have been like this already due to struct promotion requirements).

BTW, this is also the reason we must recognize ADDs in all these functions, they're produced for structs with fields at offsets larger than GenTreeLclFld (and the emitter) can represent.

Copy link
Member Author

@jakobbotsch jakobbotsch Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even without #68469 there are/were non-trivial invariants between the two. While before this PR we 'supported' call morphing introducing a local store for the address, there was already an implicit assumption that no extra uses of this temp were ever created. So in a way we are just trading one assumption for another and have a slightly more explicit assert about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unsatisfying, but I am hard pressed to think about how to improve the situation.
Probably it would be ideal if we could unify DefinesLocalAddr, the two variants of IsLocalAddrExpr and what LocalAddressVisitor is doing somehow, but that also seems difficult.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, this is what #11057 is about :).

The original "north star" design was that we'd only have STORE_LCL_VAR|FLD for tracked locals, this return buffer optimization makes that "hard" unfortunately, but even with it (assuming I won't find a way to represent it as a store after all), we'd have:

  1. DefinesLocalAddr becomes equivalent to OperIsLocalAddr.
  2. All its variations (IsLocalAddrExpr "simple" and "not simple", as well as fgIsIndirOfAddrOfLocal) are deleted.
  3. LocalAddressVisitor transforms all ASGs into STORE_LCL_VAR|FLD, and for the return buffer optimization, LCL_VAR|FLD_ADDR.
  4. All indirect uses become LCL_VAR|FLD (this is Remove the OBJ(ADDR) wrapping created for calls by the importer #51569)

As a point-in-time improvement, the "not simple" IsLocalAddrExpr will be deleted once/if physical VN changes are in.

@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib

@jakobbotsch
Copy link
Member Author

Diffs

@AndyAyersMS
Copy link
Member

The diffs are largely from better staging of arguments to calls? Can you share an example or two?

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Apr 25, 2022

The diffs are largely from better staging of arguments to calls? Can you share an example or two?

Yes, previously we would end up creating temps for a lot of common address trees when a later argument required spilling previous arguments. In particular we did not recognize GT_LCL_VAR_ADDR as invariant, so now we get improvements like:

diff --git "a/C:\\dev\\temp\\foobar\\asm.libraries.pmi.windows.x64.checked\\base\\15134.dasm" "b/C:\\dev\\temp\\foobar\\asm.libraries.pmi.windows.x64.checked\\diff\\15134.dasm"
index fb40ace..a78f69d 100644
--- "a/C:\\dev\\temp\\foobar\\asm.libraries.pmi.windows.x64.checked\\base\\15134.dasm"
+++ "b/C:\\dev\\temp\\foobar\\asm.libraries.pmi.windows.x64.checked\\diff\\15134.dasm"
@@ -4,49 +4,46 @@
 ; rbp based frame
 ; partially interruptible
 ; No matching PGO data
 ; Final local variable assignments
 ;
 ;  V00 OutArgs      [V00    ] (  1,  1   )  lclBlk (32) [rsp+00H]   do-not-enreg[] "OutgoingArgSpace"
 ;  V01 tmp1         [V01    ] (  1,  1   )  struct (72) [rbp-58H]   do-not-enreg[XS] must-init addr-exposed "NewObj constructor temp"
 ;  V02 tmp2         [V02    ] (  1,  1   )  struct (24) [rbp-70H]   do-not-enreg[XS] must-init addr-exposed "by-value struct argument"
 ;  V03 tmp3         [V03    ] (  1,  1   )  struct (24) [rbp-88H]   do-not-enreg[XS] must-init addr-exposed "by-value struct argument"
 ;  V04 tmp4         [V04    ] (  1,  1   )  struct (24) [rbp-A0H]   do-not-enreg[XS] must-init addr-exposed "by-value struct argument"
-;  V05 tmp5         [V05    ] (  1,  1   )    long  ->  [rbp-A8H]   do-not-enreg[] "argument with side effect"
 ;
-; Lcl frame size = 192
+; Lcl frame size = 176
 
 G_M9930_IG01:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
        push     rbp
        push     rdi
        push     rsi
-       sub      rsp, 192
+       sub      rsp, 176
        vzeroupper 
-       lea      rbp, [rsp+D0H]
+       lea      rbp, [rsp+C0H]
        vxorps   xmm4, xmm4
        mov      rax, -144
        vmovdqa  xmmword ptr [rbp+rax-10H], xmm4
        vmovdqa  xmmword ptr [rax+rbp], xmm4
        vmovdqa  xmmword ptr [rbp+rax+10H], xmm4
        add      rax, 48
        jne      SHORT  -5 instr
 						;; size=62 bbWeight=1    PerfScore 12.58
 G_M9930_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
-       lea      rcx, [rbp-58H]
-       mov      qword ptr [rbp-A8H], rcx
        mov      rcx, 0xD1FFAB1E
        mov      edx, 232
        call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
        mov      rcx, 0xD1FFAB1E
        mov      rcx, gword ptr [rcx]
        ; gcrRegs +[rcx]
-						;; size=44 bbWeight=1    PerfScore 5.25
+						;; size=33 bbWeight=1    PerfScore 3.75
 G_M9930_IG03:        ; , nogc, extend
        vmovdqu  xmm0, xmmword ptr [rcx+8]
        vmovdqu  xmmword ptr [rbp-70H], xmm0
        mov      rdx, qword ptr [rcx+24]
        mov      qword ptr [rbp-60H], rdx
 						;; size=20 bbWeight=1    PerfScore 8.00
 G_M9930_IG04:        ; , extend
        mov      rcx, 0xD1FFAB1E
        ; gcrRegs -[rcx]
        mov      edx, 232
@@ -77,58 +74,58 @@ G_M9930_IG07:        ; , nogc, extend
        mov      qword ptr [rbp-90H], r8
 						;; size=26 bbWeight=1    PerfScore 8.00
 G_M9930_IG08:        ; , extend
        lea      rdx, bword ptr [rbp-70H]
        ; gcrRegs -[rdx]
        ; byrRegs +[rdx]
        lea      r8, bword ptr [rbp-88H]
        ; byrRegs +[r8]
        lea      r9, bword ptr [rbp-A0H]
        ; byrRegs +[r9]
-       mov      rcx, qword ptr [rbp-A8H]
+       lea      rcx, [rbp-58H]
        call     [hackishModuleName:hackishMethodName(Microsoft.CodeAnalysis.SymbolInfo,Microsoft.CodeAnalysis.SymbolInfo,Microsoft.CodeAnalysis.SymbolInfo):this]
        ; byrRegs -[rdx r8-r9]
        mov      rax, 0xD1FFAB1E
        mov      rdi, gword ptr [rax]
        ; gcrRegs +[rdi]
        add      rdi, 8
        ; gcrRegs -[rdi]
        ; byrRegs +[rdi]
        lea      rsi, bword ptr [rbp-58H]
        ; byrRegs +[rsi]
        call     CORINFO_HELP_ASSIGN_BYREF
        movsq    
        call     CORINFO_HELP_ASSIGN_BYREF
        call     CORINFO_HELP_ASSIGN_BYREF
        movsq    
        call     CORINFO_HELP_ASSIGN_BYREF
        call     CORINFO_HELP_ASSIGN_BYREF
        movsq    
        call     CORINFO_HELP_ASSIGN_BYREF
        nop      
-						;; size=89 bbWeight=1    PerfScore 17.75
+						;; size=86 bbWeight=1    PerfScore 17.25
 G_M9930_IG09:        ; , epilog, nogc, extend
-       add      rsp, 192
+       add      rsp, 176
        pop      rsi
        pop      rdi
        pop      rbp
        ret      

or e.g.

        vxorps   xmm0, xmm0
        vmovdqu  xmmword ptr [rsp+100H], xmm0
        mov      dword ptr [rsp+110H], r8d
-       lea      r8, [rsp+100H]
-       mov      dword ptr [rsp+48H], ecx
-       mov      dword ptr [rsp+4CH], edx
+       mov      dword ptr [rsp+48H], edx
+       mov      dword ptr [rsp+4CH], ecx
        mov      qword ptr [rsp+50H], rax
-       mov      rcx, r8
        lea      rdx, bword ptr [rsp+48H]
        ; byrRegs +[rdx]
+       lea      rcx, [rsp+100H]
        call     [hackishModuleName:hackishMethodName(System.Decimal):this]

The regressions seem to be worse register allocation due to the movement of those trees:

 ;  V16 tmp10        [V16    ] (  4,  8   )  struct (16) [rsp+28H]   do-not-enreg[XSF] must-init addr-exposed "by-value struct argument"
-;* V17 tmp11        [V17,T08] (  0,  0   )     ref  ->  zero-ref    single-def "argument with side effect"
-;  V18 tmp12        [V18,T04] (  2,  4   )    long  ->  rdx         "argument with side effect"
+;* V17 tmp11        [V17,T07] (  0,  0   )     ref  ->  zero-ref    single-def "argument with side effect"
 ;
 ; Lcl frame size = 72
 
@@ -56,19 +55,19 @@ G_M42744_IG02:
 						;; size=23 bbWeight=1    PerfScore 5.25
 G_M42744_IG03:
        mov      ecx, ebx
-       mov      edx, ebp
-       add      rcx, rdx
-       mov      edx, dword ptr [rsi+8]
-       cmp      rcx, rdx
+       mov      r8d, ebp
+       add      rcx, r8
+       mov      r8d, dword ptr [rsi+8]
+       cmp      rcx, r8
        ja       SHORT G_M42744_IG07
-						;; size=15 bbWeight=0.50 PerfScore 2.00
+						;; size=17 bbWeight=0.50 PerfScore 2.00
 G_M42744_IG04:
-       lea      rdx, [rsp+38H]
        mov      gword ptr [rsp+28H], rsi
        mov      dword ptr [rsp+30H], ebx
        mov      dword ptr [rsp+34H], ebp
        mov      rcx, rdi
        lea      r8, bword ptr [rsp+28H]
+       lea      rdx, [rsp+38H]
        mov      r9, gword ptr [rsp+90H]
        call     [System.Security.Cryptography.CryptoStream:ReadAsyncInternal(System.Memory`1[Byte],System.Threading.CancellationToken):System.Threading.Tasks.ValueTask`1[Int32]:this]
        lea      rcx, [rsp+38H]
@@ -97,11 +96,11 @@ G_M42744_IG07:
        int3     
 						;; size=7 bbWeight=0    PerfScore 0.00
 
-; Total bytes of code 162, prolog size 30, PerfScore 38.90, instruction count 53, allocated bytes for code 164 (MethodHash=8dce5907) for method System.Security.Cryptography.CryptoStream:ReadAsync(System.Byte[],int,int,System.Threading.CancellationToken):System.Threading.Tasks.Task`1[Int32]:this
+; Total bytes of code 164, prolog size 30, PerfScore 39.10, instruction count 53, allocated bytes for code 166 (MethodHash=8dce5907) for method System.Security.Cryptography.CryptoStream:ReadAsync(System.Byte[],int,int,System.Threading.CancellationToken):System.Threading.Tasks.Task`1[Int32]:this

In some of the bigger regressions we end up with new spills due to it.

I think the aspnet improvements are so much bigger because we have tier-0 contexts there and avoiding the creation of these temps is much more impactful when the temps are not tracked.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Apr 25, 2022

Here's another regression where we now need to zero a smaller area of stack and we end up with a bunch of extra instructions due to it:

 ; Assembly listing for method RuntimeCustomAttributeData:get_ConstructorArguments():IList`1:this
 ; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
 ; Tier-0 compilation
 ; MinOpts code
 ; rbp based frame
 ; partially interruptible
 ; Final local variable assignments
 ;
 ;  V00 this         [V00    ] (  1,  1   )     ref  ->  [rbp+10H]   do-not-enreg[] this class-hnd
 ;  V01 loc0         [V01    ] (  1,  1   )     ref  ->  [rbp-40H]   do-not-enreg[] must-init class-hnd exact
 ;  V02 loc1         [V02    ] (  1,  1   )     int  ->  [rbp-44H]   do-not-enreg[]
 ;  V03 loc2         [V03    ] (  1,  1   )  struct (48) [rbp-78H]   do-not-enreg[XS] must-init addr-exposed
 ;  V04 OutArgs      [V04    ] (  1,  1   )  lclBlk (32) [rsp+00H]   do-not-enreg[] "OutgoingArgSpace"
 ;  V05 tmp1         [V05    ] (  1,  1   )  struct (16) [rbp-88H]   do-not-enreg[XS] must-init addr-exposed "NewObj constructor temp"
 ;  V06 tmp2         [V06    ] (  1,  1   )     int  ->  [rbp-90H]   do-not-enreg[X] addr-exposed "patchpoint counter"
 ;  V07 tmp3         [V07    ] (  1,  1   )  struct (48) [rbp-C0H]   do-not-enreg[XS] must-init addr-exposed "by-value struct argument"
-;  V08 tmp4         [V08    ] (  1,  1   )    long  ->  [rbp-C8H]   do-not-enreg[] "argument with side effect"
-;  V09 tmp5         [V09    ] (  1,  1   )     ref  ->  [rbp-D0H]   do-not-enreg[] must-init "argument with side effect"
+;  V08 tmp4         [V08    ] (  1,  1   )     ref  ->  [rbp-C8H]   do-not-enreg[] must-init "argument with side effect"
 ;
 ; Lcl frame size = 224

 G_M33983_IG01:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
        push     rbp
        push     rdi
        push     rsi
        sub      rsp, 224
        vzeroupper
        lea      rbp, [rsp+F0H]
+       xor      eax, eax
+       mov      qword ptr [rbp-C8H], rax
        vxorps   xmm4, xmm4
-       mov      rax, -144
+       vmovdqa  xmmword ptr [rbp-C0H], xmm4
+       vmovdqa  xmmword ptr [rbp-B0H], xmm4
+       mov      rax, -96
        vmovdqa  xmmword ptr [rbp+rax-40H], xmm4
        vmovdqa  xmmword ptr [rbp+rax-30H], xmm4
        vmovdqa  xmmword ptr [rbp+rax-20H], xmm4
        add      rax, 48
        jne      SHORT  -5 instr
        mov      qword ptr [rbp-40H], rax
        mov      gword ptr [rbp+10H], rcx

Might be worth it to zero more than strictly necessary to avoid handling alignment?

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

I'm not all that worried about the regressions, any time there different/fewer temps there are ripple effects.

@jakobbotsch jakobbotsch merged commit 7adf454 into dotnet:main Apr 25, 2022
@jakobbotsch jakobbotsch deleted the more-invariant-trees branch April 25, 2022 20:32
@ghost ghost locked as resolved and limited conversation to collaborators May 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants