Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lea followed by an add is generated #121064

Open
DenisYaroshevskiy opened this issue Dec 24, 2024 · 3 comments
Open

Lea followed by an add is generated #121064

DenisYaroshevskiy opened this issue Dec 24, 2024 · 3 comments

Comments

@DenisYaroshevskiy
Copy link

Hi

I observe the following codegen (the input code is tricky to share)

 11.79 │570:   vpcmpeqb     ymm0,ymm4,YMMWORD PTR [rcx+rdx*1]                                                                                                                              ▒
  5.93 │       vpcmpeqb     ymm1,ymm3,YMMWORD PTR [r9+rdx*1]                                                                                                                               ▒
  9.08 │       vpand        ymm0,ymm1,ymm0                                                                                                                                                 ▒
  2.11 │       vpmovmskb    esi,ymm0                                                                                                                                                       ▒
  0.03 │       lea          r12,[rcx+rdx*1]                                                                                                                                                ▒
  2.13 │       add          r12,0x20                                                                                                                                                       ▒
 11.17 │       test         esi,esi                                                                                                                                                        ▒
       │     ↓ jne          641    

This "add" following "lea" instruction looks weird to me. Is this expected?

@topperc
Copy link
Collaborator

topperc commented Dec 24, 2024

Prior to Icelake, on Intel CPUs, 3 source LEAs have a latency of 3 and a reciprocal throughput of 1. 2 source LEAs and add have a latency of 1 and reciprocal throughput of 0.25. So we split 3 source LEAs into 2 instructions. I think -mtune=icelake or newer or tuning for AMD CPUs will disable this.

@llvmbot
Copy link
Member

llvmbot commented Dec 24, 2024

@llvm/issue-subscribers-backend-x86

Author: Denis Yaroshevskiy (DenisYaroshevskiy)

Hi

I observe the following codegen (the input code is tricky to share)

 11.79 │570:   vpcmpeqb     ymm0,ymm4,YMMWORD PTR [rcx+rdx*1]                                                                                                                              ▒
  5.93 │       vpcmpeqb     ymm1,ymm3,YMMWORD PTR [r9+rdx*1]                                                                                                                               ▒
  9.08 │       vpand        ymm0,ymm1,ymm0                                                                                                                                                 ▒
  2.11 │       vpmovmskb    esi,ymm0                                                                                                                                                       ▒
  0.03 │       lea          r12,[rcx+rdx*1]                                                                                                                                                ▒
  2.13 │       add          r12,0x20                                                                                                                                                       ▒
 11.17 │       test         esi,esi                                                                                                                                                        ▒
       │     ↓ jne          641    

This "add" following "lea" instruction looks weird to me. Is this expected?

@DenisYaroshevskiy
Copy link
Author

Ran the experiment

My CPU:

cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 158
model name      : Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
stepping        : 13
microcode       : 0x100
cpu MHz         : 4759.331
cache size      : 12288 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit srbds mmio_stale_data retbleed eibrs_pbrsb gds bhi
bogomips        : 7200.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

With -march=native, -mtune=native:

 11.70 │570:┌─→vpcmpeqb     ymm0,ymm4,YMMWORD PTR [rcx+rdx*1]                                                                                                                                                        ▒
  5.38 │    │  vpcmpeqb     ymm1,ymm3,YMMWORD PTR [r9+rdx*1]                                                                                                                                                         ▒
 10.17 │    │  vpand        ymm0,ymm1,ymm0                                                                                                                                                                           ▒
  2.26 │    │  vpmovmskb    esi,ymm0                                                                                                                                                                                 ▒
  0.03 │    │  lea          r12,[rcx+rdx*1]                                                                                                                                                                          ▒
  2.05 │    │  add          r12,0x20                                                                                                                                                                                 ▒
 12.23 │    │  test         esi,esi                                                                                                                                                                                  ▒
       │    │↓ jne          641                                                                                                                                                                                      ▒
  0.72 │    │  cmp          r12,r10                                                                                                                                                                                  ▒
       │    │↓ je           650                                                                                                                                                                                      ▒
  0.05 │    │  vpcmpeqb     ymm0,ymm4,YMMWORD PTR [r12]                                                                                                                                                              ▒
 13.62 │    │  vpcmpeqb     ymm1,ymm3,YMMWORD PTR [r9+rdx*1+0x20]                                                                                                                                                    ▒
  0.53 │    │  vpand        ymm0,ymm1,ymm0                                                                                                                                                                           ▒
  0.03 │    │  vpmovmskb    esi,ymm0                                                                                                                                                                                 ▒
  3.06 │    │  test         esi,esi                                                                                                                                                                                  ▒
       │    │↓ jne          664                                                                                                                                                                                      ▒
 10.51 │    │  lea          rsi,[rcx+rdx*1]                                                                                                                                                                          ▒
  0.50 │    │  add          rsi,0x40                                                                                                                                                                                 ▒
  0.03 │    │  add          rdx,0x40                                                                                                                                                                                 ▒
  0.03 │    ├──cmp          rsi,r10                                                                                                                                                                                  ▒
  2.93 │    └──jne          570                      

With -march=native -mtune=icelake-client

 12.39 │560:┌─→vpcmpeqb     ymm0,ymm4,YMMWORD PTR [rcx+rdx*1]                                                                                                                                                        ▒
 12.32 │    │  vpcmpeqb     ymm1,ymm3,YMMWORD PTR [r9+rdx*1]                                                                                                                                                         ▒
  0.36 │    │  vpand        ymm0,ymm1,ymm0                                                                                                                                                                           ▒
 12.00 │    │  vpmovmskb    esi,ymm0                                                                                                                                                                                 ▒
  0.05 │    │  lea          r12,[rcx+rdx*1+0x20]                                                                                                                                                                     ▒
 11.62 │    │  test         esi,esi                                                                                                                                                                                  ▒
       │    │↓ jne          631                                                                                                                                                                                      ▒
  0.31 │    │  cmp          r12,r10                                                                                                                                                                                  ▒
       │    │↓ je           63e                                                                                                                                                                                      ▒
  0.99 │    │  vpcmpeqb     ymm0,ymm4,YMMWORD PTR [r12]                                                                                                                                                              ▒
 11.55 │    │  vpcmpeqb     ymm1,ymm3,YMMWORD PTR [r9+rdx*1+0x20]                                                                                                                                                    ▒
  0.28 │    │  vpand        ymm0,ymm1,ymm0                                                                                                                                                                           ▒
  0.94 │    │  vpmovmskb    esi,ymm0                                                                                                                                                                                 ▒
  0.88 │    │  test         esi,esi                                                                                                                                                                                  ▒
       │    │↓ jne          64d                                                                                                                                                                                      ▒
 10.72 │    │  lea          rsi,[rcx+rdx*1+0x40]                                                                                                                                                                     ▒
  0.42 │    │  add          rdx,0x40                                                                                                                                                                                 ▒
  0.03 │    ├──cmp          rsi,r10                                                                                                                                                                                  ▒
  1.58 │    └──jne          560                                     

This latter one is worse than the default one on my machine. I guess the lea split is correct.

Numbers (the padding indicates different code alignment - I test alignments from 0 to 56 bytes in increments of 8.

-mtune=native

padding:0         228 ns 
padding:8         234 ns 
padding:16        247 ns
padding:24        235 ns
padding:32        228 ns
padding:40        255 ns
padding:48        247 ns
padding:56        256 ns

-mtune=icelake-client

padding:0         279 ns
padding:8         295 ns 
padding:16        248 ns 
padding:24        222 ns 
padding:32        247 ns 
padding:40        255 ns
padding:48        247 ns         
padding:56        260 ns 

Curious that the best case for icelake is slightly better than default tuning, but that's probably just noise.
The code alignment effects on mtune=icelake are also much worse. Branches being closer probably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants