Lea followed by an `add` is generated #121064

DenisYaroshevskiy · 2024-12-24T17:46:03Z

Hi

I observe the following codegen (the input code is tricky to share)

 11.79 │570:   vpcmpeqb     ymm0,ymm4,YMMWORD PTR [rcx+rdx*1]                                                                                                                              ▒
  5.93 │       vpcmpeqb     ymm1,ymm3,YMMWORD PTR [r9+rdx*1]                                                                                                                               ▒
  9.08 │       vpand        ymm0,ymm1,ymm0                                                                                                                                                 ▒
  2.11 │       vpmovmskb    esi,ymm0                                                                                                                                                       ▒
  0.03 │       lea          r12,[rcx+rdx*1]                                                                                                                                                ▒
  2.13 │       add          r12,0x20                                                                                                                                                       ▒
 11.17 │       test         esi,esi                                                                                                                                                        ▒
       │     ↓ jne          641

This "add" following "lea" instruction looks weird to me. Is this expected?

The text was updated successfully, but these errors were encountered:

topperc · 2024-12-24T20:10:25Z

Prior to Icelake, on Intel CPUs, 3 source LEAs have a latency of 3 and a reciprocal throughput of 1. 2 source LEAs and add have a latency of 1 and reciprocal throughput of 0.25. So we split 3 source LEAs into 2 instructions. I think -mtune=icelake or newer or tuning for AMD CPUs will disable this.

llvmbot · 2024-12-24T20:10:50Z

@llvm/issue-subscribers-backend-x86

Author: Denis Yaroshevskiy (DenisYaroshevskiy)

Hi

I observe the following codegen (the input code is tricky to share)

 11.79 │570:   vpcmpeqb     ymm0,ymm4,YMMWORD PTR [rcx+rdx*1]                                                                                                                              ▒
  5.93 │       vpcmpeqb     ymm1,ymm3,YMMWORD PTR [r9+rdx*1]                                                                                                                               ▒
  9.08 │       vpand        ymm0,ymm1,ymm0                                                                                                                                                 ▒
  2.11 │       vpmovmskb    esi,ymm0                                                                                                                                                       ▒
  0.03 │       lea          r12,[rcx+rdx*1]                                                                                                                                                ▒
  2.13 │       add          r12,0x20                                                                                                                                                       ▒
 11.17 │       test         esi,esi                                                                                                                                                        ▒
       │     ↓ jne          641

This "add" following "lea" instruction looks weird to me. Is this expected?

DenisYaroshevskiy · 2024-12-25T12:17:06Z

Ran the experiment

My CPU:

cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 158
model name      : Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
stepping        : 13
microcode       : 0x100
cpu MHz         : 4759.331
cache size      : 12288 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit srbds mmio_stale_data retbleed eibrs_pbrsb gds bhi
bogomips        : 7200.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

With -march=native, -mtune=native:

 11.70 │570:┌─→vpcmpeqb     ymm0,ymm4,YMMWORD PTR [rcx+rdx*1]                                                                                                                                                        ▒
  5.38 │    │  vpcmpeqb     ymm1,ymm3,YMMWORD PTR [r9+rdx*1]                                                                                                                                                         ▒
 10.17 │    │  vpand        ymm0,ymm1,ymm0                                                                                                                                                                           ▒
  2.26 │    │  vpmovmskb    esi,ymm0                                                                                                                                                                                 ▒
  0.03 │    │  lea          r12,[rcx+rdx*1]                                                                                                                                                                          ▒
  2.05 │    │  add          r12,0x20                                                                                                                                                                                 ▒
 12.23 │    │  test         esi,esi                                                                                                                                                                                  ▒
       │    │↓ jne          641                                                                                                                                                                                      ▒
  0.72 │    │  cmp          r12,r10                                                                                                                                                                                  ▒
       │    │↓ je           650                                                                                                                                                                                      ▒
  0.05 │    │  vpcmpeqb     ymm0,ymm4,YMMWORD PTR [r12]                                                                                                                                                              ▒
 13.62 │    │  vpcmpeqb     ymm1,ymm3,YMMWORD PTR [r9+rdx*1+0x20]                                                                                                                                                    ▒
  0.53 │    │  vpand        ymm0,ymm1,ymm0                                                                                                                                                                           ▒
  0.03 │    │  vpmovmskb    esi,ymm0                                                                                                                                                                                 ▒
  3.06 │    │  test         esi,esi                                                                                                                                                                                  ▒
       │    │↓ jne          664                                                                                                                                                                                      ▒
 10.51 │    │  lea          rsi,[rcx+rdx*1]                                                                                                                                                                          ▒
  0.50 │    │  add          rsi,0x40                                                                                                                                                                                 ▒
  0.03 │    │  add          rdx,0x40                                                                                                                                                                                 ▒
  0.03 │    ├──cmp          rsi,r10                                                                                                                                                                                  ▒
  2.93 │    └──jne          570

With -march=native -mtune=icelake-client

 12.39 │560:┌─→vpcmpeqb     ymm0,ymm4,YMMWORD PTR [rcx+rdx*1]                                                                                                                                                        ▒
 12.32 │    │  vpcmpeqb     ymm1,ymm3,YMMWORD PTR [r9+rdx*1]                                                                                                                                                         ▒
  0.36 │    │  vpand        ymm0,ymm1,ymm0                                                                                                                                                                           ▒
 12.00 │    │  vpmovmskb    esi,ymm0                                                                                                                                                                                 ▒
  0.05 │    │  lea          r12,[rcx+rdx*1+0x20]                                                                                                                                                                     ▒
 11.62 │    │  test         esi,esi                                                                                                                                                                                  ▒
       │    │↓ jne          631                                                                                                                                                                                      ▒
  0.31 │    │  cmp          r12,r10                                                                                                                                                                                  ▒
       │    │↓ je           63e                                                                                                                                                                                      ▒
  0.99 │    │  vpcmpeqb     ymm0,ymm4,YMMWORD PTR [r12]                                                                                                                                                              ▒
 11.55 │    │  vpcmpeqb     ymm1,ymm3,YMMWORD PTR [r9+rdx*1+0x20]                                                                                                                                                    ▒
  0.28 │    │  vpand        ymm0,ymm1,ymm0                                                                                                                                                                           ▒
  0.94 │    │  vpmovmskb    esi,ymm0                                                                                                                                                                                 ▒
  0.88 │    │  test         esi,esi                                                                                                                                                                                  ▒
       │    │↓ jne          64d                                                                                                                                                                                      ▒
 10.72 │    │  lea          rsi,[rcx+rdx*1+0x40]                                                                                                                                                                     ▒
  0.42 │    │  add          rdx,0x40                                                                                                                                                                                 ▒
  0.03 │    ├──cmp          rsi,r10                                                                                                                                                                                  ▒
  1.58 │    └──jne          560

This latter one is worse than the default one on my machine. I guess the lea split is correct.

Numbers (the padding indicates different code alignment - I test alignments from 0 to 56 bytes in increments of 8.

-mtune=native

padding:0         228 ns 
padding:8         234 ns 
padding:16        247 ns
padding:24        235 ns
padding:32        228 ns
padding:40        255 ns
padding:48        247 ns
padding:56        256 ns

-mtune=icelake-client

padding:0         279 ns
padding:8         295 ns 
padding:16        248 ns 
padding:24        222 ns 
padding:32        247 ns 
padding:40        255 ns
padding:48        247 ns         
padding:56        260 ns

Curious that the best case for icelake is slightly better than default tuning, but that's probably just noise.
The code alignment effects on mtune=icelake are also much worse. Branches being closer probably.

github-actions bot added the new issue label Dec 24, 2024

topperc added backend:X86 and removed new issue labels Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lea followed by an `add` is generated #121064

Lea followed by an `add` is generated #121064

DenisYaroshevskiy commented Dec 24, 2024

topperc commented Dec 24, 2024

llvmbot commented Dec 24, 2024

DenisYaroshevskiy commented Dec 25, 2024

Lea followed by an add is generated #121064

Lea followed by an add is generated #121064

Comments

DenisYaroshevskiy commented Dec 24, 2024

topperc commented Dec 24, 2024

llvmbot commented Dec 24, 2024

DenisYaroshevskiy commented Dec 25, 2024

Lea followed by an `add` is generated #121064

Lea followed by an `add` is generated #121064