Speedup _observed_ with dynamic broadcasting #40

navidcy · 2022-11-20T10:34:42Z

Despite the claims in the README, I actually get:

julia> b = [1.0];

julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  45.125 μs (0 allocations: 0 bytes)

julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  18.375 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.6.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 6 on 8 virtual cores
Environment:
  JULIA_EDITOR = code

The text was updated successfully, but these errors were encountered:

chriselrod · 2022-11-20T12:19:09Z

What's the problem? Your fast_foo9 is over 2x faster.

EDIT: oh, even when broadcasting b. Huh.

chriselrod · 2022-11-20T12:30:49Z

FWIW, I got

julia> using FastBroadcast

julia> function fast_foo9(a, b, c, d, e, f, g, h, i)
           @.. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
fast_foo9 (generic function with 1 method)

julia> function foo9(a, b, c, d, e, f, g, h, i)
           @. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
foo9 (generic function with 1 method)

julia> a, b, c, d, e, f, g, h, i = [rand(100, 100, 2) for i in 1:9];

julia> using BenchmarkTools

julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  38.674 μs (0 allocations: 0 bytes)

julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  83.503 μs (0 allocations: 0 bytes)

julia> b = [1.0];

julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  85.732 μs (0 allocations: 0 bytes)

julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  30.452 μs (0 allocations: 0 bytes)

So I can reproduce.

chriselrod · 2022-11-20T13:47:27Z

Comparing 30k evaluations, where b is fullsize and bs is the small version:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(fast_foo9, 30_000, a, bs, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.60e+09   49.9%  #  3.6 cycles per ns
┌ instructions             2.96e+09   50.0%  #  0.8 insns per cycle
│ branch-instructions      2.27e+08   50.0%  #  7.7% of insns
└ branch-misses            1.85e+06   50.0%  #  0.8% of branch insns
┌ task-clock               1.01e+09  100.0%  #  1.0 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              4.00e+00  100.0%
┌ L1-dcache-load-misses    6.09e+08   25.0%  # 48.6% of dcache loads
│ L1-dcache-loads          1.25e+09   25.0%
└ L1-icache-load-misses    8.45e+06   25.0%
┌ dTLB-load-misses         1.23e+05   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.25e+09   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(fast_foo9, 30_000, a, b, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.90e+09   49.9%  #  3.6 cycles per ns
┌ instructions             1.43e+09   50.0%  #  0.4 insns per cycle
│ branch-instructions      7.52e+07   50.0%  #  5.3% of insns
└ branch-misses            3.01e+04   50.0%  #  0.0% of branch insns
┌ task-clock               1.09e+09  100.0%  #  1.1 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    6.76e+08   25.0%  # 112.4% of dcache loads
│ L1-dcache-loads          6.02e+08   25.0%
└ L1-icache-load-misses    1.71e+04   25.0%
┌ dTLB-load-misses         4.01e+00   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               6.02e+08   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(foo9, 30_000, a, b, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               9.86e+09   50.0%  #  3.8 cycles per ns
┌ instructions             3.07e+10   50.0%  #  3.1 insns per cycle
│ branch-instructions      6.37e+08   50.0%  #  2.1% of insns
└ branch-misses            6.58e+06   50.0%  #  1.0% of branch insns
┌ task-clock               2.59e+09  100.0%  #  2.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              4.80e+01  100.0%
┌ L1-dcache-load-misses    6.80e+08   25.0%  #  5.5% of dcache loads
│ L1-dcache-loads          1.24e+10   25.0%
└ L1-icache-load-misses    1.13e+06   25.0%
┌ dTLB-load-misses         7.47e+03   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.24e+10   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(foo9, 30_000, a, bs, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.12e+10   50.0%  #  3.8 cycles per ns
┌ instructions             3.18e+10   50.0%  #  2.8 insns per cycle
│ branch-instructions      8.62e+08   50.0%  #  2.7% of insns
└ branch-misses            9.41e+06   50.0%  #  1.1% of branch insns
┌ task-clock               2.93e+09  100.0%  #  2.9 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.26e+03  100.0%
┌ L1-dcache-load-misses    6.16e+08   25.0%  #  4.3% of dcache loads
│ L1-dcache-loads          1.45e+10   25.0%
└ L1-icache-load-misses    1.59e+07   25.0%
┌ dTLB-load-misses         2.51e+05   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.45e+10   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

It needs twice as many instructions for the small b, but performance is totally dominated by memory bandwidth so it doesn't really matter.
While regular foo9 requires 10-20x the instructions for some reason.

YingboMa · 2022-11-21T03:29:47Z

Should we fix this inaccuracy by inserting a sleep call in the dynamic broadcasting branch?

chriselrod · 2022-11-21T05:11:04Z

Should we fix this inaccuracy by inserting a sleep call in the dynamic broadcasting branch?

Probably better to update the README instead, as the README claims FastBroadcast is slower than base broadcasting for dynamic broadcasts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup _observed_ with dynamic broadcasting #40

Speedup _observed_ with dynamic broadcasting #40

navidcy commented Nov 20, 2022

chriselrod commented Nov 20, 2022 •

edited

Loading

chriselrod commented Nov 20, 2022 •

edited

Loading

chriselrod commented Nov 20, 2022

YingboMa commented Nov 21, 2022

chriselrod commented Nov 21, 2022

Speedup _observed_ with dynamic broadcasting #40

Speedup _observed_ with dynamic broadcasting #40

Comments

navidcy commented Nov 20, 2022

chriselrod commented Nov 20, 2022 • edited Loading

chriselrod commented Nov 20, 2022 • edited Loading

chriselrod commented Nov 20, 2022

YingboMa commented Nov 21, 2022

chriselrod commented Nov 21, 2022

chriselrod commented Nov 20, 2022 •

edited

Loading

chriselrod commented Nov 20, 2022 •

edited

Loading