Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup _observed_ with dynamic broadcasting #40

Open
navidcy opened this issue Nov 20, 2022 · 5 comments
Open

Speedup _observed_ with dynamic broadcasting #40

navidcy opened this issue Nov 20, 2022 · 5 comments

Comments

@navidcy
Copy link

navidcy commented Nov 20, 2022

Despite the claims in the README, I actually get:

julia> b = [1.0];

julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  45.125 μs (0 allocations: 0 bytes)

julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  18.375 μs (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.6.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 6 on 8 virtual cores
Environment:
  JULIA_EDITOR = code
@chriselrod
Copy link
Collaborator

chriselrod commented Nov 20, 2022

What's the problem? Your fast_foo9 is over 2x faster.

EDIT: oh, even when broadcasting b. Huh.

@chriselrod
Copy link
Collaborator

chriselrod commented Nov 20, 2022

FWIW, I got

julia> using FastBroadcast

julia> function fast_foo9(a, b, c, d, e, f, g, h, i)
           @.. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
fast_foo9 (generic function with 1 method)

julia> function foo9(a, b, c, d, e, f, g, h, i)
           @. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
foo9 (generic function with 1 method)

julia> a, b, c, d, e, f, g, h, i = [rand(100, 100, 2) for i in 1:9];

julia> using BenchmarkTools

julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  38.674 μs (0 allocations: 0 bytes)

julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  83.503 μs (0 allocations: 0 bytes)

julia> b = [1.0];

julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  85.732 μs (0 allocations: 0 bytes)

julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  30.452 μs (0 allocations: 0 bytes)

So I can reproduce.

@chriselrod
Copy link
Collaborator

Comparing 30k evaluations, where b is fullsize and bs is the small version:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(fast_foo9, 30_000, a, bs, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.60e+09   49.9%  #  3.6 cycles per ns
┌ instructions             2.96e+09   50.0%  #  0.8 insns per cycle
│ branch-instructions      2.27e+08   50.0%  #  7.7% of insns
└ branch-misses            1.85e+06   50.0%  #  0.8% of branch insns
┌ task-clock               1.01e+09  100.0%  #  1.0 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              4.00e+00  100.0%
┌ L1-dcache-load-misses    6.09e+08   25.0%  # 48.6% of dcache loads
│ L1-dcache-loads          1.25e+09   25.0%
└ L1-icache-load-misses    8.45e+06   25.0%
┌ dTLB-load-misses         1.23e+05   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.25e+09   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(fast_foo9, 30_000, a, b, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.90e+09   49.9%  #  3.6 cycles per ns
┌ instructions             1.43e+09   50.0%  #  0.4 insns per cycle
│ branch-instructions      7.52e+07   50.0%  #  5.3% of insns
└ branch-misses            3.01e+04   50.0%  #  0.0% of branch insns
┌ task-clock               1.09e+09  100.0%  #  1.1 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    6.76e+08   25.0%  # 112.4% of dcache loads
│ L1-dcache-loads          6.02e+08   25.0%
└ L1-icache-load-misses    1.71e+04   25.0%
┌ dTLB-load-misses         4.01e+00   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               6.02e+08   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(foo9, 30_000, a, b, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               9.86e+09   50.0%  #  3.8 cycles per ns
┌ instructions             3.07e+10   50.0%  #  3.1 insns per cycle
│ branch-instructions      6.37e+08   50.0%  #  2.1% of insns
└ branch-misses            6.58e+06   50.0%  #  1.0% of branch insns
┌ task-clock               2.59e+09  100.0%  #  2.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              4.80e+01  100.0%
┌ L1-dcache-load-misses    6.80e+08   25.0%  #  5.5% of dcache loads
│ L1-dcache-loads          1.24e+10   25.0%
└ L1-icache-load-misses    1.13e+06   25.0%
┌ dTLB-load-misses         7.47e+03   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.24e+10   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(foo9, 30_000, a, bs, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.12e+10   50.0%  #  3.8 cycles per ns
┌ instructions             3.18e+10   50.0%  #  2.8 insns per cycle
│ branch-instructions      8.62e+08   50.0%  #  2.7% of insns
└ branch-misses            9.41e+06   50.0%  #  1.1% of branch insns
┌ task-clock               2.93e+09  100.0%  #  2.9 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.26e+03  100.0%
┌ L1-dcache-load-misses    6.16e+08   25.0%  #  4.3% of dcache loads
│ L1-dcache-loads          1.45e+10   25.0%
└ L1-icache-load-misses    1.59e+07   25.0%
┌ dTLB-load-misses         2.51e+05   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.45e+10   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

It needs twice as many instructions for the small b, but performance is totally dominated by memory bandwidth so it doesn't really matter.
While regular foo9 requires 10-20x the instructions for some reason.

@YingboMa
Copy link
Owner

Should we fix this inaccuracy by inserting a sleep call in the dynamic broadcasting branch?

@chriselrod
Copy link
Collaborator

Should we fix this inaccuracy by inserting a sleep call in the dynamic broadcasting branch?

Probably better to update the README instead, as the README claims FastBroadcast is slower than base broadcasting for dynamic broadcasts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants