This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
fix: update to use test_gradients macro #161
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
ap/up_test
branch
from
September 18, 2024 03:58
034ef47
to
e92209e
Compare
avik-pal
force-pushed
the
ap/up_test
branch
2 times, most recently
from
September 18, 2024 04:01
93bde63
to
2c77ccb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: 749aa81 | Previous: 0df09fa | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5791 ns |
6938 ns |
0.83 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5959 ns |
7438 ns |
0.80 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
8459 ns |
7541 ns |
1.12 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5688 ns |
5750 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
118315 ns |
133931 ns |
0.88 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
2608336 ns |
2868757 ns |
0.91 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
3336084 ns |
741167 ns |
4.50 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
410709 ns |
407074 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10083.5 ns |
9916.5 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9792 ns |
9625 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9958 ns |
9937.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9708 ns |
9916.5 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
536714 ns |
536526 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17930214 ns |
17845684 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
2513792 ns |
2422500 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
11630401 ns |
678976 ns |
17.13 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1458.5 ns |
1583 ns |
0.92 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1375 ns |
3145.5 ns |
0.44 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1666 ns |
2812.5 ns |
0.59 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1667 ns |
1541.5 ns |
1.08 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
20966 ns |
21370 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1232007 ns |
1416739 ns |
0.87 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal |
206854 ns |
237500 ns |
0.87 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
29122 ns |
29161 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4083 ns |
4166 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4041 ns |
4291 ns |
0.94 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4271 ns |
4417 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
3542 ns |
4104 ns |
0.86 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
141025 ns |
143094 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
8593659 ns |
9766798.5 ns |
0.88 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal |
1574083 ns |
1569250 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
145156 ns |
144301 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58083 ns |
58000 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
39792 ns |
46834 ns |
0.85 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39750 ns |
46584 ns |
0.85 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83625 ns |
82333 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
36108.5 ns |
36625 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
809203.5 ns |
686115 ns |
1.18 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1029417 ns |
1069291 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
77888 ns |
78821 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2041709 ns |
2031375 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2088479 ns |
2084708 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2081875 ns |
2090291 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2003583.5 ns |
1985542 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
221640 ns |
225038 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
8023876 ns |
8235886 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
5473375 ns |
5106125 ns |
1.07 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1554303 ns |
987279 ns |
1.57 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
147042 ns |
174500 ns |
0.84 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
148792 ns |
162104.5 ns |
0.92 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
159833 ns |
165229 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
157979 ns |
145875 ns |
1.08 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
165109 ns |
165145 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6993410 ns |
8411274 ns |
0.83 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1577458 ns |
1520666 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
201107.5 ns |
209957 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1116791 ns |
1119979 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1123500 ns |
1112166.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1115937.5 ns |
1117709 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1121458.5 ns |
1107125 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
682529 ns |
687949 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
25183376 ns |
35372606 ns |
0.71 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6340229 ns |
6112291 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1027084.5 ns |
1024164.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5333 ns |
4625.5 ns |
1.15 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4291.5 ns |
5104 ns |
0.84 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4937.5 ns |
5583 ns |
0.88 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5750 ns |
5042 ns |
1.14 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
90041.5 ns |
92273 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5138284.5 ns |
5823843 ns |
0.88 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
440125 ns |
499583.5 ns |
0.88 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
68083 ns |
67701 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9000 ns |
9000 ns |
1 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8500 ns |
8500 ns |
1 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8875 ns |
9187.5 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8209 ns |
8417 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
586953.5 ns |
600949 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
32463791.5 ns |
36561430 ns |
0.89 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5763833 ns |
5960250 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
384805 ns |
389274 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19312.5 ns |
19625 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17292 ns |
17791 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19542 ns |
20291 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18812.5 ns |
16645.5 ns |
1.13 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
65813 ns |
65239 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
2762320.5 ns |
3323140 ns |
0.83 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1267542 ns |
1293104 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
76953 ns |
73656 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
214041 ns |
220959 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
212083 ns |
212333 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
213271 ns |
212541 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
239625 ns |
212000 ns |
1.13 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
346222 ns |
347340 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
10319532 ns |
13974103 ns |
0.74 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5686708 ns |
5755333 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
470837 ns |
462604 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
750 ns |
666 ns |
1.13 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
625 ns |
833.5 ns |
0.75 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
792 ns |
875 ns |
0.91 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
666.5 ns |
584 ns |
1.14 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
20174 ns |
20357 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1159971 ns |
1288251 ns |
0.90 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal |
289709 ns |
292667 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
31351 ns |
31491 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1625 ns |
1416.5 ns |
1.15 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1458 ns |
1416 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1500 ns |
1625 ns |
0.92 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1375 ns |
1416 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
122010.5 ns |
123399.5 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9092998 ns |
9450809 ns |
0.96 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal |
1444834 ns |
1493229 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
133079.5 ns |
135231 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7458 ns |
7500 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5417 ns |
6042 ns |
0.90 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5375 ns |
6000 ns |
0.90 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10333 ns |
10125 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23165 ns |
23818 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1198319.5 ns |
1331154.5 ns |
0.90 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
675854 ns |
628937.5 ns |
1.07 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47562 ns |
46911 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220167 ns |
219750 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
235791 ns |
265167 ns |
0.89 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228354 ns |
264416 ns |
0.86 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
258625 ns |
249854 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
188279 ns |
189311.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
29478229 ns |
33158982 ns |
0.89 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8824000 ns |
9299979.5 ns |
0.95 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
644673 ns |
643876 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4084 ns |
4125 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4084 ns |
4125 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4083 ns |
4083 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4083 ns |
4083 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23288.5 ns |
23427 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI |
1877196 ns |
2124740.5 ns |
0.88 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal |
220187.5 ns |
222770.5 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
46182 ns |
46290 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16833 ns |
16833 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16500 ns |
16792 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17292 ns |
16750 ns |
1.03 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16833 ns |
16792 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
190682 ns |
191493 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI |
11595194 ns |
11757211 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal |
951312.5 ns |
955313 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
174872 ns |
171341.5 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
511625 ns |
511167 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
332125 ns |
405458 ns |
0.82 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
332333.5 ns |
405000 ns |
0.82 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
864542 ns |
858250 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113037 ns |
113156 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI |
390383 ns |
448835 ns |
0.87 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal |
495854 ns |
471209 ns |
1.05 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
241899 ns |
240532 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2269062.5 ns |
2268250 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1751687.5 ns |
2031416 ns |
0.86 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1758583.5 ns |
2030917 ns |
0.87 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3193062.5 ns |
3275750 ns |
0.97 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
237180 ns |
236871 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11204627.5 ns |
10359638.5 ns |
1.08 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal |
1874125 ns |
1993250 ns |
0.94 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
743275.5 ns |
739142 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6542 ns |
6583 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6458 ns |
6875 ns |
0.94 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8292 ns |
7709 ns |
1.08 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6750 ns |
6292 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
89032 ns |
90224.5 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5316353 ns |
5882879 ns |
0.90 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
774000 ns |
771000 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
65767 ns |
65250 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11750 ns |
12333.5 ns |
0.95 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10459 ns |
11375 ns |
0.92 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11375 ns |
11312.5 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11708 ns |
11833.5 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
626029 ns |
622443 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
39909481 ns |
41746922 ns |
0.96 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5517500 ns |
5637750 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
411686 ns |
407854 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
541 ns |
541 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
22999 ns |
22944 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI |
2336392 ns |
2423476.5 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal |
220167 ns |
326750 ns |
0.67 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
47349 ns |
48960 ns |
0.97 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2125 ns |
2125 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2084 ns |
2125 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2083 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2125 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
224074 ns |
217144 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI |
11718228.5 ns |
12060454 ns |
0.97 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal |
1959542 ns |
1960083 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
174425 ns |
180236.5 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9209 ns |
8625 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
8562.5 ns |
9646 ns |
0.89 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
10209 ns |
11229 ns |
0.91 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8459 ns |
8792 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
100989 ns |
103267 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3052989 ns |
3427494 ns |
0.89 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
873833 ns |
875083 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
73188 ns |
73431 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17708 ns |
17834 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17416.5 ns |
17916 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18583 ns |
17333 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17438 ns |
18000 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
592788 ns |
586862 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
12675498 ns |
17435012.5 ns |
0.73 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5159749.5 ns |
5223458 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
381661 ns |
377954 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
583 ns |
0.86 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
458 ns |
541 ns |
0.85 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
34522 ns |
34849.5 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1143136.5 ns |
1279718 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
386000 ns |
435291 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
46209 ns |
45841 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9520.5 ns |
8979.5 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8875 ns |
9250 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9750 ns |
8917 ns |
1.09 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9666.5 ns |
8146 ns |
1.19 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
247762 ns |
260579 ns |
0.95 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
17492397 ns |
19733483 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5085208 ns |
4985875 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
370118 ns |
366004 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
398834 ns |
398667 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
215167 ns |
287958 ns |
0.75 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
215291 ns |
287750 ns |
0.75 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756208 ns |
756458 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111119 ns |
111261.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI |
324528 ns |
376549 ns |
0.86 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal |
483083 ns |
367583.5 ns |
1.31 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
76409 ns |
74430 ns |
1.03 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1402459 ns |
1400375 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
858958 ns |
1135375 ns |
0.76 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
859209 ns |
1132354 ns |
0.76 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2358375 ns |
2440958 ns |
0.97 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
203517 ns |
203910 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI |
9265632 ns |
9225527 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal |
1557542 ns |
1662875 ns |
0.94 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
322265 ns |
321818 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7375.5 ns |
7604.5 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7333.5 ns |
8083 ns |
0.91 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8479.5 ns |
8729 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7187.5 ns |
7437.5 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
135285 ns |
142785 ns |
0.95 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5475873.5 ns |
6299176.5 ns |
0.87 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
448604 ns |
521292 ns |
0.86 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
66419 ns |
65420 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15792 ns |
12583 ns |
1.26 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14146 ns |
12437.5 ns |
1.14 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15042 ns |
14521 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14937.5 ns |
14979.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
890292 ns |
943733.5 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
41273046 ns |
47612069 ns |
0.87 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5852584 ns |
5885062.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
432034.5 ns |
417444 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
29854.5 ns |
30395.5 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
26354 ns |
29604 ns |
0.89 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
30125 ns |
27709 ns |
1.09 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24958.5 ns |
25083.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
190182.5 ns |
195905 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7912508 ns |
8216412 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
986625.5 ns |
990125 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
116199 ns |
116401 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
152875 ns |
154583.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
146458 ns |
155500 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
147854.5 ns |
114042 ns |
1.30 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
103959 ns |
113187.5 ns |
0.92 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1006113 ns |
1061855 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42468130 ns |
46328998 ns |
0.92 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5912542 ns |
5883041 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
588675 ns |
586901 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
77000 ns |
74459 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
76458 ns |
75833 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
77042 ns |
78208 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
79458 ns |
75958 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
199450 ns |
203068 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7017337 ns |
7813436 ns |
0.90 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
532167 ns |
533437.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
130879 ns |
127391 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
303375 ns |
298166 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
319563 ns |
303208 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
288062.5 ns |
306041.5 ns |
0.94 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
312021 ns |
295666 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1064896 ns |
1104226 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
40177222.5 ns |
44772773.5 ns |
0.90 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6376583.5 ns |
6766000 ns |
0.94 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
697067.5 ns |
694176 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
16750 ns |
17000 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
17417 ns |
17292 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
18625 ns |
18375 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
18292 ns |
16792 ns |
1.09 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
140061 ns |
145201.5 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5763073.5 ns |
6348029 ns |
0.91 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
711250 ns |
448000.5 ns |
1.59 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
235139 ns |
231113 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
27937.5 ns |
27208 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
28583 ns |
28625 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26958.5 ns |
27187.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26458.5 ns |
26145.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
923284.5 ns |
972527 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
41401667 ns |
44334727.5 ns |
0.93 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5833417 ns |
5935916 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
694159 ns |
684627 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11146 ns |
11375 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11666 ns |
11625 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12084 ns |
14042 ns |
0.86 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
10667 ns |
10416 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
117789.5 ns |
123261.5 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3540129 ns |
3725175 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
897145.5 ns |
904958 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
235970 ns |
233272 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21958 ns |
22000 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21666.5 ns |
21666 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
22834 ns |
21542 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21625 ns |
21916 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
693239 ns |
697545 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21263436 ns |
22814286 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5472291 ns |
5479812.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
679571 ns |
668531 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
63209 ns |
67459 ns |
0.94 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
65084 ns |
63625 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
68084 ns |
65084 ns |
1.05 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
62541.5 ns |
62667 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
104514.5 ns |
105558.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3308618 ns |
3699497 ns |
0.89 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1322854 ns |
1336625 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
235301 ns |
231652 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
450042 ns |
450250 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
440750.5 ns |
451792 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
467375 ns |
446041.5 ns |
1.05 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
436875 ns |
484250 ns |
0.90 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
506753 ns |
508079 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
20425451.5 ns |
22280153.5 ns |
0.92 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6092458 ns |
6164479 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
717923 ns |
712097 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7958.5 ns |
7667 ns |
1.04 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7834 ns |
8458 ns |
0.93 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
9167 ns |
8041.5 ns |
1.14 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8020.5 ns |
7083.5 ns |
1.13 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
141656 ns |
142974 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
5744993.5 ns |
5983895.5 ns |
0.96 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
737709 ns |
687104.5 ns |
1.07 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
65540 ns |
68961 ns |
0.95 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15542 ns |
14333 ns |
1.08 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15625 ns |
14312 ns |
1.09 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14292 ns |
15021 ns |
0.95 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
12792 ns |
15250 ns |
0.84 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
926648 ns |
941966 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
38778177 ns |
40659493.5 ns |
0.95 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5540041 ns |
5744375 ns |
0.96 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
404172 ns |
395784 ns |
1.02 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6158208.5 ns |
6161520.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
3222792 ns |
6378125.5 ns |
0.51 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
3226479 ns |
6377708.5 ns |
0.51 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11922958 ns |
11920959 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
349729 ns |
347985 ns |
1.01 |
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
326262 ns |
320268 ns |
1.02 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19145000 ns |
19132416 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
11139458 ns |
20009458 ns |
0.56 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
11142750 ns |
19937708 ns |
0.56 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36483021 ns |
36464229.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1024792.5 ns |
1013485 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1155998.5 ns |
1165921 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
959 ns |
917 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1041 ns |
1000 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1000 ns |
917 ns |
1.09 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
958 ns |
958 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
22915 ns |
23221 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2061332 ns |
2197390 ns |
0.94 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal |
216166 ns |
332458.5 ns |
0.65 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
208941 ns |
205762 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3667 ns |
3667 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3708 ns |
3709 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3750 ns |
3667 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3625 ns |
3667 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
274731 ns |
277792 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
11037575.5 ns |
12494000 ns |
0.88 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal |
2079354 ns |
2076312.5 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
625946 ns |
624236 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8021 ns |
8792 ns |
0.91 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8250 ns |
8875.5 ns |
0.93 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9958 ns |
9875 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8563 ns |
7625 ns |
1.12 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
117470.5 ns |
119047.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3291500 ns |
3910252.5 ns |
0.84 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
806687.5 ns |
795416.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
72880 ns |
65320 ns |
1.12 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
12083.5 ns |
11374.5 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11958 ns |
12208 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
12583 ns |
11792 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
12167 ns |
11979.5 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
628343.5 ns |
629697.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
21030497 ns |
23515262 ns |
0.89 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5058833 ns |
5019875 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
362883 ns |
352263 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
250 ns |
1.33 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22172 ns |
22203 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI |
2032725 ns |
2289294 ns |
0.89 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal |
222333 ns |
228916 ns |
0.97 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
49091 ns |
46161 ns |
1.06 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
3208 ns |
3084 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2958 ns |
2959 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3500 ns |
2917 ns |
1.20 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
3125 ns |
2875 ns |
1.09 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
198272 ns |
200155 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9410897.5 ns |
9757264 ns |
0.96 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal |
1547188 ns |
1632083 ns |
0.95 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
166212 ns |
153411.5 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10416 ns |
11563 ns |
0.90 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11125.5 ns |
11334 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
13333 ns |
12292 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11125 ns |
10854 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
119897 ns |
120519 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3353945 ns |
3640370.5 ns |
0.92 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
872167 ns |
897667 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
241608 ns |
232282 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
21834 ns |
20750 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
20708.5 ns |
21083 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21292 ns |
21959 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
21021 ns |
21458.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
585466 ns |
590202 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19722619 ns |
22574638 ns |
0.87 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
4787292 ns |
4746958.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
654628 ns |
639216 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4375 ns |
4375 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4417 ns |
4375 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4375 ns |
4375 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4375 ns |
4375 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
23791 ns |
23877 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI |
2193197 ns |
2442376 ns |
0.90 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal |
223375 ns |
225708 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
49361 ns |
46800 ns |
1.05 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16250 ns |
16291 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16250 ns |
16625 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16500 ns |
16459 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16083 ns |
16500 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
326259.5 ns |
326023.5 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI |
12072256 ns |
13171553 ns |
0.92 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal |
1574083.5 ns |
1188229 ns |
1.32 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
212818 ns |
205042 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
2083 ns |
2042 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
2167 ns |
2083 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2167 ns |
2083 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
2041 ns |
2084 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
35274 ns |
35572 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1199050 ns |
1338351 ns |
0.90 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
367834 ns |
435459 ns |
0.84 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
204943 ns |
202812 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
17437.5 ns |
16520.5 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
18375.5 ns |
17104.5 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
16541 ns |
18375 ns |
0.90 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
17167 ns |
18770.5 ns |
0.91 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
287962 ns |
291395 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
21039131 ns |
23003699 ns |
0.91 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5565375 ns |
5678333 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
690990 ns |
682086 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
59500 ns |
58979 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
61125 ns |
67125 ns |
0.91 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
62834 ns |
66917 ns |
0.94 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
51458 ns |
51625 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66712 ns |
66452 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
115812 ns |
114721 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
161312.5 ns |
162292 ns |
0.99 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
122750.5 ns |
147229 ns |
0.83 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
131083 ns |
130229 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
226583 ns |
296770.5 ns |
0.76 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
214529.5 ns |
213701 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
628329 ns |
607926 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
110166.5 ns |
84250 ns |
1.31 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
117500 ns |
124729 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
85750 ns |
85875 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
106854.5 ns |
123833 ns |
0.86 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193539 ns |
193440 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5446551 ns |
7291287 ns |
0.75 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2001021 ns |
1831167 ns |
1.09 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
219713 ns |
203522 ns |
1.08 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1914063 ns |
1928271 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1889833 ns |
1891125 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1914396 ns |
1902250 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1926312.5 ns |
1914749.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
525988 ns |
525346 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
25282704.5 ns |
26967967.5 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
8964750 ns |
9298209 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1069581.5 ns |
927389 ns |
1.15 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21663 ns |
21417 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI |
2103825.5 ns |
2392141 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal |
340417 ns |
342188 ns |
0.99 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
43831 ns |
42200 ns |
1.04 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1833 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1875 ns |
1834 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1833 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1791 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
249420 ns |
249016 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI |
9776488 ns |
10390055 ns |
0.94 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal |
1520500 ns |
1093187.5 ns |
1.39 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
185088 ns |
179602 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8792 ns |
9667 ns |
0.91 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8750 ns |
10125 ns |
0.86 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11041 ns |
10249.5 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8875 ns |
9375 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
116662 ns |
118409 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3389039 ns |
3710566 ns |
0.91 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
865583.5 ns |
886083.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
234874 ns |
231452 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9833 ns |
9209 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9083 ns |
10000 ns |
0.91 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9666 ns |
9770.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9708 ns |
9500 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
513048 ns |
517575.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20325597 ns |
21956361 ns |
0.93 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
3946334 ns |
4314937.5 ns |
0.91 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
633311 ns |
624606 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59167 ns |
58209 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
39917 ns |
46542 ns |
0.86 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39667 ns |
46750 ns |
0.85 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83291 ns |
83000 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
39225 ns |
39682 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1317436 ns |
1450337.5 ns |
0.91 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1124396 ns |
1115958 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
77022 ns |
74661 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1828958.5 ns |
1939500 ns |
0.94 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1972875 ns |
1983125 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1973979 ns |
1951312.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1902583.5 ns |
1897667 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
216838.5 ns |
216819.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33688603 ns |
37812796.5 ns |
0.89 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10618958 ns |
10968478.5 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1176400 ns |
1185212 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
418542 ns |
417625 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
418666 ns |
419834 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
420437.5 ns |
420958 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
420375 ns |
417208 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
205056 ns |
204963.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7972049 ns |
8983027 ns |
0.89 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
530229 ns |
546875 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
284905 ns |
280603 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
669688 ns |
669791.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
734562 ns |
780667 ns |
0.94 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
681104 ns |
689645.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
770604 ns |
725292 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1036899 ns |
1038703 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
47133005 ns |
49679972 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6149083 ns |
6487209 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
916011 ns |
909389 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3452417 ns |
3413542 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
3455583 ns |
3417875 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
3425042 ns |
3420479 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
3460771 ns |
3414187 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
169776 ns |
168543 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
9104393 ns |
8597060 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1379604.5 ns |
1366458.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
450698 ns |
434404 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
6210334 ns |
6191104 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
6233042 ns |
6232645.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
6231833 ns |
6213854 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
6206916.5 ns |
6216250 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
988034.5 ns |
979877 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
50671907 ns |
50928344 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7136416 ns |
7557875 ns |
0.94 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1711422 ns |
1538944 ns |
1.11 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
474458 ns |
471584 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
253375 ns |
341687.5 ns |
0.74 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
253417 ns |
340375 ns |
0.74 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
903167 ns |
902500 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46475 ns |
46568 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI |
386217.5 ns |
450349 ns |
0.86 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal |
443708 ns |
504562.5 ns |
0.88 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
242304 ns |
241952 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2281792 ns |
2276916 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1759521.5 ns |
2038666 ns |
0.86 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1761416.5 ns |
2034583 ns |
0.87 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3194417 ns |
3280958 ns |
0.97 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
267186.5 ns |
253153 ns |
1.06 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
16005829 ns |
14086050 ns |
1.14 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal |
2190667 ns |
2208291.5 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
769044 ns |
765407 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58542 ns |
57959 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
39583 ns |
46250 ns |
0.86 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39667 ns |
46250 ns |
0.86 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83583 ns |
82792 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
27928.5 ns |
28134 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1374875 ns |
1575508 ns |
0.87 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1148500 ns |
1135958 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
73812 ns |
73405.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2038270.5 ns |
1962520.5 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2092792 ns |
2093312.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2093666 ns |
2086834 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2001542 ns |
2000458.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
231752 ns |
229351 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
35917944.5 ns |
38934321 ns |
0.92 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11278917 ns |
11662250 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1043680.5 ns |
1196771 ns |
0.87 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58333 ns |
58208 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
40042 ns |
46812.5 ns |
0.86 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39458 ns |
46708 ns |
0.84 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83541 ns |
82375 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
49529 ns |
49491 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
836997 ns |
947062 ns |
0.88 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1076416 ns |
1068833 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
77896.5 ns |
77751 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1887708.5 ns |
1937792 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1953250.5 ns |
1974209 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1969146 ns |
1960000 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1896416 ns |
1899959 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
239764.5 ns |
235535 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
16726563.5 ns |
22349832.5 ns |
0.75 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9964895.5 ns |
9994166 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1048441 ns |
915999 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
334 ns |
333 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
292 ns |
1.43 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
34451 ns |
34420 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1206965.5 ns |
1328125 ns |
0.91 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
397229.5 ns |
278292 ns |
1.43 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
49261 ns |
45880 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6250 ns |
6541 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6708 ns |
6917 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7250 ns |
6584 ns |
1.10 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6875 ns |
6458 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
210250 ns |
209753 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20421243.5 ns |
22541718 ns |
0.91 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5307333.5 ns |
4971437.5 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
373842.5 ns |
368183 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32507 ns |
31457 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI |
1221670 ns |
1340759 ns |
0.91 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal |
253708 ns |
258291 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
40970 ns |
36451 ns |
1.12 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2666 ns |
3458 ns |
0.77 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
3042 ns |
3292 ns |
0.92 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
3125 ns |
2917 ns |
1.07 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
3584 ns |
2917 ns |
1.23 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
188962 ns |
185714.5 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI |
9532662 ns |
8803725 ns |
1.08 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal |
1060333 ns |
950374.5 ns |
1.12 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
163034 ns |
150601 ns |
1.08 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
425125 ns |
424603.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
456833 ns |
425000 ns |
1.07 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
426083 ns |
430459 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
456458 ns |
443562.5 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
138588 ns |
136540 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6071449 ns |
6325011.5 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2788209 ns |
2056896 ns |
1.36 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
366947 ns |
365713 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3814167 ns |
3790417 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3815333.5 ns |
3803834 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3816834 ns |
3804250 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3803187.5 ns |
3813000 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
704291 ns |
699295 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33600498 ns |
34149887.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10285104 ns |
11037916.5 ns |
0.93 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1461255 ns |
1464794 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49948750 ns |
49877979 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
25988208 ns |
35522250 ns |
0.73 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
26007292 ns |
35535229 ns |
0.73 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
97079833 ns |
96934583 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1594546 ns |
1591242 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1044656.5 ns |
1047550 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154846562.5 ns |
154708541.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
88751208 ns |
112454083.5 ns |
0.79 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
89372584 ns |
112480333 ns |
0.79 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
295780062.5 ns |
296379229 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6552162 ns |
6494323.5 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5594625 ns |
5551012.5 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
18167 ns |
19062.5 ns |
0.95 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
15542 ns |
17833.5 ns |
0.87 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
13917 ns |
17041 ns |
0.82 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
15125 ns |
15875 ns |
0.95 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
20123 ns |
21028 ns |
0.96 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1097277.5 ns |
1230713 ns |
0.89 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal |
215458 ns |
219604.5 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
26061 ns |
25950 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
11209 ns |
10958 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
7583 ns |
9041 ns |
0.84 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
8042 ns |
9041.5 ns |
0.89 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
17083 ns |
17375 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
258757 ns |
257331 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
9624727.5 ns |
10803925 ns |
0.89 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal |
1541646 ns |
1552917 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
149533 ns |
147801 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8812.5 ns |
9354.5 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
9541 ns |
10000 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10833 ns |
10458 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8270.5 ns |
7458 ns |
1.11 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
124166.5 ns |
114779 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3724156 ns |
3881100 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
739667 ns |
797833 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
235009.5 ns |
233502 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9146 ns |
9916 ns |
0.92 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9708.5 ns |
9708 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10125 ns |
9334 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9437.5 ns |
9709 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
615832.5 ns |
616669 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22440306.5 ns |
25342914 ns |
0.89 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5011583 ns |
4989750 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
656304 ns |
651926 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9146 ns |
10583 ns |
0.86 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9375 ns |
9146 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11083 ns |
10584 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9625 ns |
9875 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
118765 ns |
120200.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3425413 ns |
3758991 ns |
0.91 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
896229 ns |
905750 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
73031.5 ns |
71611 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13083.5 ns |
13541 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13792 ns |
15500 ns |
0.89 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
14083 ns |
15458.5 ns |
0.91 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12833.5 ns |
18125 ns |
0.71 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
588412 ns |
585824 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19699495 ns |
21389400 ns |
0.92 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4785917 ns |
4649750 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
347898 ns |
343933 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
459 ns |
500 ns |
0.92 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
583 ns |
584 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
34901 ns |
34550 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1157159 ns |
1371228 ns |
0.84 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
326875 ns |
447645.5 ns |
0.73 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
205325 ns |
203956.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7291 ns |
8270.5 ns |
0.88 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9145.5 ns |
8708 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7709 ns |
9167 ns |
0.84 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7333.5 ns |
10625 ns |
0.69 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
228635 ns |
231015.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21730930.5 ns |
24528595.5 ns |
0.89 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
5008062.5 ns |
5171458.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
663024 ns |
654796 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
16333 ns |
16167 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
14167 ns |
15895.5 ns |
0.89 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
13521 ns |
15979 ns |
0.85 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
10833 ns |
11875 ns |
0.91 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
21021 ns |
21988 ns |
0.96 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1210830.5 ns |
1304948 ns |
0.93 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal |
206167 ns |
257646 ns |
0.80 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
187184 ns |
184412 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
31583 ns |
32084 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
31958 ns |
31875 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
32209 ns |
32250 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
31792 ns |
31708 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
271653 ns |
271511.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
10857362.5 ns |
12350146 ns |
0.88 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal |
1802125 ns |
1659167 ns |
1.09 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
592182.5 ns |
587425 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
469958 ns |
504958 ns |
0.93 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
507208 ns |
481520.5 ns |
1.05 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
444250 ns |
443208 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
456708 ns |
488374.5 ns |
0.94 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194607 ns |
195092 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6126515 ns |
6520561 ns |
0.94 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1972959 ns |
1945520.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
377228 ns |
367668 ns |
1.03 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3835666 ns |
3839417 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3835645.5 ns |
3824437.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3840583 ns |
3828250 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3839145.5 ns |
3827604.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
538632 ns |
535436 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
29473777 ns |
32985580 ns |
0.89 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
8953209 ns |
9639667 ns |
0.93 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1358598 ns |
1204966.5 ns |
1.13 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
782285709 ns |
781980875 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
418895375.5 ns |
543423875 ns |
0.77 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
417745166.5 ns |
542625875 ns |
0.77 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1560235646 ns |
1559677978.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22544059.5 ns |
22745322 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
14748380 ns |
14786409 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
2522304416 ns |
2528971583 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1512463959 ns |
2254450917 ns |
0.67 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1520064875 ns |
2476668541 ns |
0.61 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
4768759708 ns |
6300456542 ns |
0.76 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
367694257 ns |
366701385 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
89045527 ns |
88751089 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
82937.5 ns |
75666 ns |
1.10 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
76292 ns |
79041.5 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
82396 ns |
79458.5 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
77000 ns |
76208 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
205651.5 ns |
203948 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7768857 ns |
9083475 ns |
0.86 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
531708 ns |
526062.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
121769 ns |
106536 ns |
1.14 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
249792 ns |
270270.5 ns |
0.92 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
195958.5 ns |
292875 ns |
0.67 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
257583 ns |
198312 ns |
1.30 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
287041 ns |
194667 ns |
1.47 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1042639.5 ns |
1034833 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43696222 ns |
46783284 ns |
0.93 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5997000 ns |
6115521 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
664536.5 ns |
633286 ns |
1.05 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
199683166.5 ns |
199771000 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
103959542 ns |
138674666 ns |
0.75 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
104163542 ns |
138669167 ns |
0.75 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
389356166 ns |
388512334 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5842082 ns |
5812826 ns |
1.01 |
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3517132.5 ns |
3596784 ns |
0.98 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
621465854 ns |
621035604.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
353116166.5 ns |
439829542 ns |
0.80 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
355147917 ns |
440801667 ns |
0.81 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1181014666 ns |
1196350375 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26729826 ns |
26769444 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
22245968 ns |
21887487 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7084 ns |
7291 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5375 ns |
6042 ns |
0.89 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5417 ns |
6125 ns |
0.88 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10083 ns |
9875 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
27701 ns |
27497 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1191974.5 ns |
1432348 ns |
0.83 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
593375 ns |
374083 ns |
1.59 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48750 ns |
46690 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212229.5 ns |
216042 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221833 ns |
224375 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
249042 ns |
220687.5 ns |
1.13 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213750 ns |
208062.5 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
221585 ns |
218341 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
35257824.5 ns |
35298334.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9130229 ns |
9155708 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
531864 ns |
528325 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8458.5 ns |
9354.5 ns |
0.90 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8604 ns |
9396 ns |
0.92 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10791.5 ns |
9750 ns |
1.11 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
9083.5 ns |
7958.5 ns |
1.14 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
115111.5 ns |
118295 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3617059 ns |
3790588 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
879333 ns |
873834 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
71345 ns |
69600 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7584 ns |
8562.5 ns |
0.89 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8333 ns |
9834 ns |
0.85 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8166 ns |
9500 ns |
0.86 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7520.5 ns |
12312.5 ns |
0.61 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
517298 ns |
512184 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
18428378 ns |
21450760 ns |
0.86 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4662000 ns |
4433459 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
319680 ns |
315553 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
542 ns |
0.92 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
750 ns |
709 ns |
1.06 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
500 ns |
1.25 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
458 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
26044 ns |
26098 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1150773 ns |
1299422.5 ns |
0.89 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
462250 ns |
479708.5 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
47150 ns |
46840 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9542 ns |
9167 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13333 ns |
11416 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9562.5 ns |
11062.5 ns |
0.86 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
8666 ns |
9416 ns |
0.92 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
251795.5 ns |
251250 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24087876 ns |
25886430.5 ns |
0.93 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5990833 ns |
5832146.5 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
391380.5 ns |
387883 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
105959 ns |
107916 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
85375 ns |
99250 ns |
0.86 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
87209 ns |
100645.5 ns |
0.87 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
146625 ns |
146583 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
23920 ns |
24989 ns |
0.96 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
1133082 ns |
1282751 ns |
0.88 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal |
261562.5 ns |
267229.5 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
190780 ns |
189842 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
516833 ns |
514208 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
478000 ns |
478541.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
524458 ns |
478375 ns |
1.10 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
478250 ns |
482875 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
230382 ns |
229903 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11487991 ns |
12990087 ns |
0.88 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal |
2229792 ns |
2133042 ns |
1.05 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
609430 ns |
608146 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
6000 ns |
5666 ns |
1.06 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
6584 ns |
7250 ns |
0.91 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
7666.5 ns |
6291 ns |
1.22 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
6292 ns |
6625 ns |
0.95 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
16007 ns |
16240.5 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
80481 ns |
79631 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
11500 ns |
12417 ns |
0.93 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
10417 ns |
11167 ns |
0.93 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
10500 ns |
12041.5 ns |
0.87 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
16750 ns |
16416.5 ns |
1.02 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
213463 ns |
211157 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
368790 ns |
375234 ns |
0.98 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
39833 ns |
39750 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
51208 ns |
52000 ns |
0.98 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
52958 ns |
53021 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
16083 ns |
16042 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
19843 ns |
19539 ns |
1.02 |
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
86350 ns |
90780.5 ns |
0.95 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
37062.5 ns |
42917 ns |
0.86 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
29000 ns |
32167 ns |
0.90 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
32500 ns |
32875 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
57041 ns |
57042 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
193192 ns |
190769.5 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
400156.5 ns |
392564 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
1812.5 ns |
1833.5 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
2000 ns |
1875 ns |
1.07 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
2208 ns |
2083 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
1791 ns |
1792 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
20303 ns |
20462 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1098861 ns |
1239481 ns |
0.89 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal |
306812.5 ns |
307042 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
33030 ns |
31870 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
2209 ns |
2125 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
2208 ns |
2208 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
2292 ns |
2291 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
2312.5 ns |
2208 ns |
1.05 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
202420 ns |
201344.5 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
9070892 ns |
10165131 ns |
0.89 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal |
1521041.5 ns |
1570917 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
136810 ns |
136316.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5375 ns |
6520.5 ns |
0.82 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4792 ns |
5000 ns |
0.96 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5583 ns |
5625 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4958 ns |
5500 ns |
0.90 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
144381.5 ns |
143896 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
5648025.5 ns |
6277095.5 ns |
0.90 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
683167 ns |
750374.5 ns |
0.91 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
69500 ns |
69261 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8625 ns |
8645.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8625 ns |
8583.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8667 ns |
9291.5 ns |
0.93 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8062.5 ns |
8020.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
866731 ns |
867420 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
42922265 ns |
42275328 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5538771 ns |
5663374.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
390451 ns |
387123 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56833 ns |
56875 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
56875 ns |
57833 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
56875 ns |
57750 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
58209 ns |
58375 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
37311.5 ns |
36655 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1236176 ns |
1241845 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
556708 ns |
541750 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
202581 ns |
202922 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
492104.5 ns |
468937.5 ns |
1.05 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
509208 ns |
477229.5 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
509000 ns |
464541 ns |
1.10 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
438583 ns |
433625 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
265396 ns |
263574 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26630610.5 ns |
28829027 ns |
0.92 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8076791.5 ns |
8162250 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
829974 ns |
827187.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3316104.5 ns |
3317521 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
1768416 ns |
2329500 ns |
0.76 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
1771459 ns |
2336167 ns |
0.76 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6317854.5 ns |
6302416.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
204724 ns |
204892 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
215316.5 ns |
208562 ns |
1.03 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11544229 ns |
11517062.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
6581916.5 ns |
8328812.5 ns |
0.79 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
6594792 ns |
8342500 ns |
0.79 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21187791.5 ns |
21059354.5 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
745392 ns |
734814.5 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1063266 ns |
1048679.5 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5375 ns |
5604.5 ns |
0.96 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4625 ns |
5875 ns |
0.79 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6333 ns |
6395.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6375 ns |
4750 ns |
1.34 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
137026.5 ns |
136624.5 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
5889063 ns |
6038921 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
757125 ns |
813000 ns |
0.93 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
56800 ns |
56330 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7375 ns |
9834 ns |
0.75 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8625 ns |
11375 ns |
0.76 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7709 ns |
10792 ns |
0.71 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6875 ns |
7083 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
750422.5 ns |
751768 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
33915039 ns |
37322808 ns |
0.91 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5239145.5 ns |
5368750 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
370687.5 ns |
366754 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
122667 ns |
126417 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
101167 ns |
101833 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
98583 ns |
97167 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
99458 ns |
135458.5 ns |
0.73 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
152082 ns |
149617 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6135228 ns |
6377317.5 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2138604 ns |
2013729 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
203902 ns |
203027 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1941125 ns |
1956250 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2014375.5 ns |
2025708 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2034875.5 ns |
2023583 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2033875.5 ns |
2023875 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
705006 ns |
699728 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31629524 ns |
32486459.5 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11052146 ns |
11144687.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1185578.5 ns |
1109856 ns |
1.07 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
34000 ns |
33708 ns |
1.01 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
35250 ns |
36250 ns |
0.97 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
33250 ns |
35292 ns |
0.94 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
541.5 ns |
667 ns |
0.81 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15256 ns |
15147 ns |
1.01 |
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
79870 ns |
78750 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2584 ns |
3166 ns |
0.82 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
3625 ns |
3292 ns |
1.10 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3000 ns |
3916 ns |
0.77 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2167 ns |
2125 ns |
1.02 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
139305.5 ns |
138043.5 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
342553 ns |
341483.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7291 ns |
7333 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5416.5 ns |
6125 ns |
0.88 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5417 ns |
5959 ns |
0.91 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10416 ns |
10083 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
36720 ns |
36390.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1185072 ns |
1443013 ns |
0.82 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
402125 ns |
577687.5 ns |
0.70 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48010 ns |
48291 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
240167 ns |
217083 ns |
1.11 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
222562.5 ns |
233729 ns |
0.95 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
251312 ns |
220875 ns |
1.14 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213500 ns |
206583 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
243524 ns |
241954 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26052385 ns |
28863209.5 ns |
0.90 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7843208 ns |
8063584 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
577185 ns |
578495 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3958 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3916 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22084 ns |
21377 ns |
1.03 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI |
2144156 ns |
2296242.5 ns |
0.93 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal |
244500 ns |
246729.5 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
42581 ns |
42010 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14625 ns |
14750 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14667 ns |
15000 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14666 ns |
14834 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14625 ns |
14937.5 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
310198 ns |
306378 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI |
11470797 ns |
12904688 ns |
0.89 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal |
1020291 ns |
1048854 ns |
0.97 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
191302 ns |
192742 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
128709 ns |
128750 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
101459 ns |
128042 ns |
0.79 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
105333 ns |
102500 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
100917 ns |
128458 ns |
0.79 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
148106 ns |
133598 ns |
1.11 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6083211 ns |
6098969 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2535500 ns |
1992062.5 ns |
1.27 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
204437 ns |
203872 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1778292 ns |
1913375 ns |
0.93 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1838417 ns |
1918875.5 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1926667 ns |
1920354.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1926625.5 ns |
1922729.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
690367 ns |
684636 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31757588.5 ns |
31678268 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10251583.5 ns |
10983583.5 ns |
0.93 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1085040 ns |
1217291 ns |
0.89 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18833 ns |
19708 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19500 ns |
18000 ns |
1.08 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
20750 ns |
21250 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18708 ns |
17541 ns |
1.07 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
109386.5 ns |
107089 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3594861 ns |
3668703 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1361959 ns |
1366125 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
79071 ns |
79431 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
229167 ns |
222250 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
223958 ns |
227291.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
219667 ns |
221667 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
225792 ns |
217604.5 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
517573.5 ns |
512942 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
19977378 ns |
20906293 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6209166.5 ns |
6227916.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
477899.5 ns |
478915 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
24937.5 ns |
24625 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
29792 ns |
32084 ns |
0.93 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
27500 ns |
29583.5 ns |
0.93 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1375 ns |
1354 ns |
1.02 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
15934 ns |
15775 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
82261 ns |
87130 ns |
0.94 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
4667 ns |
5208 ns |
0.90 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
5375 ns |
4937.5 ns |
1.09 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
5167 ns |
6250 ns |
0.83 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
4792 ns |
4208 ns |
1.14 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
207098.5 ns |
205104 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
382274 ns |
375704 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
307083 ns |
305500 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
306624.5 ns |
305958 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
309875 ns |
308125 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
308125 ns |
304792 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
227232 ns |
224810.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7383776.5 ns |
8473173 ns |
0.87 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1119667 ns |
1064042 ns |
1.05 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
274402.5 ns |
272523 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
535375 ns |
588083 ns |
0.91 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
535458 ns |
540979 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
542041 ns |
532187.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
532375 ns |
530000 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1072987 ns |
1066787 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42757773 ns |
49056863 ns |
0.87 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6023750 ns |
6401167 ns |
0.94 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
857209 ns |
857918.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18875 ns |
20292 ns |
0.93 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19791.5 ns |
21021 ns |
0.94 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21666 ns |
21584 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
20084 ns |
19459 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
114799 ns |
111914.5 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3514225 ns |
3915484 ns |
0.90 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1441875 ns |
1445124.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
78441 ns |
79161 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213833 ns |
259584 ns |
0.82 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
213375 ns |
218709 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
219417 ns |
213833 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
217583 ns |
221709 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
759467 ns |
729277 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26177214.5 ns |
28351086 ns |
0.92 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7223062.5 ns |
7519125 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
538626 ns |
535735 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6667 ns |
7542 ns |
0.88 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7125 ns |
6750 ns |
1.06 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8208 ns |
7854 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
7375 ns |
6416 ns |
1.15 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
139107 ns |
139596.5 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
5697682 ns |
6386332.5 ns |
0.89 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
769875 ns |
812791.5 ns |
0.95 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
66301 ns |
64971 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9792 ns |
12937 ns |
0.76 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9917 ns |
9604 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10958 ns |
10479 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8708 ns |
9042 ns |
0.96 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
824252 ns |
821389 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
39780613.5 ns |
41212440 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5397083 ns |
5394125 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
390564 ns |
376673 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5125 ns |
5542 ns |
0.92 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5083.5 ns |
6041 ns |
0.84 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
5979 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5458 ns |
4125 ns |
1.32 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
142825 ns |
143159 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
5800418 ns |
6135330 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
796708.5 ns |
841208 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
67781 ns |
66410 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7417 ns |
8125 ns |
0.91 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7333 ns |
7625 ns |
0.96 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7709 ns |
7333 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7250 ns |
7458 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
778801 ns |
779803.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
38427967 ns |
42489606.5 ns |
0.90 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5585958 ns |
5806041.5 ns |
0.96 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
398294.5 ns |
385114 ns |
1.03 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14567167 ns |
14517875 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
7736292 ns |
10107833 ns |
0.77 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
7735645.5 ns |
10123375 ns |
0.76 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27854417 ns |
27737959 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
534183 ns |
529900 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
386044.5 ns |
392854 ns |
0.98 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46615145.5 ns |
46502041.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
26633209 ns |
33504375 ns |
0.79 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
26517167 ns |
33527167 ns |
0.79 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85829958 ns |
85258875 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2633578.5 ns |
2630210 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3289341 ns |
3305402 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
66875 ns |
68083 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
67145.5 ns |
66021 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
67792 ns |
69042 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
66791 ns |
66875 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
119543.5 ns |
120187.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3541492 ns |
3913619.5 ns |
0.90 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1456541.5 ns |
1439458.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
232457.5 ns |
224532 ns |
1.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
462666 ns |
502375 ns |
0.92 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
440917 ns |
452542 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
448666.5 ns |
441146 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
442459 ns |
444833 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
727570 ns |
732944 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26439577 ns |
29462542.5 ns |
0.90 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7463770.5 ns |
7794083 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
787560 ns |
779447 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
542 ns |
625 ns |
0.87 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
583 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
667 ns |
583 ns |
1.14 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
625 ns |
500 ns |
1.25 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32374 ns |
33084 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1183808 ns |
1348590.5 ns |
0.88 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
293771 ns |
458416.5 ns |
0.64 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
47460 ns |
47291 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9125 ns |
9209 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8687.5 ns |
9500 ns |
0.91 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10021 ns |
8666 ns |
1.16 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8833 ns |
9167 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
283836 ns |
289186 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
21930590 ns |
24166950 ns |
0.91 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5114333 ns |
5210708.5 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
379495 ns |
381324 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9833 ns |
9792 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
9834 ns |
9834 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
9833 ns |
9792 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
9833 ns |
9792 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23230 ns |
23519 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2044307 ns |
2258808.5 ns |
0.91 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal |
219042 ns |
221041.5 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
209443 ns |
207272 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
45792 ns |
45959 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
45875 ns |
45959 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
45958 ns |
46041 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
45708 ns |
46375 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
289262 ns |
292709.5 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
8827470 ns |
13279604 ns |
0.66 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal |
1045542 ns |
963562.5 ns |
1.09 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
601338 ns |
601736 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56459 ns |
56834 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
56459 ns |
57208 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
56458 ns |
57000 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
58125 ns |
57791 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
28542.5 ns |
28797 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1178901 ns |
1296667 ns |
0.91 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
584250 ns |
599375 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
203033 ns |
214467.5 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
487250.5 ns |
488583 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
468833.5 ns |
506875 ns |
0.92 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
512312 ns |
467854 ns |
1.10 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
448625 ns |
444854 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
245048 ns |
247966 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
32363620 ns |
35422277.5 ns |
0.91 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9380875 ns |
9625250 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
884842 ns |
889783 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
665833 ns |
662791 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
655458 ns |
645583 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
651562.5 ns |
641458 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
641604 ns |
654708 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
207953 ns |
204631.5 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8553419 ns |
9404311.5 ns |
0.91 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1360709 ns |
1366041 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
305729 ns |
307612.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2241666 ns |
2256146 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2263750 ns |
2230917 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2261250 ns |
2237292 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2258666 ns |
2235916 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
954468 ns |
983378 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
49076465 ns |
51532010 ns |
0.95 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6739354 ns |
7223667 ns |
0.93 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1311013 ns |
1360743 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
20250 ns |
21208 ns |
0.95 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19458 ns |
21895.5 ns |
0.89 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
22146 ns |
24000 ns |
0.92 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19875 ns |
18708 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
111761.5 ns |
113606 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3759777 ns |
4029922 ns |
0.93 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1447333.5 ns |
1470375 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
78961 ns |
81911 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
245042 ns |
263833 ns |
0.93 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220646 ns |
230917 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
223250 ns |
221375 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219375 ns |
261833.5 ns |
0.84 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
721992 ns |
732293.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25713634 ns |
28666996 ns |
0.90 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7609771 ns |
7932292 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
556328 ns |
557920 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
584 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
584 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
666 ns |
583 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
22645 ns |
23564 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1249581.5 ns |
1402930.5 ns |
0.89 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
464250 ns |
479854.5 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
47691 ns |
49551 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
10417 ns |
10042 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9750 ns |
9833 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9625 ns |
9208 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9875 ns |
8625 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
262905.5 ns |
271175.5 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24154524 ns |
27354439 ns |
0.88 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
6051062 ns |
5706584 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
404516 ns |
399053 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
10083 ns |
9709 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9000 ns |
9104 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11667 ns |
9437.5 ns |
1.24 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
9292 ns |
8375 ns |
1.11 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
117395 ns |
122324.5 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
3584331 ns |
3848922 ns |
0.93 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
877416 ns |
890083 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
70681 ns |
69951 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7792 ns |
7417 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7750 ns |
7500 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7833 ns |
7625 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7833.5 ns |
7333 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
500398 ns |
514534 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
16602320 ns |
19594222 ns |
0.85 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4523729 ns |
4165479 ns |
1.09 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
323165 ns |
320028 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1666 ns |
1562.5 ns |
1.07 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1666 ns |
1708.5 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2209 ns |
1833.5 ns |
1.20 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1500 ns |
1333 ns |
1.13 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
20417 ns |
21964 ns |
0.93 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1289220 ns |
1238732.5 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal |
303041 ns |
302542 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
189802 ns |
188582 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3375 ns |
3333 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3292 ns |
3458 ns |
0.95 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3625 ns |
3334 ns |
1.09 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3333 ns |
3250 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
216794.5 ns |
224397.5 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10346867 ns |
10897600 ns |
0.95 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal |
1614833 ns |
1688875 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
579879 ns |
578505.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
149188 ns |
148875 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
106396 ns |
132708 ns |
0.80 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
107770.5 ns |
130750 ns |
0.82 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
233687 ns |
225250 ns |
1.04 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
23438 ns |
24103 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1212311.5 ns |
1297180 ns |
0.93 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal |
293459 ns |
269833 ns |
1.09 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
39901 ns |
40231 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
156812.5 ns |
162604 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
114708.5 ns |
127166 ns |
0.90 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
108583.5 ns |
112750 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
254000 ns |
265229 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
213612 ns |
219287 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
10573257 ns |
11195277.5 ns |
0.94 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal |
2022750 ns |
1990375 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
267524 ns |
267987.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7042 ns |
7375 ns |
0.95 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5334 ns |
5959 ns |
0.90 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5333 ns |
6000 ns |
0.89 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10458 ns |
10209 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
32622 ns |
33200 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1216978 ns |
1323539 ns |
0.92 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
349791 ns |
615604 ns |
0.57 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
50170 ns |
50040 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
255041.5 ns |
260750 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
228146 ns |
234833 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
242396 ns |
265125 ns |
0.91 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213145.5 ns |
221333 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
257391 ns |
264591 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27184775 ns |
29454390 ns |
0.92 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8241166 ns |
8466083 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
589609 ns |
592630 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
15709 ns |
15750 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
15583 ns |
15667 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
16791 ns |
16167 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
16167 ns |
14541 ns |
1.11 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
137465 ns |
140225 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5393762.5 ns |
6115964 ns |
0.88 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
765959 ns |
798333 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
233163 ns |
232492 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24125 ns |
23708 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
23375 ns |
23479 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
24583 ns |
23562.5 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
23958 ns |
22667 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
855768 ns |
872247 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
40529822 ns |
42738683 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5795375 ns |
5646770.5 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
680810 ns |
676987 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
10416 ns |
10041 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9271 ns |
10187.5 ns |
0.91 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
11125 ns |
11666 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
9375 ns |
8792 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
121395 ns |
125357.5 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
3528807.5 ns |
3857738.5 ns |
0.91 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
857604.5 ns |
898625 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
73701 ns |
75221 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13917 ns |
14000 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13875 ns |
13812.5 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14687.5 ns |
14062.5 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13959 ns |
14292 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
655994 ns |
675390 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21978368 ns |
23526980.5 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5304604 ns |
5359958.5 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
366965.5 ns |
365113 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10125 ns |
10292 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9750.5 ns |
9646 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11292 ns |
10958 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9250 ns |
8542 ns |
1.08 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
120411 ns |
124246 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
3481885.5 ns |
3650341 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
896958 ns |
890042 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
73011 ns |
72050 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12958 ns |
13084 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12417 ns |
12896 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13083 ns |
12542 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12687 ns |
12667 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
544787 ns |
557269 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19530675 ns |
20940364 ns |
0.93 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4633292 ns |
4415208 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
344335 ns |
341913.5 ns |
1.01 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
27604.5 ns |
30438 ns |
0.91 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
33124.5 ns |
32771 ns |
1.01 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
30833 ns |
32145.5 ns |
0.96 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
1750 ns |
1875 ns |
0.93 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
15866 ns |
16382 ns |
0.97 |
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
81132 ns |
80651 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5333.5 ns |
5375 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
5104 ns |
4937 ns |
1.03 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5188 ns |
5208 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
6375 ns |
6292 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
137988 ns |
141456.5 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
371616 ns |
382544 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
250 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
25253 ns |
26188 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1186857 ns |
1349689 ns |
0.88 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
410583 ns |
455771 ns |
0.90 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
47471 ns |
48850 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6667 ns |
6583 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6625 ns |
6375 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6833 ns |
6250 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6167 ns |
6250 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
183528 ns |
190177 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22718575.5 ns |
25715880 ns |
0.88 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5561041 ns |
5628084 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
395306.5 ns |
388664 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
2042 ns |
2042 ns |
1 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
2042 ns |
2042 ns |
1 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
2083 ns |
2125 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
2042 ns |
1958 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
26234 ns |
26944 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1191724.5 ns |
1363088 ns |
0.87 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
468125 ns |
471437.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
206183 ns |
205032 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16729.5 ns |
16958 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16084 ns |
16250 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16750 ns |
16749.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16750 ns |
16250 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
269544.5 ns |
278717.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
22860307 ns |
26543319 ns |
0.86 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
6059500 ns |
6143666 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
707392 ns |
701356 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
176833 ns |
193791 ns |
0.91 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
155875 ns |
174166.5 ns |
0.89 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
151750 ns |
151875 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
152375 ns |
161458 ns |
0.94 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
198561.5 ns |
200117.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8216832 ns |
8677326 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1419958 ns |
1431250 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
221808.5 ns |
224822 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1327687.5 ns |
1332708 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1329500 ns |
1313042 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1324959 ns |
1321250 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1337687 ns |
1320542 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
893954 ns |
914262.5 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
45120426 ns |
52072722 ns |
0.87 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6332833 ns |
6865145.5 ns |
0.92 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1110068.5 ns |
1099471 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24583.5 ns |
25270.5 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
25833 ns |
25750 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
27895.5 ns |
28167 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24792 ns |
24645.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
233633.5 ns |
236681 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7675411 ns |
8520645.5 ns |
0.90 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1108375 ns |
960167 ns |
1.15 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
115432 ns |
114711 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
119062.5 ns |
128833.5 ns |
0.92 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
130250 ns |
184437.5 ns |
0.71 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
146791 ns |
126541.5 ns |
1.16 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
175812.5 ns |
117313 ns |
1.50 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1046246.5 ns |
1084581 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
46302046 ns |
48584064.5 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6122291 ns |
6244708 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
614765 ns |
609766 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
334 ns |
1.12 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
334 ns |
292 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
22330 ns |
23179.5 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1165828.5 ns |
1352649.5 ns |
0.86 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
383709 ns |
470375 ns |
0.82 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
47131 ns |
47251 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6979.5 ns |
6875 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6625 ns |
6667 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7042 ns |
6250 ns |
1.13 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6770.5 ns |
6604 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
199505 ns |
206812 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
25886485 ns |
26430531.5 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
6031292 ns |
5939666 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
397337 ns |
393154 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6542 ns |
6750 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6687.5 ns |
6416.5 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6792 ns |
7042 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6791 ns |
6750 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
143150 ns |
147041 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5705092 ns |
6204224 ns |
0.92 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
446083.5 ns |
711062.5 ns |
0.63 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
233934 ns |
232702 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10209 ns |
10250 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9959 ns |
9875 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10458.5 ns |
10250 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9791.5 ns |
9792 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
886014.5 ns |
908474 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
40692476 ns |
42229280 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5868792 ns |
6135833 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
674436.5 ns |
665637 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
667 ns |
667 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
667 ns |
667 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
667 ns |
667 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
667 ns |
625 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22372 ns |
22806 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2141517 ns |
2183221 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal |
223812.5 ns |
228667 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
207744 ns |
206602 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4625 ns |
4625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4584 ns |
4666 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4792 ns |
4625 ns |
1.04 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4625 ns |
4584 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
222546.5 ns |
229835 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10092882.5 ns |
10794904 ns |
0.93 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal |
1625125 ns |
1685770.5 ns |
0.96 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
580171 ns |
577495 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8459 ns |
9042 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8292 ns |
9083.5 ns |
0.91 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10208.5 ns |
9354 ns |
1.09 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7875 ns |
7834 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
119498 ns |
124219 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
3650353 ns |
3899985 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
798291 ns |
810375 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
74642 ns |
74040.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8500 ns |
9000 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8854 ns |
8291 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9500 ns |
8750 ns |
1.09 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8875 ns |
8375 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
581446.5 ns |
596441 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
21695216 ns |
23179785 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
4828146 ns |
4819896 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
350806 ns |
338953 ns |
1.03 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
126666.5 ns |
127000 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
95937.5 ns |
131000 ns |
0.73 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
96708.5 ns |
129584 ns |
0.75 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
183500 ns |
180958.5 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
45553 ns |
46329 ns |
0.98 |
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
101962 ns |
104561 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
336291 ns |
341167 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
178667 ns |
333583 ns |
0.54 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
193458 ns |
325333 ns |
0.59 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
618041.5 ns |
588354 ns |
1.05 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
190160.5 ns |
194256.5 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
503774 ns |
512055 ns |
0.98 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
399083 ns |
399208 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
215250 ns |
288166.5 ns |
0.75 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
215125 ns |
287875 ns |
0.75 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
757209 ns |
755750 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43473 ns |
43515 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI |
1429246 ns |
1420150 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal |
416375 ns |
420292 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
80291 ns |
81701 ns |
0.98 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1413062.5 ns |
1396437 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
862000 ns |
1134500 ns |
0.76 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
862479.5 ns |
1133416.5 ns |
0.76 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2361500 ns |
2443791.5 ns |
0.97 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
244571 ns |
250930 ns |
0.97 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI |
10826132 ns |
12447603 ns |
0.87 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal |
1747854.5 ns |
1797500 ns |
0.97 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
351066 ns |
352383.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
675417 ns |
658917 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
661000 ns |
647083.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
650312.5 ns |
625729 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
662770.5 ns |
629562.5 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
191607 ns |
202467 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8072960 ns |
9193261 ns |
0.88 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1387000 ns |
1344749.5 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
303135 ns |
311273 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2467812.5 ns |
2486625 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2472417 ns |
2447229 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2475208 ns |
2446229 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2490750 ns |
2455167 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
976394 ns |
999287 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
53431680.5 ns |
61254580 ns |
0.87 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7206917 ns |
10164208 ns |
0.71 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1432035 ns |
1302412 ns |
1.10 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
32812.5 ns |
33437.5 ns |
0.98 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
34708.5 ns |
35145.5 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
32791.5 ns |
33896 ns |
0.97 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
792 ns |
875 ns |
0.91 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15302 ns |
15909 ns |
0.96 |
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
79522 ns |
84991 ns |
0.94 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3062.5 ns |
3250 ns |
0.94 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
3416 ns |
3083.5 ns |
1.11 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3583 ns |
3333 ns |
1.08 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3166 ns |
3041 ns |
1.04 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
136303 ns |
139820.5 ns |
0.97 |
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
339621 ns |
335653 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
406875 ns |
409291 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
402791 ns |
408167 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
401958 ns |
408916 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
420875 ns |
420042 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
42510 ns |
43861 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1479206.5 ns |
1610692 ns |
0.92 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1158541.5 ns |
1146937.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
239864 ns |
241802 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3851125 ns |
3890500 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3994000 ns |
3991792 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3993875 ns |
3995938 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3792771 ns |
3777541.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
237916 ns |
245384 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
36230288 ns |
40053105 ns |
0.90 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11479542 ns |
11890208 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1431880.5 ns |
1427303 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3958 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3875 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33442 ns |
33956 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI |
1218517.5 ns |
1415999 ns |
0.86 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal |
252896 ns |
180646 ns |
1.40 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
39781 ns |
39530 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15417 ns |
15583 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15417 ns |
15708 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15667 ns |
15708 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15375 ns |
15625 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
251521 ns |
256980 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI |
10228549 ns |
9741901 ns |
1.05 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal |
874875 ns |
867771 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
170133 ns |
177356.5 ns |
0.96 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
405000 ns |
403959 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
221250 ns |
295875 ns |
0.75 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
220875 ns |
295292 ns |
0.75 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
761000 ns |
760750 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
112902 ns |
113403.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI |
1067286 ns |
1056307 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal |
489792 ns |
458041 ns |
1.07 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
89352 ns |
89041 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1435479.5 ns |
1445458 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
886666.5 ns |
1158000 ns |
0.77 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
884042 ns |
1156604 ns |
0.76 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2386041 ns |
2464729.5 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
234432 ns |
241604 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10695349 ns |
12919628 ns |
0.83 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal |
1931229 ns |
1936541.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
354056 ns |
353843 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
584 ns |
584 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
584 ns |
459 ns |
1.27 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
25409 ns |
26174 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1175972 ns |
1343237.5 ns |
0.88 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
340875 ns |
430334 ns |
0.79 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
210769 ns |
209062 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
7875 ns |
7875 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7562 ns |
7708 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8083 ns |
7625 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7708 ns |
7250 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
205553.5 ns |
214822.5 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
29432470.5 ns |
28436000 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
6069083 ns |
5825750 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
697433 ns |
684816 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
833666 ns |
836604 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
468875 ns |
618875 ns |
0.76 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
472250 ns |
620167 ns |
0.76 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1541729 ns |
1552792 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129094 ns |
130046 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
232694 ns |
229912 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2700166.5 ns |
2694187.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1538208 ns |
2000104.5 ns |
0.77 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1535458 ns |
1999042 ns |
0.77 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4928542 ns |
4936792 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
259306 ns |
251857 ns |
1.03 |
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
841325 ns |
837543 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
334 ns |
1.12 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
291 ns |
1.29 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
291 ns |
1.29 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
31418.5 ns |
32688 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1174154.5 ns |
1331487 ns |
0.88 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
282250 ns |
447625 ns |
0.63 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
47321 ns |
46711 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6417 ns |
6666 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6291 ns |
6458 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6667 ns |
6208 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6709 ns |
6417 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
219381 ns |
232857 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
23531704 ns |
24854567 ns |
0.95 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4812750 ns |
5311167 ns |
0.91 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
367896 ns |
359813.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2435625 ns |
2405750 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2436833 ns |
2416666 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2396542 ns |
2377375 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2408167 ns |
2392666 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
196586 ns |
201638 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8033275 ns |
8402298 ns |
0.96 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1439500 ns |
1416500 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
378696 ns |
372683.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4650084 ns |
4654167 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4658666 ns |
4665479 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4665604 ns |
4644229.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4658000 ns |
4648583 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
886973.5 ns |
902404.5 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
46758472 ns |
52065462 ns |
0.90 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6718062.5 ns |
6861875 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1388836 ns |
1391004 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
7145.5 ns |
6708.5 ns |
1.07 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
16833 ns |
7208 ns |
2.34 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7208 ns |
7645.5 ns |
0.94 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
7771 ns |
13396 ns |
0.58 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
22490.5 ns |
23661 ns |
0.95 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1201659 ns |
1330674.5 ns |
0.90 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal |
266125 ns |
266208 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
39881 ns |
39961 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
67854.5 ns |
51604 ns |
1.31 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
69000.5 ns |
49000 ns |
1.41 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
34000 ns |
45750 ns |
0.74 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
66938 ns |
45375 ns |
1.48 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
212927 ns |
218958 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10377988 ns |
11575244 ns |
0.90 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal |
2027979 ns |
2067250 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
269205 ns |
264843 ns |
1.02 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
21666.5 ns |
21396 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
24917 ns |
25667 ns |
0.97 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
22000 ns |
24249.5 ns |
0.91 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
5000 ns |
7375 ns |
0.68 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
16349 ns |
17124 ns |
0.95 |
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
83941 ns |
84151 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
11896 ns |
12229 ns |
0.97 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
9167 ns |
10687 ns |
0.86 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
9625 ns |
10229 ns |
0.94 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
18125 ns |
17792 ns |
1.02 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
225254.5 ns |
229557 ns |
0.98 |
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
371231.5 ns |
371578.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
406875 ns |
406750 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
223500 ns |
297125 ns |
0.75 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
222958 ns |
296834 ns |
0.75 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
762709 ns |
762417 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
45942 ns |
46955 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI |
1367160.5 ns |
1453711 ns |
0.94 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal |
430125 ns |
484187.5 ns |
0.89 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
88672 ns |
88881 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1430000 ns |
1431645.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
893542 ns |
1166209 ns |
0.77 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
895041 ns |
1164750 ns |
0.77 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2388417 ns |
2471229 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
285687.5 ns |
294082.5 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI |
12748458.5 ns |
12353848 ns |
1.03 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal |
2059583.5 ns |
2093020.5 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
380437 ns |
380814 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
434375 ns |
434500 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
430667 ns |
437125 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
430167 ns |
437250 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
448083 ns |
447542 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
53382 ns |
54894 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
999319 ns |
1083139 ns |
0.92 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1124000 ns |
1087416 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
235214 ns |
233642 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3915375 ns |
3902292 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4016417 ns |
4012625 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4025417 ns |
4016541 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3814625.5 ns |
3808250 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
258018 ns |
266487.5 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
30663314.5 ns |
35233900 ns |
0.87 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10573708 ns |
10616978.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1363895.5 ns |
1364063 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
8750 ns |
8750 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
6917 ns |
7667 ns |
0.90 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
6875 ns |
7667 ns |
0.90 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
12417 ns |
12375 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
23861 ns |
24395 ns |
0.98 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2097838.5 ns |
2388137.5 ns |
0.88 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal |
221750 ns |
229041 ns |
0.97 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
210084 ns |
209122 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
44750 ns |
44875 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
45084 ns |
45000 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
44958 ns |
45000 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
45000 ns |
45292 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
343533 ns |
350021 ns |
0.98 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
13202058 ns |
14581645 ns |
0.91 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal |
1729395.5 ns |
1777208 ns |
0.97 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
655242.5 ns |
655627 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
88604 ns |
124000 ns |
0.71 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
107854 ns |
96270.5 ns |
1.12 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
88666 ns |
86562.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
149354 ns |
86958.5 ns |
1.72 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
189838 ns |
189446 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5846123.5 ns |
6078785 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1995729 ns |
1983729 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
219674 ns |
221122 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2014958 ns |
2025375 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2026229.5 ns |
2011792 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2021687.5 ns |
2010229 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2026875 ns |
2013666.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
525687 ns |
536819 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
27091285.5 ns |
29198754 ns |
0.93 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9642083.5 ns |
9376375 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1042630 ns |
967839 ns |
1.08 |
This comment was automatically generated by workflow using github-action-benchmark.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.