Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

fix: update to use test_gradients macro #161

Merged
merged 2 commits into from
Sep 18, 2024
Merged

fix: update to use test_gradients macro #161

merged 2 commits into from
Sep 18, 2024

Conversation

avik-pal
Copy link
Member

No description provided.

test/common_ops/dropout_tests.jl Outdated Show resolved Hide resolved
test/normalization/batchnorm_tests.jl Outdated Show resolved Hide resolved
@avik-pal avik-pal force-pushed the ap/up_test branch 2 times, most recently from 93bde63 to 2c77ccb Compare September 18, 2024 04:01
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: 749aa81 Previous: 0df09fa Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5791 ns 6938 ns 0.83
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5959 ns 7438 ns 0.80
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8459 ns 7541 ns 1.12
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5688 ns 5750 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 118315 ns 133931 ns 0.88
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2608336 ns 2868757 ns 0.91
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 3336084 ns 741167 ns 4.50
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 410709 ns 407074 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10083.5 ns 9916.5 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9792 ns 9625 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9958 ns 9937.5 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9708 ns 9916.5 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 536714 ns 536526 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 17930214 ns 17845684 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 2513792 ns 2422500 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 11630401 ns 678976 ns 17.13
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1458.5 ns 1583 ns 0.92
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1375 ns 3145.5 ns 0.44
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1666 ns 2812.5 ns 0.59
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1667 ns 1541.5 ns 1.08
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 20966 ns 21370 ns 0.98
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1232007 ns 1416739 ns 0.87
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 206854 ns 237500 ns 0.87
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 29122 ns 29161 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4083 ns 4166 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4041 ns 4291 ns 0.94
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4271 ns 4417 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3542 ns 4104 ns 0.86
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 141025 ns 143094 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 8593659 ns 9766798.5 ns 0.88
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1574083 ns 1569250 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 145156 ns 144301 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58083 ns 58000 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39792 ns 46834 ns 0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39750 ns 46584 ns 0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83625 ns 82333 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36108.5 ns 36625 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 809203.5 ns 686115 ns 1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1029417 ns 1069291 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 77888 ns 78821 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2041709 ns 2031375 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2088479 ns 2084708 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2081875 ns 2090291 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2003583.5 ns 1985542 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 221640 ns 225038 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 8023876 ns 8235886 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 5473375 ns 5106125 ns 1.07
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1554303 ns 987279 ns 1.57
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 147042 ns 174500 ns 0.84
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 148792 ns 162104.5 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 159833 ns 165229 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 157979 ns 145875 ns 1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165109 ns 165145 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6993410 ns 8411274 ns 0.83
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1577458 ns 1520666 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 201107.5 ns 209957 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1116791 ns 1119979 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1123500 ns 1112166.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1115937.5 ns 1117709 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1121458.5 ns 1107125 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 682529 ns 687949 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 25183376 ns 35372606 ns 0.71
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6340229 ns 6112291 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1027084.5 ns 1024164.5 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5333 ns 4625.5 ns 1.15
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4291.5 ns 5104 ns 0.84
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4937.5 ns 5583 ns 0.88
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5750 ns 5042 ns 1.14
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 90041.5 ns 92273 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5138284.5 ns 5823843 ns 0.88
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 440125 ns 499583.5 ns 0.88
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 68083 ns 67701 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9000 ns 9000 ns 1
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8500 ns 8500 ns 1
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8875 ns 9187.5 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8209 ns 8417 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 586953.5 ns 600949 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 32463791.5 ns 36561430 ns 0.89
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5763833 ns 5960250 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 384805 ns 389274 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19312.5 ns 19625 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17292 ns 17791 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19542 ns 20291 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18812.5 ns 16645.5 ns 1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 65813 ns 65239 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 2762320.5 ns 3323140 ns 0.83
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1267542 ns 1293104 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 76953 ns 73656 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 214041 ns 220959 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212083 ns 212333 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213271 ns 212541 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 239625 ns 212000 ns 1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 346222 ns 347340 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 10319532 ns 13974103 ns 0.74
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5686708 ns 5755333 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 470837 ns 462604 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 750 ns 666 ns 1.13
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 833.5 ns 0.75
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 792 ns 875 ns 0.91
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 666.5 ns 584 ns 1.14
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 20174 ns 20357 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1159971 ns 1288251 ns 0.90
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 289709 ns 292667 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 31351 ns 31491 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1625 ns 1416.5 ns 1.15
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1458 ns 1416 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1500 ns 1625 ns 0.92
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1416 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 122010.5 ns 123399.5 ns 0.99
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 9092998 ns 9450809 ns 0.96
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1444834 ns 1493229 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 133079.5 ns 135231 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7458 ns 7500 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5417 ns 6042 ns 0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5375 ns 6000 ns 0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10333 ns 10125 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23165 ns 23818 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1198319.5 ns 1331154.5 ns 0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 675854 ns 628937.5 ns 1.07
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47562 ns 46911 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220167 ns 219750 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 235791 ns 265167 ns 0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228354 ns 264416 ns 0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 258625 ns 249854 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 188279 ns 189311.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 29478229 ns 33158982 ns 0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8824000 ns 9299979.5 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 644673 ns 643876 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4084 ns 4125 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4084 ns 4125 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4083 ns 4083 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4083 ns 4083 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23288.5 ns 23427 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 1877196 ns 2124740.5 ns 0.88
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 220187.5 ns 222770.5 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 46182 ns 46290 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16833 ns 16833 ns 1
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16500 ns 16792 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17292 ns 16750 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16833 ns 16792 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 190682 ns 191493 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 11595194 ns 11757211 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 951312.5 ns 955313 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 174872 ns 171341.5 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 511625 ns 511167 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 332125 ns 405458 ns 0.82
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 332333.5 ns 405000 ns 0.82
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 864542 ns 858250 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113037 ns 113156 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 390383 ns 448835 ns 0.87
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 495854 ns 471209 ns 1.05
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 241899 ns 240532 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2269062.5 ns 2268250 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1751687.5 ns 2031416 ns 0.86
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1758583.5 ns 2030917 ns 0.87
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3193062.5 ns 3275750 ns 0.97
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 237180 ns 236871 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11204627.5 ns 10359638.5 ns 1.08
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 1874125 ns 1993250 ns 0.94
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 743275.5 ns 739142 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6542 ns 6583 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6458 ns 6875 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8292 ns 7709 ns 1.08
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6750 ns 6292 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 89032 ns 90224.5 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5316353 ns 5882879 ns 0.90
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 774000 ns 771000 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 65767 ns 65250 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11750 ns 12333.5 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10459 ns 11375 ns 0.92
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11375 ns 11312.5 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11708 ns 11833.5 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 626029 ns 622443 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 39909481 ns 41746922 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5517500 ns 5637750 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 411686 ns 407854 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 541 ns 541 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 22999 ns 22944 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2336392 ns 2423476.5 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 220167 ns 326750 ns 0.67
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 47349 ns 48960 ns 0.97
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2084 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2083 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 224074 ns 217144 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 11718228.5 ns 12060454 ns 0.97
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 1959542 ns 1960083 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 174425 ns 180236.5 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9209 ns 8625 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8562.5 ns 9646 ns 0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10209 ns 11229 ns 0.91
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8459 ns 8792 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 100989 ns 103267 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3052989 ns 3427494 ns 0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 873833 ns 875083 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 73188 ns 73431 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17708 ns 17834 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17416.5 ns 17916 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18583 ns 17333 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17438 ns 18000 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 592788 ns 586862 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 12675498 ns 17435012.5 ns 0.73
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5159749.5 ns 5223458 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 381661 ns 377954 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 458 ns 541 ns 0.85
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 34522 ns 34849.5 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1143136.5 ns 1279718 ns 0.89
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 386000 ns 435291 ns 0.89
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 46209 ns 45841 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9520.5 ns 8979.5 ns 1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8875 ns 9250 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9750 ns 8917 ns 1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9666.5 ns 8146 ns 1.19
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 247762 ns 260579 ns 0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 17492397 ns 19733483 ns 0.89
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5085208 ns 4985875 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 370118 ns 366004 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 398834 ns 398667 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 215167 ns 287958 ns 0.75
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 215291 ns 287750 ns 0.75
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756208 ns 756458 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111119 ns 111261.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 324528 ns 376549 ns 0.86
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 483083 ns 367583.5 ns 1.31
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 76409 ns 74430 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1402459 ns 1400375 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 858958 ns 1135375 ns 0.76
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 859209 ns 1132354 ns 0.76
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2358375 ns 2440958 ns 0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 203517 ns 203910 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 9265632 ns 9225527 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1557542 ns 1662875 ns 0.94
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 322265 ns 321818 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7375.5 ns 7604.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7333.5 ns 8083 ns 0.91
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8479.5 ns 8729 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7187.5 ns 7437.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 135285 ns 142785 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5475873.5 ns 6299176.5 ns 0.87
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 448604 ns 521292 ns 0.86
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 66419 ns 65420 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15792 ns 12583 ns 1.26
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14146 ns 12437.5 ns 1.14
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15042 ns 14521 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14937.5 ns 14979.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 890292 ns 943733.5 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 41273046 ns 47612069 ns 0.87
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5852584 ns 5885062.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 432034.5 ns 417444 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 29854.5 ns 30395.5 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 26354 ns 29604 ns 0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 30125 ns 27709 ns 1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24958.5 ns 25083.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 190182.5 ns 195905 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7912508 ns 8216412 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 986625.5 ns 990125 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 116199 ns 116401 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 152875 ns 154583.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 146458 ns 155500 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 147854.5 ns 114042 ns 1.30
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103959 ns 113187.5 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1006113 ns 1061855 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42468130 ns 46328998 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5912542 ns 5883041 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 588675 ns 586901 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 77000 ns 74459 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76458 ns 75833 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 77042 ns 78208 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 79458 ns 75958 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 199450 ns 203068 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7017337 ns 7813436 ns 0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 532167 ns 533437.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 130879 ns 127391 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 303375 ns 298166 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 319563 ns 303208 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 288062.5 ns 306041.5 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 312021 ns 295666 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1064896 ns 1104226 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 40177222.5 ns 44772773.5 ns 0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6376583.5 ns 6766000 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 697067.5 ns 694176 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 16750 ns 17000 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 17417 ns 17292 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 18625 ns 18375 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 18292 ns 16792 ns 1.09
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 140061 ns 145201.5 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5763073.5 ns 6348029 ns 0.91
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 711250 ns 448000.5 ns 1.59
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 235139 ns 231113 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27937.5 ns 27208 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 28583 ns 28625 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26958.5 ns 27187.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26458.5 ns 26145.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 923284.5 ns 972527 ns 0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 41401667 ns 44334727.5 ns 0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5833417 ns 5935916 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 694159 ns 684627 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11146 ns 11375 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11666 ns 11625 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12084 ns 14042 ns 0.86
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 10667 ns 10416 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 117789.5 ns 123261.5 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3540129 ns 3725175 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 897145.5 ns 904958 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 235970 ns 233272 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21958 ns 22000 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21666.5 ns 21666 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 22834 ns 21542 ns 1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21625 ns 21916 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 693239 ns 697545 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 21263436 ns 22814286 ns 0.93
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5472291 ns 5479812.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 679571 ns 668531 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 63209 ns 67459 ns 0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 65084 ns 63625 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 68084 ns 65084 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 62541.5 ns 62667 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 104514.5 ns 105558.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3308618 ns 3699497 ns 0.89
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1322854 ns 1336625 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 235301 ns 231652 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 450042 ns 450250 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 440750.5 ns 451792 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 467375 ns 446041.5 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 436875 ns 484250 ns 0.90
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 506753 ns 508079 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 20425451.5 ns 22280153.5 ns 0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6092458 ns 6164479 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 717923 ns 712097 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7958.5 ns 7667 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7834 ns 8458 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9167 ns 8041.5 ns 1.14
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8020.5 ns 7083.5 ns 1.13
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 141656 ns 142974 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5744993.5 ns 5983895.5 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 737709 ns 687104.5 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 65540 ns 68961 ns 0.95
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15542 ns 14333 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15625 ns 14312 ns 1.09
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14292 ns 15021 ns 0.95
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 12792 ns 15250 ns 0.84
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 926648 ns 941966 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 38778177 ns 40659493.5 ns 0.95
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5540041 ns 5744375 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 404172 ns 395784 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6158208.5 ns 6161520.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 3222792 ns 6378125.5 ns 0.51
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 3226479 ns 6377708.5 ns 0.51
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11922958 ns 11920959 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 349729 ns 347985 ns 1.01
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 326262 ns 320268 ns 1.02
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19145000 ns 19132416 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 11139458 ns 20009458 ns 0.56
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 11142750 ns 19937708 ns 0.56
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36483021 ns 36464229.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1024792.5 ns 1013485 ns 1.01
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1155998.5 ns 1165921 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 959 ns 917 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1041 ns 1000 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1000 ns 917 ns 1.09
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 958 ns 958 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 22915 ns 23221 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 2061332 ns 2197390 ns 0.94
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 216166 ns 332458.5 ns 0.65
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 208941 ns 205762 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3667 ns 3667 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3708 ns 3709 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3750 ns 3667 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3625 ns 3667 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 274731 ns 277792 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11037575.5 ns 12494000 ns 0.88
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2079354 ns 2076312.5 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 625946 ns 624236 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8021 ns 8792 ns 0.91
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8250 ns 8875.5 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9958 ns 9875 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8563 ns 7625 ns 1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 117470.5 ns 119047.5 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3291500 ns 3910252.5 ns 0.84
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 806687.5 ns 795416.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 72880 ns 65320 ns 1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 12083.5 ns 11374.5 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11958 ns 12208 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12583 ns 11792 ns 1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 12167 ns 11979.5 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 628343.5 ns 629697.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 21030497 ns 23515262 ns 0.89
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5058833 ns 5019875 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 362883 ns 352263 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 250 ns 1.33
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22172 ns 22203 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2032725 ns 2289294 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 222333 ns 228916 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 49091 ns 46161 ns 1.06
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 3208 ns 3084 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2958 ns 2959 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3500 ns 2917 ns 1.20
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 3125 ns 2875 ns 1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 198272 ns 200155 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 9410897.5 ns 9757264 ns 0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 1547188 ns 1632083 ns 0.95
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 166212 ns 153411.5 ns 1.08
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10416 ns 11563 ns 0.90
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11125.5 ns 11334 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13333 ns 12292 ns 1.08
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11125 ns 10854 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 119897 ns 120519 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3353945 ns 3640370.5 ns 0.92
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 872167 ns 897667 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 241608 ns 232282 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 21834 ns 20750 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20708.5 ns 21083 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21292 ns 21959 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21021 ns 21458.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 585466 ns 590202 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19722619 ns 22574638 ns 0.87
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4787292 ns 4746958.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 654628 ns 639216 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4375 ns 4375 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4417 ns 4375 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4375 ns 4375 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4375 ns 4375 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 23791 ns 23877 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2193197 ns 2442376 ns 0.90
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 223375 ns 225708 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 49361 ns 46800 ns 1.05
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16250 ns 16291 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16250 ns 16625 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16500 ns 16459 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16083 ns 16500 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 326259.5 ns 326023.5 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 12072256 ns 13171553 ns 0.92
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 1574083.5 ns 1188229 ns 1.32
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 212818 ns 205042 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2083 ns 2042 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2167 ns 2083 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2167 ns 2083 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2041 ns 2084 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 35274 ns 35572 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1199050 ns 1338351 ns 0.90
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 367834 ns 435459 ns 0.84
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 204943 ns 202812 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 17437.5 ns 16520.5 ns 1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 18375.5 ns 17104.5 ns 1.07
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 16541 ns 18375 ns 0.90
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 17167 ns 18770.5 ns 0.91
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 287962 ns 291395 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21039131 ns 23003699 ns 0.91
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5565375 ns 5678333 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 690990 ns 682086 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 59500 ns 58979 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 61125 ns 67125 ns 0.91
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 62834 ns 66917 ns 0.94
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51458 ns 51625 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66712 ns 66452 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 115812 ns 114721 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 161312.5 ns 162292 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 122750.5 ns 147229 ns 0.83
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 131083 ns 130229 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 226583 ns 296770.5 ns 0.76
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 214529.5 ns 213701 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 628329 ns 607926 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 110166.5 ns 84250 ns 1.31
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 117500 ns 124729 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 85750 ns 85875 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 106854.5 ns 123833 ns 0.86
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193539 ns 193440 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5446551 ns 7291287 ns 0.75
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2001021 ns 1831167 ns 1.09
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 219713 ns 203522 ns 1.08
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1914063 ns 1928271 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1889833 ns 1891125 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1914396 ns 1902250 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1926312.5 ns 1914749.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 525988 ns 525346 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 25282704.5 ns 26967967.5 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8964750 ns 9298209 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1069581.5 ns 927389 ns 1.15
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21663 ns 21417 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2103825.5 ns 2392141 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 340417 ns 342188 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 43831 ns 42200 ns 1.04
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1834 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1791 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 249420 ns 249016 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 9776488 ns 10390055 ns 0.94
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 1520500 ns 1093187.5 ns 1.39
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 185088 ns 179602 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8792 ns 9667 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8750 ns 10125 ns 0.86
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11041 ns 10249.5 ns 1.08
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8875 ns 9375 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 116662 ns 118409 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3389039 ns 3710566 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 865583.5 ns 886083.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 234874 ns 231452 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9833 ns 9209 ns 1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9083 ns 10000 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9666 ns 9770.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9708 ns 9500 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 513048 ns 517575.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 20325597 ns 21956361 ns 0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 3946334 ns 4314937.5 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 633311 ns 624606 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59167 ns 58209 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39917 ns 46542 ns 0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39667 ns 46750 ns 0.85
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83291 ns 83000 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 39225 ns 39682 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1317436 ns 1450337.5 ns 0.91
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1124396 ns 1115958 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 77022 ns 74661 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1828958.5 ns 1939500 ns 0.94
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1972875 ns 1983125 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1973979 ns 1951312.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1902583.5 ns 1897667 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 216838.5 ns 216819.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33688603 ns 37812796.5 ns 0.89
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10618958 ns 10968478.5 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1176400 ns 1185212 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 418542 ns 417625 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 418666 ns 419834 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 420437.5 ns 420958 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 420375 ns 417208 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 205056 ns 204963.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7972049 ns 8983027 ns 0.89
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 530229 ns 546875 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 284905 ns 280603 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 669688 ns 669791.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 734562 ns 780667 ns 0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 681104 ns 689645.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 770604 ns 725292 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1036899 ns 1038703 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 47133005 ns 49679972 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6149083 ns 6487209 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 916011 ns 909389 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3452417 ns 3413542 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3455583 ns 3417875 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3425042 ns 3420479 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3460771 ns 3414187 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 169776 ns 168543 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 9104393 ns 8597060 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1379604.5 ns 1366458.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 450698 ns 434404 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6210334 ns 6191104 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6233042 ns 6232645.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6231833 ns 6213854 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6206916.5 ns 6216250 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 988034.5 ns 979877 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 50671907 ns 50928344 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7136416 ns 7557875 ns 0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1711422 ns 1538944 ns 1.11
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 474458 ns 471584 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 253375 ns 341687.5 ns 0.74
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 253417 ns 340375 ns 0.74
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 903167 ns 902500 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46475 ns 46568 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 386217.5 ns 450349 ns 0.86
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 443708 ns 504562.5 ns 0.88
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 242304 ns 241952 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2281792 ns 2276916 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1759521.5 ns 2038666 ns 0.86
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1761416.5 ns 2034583 ns 0.87
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3194417 ns 3280958 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 267186.5 ns 253153 ns 1.06
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 16005829 ns 14086050 ns 1.14
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2190667 ns 2208291.5 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 769044 ns 765407 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58542 ns 57959 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39583 ns 46250 ns 0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39667 ns 46250 ns 0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83583 ns 82792 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 27928.5 ns 28134 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1374875 ns 1575508 ns 0.87
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1148500 ns 1135958 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 73812 ns 73405.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2038270.5 ns 1962520.5 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2092792 ns 2093312.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2093666 ns 2086834 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2001542 ns 2000458.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 231752 ns 229351 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 35917944.5 ns 38934321 ns 0.92
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11278917 ns 11662250 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1043680.5 ns 1196771 ns 0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58333 ns 58208 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 40042 ns 46812.5 ns 0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39458 ns 46708 ns 0.84
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83541 ns 82375 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 49529 ns 49491 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 836997 ns 947062 ns 0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1076416 ns 1068833 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 77896.5 ns 77751 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1887708.5 ns 1937792 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1953250.5 ns 1974209 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1969146 ns 1960000 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1896416 ns 1899959 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 239764.5 ns 235535 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 16726563.5 ns 22349832.5 ns 0.75
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9964895.5 ns 9994166 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1048441 ns 915999 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 334 ns 333 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 417 ns 292 ns 1.43
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 34451 ns 34420 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1206965.5 ns 1328125 ns 0.91
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 397229.5 ns 278292 ns 1.43
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 49261 ns 45880 ns 1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6250 ns 6541 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6708 ns 6917 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7250 ns 6584 ns 1.10
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6875 ns 6458 ns 1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 210250 ns 209753 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 20421243.5 ns 22541718 ns 0.91
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5307333.5 ns 4971437.5 ns 1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 373842.5 ns 368183 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32507 ns 31457 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1221670 ns 1340759 ns 0.91
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 253708 ns 258291 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 40970 ns 36451 ns 1.12
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2666 ns 3458 ns 0.77
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 3042 ns 3292 ns 0.92
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3125 ns 2917 ns 1.07
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 3584 ns 2917 ns 1.23
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 188962 ns 185714.5 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 9532662 ns 8803725 ns 1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 1060333 ns 950374.5 ns 1.12
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 163034 ns 150601 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 425125 ns 424603.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 456833 ns 425000 ns 1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 426083 ns 430459 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 456458 ns 443562.5 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 138588 ns 136540 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6071449 ns 6325011.5 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2788209 ns 2056896 ns 1.36
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 366947 ns 365713 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3814167 ns 3790417 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3815333.5 ns 3803834 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3816834 ns 3804250 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3803187.5 ns 3813000 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 704291 ns 699295 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33600498 ns 34149887.5 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10285104 ns 11037916.5 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1461255 ns 1464794 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49948750 ns 49877979 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 25988208 ns 35522250 ns 0.73
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 26007292 ns 35535229 ns 0.73
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 97079833 ns 96934583 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1594546 ns 1591242 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1044656.5 ns 1047550 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154846562.5 ns 154708541.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 88751208 ns 112454083.5 ns 0.79
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 89372584 ns 112480333 ns 0.79
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 295780062.5 ns 296379229 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6552162 ns 6494323.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5594625 ns 5551012.5 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 18167 ns 19062.5 ns 0.95
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 15542 ns 17833.5 ns 0.87
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 13917 ns 17041 ns 0.82
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15125 ns 15875 ns 0.95
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 20123 ns 21028 ns 0.96
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1097277.5 ns 1230713 ns 0.89
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 215458 ns 219604.5 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 26061 ns 25950 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 11209 ns 10958 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 7583 ns 9041 ns 0.84
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 8042 ns 9041.5 ns 0.89
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17083 ns 17375 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 258757 ns 257331 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 9624727.5 ns 10803925 ns 0.89
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1541646 ns 1552917 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 149533 ns 147801 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8812.5 ns 9354.5 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 9541 ns 10000 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10833 ns 10458 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8270.5 ns 7458 ns 1.11
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 124166.5 ns 114779 ns 1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3724156 ns 3881100 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 739667 ns 797833 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 235009.5 ns 233502 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9146 ns 9916 ns 0.92
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9708.5 ns 9708 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10125 ns 9334 ns 1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9437.5 ns 9709 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 615832.5 ns 616669 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22440306.5 ns 25342914 ns 0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5011583 ns 4989750 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 656304 ns 651926 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9146 ns 10583 ns 0.86
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9375 ns 9146 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11083 ns 10584 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9625 ns 9875 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 118765 ns 120200.5 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3425413 ns 3758991 ns 0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 896229 ns 905750 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 73031.5 ns 71611 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13083.5 ns 13541 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13792 ns 15500 ns 0.89
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 14083 ns 15458.5 ns 0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12833.5 ns 18125 ns 0.71
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 588412 ns 585824 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19699495 ns 21389400 ns 0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4785917 ns 4649750 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 347898 ns 343933 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 459 ns 500 ns 0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 583 ns 584 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 34901 ns 34550 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1157159 ns 1371228 ns 0.84
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 326875 ns 447645.5 ns 0.73
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 205325 ns 203956.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7291 ns 8270.5 ns 0.88
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9145.5 ns 8708 ns 1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7709 ns 9167 ns 0.84
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7333.5 ns 10625 ns 0.69
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 228635 ns 231015.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21730930.5 ns 24528595.5 ns 0.89
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5008062.5 ns 5171458.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 663024 ns 654796 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 16333 ns 16167 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 14167 ns 15895.5 ns 0.89
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 13521 ns 15979 ns 0.85
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10833 ns 11875 ns 0.91
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 21021 ns 21988 ns 0.96
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1210830.5 ns 1304948 ns 0.93
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 206167 ns 257646 ns 0.80
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 187184 ns 184412 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 31583 ns 32084 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 31958 ns 31875 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 32209 ns 32250 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 31792 ns 31708 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 271653 ns 271511.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 10857362.5 ns 12350146 ns 0.88
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1802125 ns 1659167 ns 1.09
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 592182.5 ns 587425 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 469958 ns 504958 ns 0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 507208 ns 481520.5 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 444250 ns 443208 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 456708 ns 488374.5 ns 0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194607 ns 195092 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6126515 ns 6520561 ns 0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1972959 ns 1945520.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 377228 ns 367668 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3835666 ns 3839417 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3835645.5 ns 3824437.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3840583 ns 3828250 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3839145.5 ns 3827604.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 538632 ns 535436 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 29473777 ns 32985580 ns 0.89
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8953209 ns 9639667 ns 0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1358598 ns 1204966.5 ns 1.13
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 782285709 ns 781980875 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 418895375.5 ns 543423875 ns 0.77
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 417745166.5 ns 542625875 ns 0.77
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1560235646 ns 1559677978.5 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22544059.5 ns 22745322 ns 0.99
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14748380 ns 14786409 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 2522304416 ns 2528971583 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1512463959 ns 2254450917 ns 0.67
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1520064875 ns 2476668541 ns 0.61
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 4768759708 ns 6300456542 ns 0.76
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 367694257 ns 366701385 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 89045527 ns 88751089 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 82937.5 ns 75666 ns 1.10
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76292 ns 79041.5 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 82396 ns 79458.5 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 77000 ns 76208 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 205651.5 ns 203948 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7768857 ns 9083475 ns 0.86
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 531708 ns 526062.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 121769 ns 106536 ns 1.14
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 249792 ns 270270.5 ns 0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 195958.5 ns 292875 ns 0.67
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 257583 ns 198312 ns 1.30
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 287041 ns 194667 ns 1.47
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1042639.5 ns 1034833 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43696222 ns 46783284 ns 0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5997000 ns 6115521 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 664536.5 ns 633286 ns 1.05
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199683166.5 ns 199771000 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 103959542 ns 138674666 ns 0.75
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 104163542 ns 138669167 ns 0.75
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 389356166 ns 388512334 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5842082 ns 5812826 ns 1.01
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3517132.5 ns 3596784 ns 0.98
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 621465854 ns 621035604.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 353116166.5 ns 439829542 ns 0.80
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 355147917 ns 440801667 ns 0.81
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1181014666 ns 1196350375 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26729826 ns 26769444 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 22245968 ns 21887487 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7084 ns 7291 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5375 ns 6042 ns 0.89
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5417 ns 6125 ns 0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 9875 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 27701 ns 27497 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1191974.5 ns 1432348 ns 0.83
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 593375 ns 374083 ns 1.59
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48750 ns 46690 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212229.5 ns 216042 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221833 ns 224375 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 249042 ns 220687.5 ns 1.13
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213750 ns 208062.5 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 221585 ns 218341 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 35257824.5 ns 35298334.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9130229 ns 9155708 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 531864 ns 528325 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8458.5 ns 9354.5 ns 0.90
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8604 ns 9396 ns 0.92
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10791.5 ns 9750 ns 1.11
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9083.5 ns 7958.5 ns 1.14
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 115111.5 ns 118295 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3617059 ns 3790588 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 879333 ns 873834 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 71345 ns 69600 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7584 ns 8562.5 ns 0.89
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8333 ns 9834 ns 0.85
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8166 ns 9500 ns 0.86
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7520.5 ns 12312.5 ns 0.61
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 517298 ns 512184 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 18428378 ns 21450760 ns 0.86
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4662000 ns 4433459 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 319680 ns 315553 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 750 ns 709 ns 1.06
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 458 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 26044 ns 26098 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1150773 ns 1299422.5 ns 0.89
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 462250 ns 479708.5 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 47150 ns 46840 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9542 ns 9167 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13333 ns 11416 ns 1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9562.5 ns 11062.5 ns 0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8666 ns 9416 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 251795.5 ns 251250 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 24087876 ns 25886430.5 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5990833 ns 5832146.5 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 391380.5 ns 387883 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 105959 ns 107916 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 85375 ns 99250 ns 0.86
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 87209 ns 100645.5 ns 0.87
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146625 ns 146583 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 23920 ns 24989 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 1133082 ns 1282751 ns 0.88
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 261562.5 ns 267229.5 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 190780 ns 189842 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 516833 ns 514208 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 478000 ns 478541.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 524458 ns 478375 ns 1.10
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 478250 ns 482875 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 230382 ns 229903 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11487991 ns 12990087 ns 0.88
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2229792 ns 2133042 ns 1.05
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 609430 ns 608146 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 6000 ns 5666 ns 1.06
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 6584 ns 7250 ns 0.91
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7666.5 ns 6291 ns 1.22
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 6292 ns 6625 ns 0.95
batchedmm(16, Bsize=32)/forward/GPU/CUDA 16007 ns 16240.5 ns 0.99
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 80481 ns 79631 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 11500 ns 12417 ns 0.93
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 10417 ns 11167 ns 0.93
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 10500 ns 12041.5 ns 0.87
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 16750 ns 16416.5 ns 1.02
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 213463 ns 211157 ns 1.01
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 368790 ns 375234 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39833 ns 39750 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 51208 ns 52000 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 52958 ns 53021 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 16083 ns 16042 ns 1.00
batchedmm(16, Bsize=128)/forward/GPU/CUDA 19843 ns 19539 ns 1.02
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 86350 ns 90780.5 ns 0.95
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 37062.5 ns 42917 ns 0.86
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 29000 ns 32167 ns 0.90
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 32500 ns 32875 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 57041 ns 57042 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 193192 ns 190769.5 ns 1.01
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 400156.5 ns 392564 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1812.5 ns 1833.5 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 2000 ns 1875 ns 1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2208 ns 2083 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1791 ns 1792 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 20303 ns 20462 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1098861 ns 1239481 ns 0.89
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 306812.5 ns 307042 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 33030 ns 31870 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2209 ns 2125 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2208 ns 2208 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2292 ns 2291 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2312.5 ns 2208 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 202420 ns 201344.5 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 9070892 ns 10165131 ns 0.89
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1521041.5 ns 1570917 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 136810 ns 136316.5 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5375 ns 6520.5 ns 0.82
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4792 ns 5000 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5583 ns 5625 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4958 ns 5500 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 144381.5 ns 143896 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5648025.5 ns 6277095.5 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 683167 ns 750374.5 ns 0.91
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 69500 ns 69261 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8625 ns 8645.5 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8625 ns 8583.5 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8667 ns 9291.5 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8062.5 ns 8020.5 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 866731 ns 867420 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 42922265 ns 42275328 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5538771 ns 5663374.5 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 390451 ns 387123 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56833 ns 56875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 56875 ns 57833 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 56875 ns 57750 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58209 ns 58375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37311.5 ns 36655 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1236176 ns 1241845 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 556708 ns 541750 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 202581 ns 202922 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 492104.5 ns 468937.5 ns 1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 509208 ns 477229.5 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 509000 ns 464541 ns 1.10
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 438583 ns 433625 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 265396 ns 263574 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26630610.5 ns 28829027 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8076791.5 ns 8162250 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 829974 ns 827187.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3316104.5 ns 3317521 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1768416 ns 2329500 ns 0.76
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 1771459 ns 2336167 ns 0.76
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6317854.5 ns 6302416.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 204724 ns 204892 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 215316.5 ns 208562 ns 1.03
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11544229 ns 11517062.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 6581916.5 ns 8328812.5 ns 0.79
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 6594792 ns 8342500 ns 0.79
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21187791.5 ns 21059354.5 ns 1.01
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 745392 ns 734814.5 ns 1.01
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1063266 ns 1048679.5 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5375 ns 5604.5 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4625 ns 5875 ns 0.79
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6333 ns 6395.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6375 ns 4750 ns 1.34
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 137026.5 ns 136624.5 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5889063 ns 6038921 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 757125 ns 813000 ns 0.93
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 56800 ns 56330 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7375 ns 9834 ns 0.75
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8625 ns 11375 ns 0.76
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7709 ns 10792 ns 0.71
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6875 ns 7083 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 750422.5 ns 751768 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 33915039 ns 37322808 ns 0.91
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5239145.5 ns 5368750 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 370687.5 ns 366754 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 122667 ns 126417 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 101167 ns 101833 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 98583 ns 97167 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 99458 ns 135458.5 ns 0.73
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 152082 ns 149617 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6135228 ns 6377317.5 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2138604 ns 2013729 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 203902 ns 203027 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1941125 ns 1956250 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2014375.5 ns 2025708 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2034875.5 ns 2023583 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2033875.5 ns 2023875 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 705006 ns 699728 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31629524 ns 32486459.5 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11052146 ns 11144687.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1185578.5 ns 1109856 ns 1.07
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 34000 ns 33708 ns 1.01
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 35250 ns 36250 ns 0.97
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 33250 ns 35292 ns 0.94
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 541.5 ns 667 ns 0.81
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15256 ns 15147 ns 1.01
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 79870 ns 78750 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2584 ns 3166 ns 0.82
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 3625 ns 3292 ns 1.10
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3000 ns 3916 ns 0.77
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2167 ns 2125 ns 1.02
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 139305.5 ns 138043.5 ns 1.01
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 342553 ns 341483.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7291 ns 7333 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5416.5 ns 6125 ns 0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5417 ns 5959 ns 0.91
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10416 ns 10083 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36720 ns 36390.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1185072 ns 1443013 ns 0.82
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 402125 ns 577687.5 ns 0.70
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48010 ns 48291 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 240167 ns 217083 ns 1.11
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 222562.5 ns 233729 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 251312 ns 220875 ns 1.14
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213500 ns 206583 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 243524 ns 241954 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26052385 ns 28863209.5 ns 0.90
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7843208 ns 8063584 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 577185 ns 578495 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3958 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3917 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3916 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22084 ns 21377 ns 1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2144156 ns 2296242.5 ns 0.93
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 244500 ns 246729.5 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 42581 ns 42010 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14625 ns 14750 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14667 ns 15000 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14666 ns 14834 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14625 ns 14937.5 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 310198 ns 306378 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 11470797 ns 12904688 ns 0.89
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 1020291 ns 1048854 ns 0.97
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 191302 ns 192742 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 128709 ns 128750 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 101459 ns 128042 ns 0.79
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 105333 ns 102500 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 100917 ns 128458 ns 0.79
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 148106 ns 133598 ns 1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6083211 ns 6098969 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2535500 ns 1992062.5 ns 1.27
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 204437 ns 203872 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1778292 ns 1913375 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1838417 ns 1918875.5 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1926667 ns 1920354.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1926625.5 ns 1922729.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 690367 ns 684636 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31757588.5 ns 31678268 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10251583.5 ns 10983583.5 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1085040 ns 1217291 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18833 ns 19708 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19500 ns 18000 ns 1.08
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20750 ns 21250 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18708 ns 17541 ns 1.07
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 109386.5 ns 107089 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3594861 ns 3668703 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1361959 ns 1366125 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79071 ns 79431 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 229167 ns 222250 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 223958 ns 227291.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219667 ns 221667 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 225792 ns 217604.5 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 517573.5 ns 512942 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 19977378 ns 20906293 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6209166.5 ns 6227916.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 477899.5 ns 478915 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 24937.5 ns 24625 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 29792 ns 32084 ns 0.93
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 27500 ns 29583.5 ns 0.93
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1375 ns 1354 ns 1.02
batchedmm(16, Bsize=4)/forward/GPU/CUDA 15934 ns 15775 ns 1.01
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 82261 ns 87130 ns 0.94
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4667 ns 5208 ns 0.90
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5375 ns 4937.5 ns 1.09
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5167 ns 6250 ns 0.83
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4792 ns 4208 ns 1.14
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 207098.5 ns 205104 ns 1.01
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 382274 ns 375704 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 307083 ns 305500 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 306624.5 ns 305958 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 309875 ns 308125 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 308125 ns 304792 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 227232 ns 224810.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7383776.5 ns 8473173 ns 0.87
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1119667 ns 1064042 ns 1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 274402.5 ns 272523 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 535375 ns 588083 ns 0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 535458 ns 540979 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 542041 ns 532187.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 532375 ns 530000 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1072987 ns 1066787 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42757773 ns 49056863 ns 0.87
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6023750 ns 6401167 ns 0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 857209 ns 857918.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18875 ns 20292 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19791.5 ns 21021 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21666 ns 21584 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 20084 ns 19459 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 114799 ns 111914.5 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3514225 ns 3915484 ns 0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1441875 ns 1445124.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78441 ns 79161 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213833 ns 259584 ns 0.82
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213375 ns 218709 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219417 ns 213833 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217583 ns 221709 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 759467 ns 729277 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26177214.5 ns 28351086 ns 0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7223062.5 ns 7519125 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 538626 ns 535735 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6667 ns 7542 ns 0.88
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7125 ns 6750 ns 1.06
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8208 ns 7854 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 7375 ns 6416 ns 1.15
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 139107 ns 139596.5 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5697682 ns 6386332.5 ns 0.89
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 769875 ns 812791.5 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 66301 ns 64971 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9792 ns 12937 ns 0.76
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9917 ns 9604 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10958 ns 10479 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8708 ns 9042 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 824252 ns 821389 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 39780613.5 ns 41212440 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5397083 ns 5394125 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 390564 ns 376673 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5125 ns 5542 ns 0.92
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5083.5 ns 6041 ns 0.84
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6000 ns 5979 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5458 ns 4125 ns 1.32
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 142825 ns 143159 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5800418 ns 6135330 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 796708.5 ns 841208 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 67781 ns 66410 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7417 ns 8125 ns 0.91
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7333 ns 7625 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7709 ns 7333 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7250 ns 7458 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 778801 ns 779803.5 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 38427967 ns 42489606.5 ns 0.90
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5585958 ns 5806041.5 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 398294.5 ns 385114 ns 1.03
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14567167 ns 14517875 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 7736292 ns 10107833 ns 0.77
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 7735645.5 ns 10123375 ns 0.76
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27854417 ns 27737959 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 534183 ns 529900 ns 1.01
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 386044.5 ns 392854 ns 0.98
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46615145.5 ns 46502041.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 26633209 ns 33504375 ns 0.79
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 26517167 ns 33527167 ns 0.79
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85829958 ns 85258875 ns 1.01
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2633578.5 ns 2630210 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3289341 ns 3305402 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 66875 ns 68083 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 67145.5 ns 66021 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 67792 ns 69042 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 66791 ns 66875 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 119543.5 ns 120187.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3541492 ns 3913619.5 ns 0.90
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1456541.5 ns 1439458.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 232457.5 ns 224532 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 462666 ns 502375 ns 0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 440917 ns 452542 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 448666.5 ns 441146 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 442459 ns 444833 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 727570 ns 732944 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26439577 ns 29462542.5 ns 0.90
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7463770.5 ns 7794083 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 787560 ns 779447 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 625 ns 0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 667 ns 583 ns 1.14
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32374 ns 33084 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1183808 ns 1348590.5 ns 0.88
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 293771 ns 458416.5 ns 0.64
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 47460 ns 47291 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9125 ns 9209 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8687.5 ns 9500 ns 0.91
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10021 ns 8666 ns 1.16
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8833 ns 9167 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 283836 ns 289186 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21930590 ns 24166950 ns 0.91
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5114333 ns 5210708.5 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 379495 ns 381324 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9833 ns 9792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9834 ns 9834 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9833 ns 9792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9833 ns 9792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23230 ns 23519 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2044307 ns 2258808.5 ns 0.91
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 219042 ns 221041.5 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 209443 ns 207272 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45792 ns 45959 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45875 ns 45959 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 45958 ns 46041 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 45708 ns 46375 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 289262 ns 292709.5 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 8827470 ns 13279604 ns 0.66
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 1045542 ns 963562.5 ns 1.09
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 601338 ns 601736 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56459 ns 56834 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 56459 ns 57208 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 56458 ns 57000 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58125 ns 57791 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28542.5 ns 28797 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1178901 ns 1296667 ns 0.91
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 584250 ns 599375 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 203033 ns 214467.5 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 487250.5 ns 488583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 468833.5 ns 506875 ns 0.92
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 512312 ns 467854 ns 1.10
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 448625 ns 444854 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 245048 ns 247966 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 32363620 ns 35422277.5 ns 0.91
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9380875 ns 9625250 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 884842 ns 889783 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 665833 ns 662791 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 655458 ns 645583 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 651562.5 ns 641458 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 641604 ns 654708 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 207953 ns 204631.5 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8553419 ns 9404311.5 ns 0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1360709 ns 1366041 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 305729 ns 307612.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2241666 ns 2256146 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2263750 ns 2230917 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2261250 ns 2237292 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2258666 ns 2235916 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 954468 ns 983378 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 49076465 ns 51532010 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6739354 ns 7223667 ns 0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1311013 ns 1360743 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20250 ns 21208 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19458 ns 21895.5 ns 0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22146 ns 24000 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19875 ns 18708 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 111761.5 ns 113606 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3759777 ns 4029922 ns 0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1447333.5 ns 1470375 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78961 ns 81911 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 245042 ns 263833 ns 0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220646 ns 230917 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 223250 ns 221375 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219375 ns 261833.5 ns 0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 721992 ns 732293.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25713634 ns 28666996 ns 0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7609771 ns 7932292 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 556328 ns 557920 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 666 ns 583 ns 1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 22645 ns 23564 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1249581.5 ns 1402930.5 ns 0.89
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 464250 ns 479854.5 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 47691 ns 49551 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10417 ns 10042 ns 1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9750 ns 9833 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9625 ns 9208 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9875 ns 8625 ns 1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 262905.5 ns 271175.5 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 24154524 ns 27354439 ns 0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6051062 ns 5706584 ns 1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 404516 ns 399053 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10083 ns 9709 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9000 ns 9104 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11667 ns 9437.5 ns 1.24
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9292 ns 8375 ns 1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 117395 ns 122324.5 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3584331 ns 3848922 ns 0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 877416 ns 890083 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 70681 ns 69951 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7792 ns 7417 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7750 ns 7500 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7833 ns 7625 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7833.5 ns 7333 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 500398 ns 514534 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 16602320 ns 19594222 ns 0.85
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4523729 ns 4165479 ns 1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 323165 ns 320028 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1666 ns 1562.5 ns 1.07
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1666 ns 1708.5 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2209 ns 1833.5 ns 1.20
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1500 ns 1333 ns 1.13
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 20417 ns 21964 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1289220 ns 1238732.5 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 303041 ns 302542 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 189802 ns 188582 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3375 ns 3333 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3292 ns 3458 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3625 ns 3334 ns 1.09
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3333 ns 3250 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 216794.5 ns 224397.5 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10346867 ns 10897600 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1614833 ns 1688875 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 579879 ns 578505.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 149188 ns 148875 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 106396 ns 132708 ns 0.80
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 107770.5 ns 130750 ns 0.82
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 233687 ns 225250 ns 1.04
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 23438 ns 24103 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1212311.5 ns 1297180 ns 0.93
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 293459 ns 269833 ns 1.09
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 39901 ns 40231 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 156812.5 ns 162604 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 114708.5 ns 127166 ns 0.90
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 108583.5 ns 112750 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 254000 ns 265229 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 213612 ns 219287 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10573257 ns 11195277.5 ns 0.94
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2022750 ns 1990375 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 267524 ns 267987.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7042 ns 7375 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5334 ns 5959 ns 0.90
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5333 ns 6000 ns 0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10458 ns 10209 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32622 ns 33200 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1216978 ns 1323539 ns 0.92
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 349791 ns 615604 ns 0.57
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 50170 ns 50040 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 255041.5 ns 260750 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228146 ns 234833 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 242396 ns 265125 ns 0.91
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213145.5 ns 221333 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 257391 ns 264591 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27184775 ns 29454390 ns 0.92
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8241166 ns 8466083 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 589609 ns 592630 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 15709 ns 15750 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 15583 ns 15667 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 16791 ns 16167 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 16167 ns 14541 ns 1.11
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 137465 ns 140225 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5393762.5 ns 6115964 ns 0.88
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 765959 ns 798333 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 233163 ns 232492 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24125 ns 23708 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 23375 ns 23479 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24583 ns 23562.5 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23958 ns 22667 ns 1.06
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 855768 ns 872247 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 40529822 ns 42738683 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5795375 ns 5646770.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 680810 ns 676987 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 10416 ns 10041 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9271 ns 10187.5 ns 0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11125 ns 11666 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9375 ns 8792 ns 1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 121395 ns 125357.5 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3528807.5 ns 3857738.5 ns 0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 857604.5 ns 898625 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 73701 ns 75221 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13917 ns 14000 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13875 ns 13812.5 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14687.5 ns 14062.5 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13959 ns 14292 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 655994 ns 675390 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 21978368 ns 23526980.5 ns 0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5304604 ns 5359958.5 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 366965.5 ns 365113 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10125 ns 10292 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9750.5 ns 9646 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11292 ns 10958 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9250 ns 8542 ns 1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 120411 ns 124246 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3481885.5 ns 3650341 ns 0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 896958 ns 890042 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 73011 ns 72050 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12958 ns 13084 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12417 ns 12896 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13083 ns 12542 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12687 ns 12667 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 544787 ns 557269 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 19530675 ns 20940364 ns 0.93
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4633292 ns 4415208 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 344335 ns 341913.5 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 27604.5 ns 30438 ns 0.91
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 33124.5 ns 32771 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 30833 ns 32145.5 ns 0.96
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1750 ns 1875 ns 0.93
batchedmm(2, Bsize=128)/forward/GPU/CUDA 15866 ns 16382 ns 0.97
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 81132 ns 80651 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5333.5 ns 5375 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 5104 ns 4937 ns 1.03
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5188 ns 5208 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6375 ns 6292 ns 1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 137988 ns 141456.5 ns 0.98
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 371616 ns 382544 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 25253 ns 26188 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1186857 ns 1349689 ns 0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 410583 ns 455771 ns 0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 47471 ns 48850 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6667 ns 6583 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6625 ns 6375 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6833 ns 6250 ns 1.09
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6167 ns 6250 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 183528 ns 190177 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 22718575.5 ns 25715880 ns 0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5561041 ns 5628084 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 395306.5 ns 388664 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 2042 ns 2042 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 2042 ns 2042 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2083 ns 2125 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 2042 ns 1958 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 26234 ns 26944 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1191724.5 ns 1363088 ns 0.87
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 468125 ns 471437.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 206183 ns 205032 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16729.5 ns 16958 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16084 ns 16250 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16750 ns 16749.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16750 ns 16250 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 269544.5 ns 278717.5 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 22860307 ns 26543319 ns 0.86
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6059500 ns 6143666 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 707392 ns 701356 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 176833 ns 193791 ns 0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 155875 ns 174166.5 ns 0.89
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 151750 ns 151875 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 152375 ns 161458 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 198561.5 ns 200117.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8216832 ns 8677326 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1419958 ns 1431250 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 221808.5 ns 224822 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1327687.5 ns 1332708 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1329500 ns 1313042 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1324959 ns 1321250 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1337687 ns 1320542 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 893954 ns 914262.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 45120426 ns 52072722 ns 0.87
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6332833 ns 6865145.5 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1110068.5 ns 1099471 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24583.5 ns 25270.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25833 ns 25750 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27895.5 ns 28167 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24792 ns 24645.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 233633.5 ns 236681 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7675411 ns 8520645.5 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1108375 ns 960167 ns 1.15
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 115432 ns 114711 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 119062.5 ns 128833.5 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 130250 ns 184437.5 ns 0.71
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 146791 ns 126541.5 ns 1.16
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 175812.5 ns 117313 ns 1.50
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1046246.5 ns 1084581 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 46302046 ns 48584064.5 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6122291 ns 6244708 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 614765 ns 609766 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 334 ns 292 ns 1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22330 ns 23179.5 ns 0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1165828.5 ns 1352649.5 ns 0.86
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 383709 ns 470375 ns 0.82
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 47131 ns 47251 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6979.5 ns 6875 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6625 ns 6667 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7042 ns 6250 ns 1.13
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6770.5 ns 6604 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 199505 ns 206812 ns 0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 25886485 ns 26430531.5 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 6031292 ns 5939666 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 397337 ns 393154 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6542 ns 6750 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6687.5 ns 6416.5 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6792 ns 7042 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6791 ns 6750 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 143150 ns 147041 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5705092 ns 6204224 ns 0.92
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 446083.5 ns 711062.5 ns 0.63
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 233934 ns 232702 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10209 ns 10250 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9959 ns 9875 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10458.5 ns 10250 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9791.5 ns 9792 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 886014.5 ns 908474 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 40692476 ns 42229280 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5868792 ns 6135833 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 674436.5 ns 665637 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 667 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22372 ns 22806 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2141517 ns 2183221 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 223812.5 ns 228667 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 207744 ns 206602 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4625 ns 4625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4584 ns 4666 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4792 ns 4625 ns 1.04
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4625 ns 4584 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 222546.5 ns 229835 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10092882.5 ns 10794904 ns 0.93
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 1625125 ns 1685770.5 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 580171 ns 577495 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8459 ns 9042 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8292 ns 9083.5 ns 0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10208.5 ns 9354 ns 1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7875 ns 7834 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 119498 ns 124219 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3650353 ns 3899985 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 798291 ns 810375 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 74642 ns 74040.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8500 ns 9000 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8854 ns 8291 ns 1.07
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9500 ns 8750 ns 1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8875 ns 8375 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 581446.5 ns 596441 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 21695216 ns 23179785 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 4828146 ns 4819896 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 350806 ns 338953 ns 1.03
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 126666.5 ns 127000 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 95937.5 ns 131000 ns 0.73
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 96708.5 ns 129584 ns 0.75
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 183500 ns 180958.5 ns 1.01
batchedmm(128, Bsize=4)/forward/GPU/CUDA 45553 ns 46329 ns 0.98
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 101962 ns 104561 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 336291 ns 341167 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 178667 ns 333583 ns 0.54
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 193458 ns 325333 ns 0.59
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 618041.5 ns 588354 ns 1.05
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 190160.5 ns 194256.5 ns 0.98
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 503774 ns 512055 ns 0.98
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 399083 ns 399208 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 215250 ns 288166.5 ns 0.75
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 215125 ns 287875 ns 0.75
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 757209 ns 755750 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43473 ns 43515 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1429246 ns 1420150 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 416375 ns 420292 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 80291 ns 81701 ns 0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1413062.5 ns 1396437 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 862000 ns 1134500 ns 0.76
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 862479.5 ns 1133416.5 ns 0.76
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2361500 ns 2443791.5 ns 0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 244571 ns 250930 ns 0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 10826132 ns 12447603 ns 0.87
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1747854.5 ns 1797500 ns 0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 351066 ns 352383.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 675417 ns 658917 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 661000 ns 647083.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 650312.5 ns 625729 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 662770.5 ns 629562.5 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 191607 ns 202467 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8072960 ns 9193261 ns 0.88
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1387000 ns 1344749.5 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 303135 ns 311273 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2467812.5 ns 2486625 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2472417 ns 2447229 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2475208 ns 2446229 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2490750 ns 2455167 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 976394 ns 999287 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 53431680.5 ns 61254580 ns 0.87
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7206917 ns 10164208 ns 0.71
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1432035 ns 1302412 ns 1.10
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 32812.5 ns 33437.5 ns 0.98
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 34708.5 ns 35145.5 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 32791.5 ns 33896 ns 0.97
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 792 ns 875 ns 0.91
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15302 ns 15909 ns 0.96
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 79522 ns 84991 ns 0.94
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3062.5 ns 3250 ns 0.94
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3416 ns 3083.5 ns 1.11
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3583 ns 3333 ns 1.08
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3166 ns 3041 ns 1.04
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 136303 ns 139820.5 ns 0.97
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 339621 ns 335653 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 406875 ns 409291 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 402791 ns 408167 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 401958 ns 408916 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 420875 ns 420042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 42510 ns 43861 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1479206.5 ns 1610692 ns 0.92
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1158541.5 ns 1146937.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 239864 ns 241802 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3851125 ns 3890500 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3994000 ns 3991792 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3993875 ns 3995938 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3792771 ns 3777541.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 237916 ns 245384 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36230288 ns 40053105 ns 0.90
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11479542 ns 11890208 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1431880.5 ns 1427303 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33442 ns 33956 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1218517.5 ns 1415999 ns 0.86
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 252896 ns 180646 ns 1.40
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 39781 ns 39530 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15417 ns 15583 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15417 ns 15708 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15667 ns 15708 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15375 ns 15625 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 251521 ns 256980 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 10228549 ns 9741901 ns 1.05
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 874875 ns 867771 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 170133 ns 177356.5 ns 0.96
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 405000 ns 403959 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 221250 ns 295875 ns 0.75
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 220875 ns 295292 ns 0.75
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 761000 ns 760750 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 112902 ns 113403.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1067286 ns 1056307 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 489792 ns 458041 ns 1.07
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 89352 ns 89041 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1435479.5 ns 1445458 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 886666.5 ns 1158000 ns 0.77
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 884042 ns 1156604 ns 0.76
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2386041 ns 2464729.5 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 234432 ns 241604 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 10695349 ns 12919628 ns 0.83
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1931229 ns 1936541.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 354056 ns 353843 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 583 ns 583 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 584 ns 584 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 583 ns 583 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 584 ns 459 ns 1.27
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 25409 ns 26174 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1175972 ns 1343237.5 ns 0.88
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 340875 ns 430334 ns 0.79
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 210769 ns 209062 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7875 ns 7875 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7562 ns 7708 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8083 ns 7625 ns 1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7708 ns 7250 ns 1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 205553.5 ns 214822.5 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 29432470.5 ns 28436000 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6069083 ns 5825750 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 697433 ns 684816 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 833666 ns 836604 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 468875 ns 618875 ns 0.76
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 472250 ns 620167 ns 0.76
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1541729 ns 1552792 ns 0.99
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129094 ns 130046 ns 0.99
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 232694 ns 229912 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2700166.5 ns 2694187.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1538208 ns 2000104.5 ns 0.77
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1535458 ns 1999042 ns 0.77
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4928542 ns 4936792 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 259306 ns 251857 ns 1.03
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 841325 ns 837543 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31418.5 ns 32688 ns 0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1174154.5 ns 1331487 ns 0.88
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 282250 ns 447625 ns 0.63
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 47321 ns 46711 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6417 ns 6666 ns 0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6291 ns 6458 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6667 ns 6208 ns 1.07
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6709 ns 6417 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 219381 ns 232857 ns 0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 23531704 ns 24854567 ns 0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4812750 ns 5311167 ns 0.91
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 367896 ns 359813.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2435625 ns 2405750 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2436833 ns 2416666 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2396542 ns 2377375 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2408167 ns 2392666 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196586 ns 201638 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8033275 ns 8402298 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1439500 ns 1416500 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 378696 ns 372683.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4650084 ns 4654167 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4658666 ns 4665479 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4665604 ns 4644229.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4658000 ns 4648583 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 886973.5 ns 902404.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 46758472 ns 52065462 ns 0.90
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6718062.5 ns 6861875 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1388836 ns 1391004 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7145.5 ns 6708.5 ns 1.07
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 16833 ns 7208 ns 2.34
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7208 ns 7645.5 ns 0.94
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 7771 ns 13396 ns 0.58
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 22490.5 ns 23661 ns 0.95
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1201659 ns 1330674.5 ns 0.90
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 266125 ns 266208 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 39881 ns 39961 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 67854.5 ns 51604 ns 1.31
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 69000.5 ns 49000 ns 1.41
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 34000 ns 45750 ns 0.74
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 66938 ns 45375 ns 1.48
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 212927 ns 218958 ns 0.97
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10377988 ns 11575244 ns 0.90
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2027979 ns 2067250 ns 0.98
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 269205 ns 264843 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 21666.5 ns 21396 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 24917 ns 25667 ns 0.97
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 22000 ns 24249.5 ns 0.91
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5000 ns 7375 ns 0.68
batchedmm(2, Bsize=512)/forward/GPU/CUDA 16349 ns 17124 ns 0.95
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 83941 ns 84151 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 11896 ns 12229 ns 0.97
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 9167 ns 10687 ns 0.86
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 9625 ns 10229 ns 0.94
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18125 ns 17792 ns 1.02
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 225254.5 ns 229557 ns 0.98
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 371231.5 ns 371578.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 406875 ns 406750 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 223500 ns 297125 ns 0.75
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 222958 ns 296834 ns 0.75
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762709 ns 762417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 45942 ns 46955 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1367160.5 ns 1453711 ns 0.94
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 430125 ns 484187.5 ns 0.89
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 88672 ns 88881 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1430000 ns 1431645.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 893542 ns 1166209 ns 0.77
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 895041 ns 1164750 ns 0.77
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2388417 ns 2471229 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 285687.5 ns 294082.5 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 12748458.5 ns 12353848 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2059583.5 ns 2093020.5 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 380437 ns 380814 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 434375 ns 434500 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 430667 ns 437125 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 430167 ns 437250 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 448083 ns 447542 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 53382 ns 54894 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 999319 ns 1083139 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1124000 ns 1087416 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 235214 ns 233642 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3915375 ns 3902292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4016417 ns 4012625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4025417 ns 4016541 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3814625.5 ns 3808250 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 258018 ns 266487.5 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 30663314.5 ns 35233900 ns 0.87
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10573708 ns 10616978.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1363895.5 ns 1364063 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8750 ns 8750 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 6917 ns 7667 ns 0.90
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 6875 ns 7667 ns 0.90
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12417 ns 12375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 23861 ns 24395 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2097838.5 ns 2388137.5 ns 0.88
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 221750 ns 229041 ns 0.97
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 210084 ns 209122 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 44750 ns 44875 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45084 ns 45000 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 44958 ns 45000 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 45000 ns 45292 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 343533 ns 350021 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 13202058 ns 14581645 ns 0.91
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 1729395.5 ns 1777208 ns 0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 655242.5 ns 655627 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 88604 ns 124000 ns 0.71
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 107854 ns 96270.5 ns 1.12
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 88666 ns 86562.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 149354 ns 86958.5 ns 1.72
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 189838 ns 189446 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5846123.5 ns 6078785 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1995729 ns 1983729 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 219674 ns 221122 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2014958 ns 2025375 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2026229.5 ns 2011792 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2021687.5 ns 2010229 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2026875 ns 2013666.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 525687 ns 536819 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27091285.5 ns 29198754 ns 0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9642083.5 ns 9376375 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1042630 ns 967839 ns 1.08

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal merged commit a6c4a16 into main Sep 18, 2024
69 of 73 checks passed
@avik-pal avik-pal deleted the ap/up_test branch September 18, 2024 10:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant