This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: task switching in AMDGPU complex batched_matmul (#178)
* ci(buildkite): add downstream testing for NeuralOperators * perf: restore old batched_mul * fix: disable threading for certain devices * revert: "perf: restore old batched_mul" This reverts commit a8c0f3b.
- Loading branch information
Showing
3 changed files
with
40 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
name = "LuxLib" | ||
uuid = "82251201-b29d-42c6-8e01-566dec8acb11" | ||
authors = ["Avik Pal <[email protected]> and contributors"] | ||
version = "1.3.4" | ||
version = "1.3.5" | ||
|
||
[deps] | ||
ArrayInterface = "4fba245c-0d91-5ea0-9b3e-6abc04ee57a9" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
877ef96
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
877ef96
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/118080
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
877ef96
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5000
ns6417
ns0.78
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5125
ns6041
ns0.85
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7375
ns7167
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4833
ns5292
ns0.91
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
108327
ns103542
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
704958
nslayernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
452318
ns637131
ns0.71
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10000
ns10166.5
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9917
ns9958
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10229.5
ns10291.5
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9729.5
ns9979.5
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
538089
ns494284
ns1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
2390625
nslayernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
709441
ns719725
ns0.99
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1792
ns1583
ns1.13
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1792
ns1542
ns1.16
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
2000.5
ns1666
ns1.20
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1584
ns1500
ns1.06
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
19729
ns20684
ns0.95
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal
439229
nsbias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
33851
ns33302
ns1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4375
ns3812.5
ns1.15
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
3833.5
ns4125
ns0.93
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4250
ns4250
ns1
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
3520.5
ns4334
ns0.81
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
134838
ns134278.5
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal
2235354
nsbias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
143632.5
ns143062.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56375
ns58000
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46875
ns46417
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46750
ns46875
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
78375
ns83750
ns0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36801
ns37449
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1444229
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
84285
ns70883
ns1.19
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2037375.5
ns2037500
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2083500
ns2083416.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2090334
ns2090916.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1999916
ns1996979.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
215168.5
ns220080
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
5415625
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1280705
ns1213928
ns1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
148666.5
ns173708
ns0.86
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
145833
ns146625
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
152417
ns165062.5
ns0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
160792
ns172000
ns0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167254
ns167869.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1500250
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
172909
ns196051.5
ns0.88
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1133479.5
ns1113854.5
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1112750
ns1110541
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1115292
ns1118667
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1109687.5
ns1124479.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
623047
ns644177
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10180459
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1022168
ns899376
ns1.14
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4771
ns5333
ns0.89
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4708
ns4875
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6666
ns6750
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4167
ns4416
ns0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
80121.5
ns83066
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
1222709
nslayernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
56392.5
ns64020
ns0.88
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8521
ns8584
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8542
ns8750
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9375
ns8875
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8542
ns8584
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
547974
ns552192.5
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
7799104.5
nslayernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
384758
ns372446
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18062.5
ns17229.5
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16875
ns17250
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
21625
ns21542
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17666.5
ns17208.5
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
62259
ns63166
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1327729
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
76443
ns79573.5
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212542
ns220583
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
217708
ns218875
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
222604.5
ns223125
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
235416.5
ns219625
ns1.07
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
326680
ns329089
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
5672875
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
468011
ns423777
ns1.10
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
625
ns583
ns1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
625
ns625
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
959
ns833
ns1.15
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
792
ns834
ns0.95
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
18885
ns19066
ns0.99
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal
446167
nsbias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
31881
ns27311
ns1.17
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1417
ns1417
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1375
ns1417
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1667
ns1583
ns1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1375
ns1375
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
117120.5
ns116071.5
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal
2151437.5
nsbias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
135835
ns118732
ns1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7375
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6000
ns6000
ns1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6083
ns6083
ns1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10166
ns10334
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23630
ns24482
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
838084
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48897
ns52122
ns0.94
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220042
ns229541.5
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
234750
ns268417
ns0.87
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
270833.5
ns241500
ns1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
253000.5
ns251250
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
188891
ns189293
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8581771
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
612944.5
ns588480
ns1.04
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3958
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3958
ns4042
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23120
ns23660.5
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal
433416
nsdense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
47491
ns43502
ns1.09
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16542
ns16833
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
17041
ns16834
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17167
ns16959
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16875
ns16666
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
186342.5
ns188039
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal
2081000
nsdense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
174571.5
ns166010.5
ns1.05
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
919250
ns929291
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
828041
ns838708
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
838917
ns841584
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
1258333
ns1269208
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113235.5
ns113941
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal
452875
nsdense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
243040
ns396441
ns0.61
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2556167
ns2610729.5
ns0.98
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2320333.5
ns2330541.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2328916.5
ns2324458
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3549104.5
ns3478334
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
229235
ns232093
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal
2156125
nsdense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
739658
ns630643.5
ns1.17
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6084
ns6000
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5520.5
ns7042
ns0.78
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8354
ns7333.5
ns1.14
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5834
ns6584
ns0.89
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
83528.5
ns82915
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
1131521
nslayernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
58842
ns62131.5
ns0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11729.5
ns11875
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11583
ns11417
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11479.5
ns12417
ns0.92
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10999.5
ns9813
ns1.12
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
596279
ns585345.5
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
7505021
nslayernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
402564
ns388046
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23594
ns23179.5
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal
436875
nsdense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
48301
ns41949
ns1.15
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2084
ns2083
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2083
ns2250
ns0.93
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2208
ns2167
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2083
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
224089.5
ns226220
ns0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal
2406437.5
nsdense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
182056
ns166171
ns1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8916
ns8583
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8292
ns8542
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
11209
ns10709
ns1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8375
ns8833
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
101414
ns100758
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
1214500
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
73272.5
ns72575
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
18625
ns17228.5
ns1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17208.5
ns18583
ns0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18667
ns18500
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16771
ns17750
ns0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
555190.5
ns582511
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5531208.5
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
379272
ns371318.5
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
458
ns459
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
ns625
ns0.80
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
583
ns583
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
34468
ns34079
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
654854
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
45552
ns44423
ns1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9854
ns9479
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9250
ns9750
ns0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9458
ns10333
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8562.5
ns9562.5
ns0.90
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
257386.5
ns262881
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5553750
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
366942
ns351422
ns1.04
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
396542
ns396583
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288042
ns288042
ns1
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
287541
ns287666
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756167
ns756167
ns1
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
112104
ns112987
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal
519187.5
nsdense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
76352
ns77780.5
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1409875
ns1455709
ns0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1132584
ns1130291
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1126791.5
ns1133250
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2436813
ns2358000
ns1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
199625
ns202802
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal
1712834
nsdense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
322335
ns268682
ns1.20
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7083
ns7354.5
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6874.5
ns8000
ns0.86
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8458
ns8687.5
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6938
ns7750
ns0.90
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
134438.5
ns137305
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
1132749.5
nslayernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
59441
ns64461
ns0.92
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16563
ns12812.5
ns1.29
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13917
ns15041.5
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16167
ns15353.5
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15187.5
ns12333.5
ns1.23
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
880177
ns906003
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
7959042
nslayernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
418702.5
ns413373
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24146
ns26000
ns0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23791.5
ns27562.5
ns0.86
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
28250
ns27042
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24896
ns26021
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
185908.5
ns186382.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1653167
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
114524
ns146484
ns0.78
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
152041
ns146500
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
105395.5
ns157750
ns0.67
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
113125
ns129416
ns0.87
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
104979
ns155812.5
ns0.67
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1011252
ns1016426
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8155875
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
577332
ns551090
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
79000
ns84667
ns0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
76417
ns80167
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
76833
ns78063
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
80250
ns80521
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
190543
ns190829
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1268166
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
125494
ns124858.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
301375.5
ns219479
ns1.37
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
295750
ns281750
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
231208
ns278146
ns0.83
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
209499.5
ns320791.5
ns0.65
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1046615
ns1021778
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9187687.5
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
689189
ns643542
ns1.07
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
13333
ns13125
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
13334
ns13666.5
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
15062.5
ns14041.5
ns1.07
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12750
ns13459
ns0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
137754.5
ns136741.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
1170125
nslayernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
233927
ns226473
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
28270.5
ns27083.5
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26542
ns26125
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27166.5
ns27833.5
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26062
ns26604.5
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
912323.5
ns919419
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
7923459
nslayernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
689579
ns633979.5
ns1.09
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
15042
ns14000
ns1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
14625
ns14708.5
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
17292
ns17583.5
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
13834
ns14792
ns0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
119657.5
ns119245
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
1225791.5
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
239157
ns233827
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26375
ns26875
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26208
ns25958.5
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26375
ns26583
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26375
ns26541
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
665016.5
ns676576
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5755000
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
674067.5
ns589361.5
ns1.14
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
183750
ns182375
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
181645.5
ns183208
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
187833
ns185583
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
183666
ns183459
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
101191
ns102955
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1353021
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
235596.5
ns232900.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
636291
ns583500
ns1.09
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
594625
ns595083
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
592062.5
ns597520.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
613458
ns624167
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
491587
ns493717.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6127021
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
708249
ns657463
ns1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7375
ns6750
ns1.09
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8333
ns7645.5
ns1.09
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9417
ns8167
ns1.15
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7229.5
ns7542
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
137783
ns135360
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
1110021
nslayernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
57461
ns62767
ns0.92
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14812.5
ns15375
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14791
ns14917
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14875
ns16187.5
ns0.92
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
12896
ns15292
ns0.84
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
881205
ns885601
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
7653313
nslayernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
399470
ns392428
ns1.02
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6156708
ns6153416.5
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
6375958.5
ns6381624.5
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
6373937.5
ns6371521
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11907750
ns11926500
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
347134
ns346494
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/Metal
1596208
nsbatchedmm(512, Bsize=4)/forward/GPU/AMDGPU
300417.5
ns392843
ns0.76
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19072062.5
ns19117208.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
19937292
ns19977084
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
19969000
ns19957021
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36484084
ns36558729
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1007983
ns1005649
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/Metal
7924354
nsbatchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1163329
ns1105996
ns1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1750
ns1750
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1875
ns1834
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1792
ns1834
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23636
ns23503
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal
431667
nsdense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
208896
ns197739
ns1.06
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4792
ns4834
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4875
ns4958
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4959
ns4917
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4833
ns4916
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
270525.5
ns276337.5
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal
2513333
nsdense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
618686
ns502208
ns1.23
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
9416.5
ns8062.5
ns1.17
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7917
ns8416
ns0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9625
ns9459
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7271
ns8145.5
ns0.89
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
116370.5
ns115989
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
1185875
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
68072
ns71584
ns0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11937.5
ns11562.5
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10958
ns12438
ns0.88
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12417
ns12541
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11083.5
ns12875
ns0.86
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
603718
ns604320
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5647937.5
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
355648
ns353160
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
291
ns250
ns1.16
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
333
ns333
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22877
ns22648
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal
443875
nsdense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
46351
ns43592
ns1.06
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2916
ns2917
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
3083
ns2917
ns1.06
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3250
ns3041
ns1.07
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2958
ns3000
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
196283.5
ns197848
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal
2099292
nsdense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
160444
ns146363.5
ns1.10
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
14208.5
ns14604
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
14375
ns15458.5
ns0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
17521
ns15896
ns1.10
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
14729
ns15000.5
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
116923.5
ns117481
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
1146125
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
237206
ns236802
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25666
ns26500
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25500
ns25625
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25875
ns26041.5
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25791
ns25958
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
551650
ns561217
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5245875
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
650325
ns566814
ns1.15
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4208
ns4291
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4208
ns4209
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4208
ns4208
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4167
ns4375
ns0.95
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24277
ns24363
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal
445125
nsdense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
48561
ns44754
ns1.09
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
15917
ns16250
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16208
ns16125
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16250
ns16292
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16125
ns16416
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
320460
ns321227
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal
2478875
nsdense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
206705
ns190786
ns1.08
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5625
ns5916
ns0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5917
ns5875
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5834
ns5792
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5833
ns5750
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
35140
ns34700.5
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
657000
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
205735
ns200434
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20708
ns22292
ns0.93
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
21146
ns21292
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
22208
ns21792
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21750
ns22208
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
281377
ns283315.5
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5995542
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
679901
ns598489
ns1.14
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
58583
ns59729
ns0.98
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
65083
ns64229
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
66334
ns66833
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
51645.5
ns50958
ns1.01
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66570
ns66908
ns0.99
batchedmm(16, Bsize=512)/forward/GPU/Metal
14881125
nsbatchedmm(16, Bsize=512)/forward/GPU/AMDGPU
95562
ns115781
ns0.83
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
181791.5
ns198937.5
ns0.91
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
125000
ns144625
ns0.86
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
149958.5
ns167291.5
ns0.90
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
310334
ns303249.5
ns1.02
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
209829
ns208882.5
ns1.00
batchedmm(16, Bsize=512)/zygote/GPU/Metal
46762875
nsbatchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
579958
ns529218
ns1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
82625
ns84291
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
80750
ns83875
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
86292
ns88125
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82500
ns81562.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192479
ns193291
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1995437.5
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
168164
ns182771
ns0.92
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1923792
ns1875250
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1884271
ns1914792
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1888583
ns1928375
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1917291
ns1916625
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
508617
ns505449
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
8813959
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
923511
ns857542
ns1.08
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
291
ns333
ns0.87
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21906
ns21535
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal
450667
nsdense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
41861
ns36788
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1833
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1916
ns1834
ns1.04
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1833
ns1834
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
246989
ns243998
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal
2172458.5
nsdense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
186805
ns166221
ns1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
9979
ns11229
ns0.89
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8562.5
ns9791.5
ns0.87
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11458
ns11125
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8666.5
ns10479.5
ns0.83
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
114779
ns114440.5
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
1098750
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
238165
ns233386
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9771
ns10458
ns0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10000
ns10250
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10291
ns9917
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9604.5
ns10145.5
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
492318
ns491014
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
5055604
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
634834
ns561274
ns1.13
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56541
ns58375
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46708
ns46917
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46792
ns46625
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
77500
ns83708
ns0.93
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
38130.5
ns38960
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1203084
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
79889
ns72876
ns1.10
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1937792
ns1897625
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1980021
ns1964750
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1936541.5
ns1985854
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1886999.5
ns1899833
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
211665
ns212091
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11204125
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1008110
ns994598
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
267979
ns266354
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
266375
ns269729
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
271000
ns271041.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
268291.5
ns268271
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
193827.5
ns193629.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1446458.5
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
282897
ns271156
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
675542
ns693917
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
673792
ns692541
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
589042
ns687708
ns0.86
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
681292
ns593833
ns1.15
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
994673.5
ns991006
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8996396
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
898667.5
ns863163
ns1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2161437
ns2180687.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2211833
ns2214917
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2212042
ns2212041
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2215687.5
ns2208479
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
154115
ns154859
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1427083.5
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
406627
ns451844.5
ns0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5581500
ns5453666
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5501104
ns5518208
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5517083.5
ns5522375
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5264333.5
ns5522209
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
937351
ns930442
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10010417
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1552019
ns1495900
ns1.04
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
986917
ns999875
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
898250
ns913333
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
898500
ns912895.5
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
1324292
ns1334562.5
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46763
ns46425
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal
458458.5
nsdense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
243438
ns399125
ns0.61
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2547916.5
ns2620166
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
2324625
ns2328541
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2333583
ns2329395.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3548709
ns3468667
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
256534
ns247327
ns1.04
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal
2463833
nsdense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
770755
ns658089
ns1.17
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56084
ns58083
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46250
ns46625
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46542
ns46542
ns1
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81750
ns84000
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
27782
ns29007
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1193583
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
72909
ns73392
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2048500
ns2036000
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2090917
ns2096916
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2061417
ns2092208
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1996958.5
ns1992542
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
223774
ns225482
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11058874.5
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1035585
ns1028937.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56458
ns58417
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46709
ns47208
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47084
ns47375
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
78584
ns83541
ns0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
48280
ns48550
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1315916.5
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
71380
ns71593.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1903125
ns1926354.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1963666.5
ns1987291
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1961854
ns1972375
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1850771
ns1890375
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
231382
ns231977
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9466667
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
913772
ns931260
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
333
ns375
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns333
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
34209
ns33752
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
630896
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
48489
ns44343
ns1.09
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6625
ns6542
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6375
ns7187.5
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7208
ns7625
ns0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6500
ns6209
ns1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
205122.5
ns203191.5
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
5599333
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
366869
ns350064
ns1.05
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32165
ns32755
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal
385250
nsdense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
40300
ns36558
ns1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2875
ns3375
ns0.85
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
3083
ns3333
ns0.92
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2959
ns3000
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
3000
ns3208
ns0.94
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
183941
ns185298.5
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal
1836854.5
nsdense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
164169.5
ns144480
ns1.14
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1427166.5
ns1465479.5
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1449750
ns1410667
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1417625
ns1427770.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1441604
ns1410417
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
134383
ns136084
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2843875
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
355189
ns354201
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4996833
ns5012687.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5015708
ns5023959
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5020625
ns5034167
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4981250
ns5021667
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
673084.5
ns673868
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10662292
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1463829
ns1145811
ns1.28
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49772312.5
ns49876625
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
35522417
ns35509791
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
35489333
ns35514916
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
96946583
ns97103375
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1601690
ns1608361
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/Metal
10627562.5
nsbatchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1042214.5
ns1576726
ns0.66
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
154216458
ns154443875
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
112301604.5
ns112320833.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
112218667
ns112445042
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
294869708.5
ns296071750
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6475752.5
ns6483041.5
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/Metal
70117375
nsbatchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5557063.5
ns6222525
ns0.89
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
48417
ns48042
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47916
ns47667
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
48021
ns47916
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47541
ns47583
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
19924.5
ns19626
ns1.02
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal
496041
nsbias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
25680
ns28463
ns0.90
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
49792
ns50583.5
ns0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50708.5
ns50167
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
51209
ns51000
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
51458
ns50667
ns1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
245262
ns245482
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal
2146500
nsbias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
146160
ns140773
ns1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
10209
ns8667
ns1.18
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8959
ns8750
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10750
ns11167
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
9000
ns9666.5
ns0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
118313
ns118847
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
1163542
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
237350.5
ns237489
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10708
ns10791
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10417
ns10458
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10833
ns10333
ns1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10208
ns10709
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
582997
ns584310
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5755625
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
653411
ns572469
ns1.14
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
8417
ns9125
ns0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
8979
ns9896
ns0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11208
ns10667
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9875
ns9292
ns1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
115767
ns115727.5
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
1146625
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
72681
ns73908
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
14833
ns13874.5
ns1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
14584
ns13750
ns1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
14979.5
ns14333
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
14125
ns14375.5
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
554958.5
ns559680.5
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5137041
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
345660.5
ns337060
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
958
ns959
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
958
ns1083
ns0.88
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
34204.5
ns33675
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
638979.5
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
207831
ns206546
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8291
ns8917
ns0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8541
ns8437.5
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9292
ns8791
ns1.06
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9500
ns9250
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
223363.5
ns225862.5
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
5901875
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
657971.5
ns576667
ns1.14
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23500
ns23667
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23542
ns23292
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23834
ns23813
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23125
ns23666
ns0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
20050
ns20529
ns0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal
448583.5
nsbias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
188301
ns187811
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
53770.5
ns53583.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
53042
ns52145.5
ns1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
54042
ns53584
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
55020.5
ns53667
ns1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
258832
ns260507
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal
2415625
nsbias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
588042
ns549086
ns1.07
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1448437.5
ns1444541.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1438125
ns1445459
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1405125
ns1414666.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1396021
ns1401396
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194395.5
ns195236
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
2058625
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
346302
ns321861
ns1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5024812.5
ns5007208
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5026125
ns5006958
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5011083
ns5015812.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5006958
ns5020500
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
510089
ns510108
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9178458
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1198365
ns1117899
ns1.07
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
779661000
ns828285625
ns0.94
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
541756209
ns541921375
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
545828709
ns542359625
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1513614750
ns1558200021
ns0.97
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22673094
ns22535776.5
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/Metal
107171459
nsbatchedmm(512, Bsize=512)/forward/GPU/AMDGPU
14686436
ns12173703
ns1.21
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
2975273958
ns3903695416
ns0.76
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
2889890291
ns1771980416
ns1.63
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1793050500
ns1773568584
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
4711214375
ns5228367459
ns0.90
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118916960
ns119027931
ns1.00
batchedmm(512, Bsize=512)/zygote/GPU/Metal
2622707250
nsbatchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
87900974
ns68450588
ns1.28
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
76541
ns75916.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
79375
ns87437.5
ns0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
79167
ns84417
ns0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
85583
ns81083
ns1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
191949
ns192111.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1500104
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
105890.5
ns126607
ns0.84
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
261583.5
ns282646
ns0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
232562.5
ns283042
ns0.82
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
196625
ns236875
ns0.83
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
192687.5
ns276458
ns0.70
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
996248
ns995625
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8743333
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
628158
ns612404
ns1.03
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
198984604
ns199947208.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
139204167
ns139420500
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
139144125
ns138954958
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
393236834
ns389188834
ns1.01
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5825572
ns5832800
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/Metal
33344937.5
nsbatchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3611135.5
ns2958637.5
ns1.22
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
617564646
ns618298396
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
440013042
ns439277916
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
438881145.5
ns439303895.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1193608916
ns1200068000
ns0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26745549.5
ns26614249.5
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/Metal
110179542
nsbatchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
21869093
ns16011697.5
ns1.37
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7083
ns7417
ns0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6208
ns6125
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6042
ns6125
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9833
ns10125
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26360.5
ns26885
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
873478.5
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
46220
ns54341
ns0.85
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213416.5
ns214083
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
232437.5
ns232833
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
222375
ns230000
ns0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
219250
ns207709
ns1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
215332
ns215596
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8943333
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
524234
ns546726.5
ns0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8083
ns7417
ns1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8291
ns8875.5
ns0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10709
ns10750
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8500
ns10459
ns0.81
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
113094.5
ns111291
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
1123895.5
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
70651
ns72956
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8917
ns7792
ns1.14
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8958
ns7833.5
ns1.14
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8584
ns8125
ns1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8208
ns8375
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
492563
ns492517.5
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
5073167
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
317437.5
ns322723
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
459
ns417
ns1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
541
ns500
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
542
ns459
ns1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
500
ns583
ns0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
25048
ns25272
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
713958
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
46561
ns45194
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10666.5
ns9646
ns1.11
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
11479
ns9541
ns1.20
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
11583
ns11104
ns1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
10354
ns10333
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
244034
ns247083
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
6283709
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
383588
ns383457
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
353416
ns351000
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
353792
ns354459
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
352021
ns352250
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
350958
ns351625
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
22877.5
ns23168
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal
312208
nsbias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
188432
ns198701
ns0.95
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
793000
ns826000
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
807333.5
ns820458
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
777437
ns822083.5
ns0.95
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
830979
ns827750
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
218580
ns214195.5
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal
2766209
nsbias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
604914.5
ns578901
ns1.04
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5521
ns5229.5
ns1.06
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
5479
ns5875
ns0.93
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
7396
ns6958.5
ns1.06
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
4166
ns4667
ns0.89
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17982
ns17091
ns1.05
batchedmm(16, Bsize=32)/forward/GPU/Metal
1438291.5
nsbatchedmm(16, Bsize=32)/forward/GPU/AMDGPU
71380
ns74219
ns0.96
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
12520.5
ns13458.5
ns0.93
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
11521
ns10625
ns1.08
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
11521
ns13041
ns0.88
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
18042
ns18542
ns0.97
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
207562.5
ns202239.5
ns1.03
batchedmm(16, Bsize=32)/zygote/GPU/Metal
5079708
nsbatchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
368113
ns330217
ns1.11
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
38125
ns39833.5
ns0.96
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
51291.5
ns51209
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
52584
ns52458.5
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
13500
ns13459
ns1.00
batchedmm(16, Bsize=128)/forward/GPU/CUDA
20289
ns19993
ns1.01
batchedmm(16, Bsize=128)/forward/GPU/Metal
4978875
nsbatchedmm(16, Bsize=128)/forward/GPU/AMDGPU
84681
ns99666.5
ns0.85
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
36896
ns38229.5
ns0.97
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
31458
ns35125
ns0.90
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
31958
ns34187.5
ns0.93
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
66000
ns59417
ns1.11
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
184469
ns178995.5
ns1.03
batchedmm(16, Bsize=128)/zygote/GPU/Metal
13432687
nsbatchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
412423
ns362888
ns1.14
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3583
ns3500
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3666
ns3667
ns1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3958.5
ns3833
ns1.03
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3500
ns3709
ns0.94
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
19634
ns19015
ns1.03
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal
458041
nsbias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
28900
ns29645
ns0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4208
ns4291
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4375
ns4500
ns0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4625
ns4458
ns1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4167
ns4292
ns0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
197467.5
ns194611
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal
2168666
nsbias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
138551.5
ns126757
ns1.09
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5208
ns5916
ns0.88
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4792
ns5062.5
ns0.95
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7250
ns6375
ns1.14
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3792
ns4625
ns0.82
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
142334.5
ns138395
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
1171167
nslayernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
58781
ns65944
ns0.89
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9125
ns9625
ns0.95
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8833
ns8500
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9125
ns9333
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8250
ns10666
ns0.77
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
822603
ns807046.5
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
7665708
nslayernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
387763.5
ns378457
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
204042
ns207583
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
212000
ns209042
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210875
ns213208
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200958
ns204125
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
36985.5
ns35332
ns1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
853417
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
205912
ns203930.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
653187.5
ns603500
ns1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
665958
ns623479.5
ns1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
622770.5
ns658604.5
ns0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
585667
ns586375
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
260510
ns254148
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8195083
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
799653
ns767213
ns1.04
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3369291
ns3324167
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
2332125
ns2328667
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
2329166
ns2334417
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6307167
ns6324542
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
205325
ns206559
ns0.99
batchedmm(128, Bsize=128)/forward/GPU/Metal
6066541
nsbatchedmm(128, Bsize=128)/forward/GPU/AMDGPU
212943
ns377105
ns0.56
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11648041
ns11496208.5
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
8330687.5
ns8303562.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
8348104
ns8348416.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21116042
ns21193020.5
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
734131.5
ns736080.5
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/Metal
26082375
nsbatchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1069061
ns2044820.5
ns0.52
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4521
ns3917
ns1.15
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5208
ns5292
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7583
ns6292
ns1.21
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5500
ns7125
ns0.77
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
132826.5
ns129442
ns1.03
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
1175375
nslayernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
55421
ns57067
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9292
ns8500
ns1.09
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8334
ns7375
ns1.13
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9562.5
ns7833
ns1.22
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8604.5
ns8291.5
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
716825.5
ns711410
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
7184437.5
nslayernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
369984
ns364581
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
98313
ns117312.5
ns0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
125521
ns101437.5
ns1.24
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
100541
ns102687.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
103500
ns98458.5
ns1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
149399
ns149616
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2228333.5
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
182342
ns210473
ns0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2046104.5
ns2008250
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2031250
ns2022459
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1985791.5
ns2039937.5
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2021416.5
ns2036625
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
674153.5
ns661994.5
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10587167
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1250004
ns963831
ns1.30
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
34188
ns33416
ns1.02
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
36000
ns35459
ns1.02
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
35021
ns34709
ns1.01
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
833
ns750
ns1.11
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15860
ns15265
ns1.04
batchedmm(2, Bsize=4)/forward/GPU/Metal
553417
nsbatchedmm(2, Bsize=4)/forward/GPU/AMDGPU
75761
ns78737
ns0.96
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
3083.5
ns3959
ns0.78
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
3541
ns2917
ns1.21
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
3625
ns4708
ns0.77
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
3375
ns3666
ns0.92
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
140010.5
ns136137.5
ns1.03
batchedmm(2, Bsize=4)/zygote/GPU/Metal
1942729.5
nsbatchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
353624
ns321796.5
ns1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7000
ns7250
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6041
ns6042
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5958
ns6083
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9958
ns10042
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
35885
ns34970
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
854042
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
50330
ns56516
ns0.89
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
223104
ns221584
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
234125
ns220959
ns1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221250
ns234583
ns0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215667
ns207333
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
243422
ns237194
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8021021
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
512516
ns540189
ns0.95
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3750
ns3750
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3750
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3750
ns3833
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3709
ns3958
ns0.94
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22271.5
ns21681
ns1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal
468292
nsdense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
43460
ns39383
ns1.10
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14167
ns14458
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14541
ns14458
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14583
ns14541
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14500
ns14625
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
303531
ns297631.5
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal
2253708.5
nsdense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
200012.5
ns190215
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
99083
ns129834
ns0.76
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
128333.5
ns118271
ns1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
103812
ns106750
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
103958.5
ns101666.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
150020
ns150106
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2875583
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
195772
ns241781
ns0.81
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1887875.5
ns1921708.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1929042
ns1924583
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1884833
ns1932000
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1894729
ns1922750
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
670688
ns653385
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10463500
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1065452
ns928325
ns1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18959
ns18875
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17354.5
ns17292
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22208
ns20937
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17541.5
ns18459
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
104525.5
ns104073.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1362312.5
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
79351
ns91301
ns0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
252250
ns239083.5
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
260833
ns224791
ns1.16
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219458
ns224958.5
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
257937
ns218500
ns1.18
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
495429
ns493640.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6195583
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
462125
ns439080
ns1.05
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
24958.5
ns26166
ns0.95
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
32604.5
ns29167
ns1.12
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
27500
ns28958
ns0.95
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1208
ns1416
ns0.85
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16021
ns15781
ns1.02
batchedmm(16, Bsize=4)/forward/GPU/Metal
533959
nsbatchedmm(16, Bsize=4)/forward/GPU/AMDGPU
80071
ns72756
ns1.10
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
5250
ns6208
ns0.85
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
5854.5
ns5041
ns1.16
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
5792
ns6875
ns0.84
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
6125
ns6417
ns0.95
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
201439.5
ns199155.5
ns1.01
batchedmm(16, Bsize=4)/zygote/GPU/Metal
2014541.5
nsbatchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
376235
ns324216
ns1.16
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
221583
ns221875
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
222541.5
ns223375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
226291
ns225375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
221875
ns223542
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
219232.5
ns216803
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1686583
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
271454
ns267771
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
559604
ns508542
ns1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
548354
ns511042
ns1.07
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
500083.5
ns509500
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
498250
ns557354
ns0.89
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1034159
ns1017707.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8587229
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
850955.5
ns811461
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19625
ns19104
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
19313
ns19584
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23208
ns22063
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20583
ns19792
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
111518.5
ns111072
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1475625
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
80186
ns90009
ns0.89
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215020.5
ns221854
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
250333
ns220250
ns1.14
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214500
ns218166.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
221729.5
ns220146
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
708936
ns700847.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7292833
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
539977
ns494855
ns1.09
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6166
ns6292
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6479
ns7000
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8042
ns7375
ns1.09
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6417
ns6834
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
133623
ns130925
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
1170916
nslayernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
66921
ns63498
ns1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12250
ns11041.5
ns1.11
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11729.5
ns9959
ns1.18
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13334
ns10895.5
ns1.22
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11645.5
ns10459
ns1.11
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
771416.5
ns770540.5
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
7239334
nslayernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
391255
ns375452
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4500
ns4104
ns1.10
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5041.5
ns7041
ns0.72
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7042
ns7166
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5500
ns6166
ns0.89
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
134989.5
ns131485.5
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
1146875
nslayernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
58260
ns62607
ns0.93
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7750
ns7416.5
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7750
ns7750
ns1
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8125
ns8125
ns1
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7709
ns8083
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
738275
ns737449
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
7536771
nslayernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
386245
ns380902
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14664541
ns14481917
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
10093041
ns10107542
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
10106791
ns10094750
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27704625
ns27859959
ns0.99
batchedmm(128, Bsize=512)/forward/GPU/CUDA
529053
ns533975
ns0.99
batchedmm(128, Bsize=512)/forward/GPU/Metal
22466021
nsbatchedmm(128, Bsize=512)/forward/GPU/AMDGPU
401266
ns867906.5
ns0.46
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
46793583
ns46387667
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
33459958.5
ns33363354
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
33523667
ns33478875
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85429125
ns85752792
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2854223
ns2651799
ns1.08
batchedmm(128, Bsize=512)/zygote/GPU/Metal
89341312.5
nsbatchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3309294
ns5191497.5
ns0.64
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
188000
ns185208.5
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
186250
ns185916
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
188667
ns188604
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
185938
ns187271
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
101713
ns117719.5
ns0.86
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1484500
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
235268
ns236051
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
641812.5
ns634875
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
636958
ns627937.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
589208
ns601166
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
591771
ns587625
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
704450.5
ns694993
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7517417
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
785986
ns698169.5
ns1.13
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns541
ns0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
750
ns625
ns1.20
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
750
ns584
ns1.28
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
667
ns584
ns1.14
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32067
ns31826
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
651375
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
47241
ns48104.5
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9979
ns9541
ns1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11521
ns9687.5
ns1.19
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10188
ns10542
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9500
ns10938
ns0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
276358.5
ns276120
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5875459
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
374075
ns371078
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26291
ns26250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26291
ns26333
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26500
ns26583
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26209
ns26458
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23479
ns22942
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal
437083
nsdense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
210433
ns206526
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
67042
ns67125
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
68833
ns67333
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
68917
ns68792
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
67583
ns66875
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
274089
ns273858
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal
2210459
nsdense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
606899
ns554115
ns1.10
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
204500
ns207166
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
210417
ns211667
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
211125
ns211167
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200125
ns202875
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
27585
ns27563
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
861208
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
205157.5
ns206546
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
652542
ns609937.5
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
671541
ns669750
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
624208
ns664812.5
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
580625
ns609042
ns0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
236486
ns233231.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9239500
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
837472
ns798562
ns1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
650083
ns664875
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
650625
ns636687.5
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
550709
ns648791.5
ns0.85
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
652708
ns629792
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
186884
ns185894.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1405750
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
234974
ns349393
ns0.67
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2244125
ns2244229
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2249625
ns2225354
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2253687.5
ns2256708
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2232292
ns2271792
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
908141
ns900927
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9610291
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1356860
ns1235829
ns1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19479
ns19333
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20020.5
ns21166.5
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22000
ns22375
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20500
ns19958
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
107405.5
ns106770.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1497959
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
82031
ns89387
ns0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
259687.5
ns227250
ns1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
234896
ns262312.5
ns0.90
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
223354.5
ns231250
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
222104
ns222770.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
701938
ns700957
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7694083.5
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
552123
ns516550
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
750
ns584
ns1.28
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
750
ns584
ns1.28
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
667
ns584
ns1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
22889
ns22928
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
713250.5
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
47681
ns44243
ns1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10833
ns9583
ns1.13
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
11458
ns9958.5
ns1.15
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10958
ns13229.5
ns0.83
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
11333
ns10875
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
258094.5
ns258192
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
6601250
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
398396
ns395479
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8021
ns8062.5
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
7916.5
ns9208
ns0.86
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10479
ns10459
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
7771
ns8333
ns0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
114650.5
ns112863.5
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
1128833
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
67611
ns72315
ns0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8625
ns7500
ns1.15
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9459
ns7750
ns1.22
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9334
ns14875
ns0.63
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10083
ns8917
ns1.13
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
474110.5
ns472419
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
4853125
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
322085
ns321811
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2104.5
ns1979.5
ns1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2375
ns2500
ns0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2667
ns2542
ns1.05
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2125
ns2416
ns0.88
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
19503
ns19845
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal
435896
nsbias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
189822
ns191508
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
7666.5
ns6666
ns1.15
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
7083
ns6459
ns1.10
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
7771
ns7292
ns1.07
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
8417
ns7292
ns1.15
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
209638.5
ns208409
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal
2304438
nsbias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
579508
ns543621
ns1.07
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
749167
ns754167
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
749833.5
ns751000
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
747292
ns749375
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
748521
ns747104
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
22733
ns22303
ns1.02
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal
312604
nsbias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
37375.5
ns47829
ns0.78
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
778000
ns792250
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
807229
ns811750
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
774167
ns789500
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
776625
ns794229.5
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
207826
ns206590.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal
2597208
nsbias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
220633
ns233541
ns0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7209
ns7250
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6000
ns5917
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6042
ns6000
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10042
ns10209
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32931
ns32976
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
855708.5
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
50540
ns57267
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
262833
ns228458.5
ns1.15
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
263396
ns269270.5
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
229333
ns235021
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212854
ns213146
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
255573
ns254662
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8358834
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
524047.5
ns552652
ns0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
12083
ns12417
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11959
ns13250
ns0.90
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13583
ns14458
ns0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12771
ns13000
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
132456
ns131273.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
1189125
nslayernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
233113
ns231363
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25021
ns24854.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25500
ns24916
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25458
ns25542
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24792
ns24458
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
815326
ns813324
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
7701292
nslayernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
681611
ns634495
ns1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9562.5
ns8875
ns1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9833
ns9958
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12000
ns11167
ns1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
9541.5
ns9542
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
118599
ns116553
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
1229416
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
74341
ns74930
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14375
ns13770.5
ns1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
20917
ns14917
ns1.40
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17250
ns15916
ns1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15562.5
ns16437.5
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
626256
ns621843
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5717062
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
368145
ns356836
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
9270.5
ns9145.5
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9208
ns9354
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11042
ns10750
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9145.5
ns10125
ns0.90
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
117653
ns116468
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
1158958
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
73341
ns74383.5
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
14062.5
ns12916
ns1.09
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
15125
ns12959
ns1.17
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
15125
ns20541
ns0.74
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
15146
ns14500
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
518369.5
ns515709
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5051833
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
340775
ns328534
ns1.04
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
27708
ns31062
ns0.89
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
33875
ns33146
ns1.02
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
31792
ns30750
ns1.03
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
2229.5
ns1833
ns1.22
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16522
ns16169
ns1.02
batchedmm(2, Bsize=128)/forward/GPU/Metal
4854041.5
nsbatchedmm(2, Bsize=128)/forward/GPU/AMDGPU
78412
ns77564
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5583
ns5562.5
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
5917
ns5312.5
ns1.11
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
6084
ns7208
ns0.84
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
7770.5
ns7834
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
136257
ns134922
ns1.01
batchedmm(2, Bsize=128)/zygote/GPU/Metal
13273333
nsbatchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
379326
ns340125
ns1.12
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
333
ns375
ns0.89
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
24751
ns24307
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
682541.5
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
48791
ns45845
ns1.06
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7520.5
ns6166.5
ns1.22
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8583
ns6708
ns1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8625
ns8167
ns1.06
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7458.5
ns7083
ns1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
181857
ns179926.5
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
6285375
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
389326
ns372385.5
ns1.05
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5708
ns5834
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6208
ns5833
ns1.06
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6000
ns5875
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5958
ns5958
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
25394
ns25187
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
714417
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
207474
ns201636
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26375
ns21041
ns1.25
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
23250
ns21709
ns1.07
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21459
ns23458
ns0.91
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
20250
ns26125
ns0.78
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
262619.5
ns262884
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
6644125
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
695681
ns615780.5
ns1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
145625
ns192083.5
ns0.76
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
178292
ns158917
ns1.12
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
150417
ns154416.5
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
153812.5
ns146417
ns1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
188204
ns184640
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1588584
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
190633
ns215472.5
ns0.88
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1345771
ns1319792
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1331542
ns1328249.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1322333.5
ns1347250
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1167354
ns1337000
ns0.87
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
856737
ns844907
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9165250
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
997975
ns1041340
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24250
ns24292
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
24458.5
ns24916
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
27084
ns28000
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24417
ns24833.5
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
225455
ns224694.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1705354
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
115742
ns130334
ns0.89
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
127500
ns117583
ns1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
174187
ns131375
ns1.33
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
119042
ns160499.5
ns0.74
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
130375
ns164750
ns0.79
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
984493
ns967206
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8679292
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
591319
ns585053
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
291
ns250
ns1.16
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns334
ns1.12
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns375
ns0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
22641
ns22932
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
689208
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
47290
ns47870
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7083.5
ns6292
ns1.13
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8083
ns6833
ns1.18
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6958
ns9416
ns0.74
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6500
ns7500
ns0.87
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
197931.5
ns196587.5
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
6549187.5
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
395326.5
ns380031
ns1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6333.5
ns5875
ns1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5708
ns6292
ns0.91
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7541
ns7187.5
ns1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6000
ns6562
ns0.91
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
137058.5
ns134586
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
1181916.5
nslayernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
232733
ns230170
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10833.5
ns9833
ns1.10
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10583
ns10000
ns1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10416
ns11187.5
ns0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9792
ns11083
ns0.88
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
841858
ns840176
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
8090729
nslayernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
672580
ns631290
ns1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
ns1542
ns1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1584
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1583
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22927
ns22272
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal
429250
nsdense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
208003
ns204933
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5917
ns5750
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6375
ns6125
ns1.04
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6125
ns6417
ns0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5750
ns5875
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
217549
ns216977
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal
2167125
nsdense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
581914.5
ns491814.5
ns1.18
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8562
ns8250
ns1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8458
ns8562.5
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10291.5
ns9895.5
ns1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8229.5
ns9209
ns0.89
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
116906
ns115063
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
1209583
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
77271.5
ns73999
ns1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9104.5
ns8167
ns1.11
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
15417
ns9250
ns1.67
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8792
ns9833.5
ns0.89
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8084
ns10333
ns0.78
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
557267.5
ns548589
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
5634417
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
344656
ns340367
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
125125
ns127271
ns0.98
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
130729
ns128750
ns1.02
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
130250
ns131062
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
181042
ns181979.5
ns0.99
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46296.5
ns46303.5
ns1.00
batchedmm(128, Bsize=4)/forward/GPU/Metal
364354
nsbatchedmm(128, Bsize=4)/forward/GPU/AMDGPU
100232
ns102121
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
309333
ns338125
ns0.91
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
342125
ns339792
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
313833
ns346083
ns0.91
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
570709
ns595417
ns0.96
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
185266
ns181951
ns1.02
batchedmm(128, Bsize=4)/zygote/GPU/Metal
1373875
nsbatchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
506148
ns410627.5
ns1.23
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
396437.5
ns397708
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
289000
ns288375
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288375
ns287937.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756250
ns756708
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43482.5
ns43092
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal
434458
nsdense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
79761
ns85671
ns0.93
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1408916.5
ns1456291.5
ns0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
1136979
ns1133125
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1132062
ns1127937.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2443000.5
ns2360208
ns1.04
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
248184
ns248595.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal
1965375
nsdense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
349476
ns266317
ns1.31
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
645500
ns643479.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
650562.5
ns654166
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
546541.5
ns652750
ns0.84
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
545645.5
ns650625
ns0.84
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
173484
ns172424.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1350375
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
242424
ns315089
ns0.77
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2520666.5
ns2449417
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2473750
ns2455020.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2447792
ns2465625
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2452584
ns2469208.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
937381.5
ns922065
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10132041
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1450713
ns1363193.5
ns1.06
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
30500
ns32917
ns0.93
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
36187.5
ns35374.5
ns1.02
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
34146
ns34417
ns0.99
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
958
ns1000
ns0.96
batchedmm(2, Bsize=32)/forward/GPU/CUDA
15458
ns15534
ns1.00
batchedmm(2, Bsize=32)/forward/GPU/Metal
1293854
nsbatchedmm(2, Bsize=32)/forward/GPU/AMDGPU
71001
ns78366
ns0.91
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3084
ns2937.5
ns1.05
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3958
ns3375
ns1.17
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3333
ns5208
ns0.64
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3042
ns4625
ns0.66
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
135380
ns133935.5
ns1.01
batchedmm(2, Bsize=32)/zygote/GPU/Metal
5260562.5
nsbatchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
340585.5
ns318886
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1460666
ns1464209
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1503375
ns1500333
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1503000
ns1501333
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1441729
ns1442563
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
41871
ns41738
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1242250
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
239254
ns318625
ns0.75
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5151979
ns5128625
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5296833.5
ns5291041
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5285437.5
ns5297084
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4980042
ns4998791.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
230225
ns230499.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11359208.5
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1233400
ns1198280
ns1.03
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3709
ns3709
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3750
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3750
ns3750
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3750
ns3916
ns0.96
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33654
ns33583
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal
352750
nsdense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
39741
ns36778.5
ns1.08
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15041
ns15417
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15709
ns15500
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15500
ns15791
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15375
ns16000
ns0.96
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
251748
ns252278
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal
1635667
nsdense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
165632
ns161662
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
401812.5
ns404625
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
296666
ns296000
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
295167
ns295916
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
760709
ns760625
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113125
ns113161.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal
574187
nsdense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
87471
ns95859
ns0.91
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1429500
ns1479249.5
ns0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1159833
ns1158584
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1157541
ns1160500
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2466395.5
ns2383354
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
235512
ns228888
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal
1507125
nsdense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
353405
ns265922
ns1.33
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
959
ns958
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1042
ns1042
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
958
ns1083
ns0.88
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
24950
ns24404
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
692770.5
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
208254
ns207859
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7917
ns7917
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9916
ns8542
ns1.16
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8583
ns9917
ns0.87
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8042
ns12895.5
ns0.62
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
202658.5
ns202191
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
6448187.5
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
697032
ns620871
ns1.12
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
831021
ns835834
ns0.99
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
619667
ns615542
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
618250
ns617791.5
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1541417
ns1549375
ns0.99
batchedmm(128, Bsize=32)/forward/GPU/CUDA
131643
ns130350.5
ns1.01
batchedmm(128, Bsize=32)/forward/GPU/Metal
1716917
nsbatchedmm(128, Bsize=32)/forward/GPU/AMDGPU
166023
ns215532
ns0.77
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2699312.5
ns2690375
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1995500
ns2000479.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1985791
ns2007416.5
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4946958
ns4941104
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
234057
ns232712
ns1.01
batchedmm(128, Bsize=32)/zygote/GPU/Metal
6761458
nsbatchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
852834
ns872871.5
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
291
ns291
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns375
ns0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
32746
ns31625
ns1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
642249.5
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
47461
ns47950
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6208
ns6084
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9334
ns6708
ns1.39
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6708
ns7666
ns0.88
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6229
ns8083
ns0.77
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
223155
ns221856.5
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
6000375
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
361916
ns352319
ns1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1731292
ns1741791.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1754791
ns1752167
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1728874.5
ns1739042
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1745562.5
ns1719916
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
190073
ns183055.5
ns1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1502437.5
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
353886
ns415606.5
ns0.85
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4404625
ns4361125
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4422041
ns4365916.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4362625
ns4399333
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4346521
ns4394333
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
855907
ns827645.5
ns1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9512792
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1246280
ns1239667.5
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6875
ns7083
ns0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
17395.5
ns7395.5
ns2.35
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7250
ns7041
ns1.03
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6834
ns6854.5
ns1.00
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
22751
ns22223.5
ns1.02
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal
272959
nsbias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
37041
ns47178
ns0.79
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
33000
ns45292
ns0.73
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
68979.5
ns51167
ns1.35
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
33333
ns49250
ns0.68
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
45500
ns49437
ns0.92
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
212527.5
ns204846
ns1.04
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal
2608042
nsbias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
221728.5
ns235841
ns0.94
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
23417
ns22125
ns1.06
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
25542
ns25125
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
23312.5
ns24833
ns0.94
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
5625
ns5458.5
ns1.03
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18456
ns17859
ns1.03
batchedmm(2, Bsize=512)/forward/GPU/Metal
14791020.5
nsbatchedmm(2, Bsize=512)/forward/GPU/AMDGPU
89826.5
ns82154
ns1.09
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
11917
ns11792
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
11125
ns10750
ns1.03
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
10625
ns12583
ns0.84
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
17958
ns19708.5
ns0.91
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
223372.5
ns216235
ns1.03
batchedmm(2, Bsize=512)/zygote/GPU/Metal
45999500
nsbatchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
382947
ns331099
ns1.16
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
403917
ns406250
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
297500
ns297333
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
297375
ns296833.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762334
ns762833
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
47041
ns46303.5
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal
533542
nsdense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
89431
ns97252
ns0.92
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1426250
ns1477458
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
1164625
ns1164395.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1163125
ns1164416
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2468250
ns2386333
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
281846
ns268961
ns1.05
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal
2244750
nsdense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
378111.5
ns282959
ns1.34
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1487625
ns1488416
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1529979.5
ns1526958
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1529729.5
ns1529250
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1464667
ns1466395.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
54740
ns52650
ns1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1143667
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
235424
ns326982
ns0.72
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5146979
ns5119459
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5286395.5
ns5285084
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5251625
ns5297709
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4982541.5
ns4955208
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
258236.5
ns250192
ns1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
10236958
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1218755
ns1186136
ns1.03
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28375
ns28292
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28125
ns28292
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28250
ns28333
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28333
ns28417
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24960
ns23514.5
ns1.06
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal
430583
nsdense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
212483
ns207227
ns1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66375
ns66542
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66542
ns66750
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67000
ns66500
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66584
ns66208
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
344216.5
ns333506.5
ns1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal
2732875
nsdense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
652061
ns576948.5
ns1.13
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
84500
ns124875
ns0.68
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
93000
ns81875
ns1.14
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
85541
ns89166
ns0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81042
ns86750
ns0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
190669
ns191648
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
2029208
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
183273
ns233116
ns0.79
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2023313
ns2025145.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2010958
ns2021978.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1979291.5
ns2030542
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1995645.5
ns1995125
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
520209.5
ns506195
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9143521
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1082408
ns881973
ns1.23
This comment was automatically generated by workflow using github-action-benchmark.