From 3db4431d0514d095e4f27a5e38d0f0a210dc4ae6 Mon Sep 17 00:00:00 2001
From: Carlo Lucibello <carlo.lucibello@gmail.com>
Date: Wed, 22 Mar 2023 09:23:22 +0100
Subject: [PATCH 1/5] docs for MultiHeadAttention

---
 .DS_Store                                    | Bin 0 -> 6148 bytes
 docs/.DS_Store                               | Bin 0 -> 6148 bytes
 docs/src/models/layers.md                    |  11 ++++++++-
 docs/src/models/nnlib.md                     |  24 ++++++++++++++-----
 docs/src/tutorials/2021-10-08-dcgan-mnist.md |   2 +-
 5 files changed, 29 insertions(+), 8 deletions(-)
 create mode 100644 .DS_Store
 create mode 100644 docs/.DS_Store

diff --git a/.DS_Store b/.DS_Store
new file mode 100644
index 0000000000000000000000000000000000000000..b00850a7bb12ed26254a7dff76e4d40f362e727c
GIT binary patch
literal 6148
zcmeHK%}T>S5Z-O0O({YSDjpZS7EDVo#Y>3#0!H+pQWH}&7_+5G&7l->)EDwmd>&_Z
zH)1ho5jz9B-~8@oKgj+t#<;(T2aGw4F&i2pN2Ni~-56?_WJHc*L}fmUQW=5$Zeo8O
z@Y^jGGs*5*{{8!-S(@a9>wfT7+uGXh*d4oP-v*De42rN=<bF86M(aY#G^+J5x=g0!
z#MwQOSrI1Lbgl}LcnT>uS4kGj(wB=Y&Q)!o19sQ$PMrPaa^Md8!d>;nay)eVVmunH
zR$Y7V@aXt_@{~Sj@>SExfo&x_25Wc&<!d!B!8}W4_5i**zs@5h28aP-fEd_h2F$5o
z_coaXT01d73^X!;`-6an=ozdus;vV$ygp;xLPP-_-x7$zpl7hs2oVsjO96E$H%|<%
z%fT;9o@cPqsLL5wGs8G$=KAr%)$HIGDxGmpBlW}pF|f`+TbmA^|7Y;aG(PgzOUNPy
zh=G5`0JlzqQy&&(&ek8x!?RXEyMu;;c@-)kps!s5z`%XvKm~PNpbmMS!Ac{Jf__yF
PNEZP`2zA83FEH>4?FdQL

literal 0
HcmV?d00001

diff --git a/docs/.DS_Store b/docs/.DS_Store
new file mode 100644
index 0000000000000000000000000000000000000000..915d5752021f0d258460c94df0d22b0d7ef28ee6
GIT binary patch
literal 6148
zcmeHKPfNov6i@cYbqt{g6^{Y01G}-y@KWmh0#@{(GFv*dSevo7?l1;D>KF2(_<4LU
zNnyi!6>;xD@_TuIlI91^OBiF^E205oHe<|!hR9K=5j5Aj8YUQ#t2v@@na#sQhBedt
zO%r~5n}w`kF-zFy_kV=*B+hcj`Q(jetG(B;I#$oR_n+j_&x5?kykK#Qqbn(su+oF@
zI-V`3_QAPK^B_)V3zZN@GYGl6iPK0fJz1ntrgDAlu)0=vY9Fmu183M5&blvF<Dt_R
z<I!ln?plW@r)QVb=j0`oZ<<C9d@I>ESiw6e8~b_n7il8XN3d2IRU{!XKnxHA#K5jH
zU@C&u+f@Q+-^2hh@FN3wJ_u-tuEA2HIy#`k>ofXWh$x`rTLMuSbPbjo!2`l|Dxgl~
z=83^|I@pDYa}Aanbvol}WthjTTs>a6S{>{{g){DIq@EZc2DTY!>Y<JG{{nuQ+DHC&
z3XO;XV&I=Kz*`f4;=!g&ZT+@AthEB#12hzjD^URfed`hc13X8Ll~el#>JaA|EH&aR
SXjkcgbP-U5P)7{>0s~(u(@IkS

literal 0
HcmV?d00001

diff --git a/docs/src/models/layers.md b/docs/src/models/layers.md
index c0e1c57307..b4667e2ef3 100644
--- a/docs/src/models/layers.md
+++ b/docs/src/models/layers.md
@@ -10,7 +10,7 @@ The `Dense` exemplifies several features:
 
 * It take an `init` keyword, which accepts a function acting like `rand`. That is, `init(2,3,4)` should create an array of this size. Flux has [many such functions](@ref man-init-funcs) built-in. All make a CPU array, moved later with [`gpu`](@ref Flux.gpu) if desired.
 
-* The bias vector is always intialised [`Flux.zeros32`](@ref). The keyword `bias=false` will turn this off, i.e. keeping the bias permanently zero.
+* The bias vector is always initialised [`Flux.zeros32`](@ref). The keyword `bias=false` will turn this off, i.e. keeping the bias permanently zero.
 
 * It is annotated with [`@functor`](@ref Functors.@functor), which means that [`params`](@ref Flux.params) will see the contents, and [`gpu`](@ref Flux.gpu) will move their arrays to the GPU.
 
@@ -54,6 +54,15 @@ SamePad
 Flux.flatten
 ```
 
+## MultiHeadAttention
+
+The basic blocks needed to implement [Transformer](https://arxiv.org/abs/1706.03762) architectures. See also the functional counterparts
+documented in NNlib's [Attention](@ref) section.
+
+```@docs
+MultiHeadAttention
+``` 
+
 ### Pooling
 
 These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.
diff --git a/docs/src/models/nnlib.md b/docs/src/models/nnlib.md
index 72b8481f56..cf2618eb97 100644
--- a/docs/src/models/nnlib.md
+++ b/docs/src/models/nnlib.md
@@ -2,9 +2,20 @@
 
 Flux re-exports all of the functions exported by the [NNlib](https://github.com/FluxML/NNlib.jl) package. This includes activation functions, described on [their own page](@ref man-activations). Many of the functions on this page exist primarily as the internal implementation of Flux layer, but can also be used independently.
 
+
+## Attention
+
+Primitives for the [`MultiHeadAttention`](ref) layer.
+
+```@docs
+NNlib.dot_product_attention
+NNlib.dot_product_attention_scores
+NNlib.make_causal_mask
+```
+
 ## Softmax
 
-`Flux`'s `logitcrossentropy` uses `NNlib.softmax` internally.
+`Flux`'s [`logitcrossentropy`](@ref) uses [`NNlib.logsoftmax`](@ref) internally.
 
 ```@docs
 softmax
@@ -13,7 +24,8 @@ logsoftmax
 
 ## Pooling
 
-`Flux`'s `AdaptiveMaxPool`, `AdaptiveMeanPool`, `GlobalMaxPool`, `GlobalMeanPool`, `MaxPool`, and `MeanPool` use `NNlib.PoolDims`, `NNlib.maxpool`, and `NNlib.meanpool` as their backend.
+`Flux`'s [`AdaptiveMaxPool`](@ref), [`AdaptiveMeanPool`](@ref), [`GlobalMaxPool`](@ref), [`GlobalMeanPool`](@ref), 
+[`MaxPool`](@ref), and [`MeanPool`](@ref) use [`NNlib.PoolDims`](@ref), [`NNlib.maxpool`](@ref), and [`NNlib.meanpool`](@ref) as their backend.
 
 ```@docs
 PoolDims
@@ -32,7 +44,7 @@ pad_zeros
 
 ## Convolution
 
-`Flux`'s `Conv` and `CrossCor` layers use `NNlib.DenseConvDims` and `NNlib.conv` internally. 
+`Flux`'s [`Conv`](@ref) and [`CrossCor`](@ref) layers use [`NNlib.DenseConvDims`](@ref) and [`NNlib.conv`](@ref) internally. 
 
 ```@docs
 conv
@@ -44,7 +56,7 @@ DenseConvDims
 
 ## Upsampling
 
-`Flux`'s `Upsample` layer uses `NNlib.upsample_nearest`, `NNlib.upsample_bilinear`, and `NNlib.upsample_trilinear` as its backend. Additionally, `Flux`'s `PixelShuffle` layer uses `NNlib.pixel_shuffle` as its backend.
+`Flux`'s [`Upsample`](@ref) layer uses [`NNlib.upsample_nearest`](@ref), [`NNlib.upsample_bilinear`](@ref), and [`NNlib.upsample_trilinear`](@ref) as its backend. Additionally, `Flux`'s [`PixelShuffle`](@ref) layer uses [`NNlib.pixel_shuffle`](@ref) as its backend.
 
 ```@docs
 upsample_nearest
@@ -60,7 +72,7 @@ pixel_shuffle
 
 ## Batched Operations
 
-`Flux`'s `Bilinear` layer uses `NNlib.batched_mul` internally.
+`Flux`'s [`Bilinear`](@ref) layer uses [`NNlib.batched_mul`](@ref) internally.
 
 ```@docs
 batched_mul
@@ -72,7 +84,7 @@ batched_vec
 
 ## Gather and Scatter
 
-`Flux`'s `Embedding` layer uses `NNlib.gather` as its backend.
+`Flux`'s [`Embedding`](@ref) layer uses [`NNlib.gather`](@ref) as its backend.
 
 ```@docs
 NNlib.gather
diff --git a/docs/src/tutorials/2021-10-08-dcgan-mnist.md b/docs/src/tutorials/2021-10-08-dcgan-mnist.md
index f56d47d52f..4da32e5f2c 100644
--- a/docs/src/tutorials/2021-10-08-dcgan-mnist.md
+++ b/docs/src/tutorials/2021-10-08-dcgan-mnist.md
@@ -101,7 +101,7 @@ We will be using the [relu](https://fluxml.ai/Flux.jl/stable/models/nnlib/#NNlib
 We will also apply the weight initialization method mentioned in the original DCGAN paper.
 
 ```julia
-# Function for intializing the model weights with values 
+# Function for initializing the model weights with values 
 # sampled from a Gaussian distribution with μ=0 and σ=0.02
 dcgan_init(shape...) = randn(Float32, shape) * 0.02f0
 ```

From a43835ab0a52f9428238d47e9fec22b735c73593 Mon Sep 17 00:00:00 2001
From: Carlo Lucibello <carlo.lucibello@gmail.com>
Date: Wed, 22 Mar 2023 09:24:28 +0100
Subject: [PATCH 2/5] cleanup

---
 .DS_Store      | Bin 6148 -> 0 bytes
 .gitignore     |   1 +
 docs/.DS_Store | Bin 6148 -> 0 bytes
 3 files changed, 1 insertion(+)
 delete mode 100644 .DS_Store
 delete mode 100644 docs/.DS_Store

diff --git a/.DS_Store b/.DS_Store
deleted file mode 100644
index b00850a7bb12ed26254a7dff76e4d40f362e727c..0000000000000000000000000000000000000000
GIT binary patch
literal 0
HcmV?d00001

literal 6148
zcmeHK%}T>S5Z-O0O({YSDjpZS7EDVo#Y>3#0!H+pQWH}&7_+5G&7l->)EDwmd>&_Z
zH)1ho5jz9B-~8@oKgj+t#<;(T2aGw4F&i2pN2Ni~-56?_WJHc*L}fmUQW=5$Zeo8O
z@Y^jGGs*5*{{8!-S(@a9>wfT7+uGXh*d4oP-v*De42rN=<bF86M(aY#G^+J5x=g0!
z#MwQOSrI1Lbgl}LcnT>uS4kGj(wB=Y&Q)!o19sQ$PMrPaa^Md8!d>;nay)eVVmunH
zR$Y7V@aXt_@{~Sj@>SExfo&x_25Wc&<!d!B!8}W4_5i**zs@5h28aP-fEd_h2F$5o
z_coaXT01d73^X!;`-6an=ozdus;vV$ygp;xLPP-_-x7$zpl7hs2oVsjO96E$H%|<%
z%fT;9o@cPqsLL5wGs8G$=KAr%)$HIGDxGmpBlW}pF|f`+TbmA^|7Y;aG(PgzOUNPy
zh=G5`0JlzqQy&&(&ek8x!?RXEyMu;;c@-)kps!s5z`%XvKm~PNpbmMS!Ac{Jf__yF
PNEZP`2zA83FEH>4?FdQL

diff --git a/.gitignore b/.gitignore
index 45b845a41b..ccb9aaf97f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -8,3 +8,4 @@ deps
 .vscode
 Manifest.toml
 LocalPreferences.toml
+.DS_Store
diff --git a/docs/.DS_Store b/docs/.DS_Store
deleted file mode 100644
index 915d5752021f0d258460c94df0d22b0d7ef28ee6..0000000000000000000000000000000000000000
GIT binary patch
literal 0
HcmV?d00001

literal 6148
zcmeHKPfNov6i@cYbqt{g6^{Y01G}-y@KWmh0#@{(GFv*dSevo7?l1;D>KF2(_<4LU
zNnyi!6>;xD@_TuIlI91^OBiF^E205oHe<|!hR9K=5j5Aj8YUQ#t2v@@na#sQhBedt
zO%r~5n}w`kF-zFy_kV=*B+hcj`Q(jetG(B;I#$oR_n+j_&x5?kykK#Qqbn(su+oF@
zI-V`3_QAPK^B_)V3zZN@GYGl6iPK0fJz1ntrgDAlu)0=vY9Fmu183M5&blvF<Dt_R
z<I!ln?plW@r)QVb=j0`oZ<<C9d@I>ESiw6e8~b_n7il8XN3d2IRU{!XKnxHA#K5jH
zU@C&u+f@Q+-^2hh@FN3wJ_u-tuEA2HIy#`k>ofXWh$x`rTLMuSbPbjo!2`l|Dxgl~
z=83^|I@pDYa}Aanbvol}WthjTTs>a6S{>{{g){DIq@EZc2DTY!>Y<JG{{nuQ+DHC&
z3XO;XV&I=Kz*`f4;=!g&ZT+@AthEB#12hzjD^URfed`hc13X8Ll~el#>JaA|EH&aR
SXjkcgbP-U5P)7{>0s~(u(@IkS


From 5f9e05782a08151dc756e82b6e3ac1f777d89c98 Mon Sep 17 00:00:00 2001
From: Carlo Lucibello <carlo.lucibello@gmail.com>
Date: Wed, 22 Mar 2023 10:00:16 +0100
Subject: [PATCH 3/5] news

---
 NEWS.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/NEWS.md b/NEWS.md
index 9db14d47d5..9b82dc5347 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,14 +1,16 @@
 # Flux Release Notes
 
+## v0.13.15
+* Added [MultiHeadAttention](https://github.com/FluxML/Flux.jl/pull/2146) layer.
 
 ## v0.13.14
 * Fixed various deprecation warnings, from `Zygone.@nograd` and `Vararg`.
+* Initial support for `AMDGPU` via extension mechanism.
+* Add `gpu_backend` preference to select GPU backend using `LocalPreference.toml`.
+* Add `Flux.gpu_backend!` method to switch between GPU backends.
 
 ## v0.13.13
 * Added `f16` which changes precision to `Float16`, recursively.
-* Initial support for AMDGPU via extension mechanism.
-* Add `gpu_backend` preference to select GPU backend using `LocalPreference.toml`.
-* Add `Flux.gpu_backend!` method to switch between GPU backends.
 
 ## v0.13.12
 * CUDA.jl 4.0 compatibility.

From feec5c9b7eace98dbf3be394ba80cfc23a1f22ba Mon Sep 17 00:00:00 2001
From: Carlo Lucibello <carlo.lucibello@unibocconi.it>
Date: Wed, 22 Mar 2023 22:18:00 +0100
Subject: [PATCH 4/5] Update docs/src/models/nnlib.md

Co-authored-by: Saransh Chopra <saransh0701@gmail.com>
---
 docs/src/models/nnlib.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/models/nnlib.md b/docs/src/models/nnlib.md
index cf2618eb97..2634990db5 100644
--- a/docs/src/models/nnlib.md
+++ b/docs/src/models/nnlib.md
@@ -72,7 +72,7 @@ pixel_shuffle
 
 ## Batched Operations
 
-`Flux`'s [`Bilinear`](@ref) layer uses [`NNlib.batched_mul`](@ref) internally.
+`Flux`'s [`Flux.Bilinear`](@ref) layer uses [`NNlib.batched_mul`](@ref) internally.
 
 ```@docs
 batched_mul

From 94d0a1c19cd6c153a9529f094b20ad733078231f Mon Sep 17 00:00:00 2001
From: Carlo Lucibello <carlo.lucibello@unibocconi.it>
Date: Wed, 22 Mar 2023 22:18:09 +0100
Subject: [PATCH 5/5] Update docs/src/models/nnlib.md

Co-authored-by: Saransh Chopra <saransh0701@gmail.com>
---
 docs/src/models/nnlib.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/models/nnlib.md b/docs/src/models/nnlib.md
index 2634990db5..b308af4917 100644
--- a/docs/src/models/nnlib.md
+++ b/docs/src/models/nnlib.md
@@ -15,7 +15,7 @@ NNlib.make_causal_mask
 
 ## Softmax
 
-`Flux`'s [`logitcrossentropy`](@ref) uses [`NNlib.logsoftmax`](@ref) internally.
+`Flux`'s [`Flux.logitcrossentropy`](@ref) uses [`NNlib.logsoftmax`](@ref) internally.
 
 ```@docs
 softmax