GitHub - thlurte/transformer-pytorch: pytorch implementation of Vaswani et, al. (2017)

Transformer Architecture

A network architecture based on attention mechanism.

Input Embeddings

Here learned embeddings are used to convert the input tokens and output tokens to vectors of dimension $d_{model}$ and weights are multiplied by $\sqrt{d_{model}}$ .

Positional Encoding

To provide some information about the relative or absolute position of the tokens in the sequence "positional encodings" are added to the input embeddings at the bottoms of the encoder and decoder stacks.

The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two can be summed.

Denominator is calculated in log space.

Sine and cosine functions of different frequencies are used:

$$ PE_{(pos,2i)}=\sin(pos/10000^{2i/d_{model}}) \\ $$

$$ PE_{(pos,2i+1)}=\cos(pos/10000^{2i/d_{model}}) \\ $$

In the input vector sine function is applied to the even positions and cos is applied to the odd positions.

Layer Normalization

This is where mean and variance are calculated independently for each batch and new value is calculated for each of them, Also gamma and beta are introduced to provide some fluctuation in the data.

$$ \hat{x_{j}}=\frac{x_{j}-\mu_{j}}{\sqrt{\sigma^2_{j}+\epsilon}} $$

Feed Forward Network

Each of the layers in the encoder and decoder contains a fully connected feed-forward network. This consists of two linear transformations with a ReLU activation in between.

$$ FFN(x)=max(,xW_{1}+b_{1})W_{2}+b_{2} $$

The dimensionality of input and output is $d_{model}=512$ and, the inner-layer has dimensionality $d_{jj}=2048$.

Multi-head Attention

Multi-head attention takes input from the positional encoding and uses it three times. Where $Q$ means query, $K$ means key and $V$ means value. These three input matrices then are multiplied by weights and split into h matrices. (These matrices are split along embedding dimension).

$$ Attention(K,Q,V)=softmax\Big(\frac{QK^T}{\sqrt{d_{k}}}\Big)V $$

$$ head_{i}=Attention(QW_{i}^Q,KW_{i}^K,VW_{i}^W) $$

Then attention is applied to each of the split matrices and the result matrices are concatenated and multiplied by $W$ to get output.

$$ MultiHead(K,Q,V)=Concat(head_{i}...head_{n})W^{O} $$

In this work $h=8$, parallel attention layers, or heads are employed. For each of these $d_{k}=d_{c}=d_{model}/h=64$.

Encoder

The encoder is composed of a stack of $N = 6$ identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. residual connection is employed around each of the two sub-layers, followed by layer normalization.

Decoder

The decoder is also composed of a stack of $N = 6$ identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.

The self-attention sub-layer in the decoder stack is modified to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
dataset.py		dataset.py
model.py		model.py
requirements.txt		requirements.txt
tokenizer_en.json		tokenizer_en.json
tokenizer_ta.json		tokenizer_ta.json
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer Architecture

Input Embeddings

Positional Encoding

Layer Normalization

Feed Forward Network

Multi-head Attention

Encoder

Decoder

About

Releases

Packages

Languages

License

thlurte/transformer-pytorch

Folders and files

Latest commit

History

Repository files navigation

Transformer Architecture

Input Embeddings

Positional Encoding

Layer Normalization

Feed Forward Network

Multi-head Attention

Encoder

Decoder

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages