Generative adversarial network (GAN) was designed by Ian Goodfellow
and his colleagues in June 2014
. Two neural networks contest with each other in a game (in the form of a zero-sum game
, where one agent's gain is another agent's loss).
Hard to believe that what once started as a problem that was pondered over a beer by the father of GAN itself, has now given machine's the gift to imagination.
While deep-learning AIs can learn to
recognize
things, they have not been good atcreating
them. The goal of GANs is to give machines something akin to animagination
. In the future, computers will get much better at feasting on raw data and working out what they need to learn from it. Doing so wouldn’t merely enable them to draw pretty pictures or compose music; it would make them less reliant on humans to instruct them about the world and the way it works
When I first started this repository, I knew nothing about GAN. Not even that noise
was used as input to feed in the generator. An idea as such was completely shocking. Being a complete novice, I had to start from the very bottom. So I would first explore how to create a GAN that would generate handwritten numbers (0-9) based on the MNIST
dataset using fully connected hidden layers.
I would then move on to create a DCGAN
(Deep Convolutional GAN) to generate people's faces based on the Celeba Dataset. Trained for about 10000
epochs, the images were still pixelated but a discerning person's face was evident. This step was crucial in visualizng how noise is used as input to the generator and how with each step of training the generator is becoming better and better at fooling the discriminator.
Finally, VQGAN and CLIP would be used as a model to generate images from text prompts. It would have been hard to train a discriminator from scratch as the size of the labelled dataset needed would have been enormous. Hence, CLIP was essential to be used as the discriminator. The images generated is truly a masterpiece as it is quite evident that a human would never have come up with such peculiar design styles.
The left part of our brain is mainly for numerical
, analytical
and logical
process. Most of the AI I have worked on before was based on this part of the brain and whose aim was to classify
, predict
or recommend
. However, the ability to imagine
, create
or design
is based on the right hemisphere of the brain. Hence, the creation of an AI based on these abilities required a different type of wiring and understanding of the underlying concepts of imagination. The ability of creating something
from nothing
is truly marvelous and remains to this day a mystery.
This project was inspired by the course Generative Adversarial Networks (GANs) Specialization on Coursera taught by Sharon Zhou. Most of the material below was inspired from the course. Kudos and credit to this amazing teacher.
- CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than
200K
celebrity images, each with40
attribute annotations.
10,177
number of identities,202,599
number of face images5
landmark locations,40
binary attributes annotations per image.
10,000
images were used to train a WGAN to generate people's face.
- The ImageNet dataset, one of the largest efforts in this space, required over
25,000
workers to annotate14 million
images for22,000
object categories.CLIP
learns from text–image pairs that are already publicly available on the internet. CLIP was trained on a vast (andunknown
) dataset of random internet material. CLIP creates an encoding of its classes and is pre-trained on over400 million
text to image pairs. This allows it to leverage transformer models' ability to extract semantic meaning from text to make image classifications out of the box without being fine-tuned on custom data.
The DCGAN differ from a basic gan as follows:
- Use convolutions without any pooling layers
- Use batchnorm in both the generator and the discriminator
- Don't use fully connected hidden layers
- Use ReLU activation in the generator for all layers except for the output, which uses a Tanh activation.
- Use LeakyReLU activation in the discriminator for all layers except for the output, which does not use an activation
-
Understanding a Basic GAN
- Types of AI
- Generative models vs Discriminative models
- The Discriminator
- The Generator
- Cross Entropy Cost Function
-
Wasserstein GANs with Gradient Penalty
- Mode Collapse
- Limitation of BCE Loss
- Earth Mover Distance
- Wasserstein Loss
- Condition on Wasserstein Critic
- 1-Lipschitz Continuity Enforcement
- Coding a WGAN
-
Controllable and Conditional GAN
- Conditional GAN
- Controllable GAN
- Vector algebra in Z-space
- Challenges with Controllable GAN
-
Multimodal Generation
- CLIP: Contrastive Language Image Pre-training
- VQGAN: Vector Quantized Generative Adversarial Network
- CLIP + VQGAN
-
Application
If you got a crush on any of those people below then I got a bad news for you. These people are not real! The images have actually been downloaded from the website thispersondoesnotexist.com. It is hard to believe that an AI can generate such realistic fake images of a person in matter of seconds but that is the reality in which we are actually living. This AI face generator is powered by StyleGAN
, a neural network from NVIDIA developed in 2018.
Fun fact: The main goal was to train the AI to recognize fake faces and faces in general. The company needed this to improve the performance of its video cards by automatically recognizing faces and applying other rendering algorithms to them. However, since the StyleGAN code is publicly available, an engineer at Uber was able to take it and create a random face generator that rocked the internet!
Another interesting application of GAN is Deepfake
. Haven't you ever wondered what the movie American Psycho
would look like if we had Tom Cruise
as the protagonist instead of Christian Bale
. Here's a preview:
Pretty great, right? Now the most important question we must ask is: How to recognize the fake from the real? It is almost impossible to recognise an image of a fake person. AI is so developed that 90% of fakes are not recognized by an ordinary person and 50% are not recognized by an experienced photographer. However, occasionally a neural network makes mistakes, which is why artifacts appear: an incorrectly bent pattern, a strange hair color, and so on.
whichfaceisreal.com has been developed by Jevin West and Carl Bergstrom at the University of Washington, as part of the Calling Bullshit project, which focus on teaching people to be more analytical of potentially false portraits. I tested it and it is not that straightforward!
I would now like to take a step back and consider fundamentally what is the type
of learning that can occur when we are training neural networks to perform tasks such as shown above.
Supervised learning problems are instances in which we are given a set of data
and a set of labels
associated with that data and our goal is to learn a functional mapping
that moves from data to labels and those labels. And these labels can take many different types of forms. We will take examples of supervised learning relating to images.
-
Classification: our input is an image and we want to output Y, a class label for the category.
-
Object detection: our input is still an image but here we want to output the bounding boxes of instances of up to multiple dogs or cats.
-
Semantic segmentation: we have a label for every pixel the category that every pixel belongs to.
-
Image captioning: our label is now a sentence and so it's now in the form of natural language.
In unsupervised learning we're given only data no labels and our goal is to understand or build up a representation of the hidden and underlying structure in that data to extract insights into the foundational structure of the data itself.
-
Clustering: the goal is to find groups within the data that are similar through some type of metric.
-
Dimensionality Reduction: we start off with data in three dimensions and we're going to find two axes of variation in this case and reduce our data projected down to 2D.
-
Feature Extraction: with autoencoders we are trying to reconstruct the input data to basically learn features. So we're learning a feature representation without using any additional external labels.
-
Density Estimation: we're trying to estimate and model this density. We want to fit a model such that the density is higher where there's more points concentrated.
To summarize, in supervised
learning we want to use label data
to learn a function mapping from X to Y
and in unsupervised
learning we use no labels
and instead we try to learn some underlying hidden structure
of the data.
In generative models
which is a class of models for unsupervised learning
where given training data our goal is to try and generate new samples
from the same distribution. We have training data over generated from some distribution and we want to learn a model, to generate samples from the same distribution and so we want to learn to be similar to . Hence, it has the capability of creating data similar to the training data it received since it has learnt the distribution from which the data is provided.
We can use generative models to do explicit density estimation
where we're going to explicitly define and solve for our or we can also do implicit density estimation
where in this case we'll learn a model that can produce samples from without explicitly defining it.
Our generative model takes in a noise
which represents a random set of values going into the generative model. The generative model can also sometimes takes in a class Y
such as a dog. From these inputs, it's goal is to generate a set of features X
(wet nose or a tongue sticking out. ) that look like a realistic dog. But why do we need this noise in the first place? The noise is here to ensure that what is generated isn't actually the same image each time. Else, what is the point of generating the same image agai nand again. As explained above, gnerative models try to capture the probability distribution of X
, the different features of having a wet nose, the tongue sticking out, maybe pointy ears sometimes but not all the time, given that class Y
of a dog. With the added noise, these models would generate realistic and diverse representations of this class Y
. Note: if we are only generating one class Y of a dog, then we don't need this conditioning on Y - - and instead it's just the probability over all the features X - . If we continue to run our model multiple times without any restrictions, then we'll end up getting more pictures representing the dataset our generative model was trained on.
There are many types of generative models. The most popular ones are Variational Autoencoders (VAE)
or GANs
.
Variational Autoencoders are related to a type of unsupervised learning model called autoencoders
. With autoencoders we don't generate data, but it's an unsupervised approach for learning a lower dimensional
feature representation from unlabeled training data. We feed in as input raw data for example an image that's going to be passed through many successive deep neural network layers. At the output of that succession of neural network layers we are going to generate a low dimensional latent space - a feature representation
. We call this portion of the network an encoder
since it's mapping the data x
into a encoded vector of latent variables z
.
Note: It is important to ensure the low dimensionality of this latent space z
so that we are able to compress the data into a small latent vector where we can learn a very compact and rich feature representation. We want to learn features that can capture meaningful factors of variation in the data.
To train such a model we need to learn a decoder network that will actually reconstruct the original image. Again for the decoder we are basically using same types of networks as encoders so it's usually a little bit symmetric. We call our reconstructed output because it's our prediction and it's an imperfect reconstruction of our input x
and the way that we can actually train this network is by looking at the original input x
and our reconstructed output and simply comparing the two and minimizing the distance between these two images using L2 loss function
.
Note: Notice that by using this reconstruction loss - the difference between the reconstructed output and our original input - we do not require any labels for our data beyond the data itself. It is just using the raw data to supervise itself.
In practice, the lower the dimensionality of our latent space, the poorer and worse quality reconstruction we're going to get out. These autoencoder structures use this sort of bottlenecking hidden layer to learn a compressed latent representation of the data and we can self-supervise the training of this network by using a reconstruction loss that forces the autoencoder network to encode as much information about the data as possible into a lower dimensional latent space while still being able to build up faithful reconstructions.
To sum up, we are going to take our input data, pass it through our encoder first` which can be like a three layers convolutional network, to get these features and then we're going to pass it through a decoder which is a three layers of upconvolutionalnetwork and then get a reconstructed data out at the end of this. The reason why we have a convolutional network for the encoder and an upconvolutional network for the decoder is because at the encoder we're basically taking it from this high dimensional input to these lower dimensional features and now we want to go the other way go from our low dimensional features back out to our high dimensional reconstructed input.
In autoencoders, this latent layer is just a normal layer in a neural network just like any other layer. It is deterministic
. If we're going to feed in a particular input to this network we're going to get the same output so long as the weights are the same. Therefore, effectively a traditional autoencoder learns this deterministic encoding which allows for reconstruction and reproduction of the input.
Variational auto encoders impose a stochastic
or variational twist on this architecture. The idea behind doing so is to generate smoother representations of the input data and improve the quality of not only the reconstructions but also to actually generate new images that are similar to the input data set but not direct reconstructions of the input data. Variational autoencoders replace that deterministic layer z
with a stochastic sampling operation. Instead of learning the latent variables z
directly for each variable, the variational autoencoder learns a mean
and a variance
associated with that latent variable. And the mean and variance parameterize a probability distribution
for that latent variable. We can actually generate new data instances by sampling
from the distribution defined by these and to generate a latent sample and get probabilistic representations of the latent space.
Our encoder is now going to be trying to learn a probability distribution of the latent space z
given the input data x
while the decoder is going to take that learned latent representation and compute a new probability distribution of the input x
given that latent distribution z
and these networks - the encoder the decoder - are going to be defined by separate sets of weights and and the way that we can train this variational autoencoder is by defining a loss function that's going to be a function of the data x
as well as these sets of weights and . The reconstruction loss just as before will force the latent space to learn and represent faithful representations of the input data ultimately resulting in faithful reconstructions.
To sum up: In Variational Autoencoders we inject some noise
into this whole model and training process. Instead of having the encoder encode the image into a single point in that latent space, the encoder actually encodes the image onto a whole distribution and then samples a point on that distribution to feed into the decoder to then produce a realistic image. This adds a little bit of noise since different points can be sampled on this distribution.
A discriminative model is one typically used for classification
in machine learning. They learn how to distinguish between classes such as dogs and cats, and are often called classifiers
. Discriminative models take a set of features X
, such as having a wet nose or whether it purrs and from these features determine a category of whether the image is of a dog or a cat. In other words, they try to model the probability of class Y
given a set of features X
- P(Y|X)
.
In simple words, a discriminative model makes predictions on the unseen data based on conditional probability and can be used either for classification or regression problem statements. These models are not capable of generating new data points. Therefore, the ultimate objective of discriminative models is to separate one class from another.
Below are examples of generative and discriminative classifiers:
Generative classifiers
- Naïve Bayes
- Bayesian networks
- Markov random fields
- Hidden Markov Models (HMM)
Discriminative Classifiers
- Logistic regression
- Support Vector Machine (SVM)
- Traditional neural networks
- K-Nearest Neighbour (KNN)
- Conditional Random Fields (CRF)
To summarise:
- Generative models model the distribution of individual classes.
- Discriminative models learn the boundaries between classes.
- With Generative models, we have less chance of overfitting if our data distribution is similar to real data distribution. However, outliers can affect our model performance.
- With Discriminative models, we can work with small dataset but we should be careful of overfitting.
Another instance of generative models is GANs where we don't want to explicitly model the density or the distribution underlying some data but instead just learn a representation
that can be successful in generating new instances that are similar to the data. What we care about is to be able to sample from a complex high dimensional training distribution
. However, there's no direct way that we can do this. Therefore, we're going to have to build up some approximation of this distribution, i.e, we sample from simpler distributions. For example random noise
. We're going to learn a transformation from these simple distributions
directly to the training distribution that we want. And to model this kind of complex function or transformation we will use a neural network
. To sum up, we start from something extremely simple: random noise and try to build a generative neural network that can learn a functional transformation that goes from noise to the data distribution and by learning this functional generative mapping we can then sample in order to generate fake instances synthetic instances that are going to be as close to the real data distribution.
GANs are composed of two neural networks models: a generator
which generates images like the decoder
and a discriminator
that's actually a discriminative
model hidden inside of it. The generator and discriminator are effectively competing against each other which is why they're called adversarial
. The generator G
is going to be trained to go from random noise to produce an imitation of the data and then the discriminator is going to take that synthetic fake data
as well as real data
and be trained to actually distinguish between fake and real. If our generator network is able to generate well and generate fake images that can successfully fool
this discriminator, then we have a good generative model. This means that we're generating images that look like images from the training set.
With time we reach a point where we don't need the discriminator anyemore. The generator can take in any random noise and produce a realistic image. Note that The generator's role in some sense it's very similar to the decoder in the VAE
. What's different is that there's no guiding encoder this time that determines what noise vector should look like, that's input into the generator. Instead, there's a discriminator looking at fake and real images and simultaneously trying to figure out which ones are real and which ones are fake. Overall the effect is that the discriminator is going to get better and better at learning how to classify real and fake data and the better it becomes at doing that it's going to force the generator to try to produce better and better synthetic data to try to fool the discriminator and so on.
As exaplained above, the generator learns to generate fakes that look real, to fool the discriminator. And the discriminator learns to distinguish between what's real and what's fake. So you can think of the generator as a painting forger and the discriminator as an art inspector. So the generator forges fake images to try to look as realistic as possible, and it does this in the hopes of fooling the discriminator. So we can think of the generator as a painting forger and the discriminator as an art inspector. So the generator forges fake images to try to look as realistic as possible, and it does this in the hopes of fooling the discriminator.
The video below really depicts how a GAN works. Geoffrey Rush
plays an art inspector
which can detect fake portraits in a split of a second in the movie The Best Offer
. Geoffrey Rush can be seen as the discriminator in our GAN.
P.S. Sound on for the video below.
Project.Name.mp4
-
The generator is going to start from some completely
random noise
and producefake data
. At the beginning of this game, the generator actually isn't very sophisticated. It doesn't know how to produce real looking artwork. Additionally, the generator isn't allowed to see the real images. It doesn't know how this painting should even look. So at the very beginning, the elementary generator initially just paint a masterpiece of scribbles. -
The discriminator is going to see the fake data from the Generator as well as
real data
that we would feed in and then it's going to be trained to output a probability that the data it sees are real or fake. -
If it decides an image to be real then we can actually tell it
yes
that'sreal
orno
that'sfake
. This way we can get a discriminator that's able to differentiate a poorly drawn image like this, from the ones that are slightly better and eventually also the real ones. -
In the beginning it's not going to be trained accurately so the predictions are not going to be mediocre but then we're going to train it till it starts increasing the probabilities of real versus not real appropriately such that we get this perfect separation where the discriminator is able to perfectly distinguish what is real and what is fake.
-
The generator take instances of where the real data lies as inputs to train and then it's going to try to improve its imitation of the data trying to move the fake data that is generated closer and closer to the real data.
-
When the generator produces a batch of paintings, the generator will know in what direction to go on and improve by looking at the
scores
assigned to her work by the discriminator. -
Once again the discriminator is now going to receive these new points and it's going to estimate a probability that each of these points is real and again learn to decrease the probability of the fake points being real further and further.
-
Eventually, the generator is going to start moving these fake points closer and closer to the real data such that the fake data are almost following the distribution of the real data. It is going to be really really hard for the discriminator to effectively distinguish between what is real and what is fake while the generator is going to continue to try to create fake data instances to fool the discriminator.
-
The discriminator also improves over time because it receives more and more realistic images at each round from the generator. Essentially it tries to develop a keener and keener eye as these images get better.
-
When the discriminator says that the image created by the generator is 60% real. We actually that it's wrong that it's not necessarily real, that it's actually fake. And then after many rounds the generator, will start producing paintings that are harder and harder to distinguish if not impossible for the discriminator to distinguish from the real ones. Training ends when the first neural network begins to constantly deceive the second.
To summarize how we train GANs: the generator is going to try to synthesize fake instances to fool the discriminator which is going to be trained to identify the synthesized instances and discriminate these as fake.
We will now explore more in depth the discriminator
part of the GAN. The discriminator is a classifier whose goal is to distinguish between different classes. Given the image of a cat, the classifier should be able to tell whether it's a cat or a dog. We can have a more complex case where we want to differentiate cat from multiple classes or the simplest case where we just want to predict cat or not cat.
In the video below, Jian Yang built quite a good binary classifier
which can differentiate between hot dog and not hot dog, however much to the despair of Erlich.
New.video.mp4
One type of model for a classifier is using a neural network and this neural network can taken some features X
and a set of labels Y
associated with each of our classes.. It computes a series of nonlinearities and outputs the probabilities for a set of categories. The neural network learns these set of parameters or weights theta, . These parameters data are trying to map these features X to those labels Y
and those predictions are called because they're not exactly the exact Y
labels. They're trying to be the Y
labels. And so the goal is to reach a point where the difference between the true values Y
and the predictions is minimized.
A cost function is computed by comparing how closely is to Y. It will tell the discriminative model, the neural network, how close it is in predicting the correct class. From this cost function we can update those parameters - the nodes in that neural network according to the gradient of this cost function. This just means generally which direction those parameters should go to try to get to the right answer, to try to get to a , that's as close as possible to Y
. And then we repeat this process until our classifier is in good shape.
The goal of the discriminator is to model the probability of each class and this is a conditional probability distribution because it's predicting the probability of class Y conditioned on a certain set of features. In the GAN context the discriminator is a classifier that inspects the examples. They are fake and real examples and it determines whether they belong to the real or fake class. The discriminator models the probability of an example being fake given that set of inputs X - P(Fake|Features)
. In the example below, the discriminator look at the fake Mona Lisa and determine that with 85% probability this isn't the real one - 0.85 fake. So in this case, it will be classified as fake and that information of being fake along with this fakeness probability, 0.85, will be given to the generator to improve its efforts. That is, the output probabilities from the discriminator are the ones that help the generator learn to produce better looking examples overtime
Now we are going to see the whole process of the training of the discriminator which incorporates the output from the generator as well:
-
To train a discriminator we take a noise vector
Z
and pass it through the generator. -
We also take a set of real images, from the original dataset and input both into the discriminator.
-
The discriminator is going to receive a set of both fake and real images and produce outputs .
-
The output has a range from
0
to1
where0
represents the event of a fake image and1
the maximum probability for a real image. -
We then take that to a mathematical function that calculates the loss where we are going to compare fake inputs to the number
0
and real inputs to the number1
.
Note: The discriminator wants to be able to predict that the fake inputs are fake, therefore have a propabilty of 0
and that the real inputs are real with a probability of 1
.
- Once we have calculated the loss we are going to use
backpropagation
to update the parameters, of the discriminator only.
The generator in a GAN is like it's heart
. It's a model that's used to generate examples and the one that we should be investing in and helping achieve a really high performance at the end of the training process.
The generators final goal is to be able to produce examples from a certain class. So if we trained it from the class of a cat, then the generator will do some computations and output a representation of a cat that looks real. So ideally, the generator won't output the same cat at every run, and so to ensure it's able to produce different examples every single time, we will actually input different sets of random values - a noise vector. Then this noise vector is fed in as input
sometimes with our class Y
for cat into the generators neural network. The generator in this neural network will compute a series of nonlinearities from those inputs and return some variables, for example, three million nodes at the end that do not necessarily represent classes but each pixel's value which represents the image of a cat.
-
We are going to begin with
Z
which represents a noise vector - a mathematical vector made of random numbers. -
We pass this into a generator represented by a neural network to produce a set of features that can pose as an image of a cat or an attempt at a cat. This output image, is fake. It doesn't belong to the original real training data and we want to use it to fool the discriminator.
-
This image, is fed into the discriminator, which determines how real and how fake it thinks it is based on its inspection of it.
-
The discriminator output which is in the range of
0
to1
will be used to compute acost function
that basically looks at how far the examples produced by the generator are being considered real by the discriminator because the generator wants this to seem as real as possible. That is, how good is the performance of the generator? -
The generator wants to be as close to
1
, meaningreal
as possible. Whereas, the discriminator is trying to get this to be0
-fake
. Hence, the predictions are compared using the loss function with all the labels equal to real. Because the generator is trying to get these fake images to be equal to real or label of 1 as closely as possible. -
The cost function uses the difference between these two to then update the parameters of the generator using backpropagation. It gets to improve over time and know which direction to move it's parameters to generate something that looks more real and will fool the discriminator.
-
The difference between the output of the discriminator and the value
1
is going to be a smaller and smaller and the loss is going to be smaller and smaller. As such, the performance of the generator is going to keep on improving.
So once we get a generator that looks pretty good, we can save the parameters theta of the generator and then sample
from this safe generator. What sampling basically means is that we have these random noise vectors and when we input that into the saved generator, it can generate all sorts of different examples.
More generally, the generator is trying to model the probability of features X given the class Y - P(X|Y)
. Note: if we have only one class then we dont need that class Y
so we will just model P(X)
. For the example above, generator will model the probability of features X
without any additional conditions and this is because the class Y
will always be cat, so it's implicit for all probabilities X
. In this case, it'll try to approximate the real distribution of cats. So the most common cat breeds will actually have more chances of being generated because they're more common in the dataset. Certain features such as having pointy ears will be extra common because most cats have that. But then more rare breeds, the sphix for example, will have a less likely chance of being sampled.
To summarise:
- the generator produces fake data that tries to look real.
- It learns to mimic that distribution of features X from the class of your data.
- In order to produce different outputs each time it takes random features as input.
To understand the Binary Cross Entropy
cost function, we will first explore what is entropy.
First, information
is defined as the number of bits required to encode and transmit an event.
- Low Probability Event (surprising):
More information
. - Higher Probability Event (unsurprising):
Less information
. Information h(x) can be calculated for an event x, given the probability of the event P(x) as follows:
- Low Probability Event: P(x) = 0.1 | h(x) = -log(0.1) = 1 : More information
- Higher Probability Event: P(x) = 0.9 | h(x) = -log(0.9) = 0.045 : Less information
Figure below shows a - log(x) graph.
Note: Imagine that we are encoding a particular event. If the probability of that event happening is low, this means that it is more surprising, because we are not sure when it is going to happen. And we will also need to use more bits to encode it because we need to encode a more surprising pattern, which has more variation and requires more bits to be expressed.
Conversely, if we know that the event happens very often, with high probability, then it will be less surprising because we are almost sure that it is going to happen next time we check. And that high probability event has less information because we require less bits to express a pattern that happens almost always or always in contrast to a pattern that is more unexpected and complex.
Now let's explore entropy. Entropy
is the number of bits required to transmit a randomly selected event from a probability distribution.
- Skewed Probability Distribution (unsurprising):
Low entropy
. - Balanced Probability Distribution (surprising):
High entropy
.
Entropy H(x) can be calculated for a random variable with a set of x in X discrete states discrete states and their probability P(x) as follows:
- Skewed Distribution: one high probability event (0.9) and two low probability events (0.05):
Low Entropy - Unsurprising
- Balanced Distribution: all events have same probability (0.33)
High Entropy - Surprising
Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events.
The intuition for this definition comes if we consider a target or underlying probability distribution P and an approximation of the target distribution Q, then the cross-entropy of Q from P is the number of additional bits to represent an event using Q instead of P.
The cross-entropy between two probability distributions, such as Q from P, can be stated formally as:
Where H()
is the cross-entropy
function, P
may be the target
distribution and Q
is the approximation
of the target distribution.
We will use Binary Cross Entropy because the discriminator wants to predict two things: real images are real and the fake images are fake.
Recall: Cross entropy is the product of the target probability times the logarithm of the approximating probability.
where y
is the target (target probability) label which is 1
for real and 0
for fake and represents the prediction (approx. probability) - the output of the discriminator. So we can see our cost function as:
We notice that our cost function has two parts: one more focused on the real images and one more focused on the fake. We will now look at the function when y = 1 and when y = 0.
y = 1:
When the label is equal to 1
we have only the first part of the equation which is:
that is:
- We see that when
y = 0
, we output0
. - If we have a label of
1
and we have a really high prediction that is close to1
- of 0.99, then we also get a value that's close to0
. - In the case where it actually is real, i.e, y = 1, but our prediction is terrible, and it's 0, so far from 1, you think it's fake, but it's actually real, then this value is extremely large.
This term is mainly for when the prediction is actually just 1, and it makes it 0 if our prediction is good, and it makes it negative infinity if our prediction is bad.
In this plot, we have our prediction value on the x-axis and the loss associated with that training example on the y-axis. In this case, the loss simplifies to the negative log of the prediction. When the prediction is close to 1
, here at the tail, the loss is close to 0
because our prediction is close to the label. However, when the prediction is close to 0
out here, unfortunately our loss approaches infinity, so a really high value because the prediction and the label are very different.
y = 0:
When the label is equal to 0
we have only the second part of the equation which is:
that is:
Similarly:
- If our label is 1, then 1-y = 0. And if our prediction is anything, this will evaluate to
0
. - If our prediction is close to zero and our label is 0, then this value is close to 0.
- However, if it's fake, but our prediction is really far off, and thinks it's real, then this term evaluates to negative infinity.
When the label is 0, and the loss function reduces to the negative log of 1 minus that prediction. Hence, when the prediction is close to 0, the loss is also close to 0. That means we're doing great. But when our prediction is closer to 1, but the ground truth is 0, it will approach infinity again.
Basically, each of these terms - and evaluates to negative infinity if for their relevant label, the prediction is really bad.
Why do we have a negative sign in front of our cost function?
If either of these values evaluates to something really big in the negative direction, then this negative sign is crucial to making sure that it becomes a positive number and positive infinity. Because for our cost function, what we typically want is a high-value being bad, and our neural network is trying to reduce this value as much as possible. Getting predictions that are closer, evaluating to 0
makes sense here, because we want to minimize our cost function as we learn.
In summary, one term in the cost function is relevant when the label 0
, the other one is relevant when it's 1
, and in either case, the logarithm of a value between 1-0 was calculated, which returns that negative result. That's why we want this negative term at the beginning, to make sure that this is high, or greater than, or equal to 0. When prediction and the label are similar, the BCE loss is close to 0. When they're very different, that BCE loss approaches infinity. The BCE loss is performed across a mini-batch of several examples - n examples. It then takes the average of all those n examples.
Our cost function needs to define a global optimum
such that the generator could perfectly reproduce the true data distribution such that the discriminator cannot absolutely tell what's synthetic versus what's real.
If we consider the loss from the perspective of the discriminator we want to try to maximize
the probability that the fake data is identified as fake and real data is identified as real.
We train D to maximize the probability of assigning the correct label to both training examples and samples from G.
Therefore, the discriminator wants to maximize the average of the log probability for real images and the log of the inverted probabilities of fake images:
-
log(D(X))
is the discriminator's output for real data X. This is going to be likelihood of real data being real from the data distribution. So we want it to output1
. -
G(z)
defines the generator's output and soD(G(z))
is the discriminator's estimate of the probability that a fake instance is actually fake. So we want it to output0
and then1 - D(G(z))
becomes1
so we are left withlog(1)
which also equals to0
.
Therefore, the dircriminator wants to maximize
objective such that D(x)
is close to 1
(real) and D(G(z))
is close to 0
(fake).
The generator seeks to minimize the log of the inverse probability predicted by the discriminator for fake images. This has the effect of encouraging the generator to generate samples that have a low probability of being fake.
G(z)
defines the generator's output and soD(G(z))
is the discriminator's estimate of the probability that a fake instance is actually fake. So the generator wants it to output1
(wants to fool discriminator that data is real).1 - D(G(z))
becomes0
so we are left withlog(0)
which is undefined (-inf).
Therefore, the generator wants to minimize
objective such that D(G(z))
is close to 1
(Discriminator is fooled into thinking generated G(z) is real).
For the GAN, the generator and discriminator are the two players and take turns involving updates to their model weights. The min and max refer to the minimization of the generator loss and the maximization of the discriminator’s loss.
Now, we have these two players and so we're going to train this jointly in a minimax
game formulation. It's going to be minimum over , our parameters of our generator network G, and maximum over parameter of our Discriminator network D.
In order to train this, we're going to alternate between gradient ascent on our discriminator to maximize this objective and then gradient descent on the generator to minimize the objective.
- Gradient Ascent on Discriminator:
- Gradient Descent on Generator:
In practice, this loss function for the generator saturates. This means that if it cannot learn as quickly as the discriminator, the discriminator wins, the game ends, and the model cannot be trained effectively. Let's see how:
-
When plotting the graph
log(1-D(G(z)))
, we see that the slope of this loss is actually going to be higher towards the right. That is, the slope is high whenD(G(z))
- our generator - is doing a good job of fooling the discriminator. -
And on the other hand when we have bad samples, i.e, when our generator has not learned a good job yet, therefore when the discriminator can easily tell it is fake data, the gradient is closer to this zero region on the X axis. This actually means that our gradient signal is dominated by region where the sample is already pretty good. Whereas we actually want it to learn a lot when the samples are bad. And thus, this makes it hard to learn.
In order to improve learning, we're going to define a different objective function for the gradient where we are now going to do gradient ascent on the generator instead. In the previous case, the generator sought to minimize the probability of images being predicted as fake. Here, the generator seeks to maximize the probability of images being predicted as real. So instead of seeing the glass half empty, we want to see it half full.
If we plot this function on the right here, then we have a high gradient signal in this region on the left where we had bad samples, and now the flatter region is to the right where we would have good samples. So now we're going to learn more from regions of bad samples and so this has the same objective of fooling the discriminator but it actually works much better in practice.
We need to alternate their training, only one model is trained at a time, while the other one is held constant.
Note: It's important to keep in mind that both models should improve together
and should be kept at similar
skill levels from the beginning of training. And so the reasoning behind this is if we had a discriminator that is superior than the generator, we'll get predictions from it telling us that all the fake examples are 100% fake. the generator doesn't know how to improve. Everything just looks super fake, there isn't anything telling it to know which direction to go in.
On the other hand, if we had a superior generator that completely outskills the discriminator, then we'll get predictions telling us that all the generated images are 100% real. The discriminator has a much easier task, it's just trying to figure out which ones are real, which ones are fake, as opposed to model the entire space of what a class could look like. And so having output from the discriminator be much more informative, like 0.87 fake or 0.2 fake as opposed to just 100% fake or of probability one fake, is much more informative to the generator in terms of updating its weights and having it learn to generate realistic images over time.
After training we can actually use the generator network which is now fully trained to produce new data instances that have never been seen before.
When the trained generator of a GAN synthesizes new instances, it's effectively learning a transformation from a distribution of noise to a target data distribution and that transformation - that mapping is going to be what's learned over the course of training. If we consider one point from a latent noise distribution it's going to result in a particular output in the target data space and if we consider another point of random noise and feed it through the generator, it is going to result in a new instance. That new instance is going to fall somewhere else on the data manifold.
A major issue with GAN is when a GAN generates the same thing each time. For example, a GAN trained on all different cat breeds will only generate a Sphinx cat. This issue happens because the discriminator improves but it gets stuck between saying an image of a cat looks extremely fake
or extremely real
.
The discriminator being a classifier
is encouraged to say it's 1
- real or 0
- fake as it gets better. But in a single round of training, if the discriminator only thinks the generator's data looks real, even if it doesn't even look that real, then the generator will cling on to that image and only produce that type of data.
Now when the discriminator learns that the data is fake in the next round of training, the generator won't know where to go because there's really nothing else it has in its arsenal of different images and that's the end of learning. Digging one level deeper, this happens because of binary cross-entropy loss, where the discriminator is forced to produce a value between zero or one, and even though there's an infinite number of decimal values between zero and one, it'll approach zero and one as it gets better.
What is mode? The mode is the value that appears most often in a set of data values. If X is a discrete random variable, the mode is the value x at which the probability mass function takes its maximum
value. In other words, it is the value that is most likely
to be sampled.
As shown above, the mean value in a normal distribution is the single mode of that distribution. There are instances whereby for a probability density distribution we have two modes (Bimodal
) and mean does not necessarily have to be one of them. So more intuitively, any peak on the probability density distribution over features is a mode of that distribution.
Figure below shows handwritten digits represented by features and . The probability density distribution in this case will be a surface with many peaks corresponding to each digit. This is multimodal
with 10 different modes for each number from 0
to 9
.
We can imagine each of these peaks coming out at us in a 3D representation where the darker circle represents higher altitudes. So average looking 7
represented in red wiill be at the mode of the distribution.
To understand mode collapse let's take for example a discriminator who can perfectly classify each handwritten digits except ones and sevens.
Eventually the discriminator will probably catch on and learn to catch the generator's fake handwritten number ones by getting out of that local minima
. But the generator could also migrate to another mode of the distribution and again would collapse again to a different mode. Or the generator would not be able to figure out where else to diversify.
To sum up:
- Modes are peaks of the probability distribution of our features.
- Real-world datasets have many modes related to each possible class within them.
- Mode collapse happens when the generator learns to fool the discriminator by producing examples from a
single class
from the whole training dataset like handwritten number ones. This is unfortunate because, while the generator is optimizing to fool the discriminator, that's not what we ultimately want our generator to do.
Recall the BCE loss function is just an average of the cost for the discriminator for misclassifying real and fake observations.
The first term
is for reals and the second term is for the fakes. The higher this cost value is, the worse the discriminator is doing at it..
The generator wants to maximize
this cost because that means the discriminator is doing poorly and is classifying it's fake values into reals. Whereas the discriminator wants to minimize
this cost function because that means it's classifying things correctly. Note that the generator only sees the fake side of things, so it actually doesn't see anything about the reals. This maximization and minimization is often called a minmax game
.
The discriminator needs to output just a single value prediction within 0
and 1
. Whereas the generator actually needs to produce a pretty complex output composed of multiple features to try to fool the discriminator. As a result that discriminators job tends to be a little bit easier. To put it in another way: critisizing is more straightforward. As such, during training it's possible for the discriminator to outperform the generator.
We have two distributions: the real distribution
and the generator distribution
. The objective of the GAN is to bring them together, i.e, to make the generator distribution be as close as possible to the real distribution so that the fake images are as similar as possible to the real images.
At the beginning of training the discriminator has trouble distinguishing the generated and real distributions. There is some overlap and it is not quite sure. As a result, it's able to give useful feedback in the form of a non-zero gradient
back to the generator.
As it gets better at training, it starts to delineate the generated and real distributions a little bit more such that it can start distinguishing them much more. The real distribution will be centered around 1
and the generated distribution will start to approach 0
. As a result, when the discriminator is getting better, it will start giving less informative feedback. In fact, it might give gradients closer to zero, and that becomes unhelpful for the generator because then the generator doesn't know how to improve. This is how the vanishing gradient
problem will arise.
To sum up:
- GANs try to make real and generated distribution look similar.
- When the discriminator improves too much, the function approximated by BCE loss will contain flat regions.
- These flat regions cause vanishing gradient whereby the generator stops improving.
When using BCE loss to train a GAN, we often encounter mode collapse
and vanishing gradient
problems due to the underlying cost function of the whole architecture. Even though there is an infinite number of decimal values between 0
and 1
, the discriminator, as it improves, will be pushing towards those ends.
The Earth Mover's distance measures how different these two distributions are by estimating the amount of effort
it takes to make the generated distribution equal to the real. Recall that the objective of the GAN is to make the generator distribution as equal as possible to the real distribution. The function depends on both the distance
and the amount that the generated distribution needs to be moved
. In terms of an analogy, the generated distribution can be considered a pile of dirt and the Earth mover's distance means how difficult would it be to move that pile of dirt and mold it into the shape and location of the real distribution.
The problem with BCE loss is that as a discriminator improves, it would start giving more extreme values between 0
and 1
. As a result, this become less helpful feedback for the generator and the generator would stop learning due to vanishing gradient problems. With Earth mover's distance, however, there's no such ceiling to the 0
and 1
. The cost function continues to grow regardless of how far apart these distributions are. The gradient of this measure won't approach 0
and as a result, GANs are less prone to vanishing gradient problems and from vanishing gradient problems, mode collapse.
In summary:
- Earth mover’s distance is a measure of how different two distributions are by estimating the effort it takes to make the generated distribution equal to the real one.
- Earth mover’s distance does not have flat regions when the distributions are different.
An alternative loss function called Wasserstein Loss
- W-Loss
approximates the Earth Mover's Distance. Instead of using a discriminator to classify or predict the probability of generated images as being real or fake, the WGAN changes or replaces the discriminator model with a critic
that scores the realness
or fakeness
of a given image. Specifically, the lower the loss of the critic when evaluating generated images, the higher the expected quality of the generated images.
The discriminator is no longer bounded between 0
and 1
, i.e, it is no longer discriminating between these two classes. And so, our neural network cannot be called a discriminator because it doesn't discriminate between the classes. And so, for W-Loss, the equivalent to a discriminator is called a critic
, and what the Wasserstein loss function seeks to do is increase
the gap between the scores for real and generated images.
We can summarize the function as it is described in the Wasserstein GAN paper as follows:
- Critic Loss = [average critic score on real images] – [average critic score on fake images]
- Generator Loss = -[average critic score on fake images]
Where the average scores are calculated across a mini-batch of samples.
So the discriminator wants to maximize
the distance between its thoughts on the reals versus its thoughts on the fakes. So it's trying to push away these two distributions to be as far apart as possible.
In the case of the critic, a larger score for real images results in a larger resulting loss for the critic, penalizing the model. This encourages the critic to output smaller scores for real images. For example, an average score of 20 for real images and 50 for fake images results in a loss of -30; an average score of 10 for real images and 50 for fake images results in a loss of -40, which is better, and so on.
Meanwhile, the generator wants to minimize
this difference because it wants the discriminator to think that its fake images are as close as possible to the reals.
In the case of the generator, a larger score
from the critic will result in a smaller loss
for the generator, encouraging the critic to output larger scores for fake images. For example, an average score of 10 becomes -10, an average score of 50 becomes -50, which is smaller, and so on.
Nore that the sign of the loss does not matter in this case, as long as loss for real images
is a small number and the loss for fake images
is a large number. The Wasserstein loss encourages the critic to separate these numbers.
In these functions:
- C(x) is the critic's output for a real instance.
- G(z) is the generator's output when given noise z.
- C(G(z)) is the critic's output for a fake instance.
The discriminator model is a neural network that learns a binary classification problem, using a sigmoid activation function
in the output layer, and is fit using a binary cross entropy
loss function. As such, the model predicts a probability that a given input is real (or fake as 1 minus the predicted) as a value between 0 and 1. W-Loss
, however, doesn't have that requirement at all, so we can actually have a linear layer
at the end of the discriminator's neural network and that could produce any real value output. And we can interpret that output as how real an image is considered by the critic.
Note: Some of the explanations above are based from the blog of machinelearningmastery.
In summary:
- the discriminator under BCE Loss outputs a value between
0
and1
, while the critic in W-Loss will outputany number
. - because it's not bounded, the critic is allowed to improve without degrading its feedback back to the generator.
- It doesn't have a vanishing gradient problem, and this will mitigate against mode collapse, because the generator will always get useful feedback.
- The
generator
tries tominimize
the W-Loss - trying to get the generative examples to be as close as possible to the real examples while thecritic
wants tomaximize
this expression because it wants to differentiate between the reals and the fakes, it wants the distance to be as large as possible.
Recall W-Loss is a simple expression that computes the difference between the expected values of the critics output for the real examples x
and its predictions on the fake examples G(z)
. The generator tries to minimize
this expression: trying to get the generative examples to be as close as possible to the real examples while the critic wants to maximize
this expression: it wants to differentiate between the reals and the fakes - it wants the distance to be as large as possible.
However, the condition is that the critic needs to be 1-Lipschitz Continuous
or 1-L Continuous
which means that the norm of its gradient
needs to be at most 1
. That is, the slope or gradient can't be greater than 1
at any point. In order to check a function is 1-Lipschitz Continuous, we want to go along every point in the function and make sure its slope or gradient is 1
.
In order to check that, we drew two lines of gradient 1
and -1
respectively then we want to make sure that the growth
of this function never goes out of bounds from these lines because staying within these lines means that the function is growing linearly
. The function above is not 1-Lipschitz Continuous because it's not staying within this green area which suggests that it's growing more than linearly.
Above is a smooth curve function. We want to again check every single point on this function before we can determine whether or not that this is 1-Lipschitz Continuous. We take every single value and the function never grows more than linearly hence , this function is 1-Lipschitz Continuous.
This condition on the critics neural network is important for W-Loss because it assures that the W-Loss function is not only continuous
and differentiable
but also that it doesn't grow too much and maintain some stability
during training. This is what makes the underlying Earth Movers Distance valid, which is what W-Loss is founded on. This is required for training both the critic and generators neural networks and it also increases stability
because the variation as the GAN learns will be bounded.
Two common ways of ensuring this condition are weight clipping
and gradient penalty
:
With weight clipping, the weights
of the critics neural network are forced to take values between a fixed interval
. After we update the weights during gradient descent, we will actually clip any weights outside of the desired interval, i.e, weights that are either too high or too low will be set to the maximum
or the minimum
amount allowed. However this has a couple of downside.
- Forcing the weights of the critic to a limited range of values could
limit
the critics ability tolearn
and ultimately for the gradient to perform. - If the critic can't take on many different parameter values, it's weights can't take on many different values then it might not be able to improve easily or find a good global optima for it to be in.
- Or on the other hand, it might actually
limit
the critictoo little
if we don't clip the weights enough.
The gradient penalty is a much softer way to enforce the critic to be 1-lipschitz continuous. All we need to do is add a regularization
term to our loss function which will penalize the critic when it's gradient norm is higher than 1
.
where reg
is the regularization term and lambda
is just a hyperparameter value of how much to weigh this regularization term against the main loss function.
In order to check the critics gradient at every possible point of the feature space, that's virtually impossible or at least not practical. Instead with gradient penalty what we will do is sample
some points by interpolating
between real and fake examples using a random number epsilon
. It is on - the interpolated image - that we want to get the critics gradient to be .
where:
Note: Since checking the critic’s gradient at each possible point of the feature space is virtually impossible, we can approximate this by using interpolated images.
We get the gradient of the critics prediction on , and then we take the norm of that gradient and we want the norm to be 1
. In fact it is penalizing any value outside of 1
. With this method, we're not strictly enforcing 1-L continuity
, but you're just encouraging it. This has proven to work well and much better than weight clipping.
The complete expression of the loss function that we use for training with W-loss
gradient penalty now has these two components:
- First, we approximate
Earth Mover's distance
with this main W-loss component. This makes the GAN less prone to mode collapse and vanishing gradient. - The second part of this loss function is a
regularization term
that meets the condition for what the critic desires in order to make this main term valid. Of course, this is a soft constraint on making the critic1-lipschitz continuous
for the loss function to becontinuous
anddifferentiable
.
Recall that a GAN consists of two networks that train together:
-
Generator — Given a vector of random values -
noise
as input, this network generates data with the same distribution as the training data. We train the generator to generate data that "fools" the discriminator. -
Discriminator — Given batches of data containing observations from both the training data and generated data from the generator, this network attempts to classify the observations as
real
orfake
.We train the discriminator to distinguish between real and generated data.
Ideally, twe want a generator that generates convincingly realistic data and a discriminator that has learned strong feature representations that are characteristic of the training data.
We will use the CelebA Dataset
to create a GAN that will generate persons' faces. We will build a Generator and Critic using Transposed Convolutions
and Convolutions
respectively. More explanations on convolutions can be found at this link: Lane-Detection-with-Semantic-Segmentation
We will first define the generator network architecture which generates images from 1x1x200
arrays of random values. The network:
-
Converts the random vectors of size
200
to1x1x128
arrays using a project and reshape - forward function. -
Upscales the resulting arrays to
64x64x3
arrays using a series of transposed convolution layers and ReLU layers.
-
For the transposed convolution layers, we specify
4x4
filters(F) with a decreasing number of filters for each layer, a stride (S) of2
, and padding (P) amount. -
For the final transposed convolution layer, we specify three
4x4
filters corresponding to the3
RGB channels of the generated images, and the output size of the previous layer. -
At the end of the network, include a
tanh
layer.
# generator model
class Generator(nn.Module):
def __init__(self, z_dim=200, d_dim=16):
super(Generator, self).__init__()
self.z_dim=z_dim
self.gen = nn.Sequential(
## ConvTranspose2d: in_channels, out_channels, kernel_size, stride=1, padding=0
## Calculating new width and height: (n-1)*stride -2*padding +ks
## n = width or height
## ks = kernel size
## we begin with a 1x1 image with z_dim number of channels (200) - initlalized z_dim = 200 | 1x1x200
## - we decrease no. of channels but increase size of image
nn.ConvTranspose2d(z_dim, d_dim * 32, 4, 1, 0), ## 4x4 image (ch: 200 to 512) | 4x4x512
nn.BatchNorm2d(d_dim*32),
nn.ReLU(True),
nn.ConvTranspose2d(d_dim*32, d_dim*16, 4, 2, 1), ## 8x8 image (ch: 512 to 256) | 8x8x256
nn.BatchNorm2d(d_dim*16),
nn.ReLU(True),
nn.ConvTranspose2d(d_dim*16, d_dim*8, 4, 2, 1), ## 16x16 image (ch: 256 to 128) | 16x16x128
#(n-1)*stride -2*padding +ks = (8-1)*2-2*1+4=16
nn.BatchNorm2d(d_dim*8),
nn.ReLU(True),
nn.ConvTranspose2d(d_dim*8, d_dim*4, 4, 2, 1), ## 32x32 image (ch: 128 to 64) | 32x32x64
nn.BatchNorm2d(d_dim*4),
nn.ReLU(True),
nn.ConvTranspose2d(d_dim*4, d_dim*2, 4, 2, 1), ## 64x64 image (ch: 64 to 32) | 64x64x32
nn.BatchNorm2d(d_dim*2),
nn.ReLU(True),
nn.ConvTranspose2d(d_dim*2, 3, 4, 2, 1), ## 128x128 image (ch: 32 to 3) | 128x128x3
nn.Tanh() ### produce result in the range from -1 to 1
)
#--- Function to project and reshape noise
def forward(self, noise):
x=noise.view(len(noise), self.z_dim, 1, 1) # 128 batch x 200 no. of channels x 1 x 1 | len(noise) = batch size = 128
print('Noise size: ', x.shape)
return self.gen(x)
For the discriminator, we create a network that takes 128x128x3
images and returns a scalar
prediction score using a series of convolution layers with Instance Normalization and Leaky ReLU layers.
- For the convolution layers, we specify
4x4
filters with an increasing number of filters for each layer. We also specify a stride of2
and padding of the output.
-
For the leaky ReLU layers we have a negative slope of
0.2
. -
For the final convolution layer, specify a one
4x4
filter with no padding.
## critic model
class Critic(nn.Module):
def __init__(self, d_dim=16):
super(Critic, self).__init__()
self.crit = nn.Sequential(
# Conv2d: in_channels, out_channels, kernel_size, stride=1, padding=0
## New width and height: # (n+2*pad-ks)//stride +1
## we decrease size of image and increase number of channels
#-- we start with image of 128x128x3
nn.Conv2d(3, d_dim, 4, 2, 1), #(n+2*pad-ks)//stride +1 = (128+2*1-4)//2+1=64x64 (ch: 3 to 16) | 64x64x16
nn.InstanceNorm2d(d_dim),
nn.LeakyReLU(0.2),
nn.Conv2d(d_dim, d_dim*2, 4, 2, 1), ## 32x32 (ch: 16 to 32) | 32x32x32
nn.InstanceNorm2d(d_dim*2), # Norm applied to previous layers
nn.LeakyReLU(0.2),
nn.Conv2d(d_dim*2, d_dim*4, 4, 2, 1), ## 16x16 (ch: 32 to 64) | 16x16x64
nn.InstanceNorm2d(d_dim*4),
nn.LeakyReLU(0.2),
nn.Conv2d(d_dim*4, d_dim*8, 4, 2, 1), ## 8x8 (ch: 64 to 128) | 8x8x128
nn.InstanceNorm2d(d_dim*8),
nn.LeakyReLU(0.2),
nn.Conv2d(d_dim*8, d_dim*16, 4, 2, 1), ## 4x4 (ch: 128 to 256) | 4x4x256
nn.InstanceNorm2d(d_dim*16),
nn.LeakyReLU(0.2),
nn.Conv2d(d_dim*16, 1, 4, 1, 0), #(n+2*pad-ks)//stride +1=(4+2*0-4)//1+1= 1X1 (ch: 256 to 1) | 1x1x1
#-- we end with image of 1x1x1 - single output(real or fake)
)
def forward(self, image):
# image: 128 x 3 x 128 x 128: batch x channels x width x height
crit_pred = self.crit(image) # 128 x 1 x 1 x 1: batch x channel x width x height | one single value for each 128 image in batch
return crit_pred.view(len(crit_pred),-1) ## 128 x 1
The gradient penalty improves stability by penalizing gradients with large norm values. The lambda value controls the magnitude of the gradient penalty added to the discriminator loss. Recall that we need to create an interpolated image using real and fake images weighted by epsilon
. Then based on the gradient of the prediction of the critic on the interpolated image we will add a regularization term in our loss function.
## gradient penalty calculation
def get_gp(real, fake, crit, epsilon, lambda=10):
interpolated_images = real * epsilon + fake * (1-epsilon) # 128 x 3 x 128 x 128 | Linear Interpolation
interpolated_scores = crit(interpolated_images) # 128 x 1 | prediction of critic
# Analyze gradients if too large
gradient = torch.autograd.grad(
inputs = interpolated_images,
outputs = interpolated_scores,
grad_outputs=torch.ones_like(interpolated_scores),
retain_graph=True,
create_graph=True,
)[0] # 128 x 3 x 128 x 128
gradient = gradient.view(len(gradient), -1) # 128 x 49152
gradient_norm = gradient.norm(2, dim=1) # L2 norm
gp = lambda * ((gradient_norm-1)**2).mean()
return gp
We will now train the critic using the following steps:
- Initialize gradients to
0
. - Create
noise
using the gen_noise function. - Project and reshape noise and pass it in our Generator model to create a
Fake
image. - Get predictions on
fake
andreal
image. - Generate random epsilon and calculate
gradient penalty
. - Calculate the critic
loss
using gradient penalty. - Use
backpropagation
to update our critic parameters.
'''Critic Training'''
mean_crit_loss = 0
for _ in range(crit_cycles):
crit_opt.zero_grad()
#--- Create Noise
noise=gen_noise(cur_bs, z_dim)
#---Create Fake Image from Noise
fake = gen(noise)
#--- Get prediction on fake and real image
crit_fake_pred = crit(fake.detach())
crit_real_pred = crit(real)
#--- Calculate gradient penalty
epsilon=torch.rand(len(real),1,1,1,device=device, requires_grad=True) # 128 x 1 x 1 x 1
gp = get_gp(real, fake.detach(), crit, epsilon)
#--- Calculate Loss
crit_loss = crit_fake_pred.mean() - crit_real_pred.mean() + gp
mean_crit_loss+=crit_loss.item() / crit_cycles
#--- Backpropagation
crit_loss.backward(retain_graph=True)
#--- Update parameter of critic
crit_opt.step()
#--- Append Critic Loss
crit_losses+=[mean_crit_loss]
The training of the generator is much simpler:
- Initialize gradients to
0
. - Create
noise
vector. - Generate
fake
image from noise vector. - Get critic's
prediction
on fake image. - Calculate generator's
loss
. - Use
backpropagation
to update generator's parameters.
'''Generator Training'''
#--- Initialize Gradients to 0
gen_opt.zero_grad()
#---Create Noise Vector
noise = gen_noise(cur_bs, z_dim)
#---Create Fake image from Noise vector
fake = gen(noise)
#---Critic's prediction on fake image
crit_fake_pred = crit(fake)
#--- Calculate Generator Loss
gen_loss = -crit_fake_pred.mean()
#---Backpropagation
gen_loss.backward()
#--- Update generator's paramaters
gen_opt.step()
#--- Append Generator Loss
gen_losses+=[gen_loss.item()]
We being training our GAN with the following hyperparameters:
- Number of images: 10000
- Number of epochs: 50000
- Batch size: 128
- Number of steps per epoch: Number of epochs/Batch size = 50000/128 = 390.625
- Learning rate: 0.0001
- Dimension of noise vector: 200
- Optimizer: Adam
- Critic cycles: 5 (we train critic 5 times + 1 train of generator - so that critic is not overpowered by generator)
We plot the graph of the Generator loss and Critic loss w.r.t to the number of steps. Some important features of the graph are:
- The critic loss (red) is initially positive (not clearly shown on the graph). This is because of the loss function of our critic:
Remember that for the critic a larger score for real images results in a larger resulting loss. Hence. for the function above to be positive, E(C(x))
should be positive, that is, the critic outputs a large score score for real images (it thinks the real images are fakes).
With time, the loss of the critic drops to become negative, i.e, it outputs a smaller score for real images which means it starts to correctly identify between reals and fakes.
-
The loss of the generator is positive because a larger score from the critic will result in a smaller loss for the generator, encouraging the critic to output larger scores for fake images.
-
The absolute value of the critic (average
7
) is much lower than the absolute value of the generator (average22
). This is because we are training the critic 5 times more for everyt training of the generator. It allows the critic to not be overpowered by the generator. -
Unfortunately, the loss for both the critic and the generator does not approach zero. We observe that the loss of the generator approaches its minimum at about 6000 steps while the loss of the critic remains mainly constant.
Below is the results of the training. The first 200
steps are just noise with no particular structure in the data. But with time, we can clearly see some facial features appearing in the noise at about 800
steps. At the end of 6000
steps, we successfully generate faces while not very high definition.
Gan.Results.mp4
With our model saved, we will use the generator and scrape out the discriminator to generate new faces from noise.
#### Generate new faces
noise = gen_noise(batch_size, z_dim)
fake = gen(noise)
show(fake)
Note that we are actually displaying 25 images at a time but the generator output 1 images at each step. Although the picture is highly pixelated, it will be hard to distinguish whether it is real or fake.
We can also interpolate between two points in the latent space to see how the image of a picture morph to become another one. Note that the AI has only been trained with 10000 images and since our loss function did not approach to zero as near as possible, the interpolation is quite rudimentary. Yet, it is still impressive to see the result.
Although we are generating very good fake images from our WGAN, we do not really have a control of the type of faces to generate. For example, if I want to generate only women
faces I would not be able to do that. What we have been doing is called unconditional generation
. What we want to achieve is Conditional
generation, that is, we tell our model to generate different items we specify or condition and we adapt the training process so it actually does that. There is also Controllable
generation where we figure out how to adapt the inputs to our model without changing the model itself.
We will now control the output and get examples from a particular class or for those examples to take on certain features. Below are the key differences between unconditional and conditional GAN:
- With unconditional we get examples from random classes whereas with conditional gan we get examples from the classes we specify.
- The training dataset is not labeled for unconditional whereas for conditional gan is should be labeled and the label are the different class we want.
With unconditional generation, the generator needs a noise vector to produce random examples. For conditional generation, we also need a vector to tell the generator from which class the generated examples should come from. Usually this is a one-hot vector
, which means that there are zeros
in every position except for one position corresponding to the class we want. In the example below, we specify a one
at Sphinx cat because that's the class that we want the generator to create images of.
The noise vector is the one that adds randomness
in the generation, similar to before; to let us produce a diverse set of examples. But now it's a diverse set within the certain class, conditioned on the certain class and restricted by the second class. The input to the generator in a conditional GAN is actually a concatenated vector
of both the noise and the one-hot class information.
In the example below, we generate a Sphinx cat from one noise vector but when we change that noise vector while the class information stays the same, it produces another picture of a Sphinx cat.
The discriminator in a similar way will take the examples, but now it is paired with the class
information as inputs to determine if the examples are either real or fake representations of that particular class. For the discriminator to predict that an example is real, it needs to look like the examples from that class in the training data set.
The image is fed in as 3
different channels -RGB - or just one channel if it's a gray-scale image. Then the one-hot class information is also fed in as additional channels where all the channels take on values of zeros
whereas this black one will take on values of ones
. In contrast to that one-hot vector, these are typically much larger matrices where each channel is full of zeros at every position where it's not that class.
In summary:
- The class is passed to the generator as one-hot vestors.
- The class is passed to the discriminator as one-got matrices.
- The size of the vector and the number of matrices represent the number of classes.
While conditional generation leverages labels during training, controllable gan focus on controlling what features we want in the output examples even after the model has been trained. For instance, with a gan that performs face generation, we could control the age
of the person's looks in the image or if they have sunglasses
or the direction
they're looking at in the picture, or even they're perceived gender
.
We do this by actually tweaking the input noise vector Z
that is fed to the generator after we train the model. With an input noise vector Z
we can get this picture of a woman with red hair. But if we tweak one of the features, representing changing hair color, from this input noise vector maybe we get the same woman but with blue hair.
Let's see how controllable differ from conditional GANs:
- With controllable generation we get examples with features that we want but with conditional generation we get examples from classes that we want.
- We do not need a labeled dataset for controllable gan but we do need a labeled dataset for conditional generation.
- In controllable gan, we tweak the input noise vector
Z
while with conditional generation, we have to pass additional information representing the class that we want appendeded to that noise vector
Note however that sometimes controllable generation can definitely include conditional generation
Recall that controllable generation is achieved by manipulating the noise vector z that's fed into the generator. Earlier we output how we could morph between two faces generated by the generator. We will see here the math behind it:
Consider and which are dimensions on the Z-space (vector space of noise vectors) and the vectors and are going to represent concrete vector values in this Z-space. If we want to get intermediate values between two images, we can make a linear interpolation
between their two input vectors and in the Z-space.
Controllable generation also uses changes in the Z-space and takes advantage of how modifications to the noise vectors are reflected on the output from the generator. For example, with the noise vector, we could get a picture of a woman with red hair and then with another noise vector, we could get a picture of the same woman but with blue hair. The difference
between these two noise vectors is the direction
in which we have to move in Z-space to modify the hair color of our generated images. In controllable generation, our goal is to find these directions
for different features
we care about.
This means that if we generate an image of a woman with red hair with an input noise vector, , we can modify the hair color of this woman in the image by adding that direction vector d
to the noise vector creating a new noise vector, . Passing that into our generator result into an image where the hair is now blue.
To sum up, controllable generation works by moving the noise vector in different directions in that Z-space.
A lot of the material in the section below was inspired by the medium post of Alexa Steinbrück. A truly marvelous explanation that depicts in the simplest way possible the architecture of VQGAN and CLIP. Major credit goes to the author.
The model which we will use connects two existing (open-source, pretrained) models: CLIP (OpenAI) and VQGAN (Esser et al. from Heidelberg University). VQGAN+CLIP is a text-to-image
model that generates images of variable size given a set of text prompts
(and some other parameters).
In essence, the way they work is that VQGAN
generates
the images, while CLIPjudges
how well an image matches the text prompt. This interaction guides the generator(VQGAN) to produce more accurate images.
- a model trained to determine which caption from a set of captions best fits with a given image
- CLIP = Contrastive Language–Image Pre-training
- it also uses Transformers
- proposed by OpenAI in January 2021
- Paper: “Learning transferable visual models from natural language supervision”
- Git Repository: https://github.com/openai/CLIP
The revolutionary thing about CLIP is that it is capable of zero-shot learning
. That means that it performs exceptionally well on previously unseen datasets.
- a type of neural network architecture
- VQGAN = Vector Quantized Generative Adversarial Network
- was first proposed in the paper “Taming Transformers” by University Heidelberg (2020)
- it combines convolutional neural networks (traditionally used for images) with Transformers (traditionally used for language)
- it’s great for high-resolution images
Although VQGAN involves Transformers the models are not trained with text, but pure image data. They just apply the Transformer architecture that was previously used for text to images, which is an important innovation.
CLIP guides
VQGAN towards an image that is the best match to a given text.
Note that contrary to VQGAN, CLIP is not a generative model. CLIP is “just” trained to represent both text and images very well.
CLIP is a model that was originally intended for doing things like searching for the best match to a description like “a dog playing the violin” among a number of images. By pairing a network that can produce images (a “generator” of some sort) with CLIP, it is possible to tweak the generator’s input to try to match a description.
So how does it work:
-
VQGAN: Like all GANs VQGAN takes in a
noise vector
and outputs a (realistic) image. -
CLIP on the other hand takes in:
- (a) an image, and outputs the image features; or
- (b) a text, and outputs text features.
The similarity between image and text can be represented by the cosine similarity
of the learnt feature vectors.
We can use CLIP to guide a search through VQGAN’s latent space
to find images that match a text prompt very well according to CLIP.
Note:
Eventhough both VQGAN and CLIP models are pretrained when you use them in VQGAN, you basically train it (again) for every prompt you give to it. That is different to “normal” GANs where you train it one time (or you use a pretrained model) and then you just do inference in order to generate an image.
The VQGAN-CLIP architecture kind of blurs the distinction of training-vs-inference, because when we “run” VQGAN-CLIP we’re kind of doing inference
, but we’re also optimizing
. This special case of inference has been called “inference-by-optimization”. That’s why we need a GPU to run VQGAN-CLIP.
We’re not training a VQGAN model and we’re also not training a CLIP model. Both models are already
pretrained
and their weights arefrozen
during the run of the notebook. What’s being optimised (or “trained”) isZ (noise)
, the latent image vector that is being passed as an input to VQGAN.
Forward pass: We start with a noise vector z
, a VQGAN-encoded image vector, pass it to VQGAN to synthesize/decode an actual image out of it, then we cut it into pieces, then we encode these pieces with CLIP, calculate the distance to the text prompt and get out some loss(es).
Backward pass: We backpropagate through CLIP
and VQGAN
all the way back to the latent vector z
and then use gradient ascent to update z
.
try_on_video.mp4
- https://www.youtube.com/watch?v=xkqflKC64IM&t=489s
- https://www.youtube.com/watch?v=CDMVaQOvtxU
- https://www.whichfaceisreal.com/
- https://this-person-does-not-exist.com/en
- https://www.youtube.com/watch?v=HHNESCbZqUg
- https://machinelearningmastery.com/generative-adversarial-network-loss-functions/
- https://www.analyticsvidhya.com/blog/2021/07/deep-understanding-of-discriminative-and-generative-models-in-machine-learning/
- http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
- https://medium.com/@mlengineer/generative-and-discriminative-models-af5637a66a3
- https://www.youtube.com/watch?v=z5UQyCESW64
- https://machinelearningmastery.com/cross-entropy-for-machine-learning/
- https://towardsdatascience.com/keywords-to-know-before-you-start-reading-papers-on-gans-8a08a665b40c#:~:text=Latent%20space%20is%20simply%20any,dataset%20it%20was%20trained%20on).
- https://towardsdatascience.com/understanding-latent-space-in-machine-learning-de5a7c687d8d
- https://medium.com/swlh/how-i-would-explain-gans-from-scratch-to-a-5-year-old-part-1-ce6a6bccebbb
- https://machinelearningmastery.com/how-to-implement-wasserstein-loss-for-generative-adversarial-networks/
- https://developers.google.com/machine-learning/gan/loss
- https://arxiv.org/abs/1701.07875
- https://arxiv.org/abs/1704.00028
- https://lilianweng.github.io/posts/2017-08-20-gan/
- https://arxiv.org/abs/1411.1784
- https://alexasteinbruck.medium.com/explaining-the-code-of-the-popular-text-to-image-algorithm-vqgan-clip-a0c48697a7ff
- https://alexasteinbruck.medium.com/vqgan-clip-how-does-it-work-210a5dca5e52
- https://ljvmiranda921.github.io/notebook/2021/08/08/clip-vqgan/
- https://www.vice.com/en/article/n7bqj7/ai-generated-art-scene-explodes-as-hackers-create-groundbreaking-new-tools
- https://arxiv.org/pdf/2012.09841.pdf