Skip to content

Examining the stability of training and quality of image generation as well as interpolatability of latent space with three types of GAN training methods and VAEs

Notifications You must be signed in to change notification settings

bakshienator77/GANs_vs_VAEs_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Software setup

Use Python 3.8.

Please run pip install -r requirements.txt to install the necessary dependencies. Then run:

cd gan/
mkdir datasets/
gdown https://drive.google.com/uc\?id\=1hbzc_P1FuxMkcabkgn9ZKinBwW683j45 -O datasets/
tar zxvf datasets/CUB_200_2011.tgz
mv CUB_200_2011/ datasets/
python resize_dataset.py --input_folder datasets/CUB_200_2011/images --output_folder datasets/CUB_200_2011_32/ --res 32
rm -rf datasets/cub.tgz
rm -rf datasets/CUB_200_2011_32/Mallard_0130_76836.jpg datasets/CUB_200_2011_32/Brewer_Blackbird_0028_2682.jpg datasets/CUB_200_2011_32/Clark_Nutcracker_0020_85099.jpg datasets/CUB_200_2011_32/Ivory_Gull_0040_49180.jpg datasets/CUB_200_2011_32/Pelagic_Cormorant_0022_23802.jpg datasets/CUB_200_2011_32/Western_Gull_0002_54825.jpg datasets/CUB_200_2011_32/Ivory_Gull_0085_49456.jpg datasets/CUB_200_2011_32/White_Necked_Raven_0070_102645.jpg
cp cub_clean_custom_na.npz /path/to/python_env/lib/python3.8/site-packages/cleanfid/stats/cub_clean_custom_na.npz

Generative Adversarial Networks

Simple GAN loss

In this section, we test the performance of the original GAN losses for the generator and discriminator as described in Algorithm 1 of 1

Final FID attained is 146. However as can be seen from the graph ahead the FID had gone < 60 around iterations 23k - 25k. This indicates some form of overfitting is happening. From the samples below we can see some form of mode collapse happening.

fid_vs_iterations.png

Samples At iteration 30k

samples_30000.png

Samples At iteration 23k

samples_23000.png

As we can see the results at iteration 23k are better than those at 30k, however many of the generated images might look birdlike from a distance but upon zooming in they are unnatural.

Below I have plotted it interpolations at iteration 30k and at 23k and it is clear that the interpolations at 23k are better. The colours are more nature oriented as would be seen with birds and there is more variation by varying just the first two dimensions of the laten space. I would say that at 30k there is hardly any disentagling but at 23k there is slightly better disentangling

Interpolations At iteration 30k

interpolations_30000.png

Interpolations At iteration 23k

interpolations_23000.png

Discussion Since the loss function is binary cross entropy, there is a sigmoid function coming into play here due to which the problem of vanishing gradients is a problem. While in this round of training I was lucky enough to witness a duration of stable training, the model eventually did run into vanishing gradients causing the training to worsen, it could've been the generator loss function that ran into the problem (poorer samples) or the discrimator part (poorer checks leading to poorer samples by the generator).

LSGAN loss

LSGAN loss is implemented as per equation (2) as the loss for the generator and discriminator with c=1 in 2.

The final FID attained is 52.57

fid_vs_iterations.png

The FID is much better than the simple GAN loss. I also verified that my upsample and downsample were inversions of each other through unittesting so I'm confident the implementation was correct.

Samples At iteration 30k

samples_30000.png

These are much better samples than were generated by the previous model at 30k iterations. But there are still some strange samples generated. The ones that do look like birds are more natural looking and believable.

Samples at iteration 29k

samples_29000.png

Interpolations At 30k iterations

interpolations_30000.png

At iteration 30k but varying two latent dims from (-3,3) instead of (-1,1)

interpolations_14.png

The latent space generated from (-1, 1) (first image) looks a little bird like but doesn't seem all that disentangled, that being said the choice to pick the first two hidden dimensions is arbitrary and unlike some other deterministic dimensionality rediction algorith like PCA it is not guaranteed that the first two dimensions will be the most meaningful.

Discussion This version of GAN loss was more stable because it used mse loss without any sigmoid activations leading to better gradients more consistently as the problem of vanishing gradients is avoided.

WGAN-GP loss

Here I use the generator and discriminator losses from Algorithm 1 in WGAN-GP paper 3.

The final FID obtained is 38.6.

fid_vs_iterations.png

Samples At iteration 30k

samples_30000.png

Interpolations At 30k iterations

interpolations_30000.png

At iteration 30k but varying two latent dims from (-3,3) instead of (-1,1)

interpolations_14.png

Here we got a little lucky! though the first two latent dimensions are being altered for both attempts, the rest of the randomly generated vector would not have been the same between the above two images, nevertheless the generation is certainly shows a birdlike - batlike variation while keeping the sky constant. This shows that this model is better at latent space disentanglement than the first two.

Discussion The previous loss function overcame the problem of vanishing gradients by using MSE Loss however there are two issues still unaddressed:

  • Mseloss can still have exploding gradients as gradients are unclipped leading to training instability
  • MSEloss never truly converges as the gradients vanish as they approach prediction approaches the true value (this is not a problem with BCE loss)

Hence in the WGAN-GP loss we bring back the BCE loss as the GAN objective and additionally impose a loss on the norm of the gradient of the loss such that it is close to 1.0. From the graph of a sigmoid it is clear that sigmoid's behaviour can be linear near zero input and this is the region in which it is most beneficial to train a network which has sigmoids in it. Therefore this loss avoid the issue of both vanishing and exploding gradients to have the most stable training we have seen among the three losses.

Variational Autoencoders

AutoEncoders

loss_curve_final.png

Latent Dim 16

Reconstruction quality at epoch 19 of training epoch_19_recons.png

Latent Dim 128

Reconstruction quality at epoch 19 of training epoch_19_recons.png

Latent Dim 1024

Reconstruction quality at epoch 19 of training epoch_19_recons.png

The latent size 1024 performs the best primarily because the details in the image are subjected to lesser compression. Similar to when we reduce file sizes of images the first things we lose are high frequency details, the smaller the latent dimension size the more the model has to prioritize maintaining low frequency information like the general shape of the an airplane but not the exact texture on the body.

Variational Auto-Encoders

loss_curve.png

loss_curve_kl.png

Reconstruction and sample plots from epoch 19

epoch_19_samples.png

epoch_19_recons.png

Beta Variational Auto-Encoder

Tuning beta

Comparin the performance of the models with beta values 0.8, 1, 1.2

Recon Loss

Beta = 0.8 Beta = 1.0 Beta = 1.2
loss_curve.png loss_curve.png loss_curve.png

loss_curve_vae_final.png

KL Loss

Beta = 0.8 Beta = 1.0 Beta = 1.2
loss_curve.png loss_curve.png loss_curve.png

loss_curve_vae_final_kl.png

Samples

Beta = 0.8 Beta = 1.0 Beta = 1.2
loss_curve.png loss_curve.png loss_curve.png

Reconstructions

Beta = 0.8 Beta = 1.0 Beta = 1.2
loss_curve.png loss_curve.png loss_curve.png

Discussion: The recon loss of beta=0.8 is the best while the KL loss of beta = 1.2 is the best. This makes sense as the beta parameter controls the weight of the KL loss in the final loss term. With regards to the quality of samples all three yield pretty blurry results, that being said taking beta 0.8 gives more intricate reconstructions with better details while beta 1.2 gives results that overall respect the distribution of examples better but you really have to squint hard and tilt your head to make out any of the CIFAR 10 classes from them.

Note: For Beta = 0.0 the VAE reduces to an auto-encoder.

Linear schedule for beta

epoch_19_samples.png

loss_curve.png

loss_curve_kl.png

Discussion (comparison with vanilla VAE) Both the quality of the reconstructions and the actual reocnstruction loss value are much better than the vanilla VAE. There are more details and better high frequency information despite incorporating another loss (KL loss) which could've lead to a compromising on reconstruction quality.

Relevant papers:

[1] Generative Adversarial Nets (Goodfellow et al, 2014): https://arxiv.org/pdf/1406.2661.pdf

[2] Least Squares Generative Adversarial Networks (Mao et al, 2016): https://arxiv.org/pdf/1611.04076.pdf

[3] Improved Training of Wasserstein GANs (Gulrajani et al, 2017): https://arxiv.org/pdf/1704.00028.pdf

[4] Tutorial on Variational Autoencoders (Doersch, 2016): https://arxiv.org/pdf/1606.05908.pdf

[5] Understanding disentangling in β-VAE (Burgess et al, 2018): https://arxiv.org/pdf/1804.03599.pdf

This work was done toward completion of Visual Learning and Recognition (16-824) at CMU Robotics Institute

About

Examining the stability of training and quality of image generation as well as interpolatability of latent space with three types of GAN training methods and VAEs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages