final update

deep-learning-mit · Dec 13, 2023 · 8c7b1ae · 8c7b1ae
1 parent 069bfa5
commit 8c7b1ae
Showing 1 changed file with 18 additions and 17 deletions.
diff --git a/_posts/2023-12-12-Recovering Latent Variables with VAEs despite Training Bias.md b/_posts/2023-12-12-Recovering Latent Variables with VAEs despite Training Bias.md
@@ -6,14 +6,13 @@ date: 2022-12-01
 htmlwidgets: true
 
 # Anonymize when submitting
-# authors:
-#   - name: Anonymous
-
 authors:
-  - name: Patrick Timons
-    url: "https://en.wikipedia.org/wiki/Albert_Einstein"
-    affiliations:
-      name: MIT
+  - name: Anonymous
+
+# authors:
+#   - name: Patrick Timons
+#     affiliations:
+#       name: MIT
 
 # must be the exact same name as your blogpost
 bibliography: 2023-12-12-Recovering Latent Variables with VAEs despite Training Bias.bib
@@ -36,7 +35,7 @@ toc:
     subsections:
     - name: Training Observations
     - name: Evaluation
-  - name: Conclusion
+  - name: Conclusion and Future Work
 
 
 # Below is an example of injecting additional post-specific styles.
@@ -69,27 +68,27 @@ In particular, we will choose the setting in which our training data is biased,
 
 ## Background
 
-VAEs are useful as encoders for downstream tasks, and as generative models. Compared to vanilla autoencoders, they offer significant advantages, since they provide some assurances regarding the distribution of its latent variables. Unlike VAEs, standard Autoencoders can have arbitrarily distributed embeddings, making them poor generative models, since there is no straightforward way to in latent space so that we generate samples in distribution with our training data. VAEs are similar to standard Autoencoders, however, they are trained with a modified loss function that ensures the learned embedding space is regularized towards an isotropic Gaussian (there exist alternative choices regarding which distribution we regularize towards, but Gaussian Mixture Models are the most popular as it stands due to their simple parameterization and empirical success). Additionally, instead of simply compressing the input with a neural network during the forward pass, the encoder of a VAE outputs a mean and covariance, defining a distribution from which we sample to obtain our latent variables. 
+VAEs are useful as encoders for downstream tasks, and as generative models. Compared to vanilla autoencoders, they offer significant advantages, since they provide some assurances regarding the distribution of its latent variables. Unlike VAEs, standard Autoencoders can have arbitrarily distributed embeddings, making them poor generative models, since there is no straightforward way to sample in latent space so that we generate samples in distribution with our training data. VAEs are similar to standard Autoencoders, however, they are trained with a modified loss function that ensures the learned embedding space is regularized towards an isotropic Gaussian (there exist alternative choices regarding which distribution we regularize towards, but Gaussian Mixture Models are the most popular as it stands due to their simple parameterization and empirical success). Additionally, instead of simply compressing the input with a neural network during the forward pass, the encoder of a VAE outputs a mean and covariance, defining a distribution from which we sample to obtain our latent variables. 
 
 Since the VAE loss function regularizes our latent variables towards an isotropic Gaussian, encoded data is both disentangled and interpretable. To use trained VAEs as generative models, we simply sample latent variables i.i.d. from the Gaussian distribution and pass it through the VAE decoder to generate samples in distribution with our training data. VAEs also offer significant advantages as encoders, since regularization encourages them to learn factored, disentangled representations of the data. Finally, VAEs are particularly well-suited for interpretability since regularization encourages each latent variable to capture a unique aspect of the data.
 
 ## Related Work
 
-There has been significant prior work studying regularization and choice of priors in VAEs. Notably, Beta-VAE <d-cite key="higgins2017betavae"></d-cite> introduces the beta parameter to control the degree to which the VAE loss function penalizes the KL divergence of the latent variable distribution with the chosen prior (an isotropic Gaussian in their case). Higgins et al. demonstrate that introducing the beta parameter allows the VAE encoder to learn quantitatively more disentangled latent variables. They introduce a novel quantitative metric to evaluate the disentanglement of latent space and show that Beta-VAE improves on existing methods. Furthermore, they train a $$\beta$$-VAE on a dataset of faces (celebA) and qualitatively show that $$\beta$$ regularization allows for the factorization of previously entangled latent variables such as azimuth and emotion. 
+There has been significant prior work studying regularization and choice of priors in VAEs. Notably, $$\beta$$-VAE <d-cite key="higgins2017betavae"></d-cite> introduces the beta parameter to control the degree to which the VAE loss function penalizes the KL divergence of the latent variable distribution with the chosen prior (an isotropic Gaussian in their case). Higgins et al. demonstrate that introducing the beta parameter allows the VAE encoder to learn quantitatively more disentangled latent variables. They introduce a novel quantitative metric to evaluate the disentanglement of latent space and show that $$\beta$$-VAE improves on existing methods. Furthermore, they train a $$\beta$$-VAE on a dataset of faces (celebA) and qualitatively show that $$\beta$$ regularization allows for the factorization of previously entangled latent variables such as azimuth and emotion. 
 
-There have been several iterations on $$\beta$$-VAE such as Factor-VAE <d-cite key="kim2019disentangling"></d-cite>. Kim and Mnih point out that although $$\beta$$ regularization improves disentanglement in embedding space, it does so at the cost of reconstruction quality. To reduce this trade-off and still encourage disentanglement, they introduce a term to the VAE loss function that penalizes the KL-divergence between the joint distribution and the product of the marginals, instead of with an isotropic Gaussian as in $$\beta$$-VAE.
+There have been several iterations on $$\beta$$-VAE such as Factor-VAE <d-cite key="kim2019disentangling"></d-cite>. Kim and Mnih point out that although $$\beta$$ regularization improves disentanglement in embedding space, it does so at the cost of reconstruction quality. To reduce this trade-off and still encourage disentanglement, they introduce a term to the VAE loss function that penalizes the KL divergence between the joint distribution and the product of the marginals, instead of with an isotropic Gaussian as in $$\beta$$-VAE.
 
 Selecting an appropriate data prior is fundamental when performing Bayesian inference. In vanilla VAEs, we often assume an isotropic Gaussian prior for our latent variables, however, this is not always a good assumption, making it difficult to converge <d-cite key="miao2022on"></d-cite>. Miao et al. propose InteL-VAE, a VAE architecture capable of learning more flexible latent variables that can satisfy properties such as sparsity even when the data has significant distributional differences from a Gaussian. Their contributions allow for higher customizability of latent variables while bypassing many of the convergence issues commonplace with other methods that assume non-Gaussian priors. 
 
-Since that under ideal conditions, VAEs recover factorized latent variables, causal inference has become a standard setting for their application. Madras et al. propose structured causal models to recover hidden "causal effects" with the aim of improving fairness when presented with biased data <d-cite key="10.1145/3287560.3287564"> </d-cite>. They specify a framework where we want to recover the latent factors so that decision making in applications such as loan assignment and school admissions can be approached fairly. Admiddetly, Structured Causal Modeling (CSM) is arguably a better setting for futher work on our proposed research question. However, this field is largely outside of the scope of the course, so we will only observe that Madras et al. utilyze a model where causal factors, which are analaguous to our ground truth latent variables, affect a treatment (decision) and an outcome, and that they utilyze a Bayesian framework to perform variational inference. Future iterations of our research should borrow methods from this field of Mathematics for maximum impact. Louizos et al. propose the Causal Effect VAE <d-cite key="louizos2017causal"></d-cite>, marrying the adjacent fields and setting the stage for future research.
+Since that under ideal conditions, VAEs recover factorized latent variables, causal inference has become a standard setting for their application. Madras et al. propose structured causal models to recover hidden "causal effects" with the aim of improving fairness when presented with biased data <d-cite key="10.1145/3287560.3287564"> </d-cite>. They specify a framework where we want to recover the latent factors so that decision making in applications such as loan assignment and school admissions can be approached fairly. Admiddetly, Structured Causal Modeling (CSM) is arguably a better setting for futher work on our proposed research question. However, this field is largely outside of the scope of the course, so we will only observe that Madras et al. utilyze a model where causal factors, which are analaguous to our ground truth latent variables, affect a decision and an outcome, and that they utilyze a Bayesian framework to perform variational inference. Future iterations of our research should borrow methods from this field of Mathematics for maximum impact. Louizos et al. propose the Causal Effect VAE <d-cite key="louizos2017causal"></d-cite>, marrying the adjacent fields and setting the stage for future research.
 
-Although there is plenty of research adjacent to our particular question of interest, Beta-VAE investigates how $$\beta$$-regularization effects disentanglement, but not robustness to training bias. Other works that investigate the ability of latent variable models to recover the ground truth in the presence of training bias are not concerned with $$\beta$$-regularization. Thus our particular research question, "how does $$\beta$$-regularization effect VAE robustness to training bias" is both novel and supported by adjacent reseach. 
+Although there is plenty of research adjacent to our particular question of interest, $$\beta$$-VAE investigates how $$\beta$$-regularization affects disentanglement, but not robustness to training bias. Other works that investigate the ability of latent variable models to recover the ground truth in the presence of training bias are not concerned with $$\beta$$-regularization. $$\beta$$-regularization has been shown to be effective, in addition to being extremely simple to implement, compared to other regularization techniques. Thus it is an ideal candidate for directed research on how regularization affects VAE robustness to training bias. Our question is novel, supported by adjacent research, and reasonable to implement with the resources available to an undergraduate student.
 
 ## Set-up and Methods
 
 ### Data
 
-More concretely, suppose that there exists a data generating function $$\mathcal{G}: Z \to X$$ that generates our training dataset given random variables $$Z \sim p_{\text{data}}$$. For simplicity, our data with be nxn grids of squares, where the intensity of each square is deterministically proportional to its respective random variable. To create our training dataset, we sample $$n^2$$ random variables from an isotropic Gaussian distribution with mean $$\mu$$ and covariance I. We then apply a sigmoid activation to the random variables so that values are in the range [0,1]. We then create a mn x mn image with mxm pixel grids for each random variable. Finally, we add Gaussian noise to the image. We choose n=3, m=7, and train a VAE for each integer value of $$\mu$$ in the range [0, 1/2, 1, 3/2, ... 5]. 
+More concretely, suppose that there exists a data generating function $$\mathcal{G}: Z \to X$$ that generates our training dataset given random variables $$Z \sim p_{\text{data}}$$. For simplicity, our data will be nxn grids of squares, where the intensity of each square is deterministically proportional to its respective random variable. To create our training dataset, we sample $$n^2$$ random variables from an isotropic Gaussian distribution with mean $$\mu$$ and covariance I. We then apply a sigmoid activation to the random variables so that values are in the range [0,1]. We then create a mn x mn image with mxm pixel grids for each random variable. Finally, we add Gaussian noise to the image. We choose n=3, m=7, and train a VAE for each value of $$\mu$$ in the set {0, 1/2, 1, 3/2, ... 5}. 
 
 
 #### Training Data
@@ -129,7 +128,7 @@ We train with the Adam optimizer.
 
 ### Training Observations
 
-During the unsupervised training phase where we train the various VAE models on their respective training sets, we observe that dataset choice and penalization of the KL divergence (beta hyperparameter) have consistent effects on the training curves. The following charts demonstrate that increased penalization of the KL-divergence results in higher training loss, as well as nosier training loss and longer convergence times. This is expected since we approximate the KL divergence with only one sample, which is highly variable. Additionally, we observe that higher training bias (i.e. higher pre-activation mean of the pre-activation data generating latent variables) results in higher training loss. As we increase this training bias, it becomes harder and harder to disambiguate latent features from noise. Thus models learn uninterpretable latent variables and poor decoders that learn to trivially output the dominating color (white).
+During the unsupervised training phase where we train the various VAE models on their respective training sets, we observe that dataset choice and penalization of the KL divergence (beta hyperparameter) have consistent effects on the training curves. The following charts demonstrate that increased penalization of the KL divergence results in higher training loss, as well as nosier training loss and longer convergence times. This is expected since higher regularization directly increases the loss and its associated noise. We approximate the KL divergence by drawing one sample, which is highly variable, but tends to work emperically. We also observe that higher training bias (i.e. higher pre-activation mean of the pre-activation data generating latent variables) results in higher training loss. As we increase this training bias, it becomes harder and harder to disambiguate latent features from noise. Thus models learn uninterpretable latent variables and poor decoders that learn to trivially output the dominating color (white).
 
 <div class="row mt-3">
     <div class="col-md mt-3 mt-md-0">
@@ -159,6 +158,8 @@ Another heuristic that we can utilize to estimate the Mutual Information between
 
 
 
-## Conclusion
+## Conclusion and Future Work
+
+From the collected data, it is visually clear that there exists a relationship between $$\beta$$-regularization and training bias. In both heat maps, there are reasonably well-defined diagonal level surfaces, indicating that there is some relationship between regularisation towards an isotropic Gaussian prior and robustness to training bias. Validation and further experiments are required to legitimize this conclusion, however, these experiments are an indication that conscious regularization can be a useful technique to mitigate training biases of a particular form. At this point, further work is required to interpret the results, since it is not clear why we seem to observe inverse relationships between the $$\beta$$-regularization and training bias when we involve the decoder. 
 
-From the collected data, it is visually clear that there exists a relationship between $$\beta$$-regularization and training bias. In both heat maps, the level surfaces are diagonal, indicating that there is some relationship between regularisation towards an isotropic Gaussian prior and robustness to training bias. Validation and further experiments are required to legitimize this conclusion, however, these experiments are an indication that conscious regularization can be a useful technique to mitigate training biases of a particular form. At this point, further work is required to interpret the results, since it is not clear why we seem to observe inverse relationships between the $$\beta$$-regularization and training bias when we involve the decoder. 
+It is also worth noting that during pretraining, VAEs were trained for a fixed number of training steps, and not until convergence. Thus it is highly plausible that models with higher $$\beta$$-regularization (i.e. models with $$\beta > 1$$) were not trained to completion, and therefore can not be fairly evaluated with mutual information estimators without further training. Given my computational and temporal constraints, it was not reasonable to run experiments with longer training. Future work will have to validate my findings by pretraining for longer and testing a finer resolution of $$\beta$$ parameters. Finally, it will be interesting to expand this work to more interesting datasets such as celebA and inject training bias by resampling the dataset according to some variables such as hair color or skin tone. Once we move beyond the assumptions assured by pet data, we can reevaluate what relationships hold true as we gradually add the complexity inherent to the real world.