We assume our dataset has been corrupted by a set of transformations parameterized by a vector
We assume there is some underlying true representation of our data
For each training sample, we know that they belong to a transformation class parameterized by a single index where all in this class share the same
At test time, we want to learn a function
Note: We may also include the true
To solve this problem, we propose implicitly learning a useful representation of
Define
Note: This may be trained end to end as it is fully differentiable, but
We suggest that
Note: While we could define a new model
For an example of where this may be applicable, imagine the set of all cameras with all their varying parameters (for an extreme case we could even introduce wildly divergent cameras such as some function of event based signals or those scrambled by a filter). We now want to perform a classification task with any of these cameras. We get a few pictures that have been taken as examples and now need to classify as cats or dogs.
In order to test this method, we first modify the standard MNIST task by defining a set of transformations to the images.
We then increase the complexity by performing classification of imagenet though transformations.
For a test of effectiveness, we also train a normal network without using any example inputs with a similar number of parameters varying the severity and variation in transformations.
I think this proof of concept serves better if it is just plain contrastive loss on the transformation without any gamma function. Once we know how effective that is we can then introduce combination rules that allow us to take into account multiple examples. We can explore plain combinations with means or other symmetric functions.
We could also explore more advanced combination rules such as attention.
We can also explore this separately in something like learning embeddings of videos. Or better use a masking method to remove most of the information in the image and then use combination to recover it to validate the method which can then be applied to stuff like video scene representation or more complex things like transformation representation.
For this task it would be interesting to explore where to exchange information between the representations. Like in a convnet would it be useful to have an interem layer before hitting the representation where information can be exchanged?
I guess this task is something like multi-instance representation learning or fusion representation learning?
There's also the idea of grounding the system more in bayesian statistics by treating the representation as a transformation from a uniform to unknown distribution by sampling the space and treating the result as a system we can perform filtering on. Not exactly sure of the training process. Maybe generate a mixture model then have the contrastive objective approximate a KL divergence maximizing that of the positive example and minimizing the negative example. I guess the number of samples in the mixture model would be a hyperparameter. This is most useful if sometimes the mixture model can well approximated by completely disjoint set of distributions as otherwise we could just view this as manually defining part of the latent space manifold. Although in this case we are also trying to get the networks to quantify its uncertainty through the use of predicting the standard deviation of the transformed guassians so maybe even then it couldn't be approximated by a different latent space geometry.