-
Notifications
You must be signed in to change notification settings - Fork 0
/
methodology.tex
223 lines (192 loc) · 9.88 KB
/
methodology.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
\section{Methodology}\label{methodology}
A straight forward approach to categorization and measuring object to
category distances would be to use a k-NN classifier with a Euclidian
metric. However, with few data points a comparison based approach like
k-NN is sensible to outliers and comparisons in high dimensions are
complicated by the curse of dimensionality. In addition the intrinsic
metric of the data might not be reflected by the Euclidian metric.
A more sensible approach would therefore be to learn the metric from the
data and do it such that irrelevant features are discounted. To this end
we employ a specific set of metric learning techniques, called
Mahalanobis metric learning, that learns a transform of the data such
similar items are close and non-similar items are far away.
\input{metric_learning_explanation}
\subsection{Mahalanobis Metric
Learning}\label{mahalanobis-metric-learning}
Mahalanobis metric learning algorithms are a set of metric learning
algorithms, that try to learn a positive semidefinite matrix (PSD),
\(\mathbf{M}\), such that a valid metric can be specified by,
\begin{equation}
D(x,y) = (x-y)^{T} \mathbf{M} (x-y) = \parallel \mathbf{L}(x-y) \parallel^{2}_{2} \enspace ,
\end{equation}
where the second equality follows since \(\mathbf{M}\) is PSD and can be
considered a gramian matrix, that is,
\(\mathbf{M}=\mathbf{L}^{T}\mathbf{L}\). Applying \(\mathbf{M}\) to the
Euclidian metric is thus equivalent to applying a linear transform and
then computing the Euclidian distance. The common theme for these metric
learning algorithms is that they work on class labeled data trying to
find a matrix that pushes similar items closer while pushing non-similar
items further away. See Kulis et al. \cite{Kulis:2012tx} for an
excellent survey.
In our previous work \cite{Hjelm:2015hw} we used the Large Margin
Nearest Neighbor (LMNN) model due to Weinberg et al.
\cite{Weinberger:2009to}, to learn metrics over grasp affordance
features. We forced the learned matrix, \(\mathbf{M}\), to be diagonal,
meaning that it discarded information about correlations between
features in favor of reducing overfitting and enhancing accuracy. The
diagonal could be considered feature weights and although it gave
favorable results in terms of feature selection and grasp execution, it
was unsatisfactory in modeling dependencies between features.
Here we instead use the Large Margin Component Analysis (LMCA) model due
to Torresani et al. \cite{Torresani:2006wb}. \cite{Torresani:2006wb}
formulate an objective function similar to \cite{Weinberger:2009to} but
where the optimization is carried out over \(\mathbf{L}\) using gradient
descent.
\small
\begin{equation}
\label{lossfun}
\begin{aligned}
& \epsilon(\mathbf{L}) = \sum\limits_{i,i \leadsto j} ||\mathbf{L} (x_i-x_j)||^{2} \\
& + c \sum\limits_{i,i \leadsto j,l}(1-y_{il})h(||\mathbf{L} (x_i-x_j)||^{2} - || \mathbf{L} (x_i-x_l)||^{2} +1 ) \\
\end{aligned}
\end{equation}
\begin{equation}
\label{lossfunderivative}
\begin{aligned}
& \frac{d\epsilon(\mathbf{L})}{d\mathbf{L}} = 2 \mathbf{L} ( \sum\limits_{i,i \leadsto j} (x_i-x_j)(x_i-x_j)^{T} \\
& + c \sum\limits_{i,i \leadsto j,l}(1-y_{il}) [(x_i-x_j)(x_i-x_j)^{T} - (x_i-x_l)(x_i-x_l)^{T} ] \\
& h'(||\mathbf{L} (x_i-x_j)||^{2} - || \mathbf{L} (x_i-x_l)||^{2} +1 ) ) \\
\end{aligned}
\end{equation}
. \normalsize
Here \(i \leadsto j\) means the \(k\) nearest neighbors of \(i\)
belonging to the same class, \(y_{il}\) is binary variable that is one
if \(i\) and \(l\) have the same label and zero otherwise, \(c\) is a
constant and \(h\) is the smooth hinge loss \cite{Rennie:2005ds}. The
first term thus penalizes large class neighbor distances while the
second term penalizes any non-class members being closer than the \(k\)
nearest neighbors by a margin.
By optimizing over \(\mathbf{L}\) instead of \(\mathbf{M}\), for a
dataset of dimension \(D\), we can let \(\mathbf{L}\) be \(d \times D\)
where \(d\) is a variable that can be set by cross-validation. Learning
\(\mathbf{L}\) will then amount to learning \(dD\) variables instead of
\(D^2\) when learning \(M\). This reduces overfitting for small but
high-dimensional datasets and in addition dimensionality reduction is
learned simultaneously while in \cite{Weinberger:2009to} the
dimensionality reduction is done using PCA before learning
\(\mathbf{M}\).
The drawback of this approach is that since the objective function is no
longer convex we will discover local minima. \cite{Torresani:2006wb}
showed empirically that this was not a problem since the algorithm
consistently converged to adequate solutions and the kernelized version
outperformed k-NN and LMNN on several benchmark datasets.
Learning matrices that re-represents the input data opens up for
analysis of the correlation between different features by considering
\(M\). The magnitude of the column vectors of \(L\) also works as an
importance measure for each of the features. If we are interested in
learning which features are important for a specific affordance we would
want all columns in the \(L\) matrix connected to non-important features
to have zero or close to zero magnitude.
We therefore introduce the elastic net regularization of
\cite{Cimpoi:2015eg}, which penalizes the column vectors of \(L\) using
the sum of the \(L_1\) and \(L_2\) norm. If we view the column vectors
of \(L\) in Eq. \ref{lossfun} as a form of \(D\) linear regression
functions that we simultaneously optimize it would make sense to
regularize them for cancelation effects and to curb any blowing up of a
specific column. From our experience LMNN is quite sensitivy to the
parameters and easily overfits when having few data points compared to
the number of parameters, that is, \(N \ll Dd\), and using
regularization helps to avoid some of it. In addition the elastic net
regularization has the benefit of keeping groups of correlated variables
whereas the \(l_1\) regularization usually keeps only one of them
\cite{Cimpoi:2015eg}.
The modified elastic net regularized loss function now has the form,
\small
\begin{equation}
\label{lossfun}
\begin{aligned}
& \epsilon_{new}(\mathbf{L}) = \beta(\lambda_1 \sum_{j=1}^{D} \left\lVert L_j \right\rVert_1 + \lambda_2 \sum_{j=1}^{D} \left\lVert L_j) \right\rVert_{2}^{2} + \epsilon(\mathbf{L}) \\
\end{aligned}
\end{equation}
\begin{equation}
\label{lossfunderivative}
\begin{aligned}
& \frac{d\epsilon_{new}(\mathbf{L})}{d\mathbf{L}} = \beta(\lambda_1 \tilde{L}_1 + \lambda_2 \tilde{L}_2) + \frac{d\epsilon(\mathbf{L})}{d\mathbf{L}} \\
& (\tilde{L}_1)_{ij} = sign(L_{ij}), \; \; \; (\tilde{L}_2)_{ij} = 2 L_{ij},
\end{aligned}
\end{equation}
\normalsize
where \(\lambda_1 + \lambda_2 = 1\). When \(L_{ij}\) is zero the
derivate of the \(l_1\) norm is singular. We define its derivate at zero
as,
\begin{equation}
( \tilde{L}_1 )_{ij}_ =
\begin{cases}
1 & \frac{d\epsilon(\mathbf{L})}{d\mathbf{L}} > \lambda \\
-1 & \frac{d\epsilon(\mathbf{L})}{d\mathbf{L}} < \lambda \\
\end{cases}
\end{equation}
\begin{comment}
[We should maybe investigate if the penalization has the same effect on the eigenvalues as ridge regression ]
\end{comment}
\subsection{Features}\label{features}
We want to use features that have at least a weak semantic meaning to a
human observer and that are as generic as possible. Affordances can from
a human perspective be understood in both global and local properties of
the object we therefore try to cover both. Due to lack of space we will
provide a short list with motivations when needed.
\small
\begin{itemize}
\tightlist
\item
\textbf{Object Volume} - The volume of the convex hull enclosing the
object.
\item
\textbf{Primitive Shape} - Scoring function for how well the object
resembles spherical, cylindrical, or cuboid form.
\item
\textbf{Elongatedness} - The ratio of the minor axes of the object to
the major axis.
\item
\textbf{Opening} - Wether our opening detection algorithm detects an
opening.
\item
\textbf{Intensity and Gradient Filters} - We apply gradient filters of
first, second order over the 2D window containing the object along, in
addition we extract the intensity values of the image. Since pixels
correspond to the points on the point cloud of the object we map the
values on to each point. We then compute a count normalized histograms
(CNH) for all the points of the object for each of the filter values
and the intensity values.
\item
\textbf{Color Quantization} - We segment the image using the
segmentation algorithm of \cite{Felzenszwalb:2004vi} and encode each
patch with its closest color from a chart of 15 different colors, in
CIELab space. Each point in the point cloud of the object thus gets a
quantized color assigned to it. We then form a CNH over all the points
of the object.
\item
\textbf{Color Uniformity} - We compute the entropy, mean and variance
of the color quantization.
\item
\textbf{Bag of Words Histogram over Fast Point Feature Histogram
(FPFH)} \cite{Rusu:2009hf} - We encode the local curvature of the
object using a Bag of Words model over the FPFH features of the point
cloud using a small set of 40 keywords. We then extract the histogram
of the keywords for the object.
\item
\textbf{Texture} - We employ a similar approach to
\cite{Cimpoi:2015eg} by learning a Fisher vector representation from
dense SIFTs over a set of labeled images containing the following
textures: glass, carton, porcelain, metal, plastic and wood. We also
extract the output of the fifth layer of a GooleLeNet CNN
concatenating it with the Fisher vectors. These two features
complement each other, the Fisher vector trained on textures, and the
GooleLeNet in that certain objects are more likely to be made of a
specific material. We use the final feature vector to train 6
1-against-\(K\) linear SVMs. The output from the 6 SVMs when
classifying the object can be considered as a score of how similar
each object is to one of the textures and we use this as the feature
vector.
\end{itemize}
\normalsize