Convolutional Networks

Convolutional Neural Networks

These networks are commonly used to identify the subject of an image, but can also be applied to natural language processing problems. This class of neural network uses a concept of a 'window' (receptive field) which has a set size, and includes a number of features equal to that size. In a fully connected feed forward network, a neuron will receive input from all the neurons in a previous layer. In a convolutional neural network, neurons will only receive input from those neurons in the previous layer that are in this receptive field window.

In situations where there's a very large number of features this is a powerful way to avoid the rapidly increasing number of connections between layers. This is because a neuron in a later layer is only fully connected with everything in it's window, instead of everything in the entire previous layer. Like traditional neural networks, a neuron applies a function to its inputs. In the CNN case this function uses a weight vector (with a weight for each value in the window) and a bias. Together these are called a filter and represent a feature of the input. In CNNs it is common to reuse a filter for many neurons.

Each filter gives a measure of how much the inputs resemble a particular feature. (e.g. sentiment) It is created by the learning algorithm as it trains on data. The filter is convolved (slid) through the input. The dot product is calculated between the filter and the receptive field. At each stop we record the result of the filter in an activation map. Filters are initialised with random numbers.

Layers

Convolutional Layer

The parameters of this layer are the filters that are applied to a neuron's input. The receptive fields are the inputs to each neuron. Each neuron has a filter, and receives each stop of the sliding window as input. The neuron finds the dot product of the filter with the receptive field input, which is where the name 'convolutional' comes from. The outputs are reconstructed to a feature map.

The feature maps are 2-dimensional. A map shows where a particular feature occurs in the input. The maps are stacked on top of each other to form a 3-dimensional output. The total number of maps is equivalent to the total number of filters or features.

Non Linearity Layer

CNNs use an activation function just like other networks. This layer takes the feature map generated by the convolutional layer and creates and activation map based on this. It applies an activation function to each element in the feature map, meaning the output has the same dimensions as the input.

The activation function can be the sigmoid function or hyperbolic tangent function, however these can be subject to the vanishing gradient problem. ReLU can also be used here (more on ReLU later). They make their output non-linear by use of these activation functions.

Pooling Layer

Pooling layers are a form of downsampling. The number of activation maps outputted from a pooling layer is the same as the previous layer, however the size of each map is smaller. Again, we take a section or window of the previous layer results as the input. In this case the sections is a window of the activation map.

The input values are processed in some way that converts them to a single output. The output of the entire layer is an activation map built from the output of each pool.

The input sections are non overlapping. This is aggressive way to downsample features and it is becoming less popular to use large input sections.

Pooling helps us reduce variance, extract low level features and reduce computational overhead.

Average Pooling: Outputs the arithmetic mean of all values in the input section.
Max Pooling: Outputs the maximum of all values in the input section. This has become preferred because it tends to perform better in practice than average pooling.
L2 Norm Pooling: Outputs the L2 norm of the input section.
RoI pooling: Divides the input section into a number of non-overlapping rectangles and performs pooling on each of these sections. Only the proposed region is included in the rectangles. The region is scaled to a predefined size, e.g. 5x5, where each rectange contributes a value in this output. Each rectange can produce it's value by finding the max value contained in it.

Global pooling takes all of the values in the height and width dimensions and produces a downsampled output. For average or max pooling this will produce a single output, resulting in a 1x1xdepth output.

Global pooling is a technique used to avoid overfitting caused by the fully connected layers that would normally come at the end of the network. It is more common to use global pooling in natural language processing.

Rectification Layer / ReLU

This layer is important to achieve a high accuracy when non-linearity layers are present. It performs a tranformation that eliminates cancellation effects, such as negative values cancelling out positive values, as is the case for average pooling. The output of this layer is a rectified map that has the same dimensions as the input map.

ReLU is an implementation of a Rectification Layer. It uses a common activation function consisting of two linear pieces. The activation function is f(x) = max(0, x) (Add LaTeX here). This means that the function is only removing negative values from the inputted activation map. Despite it's simplicity it tends to work well in practice.

ReLU is good for three main reasons:

They threshold negative values at 0, creating sparse data which provides robustness to small changes in input such as noise
Not computationally intensive
Propagate the gradient efficiently, reducing the likelihood of a vanishing gradient problem.

Fully Connected Layer

This is the same idea as a fully connected layer in other neural networks. It's purpose is to map it's inputs to a probability distribution. The difference is that here the fully connected layer will receive activation maps as input.

It can be argued that there are no fully connected layers in a convolutional network, and that instead these layers are convolutional layers with a 1 x 1 receptive field.

Loss Layer

This is usually the final layer of a Convolutional Network. It uses a function such as Euclidian loss, sigmoid loss or softmax loss to penalize innacurate predictions during training.

Hyperparameters

Number of filters: This is equivalent to the number of neurons. It depends on the problem complexity, the more features an input might have the higher the number of filters.
Filter shape: This should be set to find the right level of granularity. A larger filter can include more data, however at least in image processing a 3 x 3 filter is common.
Stride: Indicates how far a window should shift each time it slides. It is normally set so that the window stops at the edge of the input.
(Zero) Padding: This is a number of extra volume to add to the input as zeros. If we don't want our input map to be reduced by too much because of our convolution, we can increase this parameter
Pooling shape: Used to reduce the input size for a number of benefits. The bigger the size, the more aggressive the impact of pooling. See pooling layer above.

Text classification

Using word embeddings is important as it has been shown to give large performance improvements. We want to preserve the orientation of our text, so we should have filter sizes that match the word embedding dimension. Then we will not slide through the word embedding dimension, but through the word dimension.
word2vec vs GloVe has not been found to make much of a change
No activation funcion, ReLU and tanh have been shown to be effective activation functions. ReLU has separate merits, and so is a good choice. [2]
When CNNs are not applied to image data, it is common for the input data to be preprocessed into image data format.
Different datasets have their own optimal filter size. [2] finds that for sentence classification a reasonable range is 1-10
Possible to improve performance slightly by combining several filters.
Increasing number of feature maps increases accuracy up to limit, after which there returns are not worthwhile. 100-600 is a good searchable range for sentence classification
In [2] global max pooling over feature maps outperformed all local max pooling sizes.
Dropout normalisation can possibly mitigate overfitting when the number of feature maps is large

Previous Work

1 2014: Convolutional neural networks for sentence classification Basic Convolutional Network with word embeddings improved on state of the art in 4/7 tasks.
2 2015: A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification Analyses the architecture and configuration of a Convolutional Network. Advice from this is included above. Good paper.
3 2017: Convolutional neural networks for text categorization with latent semantic analysis Greater than 90% accuracy in classifying articles, although the test set was small. Simple paper that applies convolutional network to a number of datasets.
4 2015: Semi-supervised Convolutional Neural Networks for text categorization via region embedding Unsupervised learning to create 'two-view' text region embeddings that become inputs to a supervised convolutional network. These are purpose-generated and perform better than previous work

Extra points

Convolutional layers can slide in different ways depending on how the input is structured. We can use different layers to slide through input volumes through all directions. (e.g. 1d, 2d, 3d)

Proof of Concept

A proof of concept of using Convolutional Networks on review data can be seen here (currently pending review)

Sidebar

General

LUCAS (Backend)

API

Endpoint Definitions

Data Science

ACLSW 2019
Our datasets
Experiment Results
Research Analysis
Hypothesis
Machine Learning
- Naive Bayes
- Logistic Regression
Deep Learning
Paper Section Drafts
Word Embeddings
References/Resources
Correspondence with H. Aghakhani
The Gotcha! Collection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly