Build a Traffic Sign Recognition Project
The goals / steps of this project are the following:
- Load the data set (see below for links to the project data set)
- Explore, summarize and visualize the data set
- Design, train and test a model architecture
- Use the model to make predictions on new images
- Analyze the softmax probabilities of the new images
- Summarize the results with a written report
You're reading it! And here is a link to my project code. Most figures shown in this writeup can be found on the notebook as well
I used the basic numpy API on the training, validation and test arrays to get some statistics of the training set:
- The size of training set is 34799
- The size of the validation set is 4410
- The size of test set is 12630
- The shape of a traffic sign image is (32, 32, 3)
- The number of unique classes/labels in the data set is 43
(Note: the number of unique classes was cross-checked with the file signnames.csv
to make sure the training set covered all classes.
Three basic analysis were done on the dataset (always only looking at the training set):
- A histogram of samples per class
- A histogram of lightness/darkness of the training examples.
- Plotting a random image of each class
I used the keras ImageDataGenerator
class to generate more training data. During experimentation, I found that this improved the generalization of the network. To include the generated data on the training process, I duplicated every batch, using the basic training data as the first part of the batch and a second random batch using the image generator. The reason for this will be discussed on the section about training of the network.
The ImageDataGenerator
was configured with the following parameters:
rotation_range=15,
width_shift_range=2.0,
height_shift_range=2.0,
shear_range=5.0,
zoom_range=[0.9, 1.1],
fill_mode='reflect',
data_format='channels_last'
To find out a good set of parameters, some manual experimentation was done and the results visually inspected.
Some transformation options should obviously not be used for this application, such as mirroring the image. Some examples of the transformations are shown on the figures below:
The only pre-processing steps used are conversion to grayscale (as suggested in [LeCun:2011]) and image normalization.
To avoid errors (e.g. forgetting to apply the preprocessing steps), the preprocessing of each image was embedded on the tensorflow network, using the functions tf.image.rgb_to_grayscale
and tf.image.per_image_standardization
. This is quite convenient, though slightly slower for training (due to the re-conversion of the images every epoch).
The effect of per-image standardization can be seem on the example below (using the same image #16149, shown on the class examples.
The first layer of the network uses a separate convolution for a grayscale feature (42 filters) and for the color features (16 filters). The value number of filters was found through a lot of trial and error, but generally the network presents similar performance when up to 100 filters are used on the first layer (beyond that the training starts becoming too slow).
After a lot of experimentation (using ideas mostly from [LeCun:2011] and [Szegedy:2015]), the architecture below was found. The most important characteristic of the network is that it uses a multi-scale feed on the first fully connected layer (it receives as input all convolutional layers, with a heavier pooling). An overview of the network structure is on the picture below:
The two basic layers of the network are a convolutional layer and a fully connected layer. A convolutional layer has the following operations:
- A linear convolution kernel (with bias)
- A tanh activation function
- A local response normalization function
All fully connected layers have this structure:
- Linear combination of the previous inputs (with bias)
- A tanh activation function (except on the logits layer)
- During training, a dropout operation on the two mid layers.
To develop the network, I started from the LeNet network presented on the classes and from there, experimented with different activation functions, compositions of each layer and connections (using the directions provided in the aforementioned papers). I found that:
- Application of multi-scale feed made the accuracy jump from 89% to around 92%.
- Usage of tanh instead of RELU greatly improved training convergence and accuracy (from around 92% to 96%), without any noticeable change on time to execute each training epoch.
- The next jump from around 96% accuracy to 97% accuracy was achieved by using local response normalization and the layer with 1x1 convolutional kernels on the 2nd convolutional layer, forcing a compression on the number of filters.
- Finally, experimenting with dropout rates on the fully connected layers helped improve the accuracy by around 1 percent point also. I tested dropouts from 0 to 60%, in steps of 10%. Dropouts above 30% generally gave better results, so I chose 50% , also due to the theoretical reasons behind it.
During the development, a systematic experiment was done to find out the influence of the number of neurons on the fully connected layer. The experiment is not recorded on the python notebook but, basically, 36 combinations of number of neurons on the two hidden layers were tried out (from 300 to 100 on the first layer and from 200 to 50 on the second layer). It was found that the network is not exceptionally sensitive to these parameters, as long as the second layer has between 100 to 150 neurons.
To train the model, I used the suggested Adam optimizer, as it seems relatively robust. I experimented with learning rates from 0.001 to 0.000005, including trying to do a step reduction on the training rate after an accuracy of 0.97 was found and also progressively reducing it every epoch. Neither method seemed to cause any perceivable gain on speed or quality of the final result, thus I decided to use a fixed learning rate on the final training.
To define the batch size, I did experiments with 256, 128, 64, 32 and 16 batches. I found that 32 had the best trade-off in terms of improvement in accuracy and batch time. With batch size of 16, the time to complete an epoch became significantly larger, without improving the learning speed significantly.
Most of the experiments executed were done using something between 10 to 20 epoch of training (in my computer this gives something between 5 to 30 min, depending on the experiment). When I settled on a suitable architecture, I ran the training through 200 epochs. The decrease in error rate (1 - accuracy) can be seen below:
It can be seem that the training converges somewhat fast to something between 0.98 and 0.99 of accuracy, becoming noisy after that. The best accuracy obtained for this network during training was 0.991, but my training loop didn't implement checkpointing the best accuracy case.
When I added the generation of variants based on the original training set, I did it in a slightly hackish way. As I wanted to make sure that all images on the training set were shown to the network, I made the main training loop (the one that gets a batch and trains it) actually run twice: one with a batch from the training set and the second run with a same sized batch of variants. Thus, when the training script outputs the training of an "EPOCH", what actually happened is that the network was trained in a training set twice the size of the original training set. No randomization of images was used on the validation set as it is simply used as a reference to understand how well the training is going.
Another thing that, in retrospective, I should have done while experimenting with the structure of the network was to avoid training the convolution layers until I got a good structure and only after being happy with the other parameters, start training then. That would have greatly speed up investigation and, as mentioned by [LeCun:2011], random features give a reasonable estimate of the performance of the network during exploratory phases.
Finally, to allow me to use the computer while training was running, I implemented a "pausing" function on the training loop. It simply checks for the existence of a file called 'pause' and, while it exists, pauses the training. The time spent paused is not accounted on the estimates of time per epoch.
My final model results were:
- training set accuracy of 99.99 % (without including random generate examples)
- validation set accuracy of 98.62 %
- test set accuracy of 97.82 %
Most of the analysis of how I got to these results are shown on the previous discussion. A important remark is that I did not consider test set results until all parameters of the network were fixed, so there is no "leakage" of the test results on the network architecture or hyperparameters.
As I happen to live in Germany, instead of getting traffic signs from the Internet I took some pictures on my path from work to home. All pictures were taken at dusk, on the same day.
From the pictures, I them manually cut several traffic signs, of different resolutions and separated 52 images. Due to the characteristics of the place were I took the pictures, there are not too many variations (I have examples of only 13 of the 43 classes). But I collected two interesting examples:
- LED light speed signs, which are part of a category on the training set but are a configuration not on the training set
- Two signs that are not on the training set.
All new test images, together with their classification results, can be seem on the image below. My original expectation on the classification results were:
- Most of the images that are present on the training set should be correctly classified, as they are not particularly hard (from a human perspective, of course. I found some images on the training set quite hard to identify)
- The LED speed signs should be classified as speed signs, even if the model cannot recognize the text.
- The two images not on the original categories should be classified as categories of visually similar things (e.g. I expect the left arrow over a blue field to be classified as one of the blue signs).
Overall, I was generally satisfied with the results. I was surprised by the number of LED signs that were mis-classified. After inspecting the first internal layer of the network and the normalized images it struck me that the network is probably expecting the numbers to be darker than the surroundings. This explains the somewhat bad results.
The classification of the new signs (the two last ones on the picture) was exactly as I expected, them being assigned to similar categories. I was surprised by the right arrow being classified as a turn left ahead though (in tests with a less trained network, it was classified as a "keep right").
Since the set had some cases that were never seen by the network, I'll analyse them separately.
Set | Accuracy | Comments |
---|---|---|
Already seen signs | 100% | Accuracy in line with the expected |
LED signs | 0% | Most were classified with a different speed, except one that was classified as a stop sign. |
Unknown cases | - | Both were classified as similar signs, as as expected |
Without considering the unknown cases (but including the LED speed signs), the overall model accuracy was 86%.
For each of the 52 example images, the top 5 predictions were calculated. As most of them were correct, they will not be shown in this writeup. The reader should refer to the python notebook (the table is to the end of the notebook). Nevertheless, the two unknown signs and one example of the LED speed sign will be analysed here, as they are a good example of how the network was able to generalize some features.
In both unknown signs, it can be seen that the top 5 hypotheses by the network were consistent with the visual appearance of the signs. I understand this as a sign of good generalization, in the sense that the network ended up creating features that somehow map to human intuition.
On the LED speed signs, the network typically failed to "read" the speed, being usually able to detect that it was a speed sign (and most hypotheses were consistent with that). Nevertheless, one of them was classified (weakly) as a stop sign.
I'll use one of the 60 km/h misclassification cases as an example for visualizing the network features. I'll concentrate on the first convolutional layer as I found it easier to interpret (to be honest, I couldn't make sense at all of the meaning of the 3rd convolutional layer). Nevertheless, they are quite different for both inputs.
On the other hand, on the first layer, it becomes clear that the number information is less distorted on the correctly classified sign than in the wrongly classified sign. Analysing the color only filters and comparing them to the normalized image is quite informative (the color filters start from Feature Map 42 until the end).
All plots can be found on the python notebook. On this writeup I added only the color channel with standardized colors and the first layer
For the LED sign:
For the normal sign:
For the LED sign:
For the normal sign:
[LeCun:2011]: Pierre Sermanet and Yann LeCun, "Traffic sign recognition with multi-scale Convolutional Networks"
[Szegedy:2015]: Szegedy, Christian and Vanhoucke, Vincent and Ioffe, Sergey and Shlens, Jonathon and Wojna, Zbigniew, "Rethinking the Inception Architecture for Computer Vision"