Visualizing convolutional neural networks with the CIFAR-10 dataset
3/4/2019
Introduction
In this post we will train a convolutional neural network on the CIFAR-10 dataset and investigate what the network is really learning. To do so, we will find image patches that maximally activate various filters. This post was inspired by Andrew Ng's video on the topic. All the code I use here is available on GitHub.
The data
The CIFAR-10 dataset is widely used in machine learning and computer vision. It contains 60,000 32x32 rgb images which belong to 10 different classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images in each class.
The model
We will use a model architecture with 6 convolutional layers described here. The model consists of two convolutional layers, each with 32 filters, followed by a maxpool layer, followed by two more convolutional layers, each with 64 filters, followed by a second maxpool layer, and finally two more convolutional layers with 128 filters each, followed by a softmax layer with 10 neurons, one for each class. All of the convolutional filters are 3x3 with a stride of 1 and same padding. The two max pool layers both have have kernel size 2x2 and stride 2. Here is a quick visualization of the model:
Input (32x32x3) -> Conv1 (32 filters) -> Conv2 (32 filters) -> Maxpool1 -> Conv3 (64 filters) -> Conv4 (64 filters) -> Maxpool2 -> Conv5 (128 filters) -> Conv6 (128 filters) -> Softmax (10 neurons)
Training
In order to achieve a high level of accuracy, a number of training techniques are used, including learning rate decay, dropout, L2 regularization, batch norm layers, and image preprocessing. I won't go into details about training the network here, but after training for 128 epochs, the network achieved 88.4% accuracy on the test data. A pretty nice result and certainly good enough for our purposes, though far from cutting edge.
What are the filters learning?
One way to understand what the filters are learning is to identify image patches from our test (or training) data that maximize their activations. What size should the image patches be? This depends on how deep we go in our network. For our first convolutional layer, the image patches should be the same size as the filters, so a 3x3 array of pixels in our case. For the next convolutional layer, each activation neuron corresponds to a 3x3 array of activation neurons from the first layer. Further, each of these activation neurons corresponds to a 3x3 array of pixel values, but with a filter stride of 1, there is substantial overlap in these pixels, and the image patch ends up being 5x5.
For the next convolutional layer the calculation is slightly more complicated since we pass through a maxpool layer with kernel size 2x2 and stride 2 along the way. Each activation neuron in the third layer corresponds to a 3x3 patch of activation neurons in the max pool layer. Because the max pool layer has kernel size 2x2 and stride 2, this patch corresponds to a 6x6 patch of activation neurons from the second convolutional layer with no overlap, which corresponds to an 8x8 patch from the first convolutional layer, and a 10x10 patch of pixels values. For the fourth layer I'll just give you the calculation in shorthand (traversing backwards through the network):
1x1 (Conv4 activation) -> 3x3 (Conv3 activation) -> 5x5 (Maxpool1 activation) -> 10x10 (Conv2 activation) -> 12x12 (Conv1 activation) -> 14x14 (pixel array).
You can check that the complete list of image patch sizes for each of the six layers is: 3x3, 5x5, 10x10, 14x14, 24x24, 32x32. This works nicely since our input was 32x32. Here we see a general trend: convolutional filters tend to cover bigger and bigger image patches as you get deeper into the network. In what follows, I've selected six interesting filters from each convolutional layer and shown the 16 image patches from the test set that achieve the highest activations.
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
Layer 6
We can see that the filters in the first convolutional layer are picking up mostly on colors, corners, and simple edges. In the second layer, longer edges, stripes, and lines start to be detected. In the third layer, more complex shapes begin to be detected, such as two lines meeting at an angle. The fourth layer seems to start to recognize things like wheels or the wings of a plane. Finally, in the last two layers we start to detect larger features such as cars, or the faces of dogs and cats.
The exact patterns observed above won't necessarily hold for other convolutional networks, but there is good reason to expect many of the general trends to be similar. It's worth noting that each filter gets to combine the activations of all filters in the previous layer. So if the first layer or two pick up on simple things like color, corners, and edges, the later filters get to take into account all the different color, corner, and edge combinations in their larger image patch scope. That said, the later layers sometimes appear to indicate a noticeable preference for certain colors and or the presence of basic edges and shapes as well.
I also noticed that the filters, at least in the later layers, often pick up on objects from the same class. While this may aid in a correct classification, there's nothing inherently wrong with filters being maximally activated by images from different classes since the network's predictions are determined by a final softmax layer which gets to take into account all filters in the preceding layer. It's also worth pointing out some exceptions: as you can see above in layer 6, the antlers of a deer seem to cause similar activations to the wings of a plane, and the bodies of frogs are confused with boats. I think it's worth showing some more filters whose image patches are mixed across classes but where it is still possible to identify a common feature that they share:
I also found a few amusing examples of "imposters" in some of the collections of image patches:
So do convolutional networks really put small features together into larger and larger features, finally figuring out what a car or a cat looks like? Well, only sort of. While convolutional networks can achieve performance on par or even superior to humans in certain situations, they are generally more limited in scope and can't generalize as well to situations that they aren't trained for. Even though the network trained in this post did not overfit to the training data relative to the test data, it's worth keeping in mind that the test data follows the same distribution as the training data. There is no reason to believe the network would be able to recognize a car or a cat if it were given pictures that differed more substantially from those it was trained on. Further, as demonstrated in this post by Francois Chollet, generating an input image that maximizes a neural network's probability determination for a specific class results in an image that more or less looks like random noise to a human. So although in some sense, convolutional networks really do learn simple features in the early layers and put them together in the later layers in a way that can generalize well to the test data, we should be careful not to conclude that they really end up learning the true essence of what a car or a cat looks like.