3×3 convolution filters - A popular choice
In image processing, a kernel, convolution matrix, or mask is a small matrix. It is used for blurring, sharpening, embossing, edge detection, and more. This is accomplished by doing a convolution between a kernel and an image.
In this article, here are some conventions that we are following —
A convolution filter passes over all the pixels of the image in such a manner that, at a given time, we take 'dot product' of the convolution filter and the image pixels to get one final value output. We do this hoping that the weights (or values) in the convolution filter, when multiplied with corresponding image pixels, gives us a value that best represents those image pixels. We can think of each convolution filter as extracting some kind of feature from the image.
Therefore, convolutions are done usually keeping these two things in mind -
Now that we have convolution filter sizes as one of the hyper-parameters to choose from, the choice is can be made between smaller or larger filter size.
Here are the things to consider while choosing the size —
Smaller Filter Sizes
Larger Filter Sizes
We only look at very few pixels at a time. Therefore, there is a smaller receptive field per layer.
We look at lot of pixels at a time. Therefore, there is a larger receptive field per layer.
The features that would be extracted will be highly local and may not have a more general overview of the image. This helps capture smaller, complex features in the image.
The features that would be extracted will be generic, and spread across the image. This helps capture the basic components in the image.
The amount of information or features extracted will be vast, which can be further useful in later layers.
The amount of information or features extracted are considerably lesser (as the dimension of the next layer reduces greatly) and the amount of features we procure is greater.
In an extreme scenario, using a 1x1 convolution is like considering that each pixel can give us some useful feature independently.
In an extreme scenario, if we use a convolution filter equal to the size of the image, we will have essentially converted a convolution to a fully connected layer.
Here, we have better weight sharing, thanks to the smaller convolution size that is applied on the complete image.
Here, we have poor weight sharing, due to the larger convolution size.
Now that you have a general idea about the extraction using different sizes, we will follow this up with an experiment convolution of 3X3 and 5X5 —
Smaller Filter Sizes
Larger Filter Sizes
If we apply 3x3 kernel twice to get a final value, we actually used (3x3 + 3x3) weights. So, with smaller kernel sizes, we get lower number of weights and more number of layers.
If we apply 5x5 kernel once, we actually used 25 (5x5) weights. So, with larger kernel sizes, we get a higher number of weights but lower number of layers.
Due to the lower number of weights, this is computationally efficient.
Due to the higher number of weights, this is computationally expensive.
Due to the larger number of layers, it learns complex, more non-linear features.
Due to the lower number of layers, it learns simpler non linear features.`
With more number of layers, it will have to keep each of those layers in the memory to perform backpropogation. This necessitates the need for larger storage memory.
With lower number of layers, it will use less storage memory for backpropogation.
Based on the points listed in the above table and from experimentation, smaller kernel filter sizes are a popular choice over larger sizes.
Another question could be the preference for odd number filters or kernels over 2X2 or 4X4.
The explanation for that is that though we may use even sized filters, odd filters are preferable because if we were to consider the final output pixel (of next layer) that was obtained by convolving on the previous layer pixels, all the previous layer pixels would be symmetrically around the output pixel. Without this symmetry, we will have to account for distortions across the layers. This will happen due to the usage of an even sized kernel. Therefore, even sized kernel filters aren’t preferred.
1X1 is eliminated from the list as the features extracted from it would be fine grained and local, with no consideration for the neighboring pixels. Hence, 3X3 works in most cases, and it often is the popular choice.