Tag Archives for " convolutional neural networks "

convolution filter or matrix

3×3 convolution filters - A popular choice

In image processing, a kernel, convolution matrix, or mask is a small matrix. It is used for blurring, sharpening, embossing, edge detection, and more. This is accomplished by doing a convolution between a kernel and an image.

In this article, here are some conventions that we are following —

  • We are specifically referring to 2D convolutions that are usually applied on 2 matrix objects such as images. These concepts also apply for 1D and 3D convolutions, but may not correlate directly.
  • While applying 2D convolutions like 3X3 convolutions on images, a 3X3 convolution filter, in general will always have a third dimension in size. This filter depends on (and is equal to) the number of channels of the input image. So, we apply a 3X3X1 convolution filter on gray-scale images (the number of channels = 1) whereas, we apply a 3X3X3 convolution filter on a colored image (the number of channels = 3).
  • We will refer to all the convolutions by their first two dimensions, irrespective of the channels. (We are observing the assumption of zero padding).

A convolution filter passes over all the pixels of the image in such a manner that, at a given time, we take 'dot product' of the convolution filter and the image pixels to get one final value output. We do this hoping that the weights (or values) in the convolution filter, when multiplied with corresponding image pixels, gives us a value that best represents those image pixels. We can think of each convolution filter as extracting some kind of feature from the image.

Therefore, convolutions are done usually keeping these two things in mind -

  • Most of the features in an image are usually local. Therefore, it makes sense to take few local pixels at once and apply convolutions.
  • Most of the features may be found in more than one place in an image. This means that it makes sense to use a single kernel all over the image, hoping to extract that feature in different parts of the image.

Now that we have convolution filter sizes as one of the hyper-parameters to choose from, the choice is can be made between smaller or larger filter size.

Here are the things to consider while choosing the size —

Smaller Filter Sizes

Larger Filter Sizes

We only look at very few pixels at a time. Therefore, there is a smaller receptive field per layer.

We look at lot of pixels at a time. Therefore, there is a larger receptive field per layer.

The features that would be extracted will be highly local and may not have a more general overview of the image. This helps capture smaller, complex features in the image. 

The features that would be extracted will be generic, and spread across the image. This helps capture the basic components in the image. 

The amount of information or features extracted will be vast, which can be further useful in later layers.

The amount of information or features extracted are considerably lesser (as the dimension of the next layer reduces greatly) and the amount of features we procure is greater.

In an extreme scenario, using a 1x1 convolution is like considering that each pixel can give us some useful feature independently. 

In an extreme scenario, if we use a convolution filter equal to the size of the image, we will have essentially converted a convolution to a fully connected layer. 

Here, we have better weight sharing, thanks to the smaller convolution size that is applied on the complete image. 

Here, we have poor weight sharing, due to the larger convolution size.

Now that you have a general idea about the extraction using different sizes, we will follow this up with an experiment convolution of 3X3 and 5X5 —

Smaller Filter Sizes

Larger Filter Sizes

If we apply 3x3 kernel twice to get a final value, we actually used (3x3 + 3x3) weights. So, with smaller kernel sizes, we get lower number of weights and more number of layers. 

If we apply 5x5 kernel once, we actually used 25 (5x5) weights. So, with larger kernel sizes, we get a higher number of weights but lower number of layers. 

Due to the lower number of weights, this is computationally efficient. 

Due to the higher number of weights, this is computationally expensive. 

Due to the larger number of layers, it learns complex, more non-linear features.

Due to the lower number of layers, it learns simpler non linear features.`

With more number of layers, it will have to keep each of those layers in the memory to perform backpropogation. This necessitates the need for larger storage memory. 

With lower number of layers, it will use less storage memory for backpropogation.

Based on the points listed in the above table and from experimentation, smaller kernel filter sizes are a popular choice over larger sizes.

Another question could be the preference for odd number filters or kernels over 2X2 or 4X4.

The explanation for that is that though we may use even sized filters, odd filters are preferable because if we were to consider the final output pixel (of next layer) that was obtained by convolving on the previous layer pixels, all the previous layer pixels would be symmetrically around the output pixel. Without this symmetry, we will have to account for distortions across the layers. This will happen due to the usage of an even sized kernel. Therefore, even sized kernel filters aren’t preferred.

1X1 is eliminated from the list as the features extracted from it would be fine grained and local, with no consideration for the neighboring pixels. Hence, 3X3 works in most cases, and it often is the popular choice.

Related AI articles

Transforming the Retail Customer Experience with In-store Analytics
Online retailers have the advantage of tracking cookies and web analytics tools to calibrate different aspects of an online shopping[...]
AI is Redefining Experience in Customer Support Centres
Businesses need to understand the complexities of individual transactions and customer behavior over multiple touch points and channels, now more[...]
Speech Analytics vs Voice Analytics: What is the difference?
Speech Analytics vs Voice AnalyticsBusinesses today have access to more consumer data than ever before, especially through their customer support[...]
Conversational AI – The next Step in E-commerce Evolution
Conversational AI - The next Step in E-commerce EvolutionThere is no doubt that AI is a popular buzzword in the[...]
Conversational AI: Getting Started
Conversational AI: Getting StartedWith the increasing list of benefits and a growing demand for voice interfaces, the retail space is[...]
Voice-enabled chatbots vs Messenger bots: What you need to know
  There are two distinct ways in which a conversational interface works: text conversations and voice. Consumers interact with chatbots[...]
Convolutional neural nets

Everything you need to know about Convolution Neural Nets

Machine Learning has been around for a while now and we are all aware of its impact in solving everyday problems. Initially, it was about solving simple problems of statistics, but with the advancements in technology over time, it picked up pace to give bigger and better results. It has grown to solve bigger problems such as image recognition and now even possesses the ability to distinguish a cat from a dog.

In this article, we will briefly touch upon the nature and how to manipulate information represented through the network to solve some of the toughest problems around image recognition.

Prologue: a troublesome story of Real Estate Agents

Let’s start right at the beginning. Say we have input vectors — specifications of a house, and outputs like the price of the house. Not delving deeper into the details, visualize it as though we have information described as a set of concepts such as kitchen size, number of floors, location of the house and we need to represent information pertinent to another set of concepts such as the price of house, architecture quality, etc. This is basically conversion from one conceptual representation to another conceptual representation. Let’s now look at a human converting this –

He (say Alex) would probably have a mathematical way to convert this from one conceptual representation to another through some ‘if-else’ condition to start off.

If he (say Bob) was slightly smarter, he would have converted input concepts into some intermediary scores like simplicity, floor quality, noise in the neighbourhood, etc. He would also cleverly map these scores to the corresponding final output, say price of the house.

If you see what has changed from an ordinary real estate agent(Alex) to a slightly smarter real estate agent (Bob) is that he mapped input-output information flow in detail. In other words, he changed the framework in which he thought he could best represent the underlying architecture.

Lesson 1: The ‘Framework of thinking’ is everything

So the difference between Alex and Bob’s thought process was that Bob could figure out that secondary concepts are easy to calculate, and hence he combined them to represent the final desired output whereas Alex tried to apply an entire ‘if-else’ logic for each one of the input variables and mapped it with each one of the output variables. Bob in a way represented the same mapping in a more systematic way by breaking them into smaller concepts and just had to remember fewer concepts. Meanwhile, Alex had to remember how every input is connected to every output without breaking it into smaller concepts. So the big lesson here is that the ‘framework of thinking’ is everything.

This is what most researchers have realized. Every researcher has the same problem, let’s take for instance, the cat vs dog image.

Researchers have to convert information from one conceptual representation (pixels) to another conceptual representation (is-cat is True/False). They also have almost the same computational power(memory, complexity etc), hence the only way to solve this problem is to introduce the framework of thinking that decodes inputs with minimum resources and converts it from one form to another. You would’ve already heard about a lot of ‘frameworks of thinking’. When people say Convolutional Networks, it simply means — it is a framework of representing a particular mapping function. Most statistical models that predict house prices are also just mapping functions. They all try to best predict a universal mapping function from input to output

Lesson 2: Universal Mapping function like Convolutional Neural Networks

Convolutional Neural Networks or CNN are a form of functions that uses some concepts around images — like positional invariance. That means the network can re-use the same sub mapping function from the bottom part of the image to the top part of the image. This essentially reduces the number of parameters in which the Universal Mapping function can be represented. This is why CNNs are cone shaped. Here we move from concepts that are space oriented (pixels) to concepts that are space independent (cat-or-not, has-face). That’s it. It’s that simple. Information is smartly converted from one form to another.

Lesson 3: Convolutional Neural Networks and the Brain

Recent advancements in Neuroscience has essentially said the same thing regarding how we decode the information in the visual cortex. We first decode lines, then decode objects like boxes, circles, curves etc, then decode them into faces, headphones etc.

Conclusion

A lot of Machine Learning/Deep Learning/AI technologies have very simple conceptual frameworks. The reason behind it solving gargantuan problems lies in the complexity that arises from a whole lot of simple-conceptual-frameworks that are attached end-to-end. It is so complex that we can’t really predict whether these networks can solve any kind of problem. Yet, we have been implementing them on a day to day basis based on some sort of assumption. It’s very similar to the human brain. We know its underlying structure and framework. We discovered it half a century ago. Yet, we’ve not been able to decipher this complex world and we are still unsure as to when we’ll reach such an understanding.