How does convolution work




















CNNs have one more arrow in their quiver. Fully connected layers take the high-level filtered images and translate them into votes. In our case, we only have to decide between two categories, X and O. Fully connected layers are the primary building block of traditional neural networks. Instead of treating inputs as a two-dimensional array, they are treated as a single list and all treated identically.

Every value gets its own vote on whether the current image is an X or and O. Some values are much better than others at knowing when the image is an X, and some are particularly good at knowing when the image is an O. These get larger votes than the others. These votes are expressed as weights, or connection strengths, between each value and each category. When a new image is presented to the CNN, it percolates through the lower layers until it reaches the fully connected layer at the end.

Then an election is held. The answer with the most votes wins and is declared the category of the input. Fully connected layers, like the rest, can be stacked because their outputs a list of votes look a whole lot like their inputs a list of values. In effect, each additional layer lets the network learn ever more sophisticated combinations of features that help it make better decisions.

Our story is filling in nicely, but it still has a huge hole—Where do features come from? If these all had to be chosen by hand, CNNs would be a good deal less popular than they are. Luckily, a bit of machine learning magic called backpropagation does this work for us. To make use of backpropagation, we need a collection of images that we already know the answer for.

This means that some patient soul flipped through thousands of images and assigned them a label of X or O. We use these with an untrained CNN, which means that every pixel of every feature and every weight in every fully connected layer is set to a random value. Then we start feeding images through it, one after other. Each image the CNN processes results in a vote. The amount of wrongness in the vote, the error, tells us how good our features and weights are.

The features and weights can then be adjusted to make the error less. Each value is adjusted a little higher and a little lower, and the new error computed each time. Whichever adjustment makes the error less is kept. After doing this for every feature pixel in every convolutional layer and every weight in every fully connected layer, the new weights give an answer that works slightly better for that image.

This is then repeated with each subsequent image in the set of labeled images. Quirks that occur in a single image are quickly forgotten, but patterns that occur in lots of images get baked into the features and connection weights.

If you have enough labeled images, these values stabilize to a set that works pretty well across a wide variety of cases. As is probably apparent, backpropagation is another expensive computing step, and another motivator for specialized computing hardware. Unfortunately, not every aspect of CNNs can be learned in so straightforward a manner. There is still a long list of decisions that a CNN designer must make. In addition to these there are also higher level architectural decisions to make: How many of each layer to include?

In what order? Some deep neural networks can have over a thousand layers, which opens up a lot of possibilities. With so many combinations and permutations, only a small fraction of the possible CNN configurations have been tested.

CNN designs tend to be driven by accumulated community knowledge, with occasional deviations showing surprising jumps in performance. The trick is, whatever data type you start with, to transform it to make it look like an image. For instance, audio signals can be chopped into short time chunks, and then each chunk broken up into bass, midrange, treble, or finer frequency bands.

This can be represented as a two-dimensional array where each column is a time chunk and each row is a frequency band. CNNs work well on this. For example, if you would apply a convolution to an image, you will be decreasing the image size as well as bringing all the information in the field together into a single pixel.

The final output of the convolutional layer is a vector. Based on the type of problem we need to solve and on the kind of features we are looking to learn, we can use different kinds of convolutions.

The 2D Convolution Layer The most common type of convolution that is used is the 2D convolution layer and is usually abbreviated as conv2D. As a result, it will be summing up the results into a single output pixel. The kernel will perform the same operation for every location it slides over, transforming a 2D matrix of features into a different 2D matrix of features. The Dilated or Atrous Convolution This operation expands window size without increasing the number of weights by inserting zero-values into convolution kernels.

Dilated or Atrous Convolutions can be used in real time applications and in applications where the processing power is less as the RAM requirements are less intensive. Separable Convolutions There are two main types of separable convolutions: spatial separable convolutions, and depthwise separable convolutions. Their name stems from one of the most important operations in the network: convolution.

Convolutional Neural Networks are inspired by the brain. Research in the s and s by D. H Hubel and T. N Wiesel on the brain of mammals suggested a new model for how mammals perceive the world visually. They showed that cat and monkey visual cortexes include neurons that exclusively respond to neurons in their direct environment.

In their paper , they described two basic types of visual neuron cells in the brain that each act in a different way: simple cells S cells and complex cells C cells. The simple cells activate, for example, when they identify basic shapes as lines in a fixed area and a specific angle.

The complex cells have larger receptive fields and their output is not sensitive to the specific position in the field. The complex cells continue to respond to a certain stimulus, even though its absolute position on the retina changes.

Complex refers to more flexible, in this case. In vision , a receptive field of a single sensory neuron is the specific region of the retina in which something will affect the firing of that neuron that is, will active the neuron. Every sensory neuron cell has similar receptive fields, and their fields are overlying.

Further, the concept of hierarchy plays a significant role in the brain. Information is stored in sequences of patterns, in sequential order. The neocortex , which is the outermost layer of the brain, stores information hierarchically. It is stored in cortical columns, or uniformly organised groupings of neurons in the neocortex.

In , a researcher called Fukushima proposed a hierarchical neural network model. He called it the neocognitron. This model was inspired by the concepts of the Simple and Complex cells. The neocognitron was able to recognise patterns by learning about the shapes of objects.

Their first Convolutional Neural Network was called LeNet-5 and was able to classify digits from hand-written numbers. For the entire history on Convolutional Neural Nets, you can go here.

In the remainder of this article, I will take you through the architecture of a CNN and show you the Python implementation as well.

Convolutional Neural Networks have a different architecture than regular Neural Networks. Regular Neural Networks transform an input by putting it through a series of hidden layers.

Every layer is made up of a set of neurons , where each layer is fully connected to all neurons in the layer before. Finally, there is a last fully-connected layer — the output layer — that represent the predictions. Convolutional Neural Networks are a bit different.

First of all, the layers are organised in 3 dimensions : width, height and depth. Further, the neurons in one layer do not connect to all the neurons in the next layer but only to a small region of it. Lastly, the final output will be reduced to a single vector of probability scores, organized along the depth dimension.

In this part, the network will perform a series of convolutions and pooling operations during which the features are detected.

If you had a picture of a zebra, this is the part where the network would recognise its stripes, two ears, and four legs. Here, the fully connected layers will serve as a classifier on top of these extracted features. They will assign a probability for the object on the image being what the algorithm predicts it is.

Convolution is one of the main building blocks of a CNN. The term convolution refers to the mathematical combination of two functions to produce a third function. It merges two sets of information. In the case of a CNN, the convolution is performed on the input data with the use of a filter or kernel these terms are used interchangeably to then produce a feature map. We execute a convolution by sliding the filter over the input. At every location, a matrix multiplication is performed and sums the result onto the feature map.

In the animation below, you can see the convolution operation. You can see the filter the green square is sliding over our input the blue square and the sum of the convolution goes into the feature map the red square.

The area of our filter is also called the receptive field, named after the neuron cells!



0コメント

  • 1000 / 1000