Convolutional Neural Networks (CNNs) are one important subset of deep learning models that are particularly useful for processing image input data. Recall that more broadly, in a machine learning task, we are given some type of input data and wish to use it to build a machine learning model that can predict some output variable of interest. In many cases, the input data of interest is an image.
Images as Input Data
Images are typically represented as 3-dimensional matrices of pixel values (Height x Width x Channel). For a standard RGB image, this corresponds to a matrix of (H x W x 3) where red, green, and blue make up the 3 color channels. For a satellite image from the Sentinel-2 mission, this would correspond to a matrix of (H x W x 13), where the 13 channels are made up of the 13 multispectral bands.
Whereas with a standard vanilla (non-convolutional) neural network, these 3D image arrays must be flattened in order to be used as input into the model, a CNN is able to process this 3D input directly. In this way, CNNs are able to better preserve important spatial relations between pixels in an image than standard neural networks are.
CNN Model Basics
CNNs, like other deep learning models, come in various shapes and sizes. But in general, most CNN models are composed of some sequence of the following types of layers:
- Convolutional layer — A convolutional layer uses a set of filters that perform convolution operations with the input volume. These convolution operations, which involve “sliding” a filter over the input region and then performing a matrix multiplication and summation, produce output feature maps. At a high level, the network learns filters that activate when finding some important visual feature, such as an edge or a specific shape.Shown on the right is an 2-D example of a convolutional operation of a 3×3 filter being convolved with a 4×4 image, to produce a 2×2 feature map.
- Pooling layer — A pooling layer is often performed after a convolutional layer, and performs a downsampling operation to reduce the spatial size of the representation and thereby the complexity of the network. Pooling layers often come in two flavors, “max” pooling and “average” pooling. In a “max pooling” operation, the maximum value over an area is taken as the value to be used in the next layer, whereas in an “average pooling” operation, the average value is.
- Fully-connected layer: These layers process flattened input, and are the same type of layers that make up standard vanilla neural networks. These layers, if present, will generally be at the end of a CNN architecture, with the last one connecting to the final output variable.
A CNN simply a sequence of these layers, where each of these layers transforms a given input volume into an output volume that has been changed through an operation, such as a convolution or pooling.
CNNs are able to learn and extract features from images implicitly. Early on in the network, the model is able to learn and pick up low-level features, such as edges and blocks of color. In later layers of the network, the model can abstract higher-level features: for a model trained on satellite imagery, these may be things such as buildings, or specific types of cropland. This gives CNNs a notable advantage over more traditional machine learning models, as little preprocessing or explicit feature construction is needed when using CNNs.
CNN techniques can solve much harder tasks involving imagery and video, allowing problem solvers to tackle more complex things like autonomous driving, or (what I’d argue is even cooler!) automatic analysis of satellite imagery. Now that you are equipped with the tools, go out there and find a problem you are passionate about. You never know what kind of exciting, impactful discoveries you might uncover!