Deep Learning and Convolutional Neural Network

6 min readMar 28, 2020

This post is to summarize the deep learning classes from Udacity Data Science Nano Degree Program. Github repo: https://github.com/jl4730/DeepLearning

1 Introduction to neural networks

Perception algorithm:

This video above shows a simple 0/1 classification problem and how why gradually gets to the optimal classification line.

But most of the time we need to deal with more complicated problems with multiple classes and complicated classification boundaries (not a line, but a curve or something higher dimensional).

One-hot encoding is introduced to label the multiple classes before doing classification. And then softmax function will help give each category a probability after the model produced the result.

Then error function will need to come into play by telling us how far away with the “solution” and which direction we need to take to get there. In order for us to use gradient descent, the error function should be continuous. To do that, the sigmoid function is introduced to translate the values into probabilities.

After introducing the basic concepts, the core question is the error function. The idea comes from the max likelihood where the best model will make the actual combination most likely. If I win the lottery, the model says that I will win the lottery is the best. Otherwise, the model says I cannot win is the best.

Bring this max likelihood concept to the classification problem, we have cross-entropy. Good models will give low cross-entropy and bad models will have high cross-entropy.

For multi-class classification, the cross-entropy will become:

Combining everything we have above and apply them to the simply 2 class classification problem, we will have the error function as:

The above shows the full process to get a linear classification model. What about non-linear boundaries? That’s where neural networks show its full potential: by combining all the “perceptions” we can build a powerful model:

How we train a neural network? The high-level answer is feedforward and Backpropagation. Feedforward is the process neural networks use to turn the input into an output. In a nutshell, backpropagation will consist of:

Doing a feedforward operation.
Comparing the output of the model with the desired output.
Calculating the error.
Running the feedforward operation backward (backpropagation) to spread the error to each of the weights.
Use this to update the weights, and get a better model.
Continue this until we have a model that is good.

The video below will show us a conceptual interpretation of what backpropagation is:

2 Training neural networks

There are two types of mistakes we can make: 1) trying to kill Godzilla using flyswatter; 2) trying to kill a fly using a bazooka. The former one is called underfitting and the latter one is called overfitting in machine learning. The model complexity graph can help us determine the goldilocks point.

An issue with our previous error function is that it does not penalize large coefficients. Large coefficients tend to overfit as “they are so confident in themselves and hard to adjust”. We can use L1 or L2 regularization to deal with the issue. L1 will generally give spare results and hence good for feature selection. L2 tends to keep all the parameters homogenously small and tend to perform better.

Another way to improve the performance of the model is drop-out. By randomly drop out some nodes during our training process, the result is more robust as each node will have a chance in shaping the model.

The random start is often used to avoid local minima. And we also use different activation functions to help solve the vanishing gradient problem. For instance ReLU function.

3 Convolutional neural network

CNN is pretty widely applied in many different fields. For instance, Google used CNN to build the https://deepmind.com/blog/article/wavenet-generative-model-raw-audio wavenet model that can read any text. If you provide enough your own voice, the model can help read the text sounds just like you.

To understand CNN, we need the concept of MLP. The key concept of MLP( multilayer perceptron) is perceptron which mimics the function in the human body:

To detect an image, MLP will first convert the 2-dimensional picture into a vector. Then by connecting multiple layers of fully connected perceptrons, we can get an MLP as illustrated below:

Tow biggest issues with MLPs are: 1) it only uses fully connected layers; 2) it only accepts vectors as input. Hence the number of parameters of MLPs is huge and the key structure information in the image is lost.

Instead of using vector, CNN is built to elucidate the patterns in multidimensional data. Unlike MLPs, CNNs understand the fact that image pixels that are closer in proximity to each other are more heavily related than pixels that are far apart. CNN and MLPs do share the similarity that both are composed of a stack of layers. But CNN introduced two different types of hidden layers: convolution layers and pooling layers.

CNN only connected the layers locally which drops the number of parameters significantly and it’s less prone to overfitting.

We can add more patterns to the model where each is still confined to analyzing a single small region within the image.

The convolutional layers are calculated in a very straight forward way. After defining the stride of padding ( https://iamaaditya.github.io/2016/03/one-by-one-convolution/), the convolutional layer can be derived through matrix multiplication. We often add RELU function to help with vanishing gradients problems. Oftentimes we will have many different filters to capture different patterns in the image. We can generally visualize the filters to understand which patterns CNN trying to capture.

The pooling layers are used to reduce the dimensionality of the convolutional layers. The reason is that large number of parameters in the convolutional layers will lead to overfitting. There are many types of pooling strategy and the graph below illustrated the idea of the max-pooling layer.

CNN is the process of discovering the spatial patterns contained in the image through multiple layers. The combination of convolutional layers and max-pooling layers accomplishes the goal of attaining an array that is quite deep with very small spatial dimensions.

As the CNNs are often hard to train, we can use transfer learning to take advantage of the groundbreaking CNN architectures like VGG and ResNet trained on the best GPUs on the planet. This will help us reduce the time of training and often achieve good results.

4 Groundbreaking CNN architectures

Check out the AlexNet paper!
Read more about VGGNet here.
The ResNet paper can be found here.
Here’s the Keras documentation for accessing some famous CNN architectures.
Read this detailed treatment of the vanishing gradients problem.
Here’s a GitHub repository containing benchmarks for different CNN architectures.
Visit the ImageNet Large Scale Visual Recognition Competition (ILSVRC) website.

Deep Learning and Convolutional Neural Network

1 Introduction to neural networks

2 Training neural networks

3 Convolutional neural network

4 Groundbreaking CNN architectures

Written by Jingying Liu