Supervised Learning

Jingying Liu

5 min readFeb 18, 2020

Github Repo: https://github.com/jl4730/SupervisedLearning

1 Linear Regression

Code link: https://github.com/jl4730/SupervisedLearning/blob/master/batch_graddesc.py

Regularization

Put the complexity of the model into the error function to avoid overfitting.

L1-Absolute L2-Square

Feature Scaling

2 Classification

2.1 Neural network building blocks: the perceptron

Perceptron algorithm:

2.2 Entropy, which will be used for developing decision trees (error function)

The more rigid, the less entropy. The more knowledge, the less entropy.

Information gain:

Hyperparameters for decision trees:

In order to create decision trees that will generalize to new problems well, we can tune a number of different aspects of the trees. We call the different aspects of a decision tree “hyperparameters”. These are some of the most important hyperparameters used in decision trees:

2.3 Naive Bayes

Using Naive Bayes to classify spam and ham, idea:

2.4 SVM （support vehicle machine)

The error function for SVM is quite unique in that it contains two parts: classification error plus the margin error. Small margins will be penalized as we want the margin to be as big as possible to better generalize.

The C parameter will be attached to the classification error. Then depending on the need we can decide how large C we want in our error function.

C is a hyperparameter and we will need to use techniques as a grid search to find the best C. The other hyperparameter is the kernel used. The polynomial kernel can separate the points better than the linear line with the higher dimensional surface. RBF (radio-based function) kernel is another choice. Gamma parameter will help determine how wide/narrow we want the mountain is.

2.5 Ensemble methods and random forest（bagging and boosting)

Take several models’ results and build a better model.

Bagging: combine each model’s result.

Boosting: similar to bagging but in a more harder way.

The decision tree tends to overfit as if it’s memorizing the data. We can have many different trees by randomly picking the columns and then let the trees vote — random forest.

Ada-boost is one of the most powerful boosting algorithms. Punish previous learners’ mistakes and keep refining the subsequent learners’ performance.

Building an AdaBoost model in sklearn is no different than building any other model. You can use scikit-learn’s AdaBoost Classifier class. This class provides the functions to define and fit the model to your data.

>>> from sklearn.ensemble import AdaBoostClassifier
>>> model = AdaBoostClassifier()
>>> model.fit(x_train, y_train)
>>> model.predict(x_test)

In the example above, the model variable is a decision tree model that has been fitted to the data x_train and y_train. The functions fit and predict work exactly as before.

When we define the model, we can specify the hyperparameters. In practice, the most common ones are

base_estimator: The model utilized for the weak learners (Warning: Don't forget to import the model that you decide to use for the weak learner).
n_estimators: The maximum number of weak learners used.

For example, here we define a model that uses decision trees of max_depth 2 as the weak learners, and it allows a maximum of 4 of them.

>>> from sklearn.tree import DecisionTreeClassifier
>>> model = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth=2), n_estimators = 4)

2.6 Evaluation Metrics

Confusion matrix, precision, and recall:

F1 is used when you want to balance out precision and recall.

2.7 Training and Tuning

the horizontal axis is the complexity of the model

k-fold cross-validation: break data into k folds and run the model k times

Using the learning curve to determine if the model is underfitting or overfit:

The models that are underfitting is like a random guess. No matter how many points you put in the training set, you end up with a high error in both the training and cross-validation set. On the other hand, the models that are overfitting are doing a great job with the training set. But then they’re more like memorizing the data points and hence not easy to generalize out of sample. Hence the cross-validation error is high.

If the model has more than 1 hyper-parameters, we can use Grid Search to find the best set of parameters.

Supervised Learning

1 Linear Regression

2 Classification

Written by Jingying Liu